1. 04 Feb, 2015 34 commits
    • Morten Rasmussen's avatar
      sched: Disable energy-unfriendly nohz kicks · f05d25b9
      Morten Rasmussen authored
      
      
      With energy-aware scheduling enabled nohz_kick_needed() generates many
      nohz idle-balance kicks which lead to nothing when multiple tasks get
      packed on a single cpu to save energy. This causes unnecessary wake-ups
      and hence wastes energy. Make these conditions depend on !energy_aware()
      for now until the energy-aware nohz story gets sorted out.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      f05d25b9
    • Morten Rasmussen's avatar
      sched: Enable active migration for cpus of lower capacity · c13fa615
      Morten Rasmussen authored
      
      
      Add an extra criteria to need_active_balance() to kick off active load
      balance if the source cpu is overutilized and has lower capacity than
      the destination cpus.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      c13fa615
    • Dietmar Eggemann's avatar
      sched: Turn off fast idling of cpus on a partially loaded system · add7d058
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      We do not want to miss out on the ability to do energy-aware idle load
      balancing if the system is only partially loaded since the operational
      range of energy-aware scheduling corresponds to a partially loaded
      system. We might want to pull a single remaining task from a potential
      src cpu towards an idle destination cpu if the energy model tells us
      this is worth doing to save energy.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      add7d058
    • Dietmar Eggemann's avatar
      sched: Skip cpu as lb src which has one task and capacity gte the dst cpu · 7369d2b4
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Skip cpu as a potential src (costliest) in case it has only one task
      running and its original capacity is greater than or equal to the
      original capacity of the dst cpu.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      7369d2b4
    • Dietmar Eggemann's avatar
      sched: Tipping point from energy-aware to conventional load balancing · bdd70613
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Energy-aware load balancing bases on cpu usage so the upper bound of its
      operational range is a fully utilized cpu. Above this tipping point it
      makes more sense to use weighted_cpuload to preserve smp_nice.
      This patch implements the tipping point detection in update_sg_lb_stats
      as if one cpu is over-utilized the current energy-aware load balance
      operation will fall back into the conventional weighted load based one.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      bdd70613
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into detach_tasks · 3aafbd09
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Energy-aware load balancing does not rely on env->imbalance but instead it
      evaluates the system-wide energy difference for each task on the src rq by
      potentially moving it to the dst rq. If this energy difference is lesser
      than zero the task is actually moved from src to dst rq.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      3aafbd09
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into find_busiest_queue · 42a039b5
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      In case that after the gathering of sched domain statistics the current
      load balancing operation is still in energy-aware mode and a least
      efficient sched group has been found, detect the least efficient cpu by
      comparing the cpu efficiency (ratio between cpu usage and cpu energy
      consumption) among all cpus of the least efficient sched group.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      42a039b5
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into find_busiest_group · cbd6cef7
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      In case that after the gathering of sched domain statistics the current
      load balancing operation is still in energy-aware mode, just return the
      least efficient (costliest) reference. That implies the system is
      considered to be balanced in case no least efficient sched group was
      found.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      cbd6cef7
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into update_sd_lb_stats · b6ab76d3
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Energy-aware load balancing has to work alongside the conventional load
      based functionality. This includes the tipping point feature, i.e. being
      able to fall back from energy aware to the conventional load based
      functionality during an ongoing load balancing action.
      That is why this patch introduces an additional reference to hold the
      least efficient sched group (costliest) as well its statistics in form of
      an extra sg_lb_stats structure (costliest_stat).
      The function update_sd_pick_costliest is used to assign the least
      efficient sched group parallel to the existing update_sd_pick_busiest.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      b6ab76d3
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into update_sg_lb_stats · c758855f
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      To be able to identify the least efficient (costliest) sched group
      introduce group_eff as the efficiency of the sched group into sg_lb_stats.
      The group efficiency is defined as the ratio between the group usage and
      the group energy consumption.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      c758855f
    • Dietmar Eggemann's avatar
      sched: Infrastructure to query if load balancing is energy-aware · 90f23141
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Energy-aware load balancing should only happen if the ENERGY_AWARE feature
      is turned on and the sched domain on which the load balancing is performed
      on contains energy data.
      There is also a need during a load balance action to be able to query if we
      should continue to load balance energy-aware or if we reached the tipping
      point which forces us to fall back to the conventional load balancing
      functionality.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      90f23141
    • Morten Rasmussen's avatar
      sched: Determine the current sched_group idle-state · 243cf415
      Morten Rasmussen authored
      
      
      To estimate the energy consumption of a sched_group in
      sched_group_energy() it is necessary to know which idle-state the group
      is in when it is idle. For now, it is assumed that this is the current
      idle-state (though it might be wrong). Based on the individual cpu
      idle-states group_idle_state() finds the group idle-state.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      243cf415
    • Morten Rasmussen's avatar
      sched: Count number of shallower idle-states in struct sched_group_energy · 565ccb57
      Morten Rasmussen authored
      
      
      cpuidle associates all idle-states with each cpu while the energy model
      associates them with the sched_group covering the cpus coordinating
      entry to the idle-state. To get idle-state power consumption it is
      therefore necessary to translate from cpuidle idle-state index to energy
      model index. For this purpose it is helpful to know how many idle-states
      that are listed in lower level sched_groups (in struct
      sched_group_energy).
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      565ccb57
    • Morten Rasmussen's avatar
      sched, cpuidle: Track cpuidle state index in the scheduler · db8d6fb6
      Morten Rasmussen authored
      
      
      The idle-state of each cpu is currently pointed to by rq->idle_state but
      there isn't any information in the struct cpuidle_state that can used to
      look up the idle-state energy model data stored in struct
      sched_group_energy. For this purpose is necessary to store the idle
      state index as well. Ideally, the idle-state data should be unified.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      db8d6fb6
    • Morten Rasmussen's avatar
      sched: Bias new task wakeups towards higher capacity cpus · 5ea824fc
      Morten Rasmussen authored
      
      
      Make wake-ups of new tasks (find_idlest_group) aware of any differences
      in cpu compute capacity so new tasks don't get handed off to a cpus with
      lower capacity.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      5ea824fc
    • Morten Rasmussen's avatar
      sched: Energy-aware wake-up task placement · cc8d42bb
      Morten Rasmussen authored
      
      
      Let available compute capacity and estimated energy impact select
      wake-up target cpu when energy-aware scheduling is enabled.
      energy_aware_wake_cpu() attempts to find group of cpus with sufficient
      compute capacity to accommodate the task and find a cpu with enough spare
      capacity to handle the task within that group. Preference is given to
      cpus with enough spare capacity at the current OPP. Finally, the energy
      impact of the new target and the previous task cpu is compared to select
      the wake-up target cpu.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      cc8d42bb
    • Morten Rasmussen's avatar
      sched: Estimate energy impact of scheduling decisions · 0a6e3510
      Morten Rasmussen authored
      
      
      Adds a generic energy-aware helper function, energy_diff(), that
      calculates energy impact of adding, removing, and migrating utilization
      in the system.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      0a6e3510
    • Morten Rasmussen's avatar
      sched: Extend sched_group_energy to test load-balancing decisions · 24736b79
      Morten Rasmussen authored
      
      
      Extended sched_group_energy() to support energy prediction with usage
      (tasks) added/removed from a specific cpu or migrated between a pair of
      cpus. Useful for load-balancing decision making.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      24736b79
    • Morten Rasmussen's avatar
      sched: Calculate energy consumption of sched_group · 716bd56f
      Morten Rasmussen authored
      
      
      For energy-aware load-balancing decisions it is necessary to know the
      energy consumption estimates of groups of cpus. This patch introduces a
      basic function, sched_group_energy(), which estimates the energy
      consumption of the cpus in the group and any resources shared by the
      members of the group.
      
      NOTE: The function has five levels of identation and breaks the 80
      character limit. Refactoring is necessary.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      716bd56f
    • Morten Rasmussen's avatar
      sched: Highest energy aware balancing sched_domain level pointer · eb3af42f
      Morten Rasmussen authored
      
      
      Add another member to the family of per-cpu sched_domain shortcut
      pointers. This one, sd_ea, points to the highest level at which energy
      model is provided. At this level and all levels below all sched_groups
      have energy model data attached.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      eb3af42f
    • Morten Rasmussen's avatar
      sched: Use capacity_curr to cap utilization in get_cpu_usage() · c6a4f6ed
      Morten Rasmussen authored
      
      
      With scale-invariant usage tracking get_cpu_usage() should never return
      a usage above the current compute capacity of the cpu (capacity_curr).
      The scaling of the utilization tracking contributions should generally
      cause the cpu utilization to saturate at capacity_curr, but it may
      temporarily exceed this value in certain situations. This patch changes
      the cap from capacity_orig to capacity_curr.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      c6a4f6ed
    • Morten Rasmussen's avatar
      sched: Relocated get_cpu_usage() · 857c1377
      Morten Rasmussen authored
      
      
      Move get_cpu_usage() to an earlier position in fair.c.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      857c1377
    • Morten Rasmussen's avatar
      sched: Compute cpu capacity available at current frequency · cefa515b
      Morten Rasmussen authored
      
      
      capacity_orig_of() returns the max available compute capacity of a cpu.
      For scale-invariant utilization tracking and energy-aware scheduling
      decisions it is useful to know the compute capacity available at the
      current OPP of a cpu.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      cefa515b
    • Dietmar Eggemann's avatar
      arm: topology: Define TC2 energy and provide it to the scheduler · 8f965e03
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      This patch is only here to be able to test provisioning of energy related
      data from an arch topology shim layer to the scheduler. Since there is no
      code today which deals with extracting energy related data from the dtb or
      acpi, and process it in the topology shim layer, the content of the
      sched_group_energy structures as well as the idle_state and capacity_state
      arrays are hard-coded here.
      
      This patch defines the sched_group_energy structure as well as the
      idle_state and capacity_state array for the cluster (relates to sched
      groups (sgs) in DIE sched domain level) and for the core (relates to sgs
      in MC sd level) for a Cortex A7 as well as for a Cortex A15.
      It further provides related implementations of the sched_domain_energy_f
      functions (cpu_cluster_energy() and cpu_core_energy()).
      
      To be able to propagate this information from the topology shim layer to
      the scheduler, the elements of the arm_topology[] table have been
      provisioned with the appropriate sched_domain_energy_f functions.
      
      cc: Russell King <linux@arm.linux.org.uk>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      8f965e03
    • Morten Rasmussen's avatar
      sched: Introduce SD_SHARE_CAP_STATES sched_domain flag · a4f258d4
      Morten Rasmussen authored
      
      
      cpufreq is currently keeping it a secret which cpus are sharing
      clock source. The scheduler needs to know about clock domains as well
      to become more energy aware. The SD_SHARE_CAP_STATES domain flag
      indicates whether cpus belonging to the sched_domain share capacity
      states (P-states).
      
      There is no connection with cpufreq (yet). The flag must be set by
      the arch specific topology code.
      
      cc: Russell King <linux@arm.linux.org.uk>
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      a4f258d4
    • Dietmar Eggemann's avatar
      sched: Allocate and initialize energy data structures · d66b7a0e
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      The per sched group sched_group_energy structure plus the related
      idle_state and capacity_state arrays are allocated like the other sched
      domain (sd) hierarchy data structures. This includes the freeing of
      sched_group_energy structures which are not used.
      
      One problem is that the number of elements of the idle_state and the
      capacity_state arrays is not fixed and has to be retrieved in
      __sdt_alloc() to allocate memory for the sched_group_energy structure and
      the two arrays in one chunk. The array pointers (idle_states and
      cap_states) are initialized here to point to the correct place inside the
      memory chunk.
      
      The new function init_sched_energy() initializes the sched_group_energy
      structure and the two arrays in case the sd topology level contains energy
      information.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      d66b7a0e
    • Dietmar Eggemann's avatar
      sched: Introduce energy data structures · e018419a
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      The struct sched_group_energy represents the per sched_group related
      data which is needed for energy aware scheduling. It contains:
      
        (1) atomic reference counter for scheduler internal bookkeeping of
            data allocation and freeing
        (2) number of elements of the idle state array
        (3) pointer to the idle state array which comprises 'power consumption'
            for each idle state
        (4) number of elements of the capacity state array
        (5) pointer to the capacity state array which comprises 'compute
            capacity and power consumption' tuples for each capacity state
      
      Allocation and freeing of struct sched_group_energy utilizes the existing
      infrastructure of the scheduler which is currently used for the other sd
      hierarchy data structures (e.g. struct sched_domain) as well. That's why
      struct sd_data is provisioned with a per cpu struct sched_group_energy
      double pointer.
      
      The struct sched_group obtains a pointer to a struct sched_group_energy.
      
      The function pointer sched_domain_energy_f is introduced into struct
      sched_domain_topology_level which will allow the arch to pass a particular
      struct sched_group_energy from the topology shim layer into the scheduler
      core.
      
      The function pointer sched_domain_energy_f has an 'int cpu' parameter
      since the folding of two adjacent sd levels via sd degenerate doesn't work
      for all sd levels. I.e. it is not possible for example to use this feature
      to provide per-cpu energy in sd level DIE on ARM's TC2 platform.
      
      It was discussed that the folding of sd levels approach is preferable
      over the cpu parameter approach, simply because the user (the arch
      specifying the sd topology table) can introduce less errors. But since
      it is not working, the 'int cpu' parameter is the only way out. It's
      possible to use the folding of sd levels approach for
      sched_domain_flags_f and the cpu parameter approach for the
      sched_domain_energy_f at the same time though. With the use of the
      'int cpu' parameter, an extra check function has to be provided to make
      sure that all cpus spanned by a sched group are provisioned with the same
      energy data.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      e018419a
    • Morten Rasmussen's avatar
      sched: Make energy awareness a sched feature · 0fbd4fdf
      Morten Rasmussen authored
      
      
      This patch introduces the ENERGY_AWARE sched feature, which is
      implemented using jump labels when SCHED_DEBUG is defined. It is
      statically set false when SCHED_DEBUG is not defined. Hence this doesn't
      allow energy awareness to be enabled without SCHED_DEBUG. This
      sched_feature knob will be replaced later with a more appropriate
      control knob when things have matured a bit.
      
      ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
      must be enable. This dependency isn't checked at compile time yet.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      0fbd4fdf
    • Morten Rasmussen's avatar
      sched: Documentation for scheduler energy cost model · db770f39
      Morten Rasmussen authored
      
      
      This documentation patch provides an overview of the experimental
      scheduler energy costing model, associated data structures, and a
      reference recipe on how platforms can be characterized to derive energy
      models.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      db770f39
    • Morten Rasmussen's avatar
      sched: Include blocked utilization in usage tracking · 6814d05e
      Morten Rasmussen authored
      
      
      Add the blocked utilization contribution to group sched_entity
      utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage().
      With this change cpu usage now includes recent usage by currently
      non-runnable tasks, hence it provides a more stable view of the cpu
      usage. It does, however, also mean that the meaning of usage is changed:
      A cpu may be momentarily idle while usage >0. It can no longer be
      assumed that cpu usage >0 implies runnable tasks on the rq.
      cfs_rq->utilization_load_avg or nr_running should be used instead to get
      the current rq status.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      6814d05e
    • Morten Rasmussen's avatar
      sched: Track blocked utilization contributions · f49ff016
      Morten Rasmussen authored
      
      
      Introduces the blocked utilization, the utilization counter-part to
      cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
      contributions of entities that were recently on the cfs_rq that are
      currently blocked. Combined with sum of contributions of entities
      currently on the cfs_rq or currently running
      (cfs_rq->utilization_load_avg) this can provide a more stable average
      view of the cpu usage.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      f49ff016
    • Dietmar Eggemann's avatar
      sched: Get rid of scaling usage by cpu_capacity_orig · e10cda6c
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Since now we have besides frequency invariant also cpu (uarch plus max
      system frequency) invariant cfs_rq::utilization_load_avg both, frequency
      and cpu scaling happens as part of the load tracking.
      So cfs_rq::utilization_load_avg does not have to be scaled by the original
      capacity of the cpu again.
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      e10cda6c
    • Dietmar Eggemann's avatar
      arm: Cpu invariant scheduler load-tracking support · f272b2c8
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Reuses the existing infrastructure for cpu_scale to provide the scheduler
      with a cpu scaling correction factor for more accurate load-tracking.
      This factor comprises a micro-architectural part, which is based on the
      cpu efficiency value of a cpu as well as a platform-wide max frequency
      part, which relates to the dtb property clock-frequency of a cpu node.
      
      The calculation of cpu_scale, return value of arch_scale_cpu_capacity,
      changes from:
      
          capacity / middle_capacity
      
          with capacity = (clock_frequency >> 20) * cpu_efficiency
      
      to:
      
          SCHED_CAPACITY_SCALE * cpu_perf / max_cpu_perf
      
      The range of the cpu_scale value changes from
      [0..3*SCHED_CAPACITY_SCALE/2] to [0..SCHED_CAPACITY_SCALE].
      
      The functionality to calculate the middle_capacity which corresponds to an
      'average' cpu has been taken out since the scaling is now done
      differently.
      
      In the case that either the cpu efficiency or the clock-frequency value
      for a cpu is missing, no cpu scaling is done for any cpu.
      
      The platform-wide max frequency part of the factor should not be confused
      with the frequency invariant scheduler load-tracking support which deals
      with frequency related scaling due to DFVS functionality on a cpu.
      
      Cc: Russell King <linux@arm.linux.org.uk>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      f272b2c8
    • Dietmar Eggemann's avatar
      arm: vexpress: Add CPU clock-frequencies to TC2 device-tree · bb158217
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      To enable the parsing of clock frequency and cpu efficiency values
      inside parse_dt_topology [arch/arm/kernel/topology.c] to scale the
      relative capacity of the cpus, this property has to be provided within
      the cpu nodes of the dts file.
      
      The patch is a copy of commit 8f15973e
      
       ("ARM: vexpress: Add CPU
      clock-frequencies to TC2 device-tree") taken from Linaro Stable Kernel
      (LSK) massaged into mainline.
      
      Cc: Jon Medhurst <tixy@linaro.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      bb158217
  2. 02 Feb, 2015 6 commits
    • Morten Rasmussen's avatar
      arm: Frequency invariant scheduler load-tracking support · bf311f1e
      Morten Rasmussen authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Implements arch-specific function to provide the scheduler with a
      frequency scaling correction factor for more accurate load-tracking. The
      factor is:
      
      	current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)
      
      This implementation only provides frequency invariance. No
      micro-architecture invariance yet.
      
      Cc: Russell King <linux@arm.linux.org.uk>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      bf311f1e
    • Morten Rasmussen's avatar
      cpufreq: Architecture specific callback for frequency changes · 9bd38f13
      Morten Rasmussen authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Architectures that don't have any other means for tracking cpu frequency
      changes need a callback from cpufreq to implement a scaling factor to
      enable scale-invariant per-entity load-tracking in the scheduler.
      
      To compute the scale invariance correction factor the architecture would
      need to know both the max frequency and the current frequency. This
      patch defines weak functions for setting both from cpufreq.
      
      Related architecture specific functions use weak function definitions.
      The same approach is followed here.
      
      These callbacks can be used to implement frequency scaling of cpu
      capacity later.
      
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      9bd38f13
    • Dietmar Eggemann's avatar
      sched: Make usage tracking cpu scale-invariant · 23909cd1
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Besides the existing frequency scale-invariance correction factor, apply
      cpu scale-invariance correction factor to usage tracking.
      
      Cpu scale-invariance takes cpu performance deviations due to
      micro-architectural differences (i.e. instructions per seconds) between
      cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
      value of the highest OPP between cpus in SMP systems into consideration.
      
      Each segment of the sched_avg::running_avg_sum geometric series is now
      scaled by the cpu performance factor too so the
      sched_avg::utilization_avg_contrib of each entity will be invariant from
      the particular cpu of the HMP/SMP system it is gathered on.
      
      So the usage level that is returned by get_cpu_usage stays relative to
      the max cpu performance of the system.
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      23909cd1
    • Dietmar Eggemann's avatar
      sched: Make load tracking frequency scale-invariant · 51318de5
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Apply frequency scale-invariance correction factor to load tracking.
      Each segment of the sched_avg::runnable_avg_sum geometric series is now
      scaled by the current frequency so the sched_avg::load_avg_contrib of each
      entity will be invariant with frequency scaling. As a result,
      cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib,
      becomes invariant too. So the load level that is returned by
      weighted_cpuload, stays relative to the max frequency of the cpu.
      
      Then, we want the keep the load tracking values in a 32bits type, which
      implies that the max value of sched_avg::{runnable|running}_avg_sum must
      be lower than 2^32/88761=48388 (88761 is the max weight of a task). As
      LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less
      than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY =
      1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to
      avoid overflow.
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Acked-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      51318de5
    • vingu-linaro's avatar
      sched: move cfs task on a CPU with higher capacity · 5a08f925
      vingu-linaro authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
      capacity for CFS tasks can be significantly reduced. Once we detect such
      situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
      load balance to check if it's worth moving its tasks on an idle CPU.
      
      Once the idle load_balance has selected the busiest CPU, it will look for an
      active load balance for only two cases :
      - there is only 1 task on the busiest CPU.
      - we haven't been able to move a task of the busiest rq.
      
      A CPU with a reduced capacity is included in the 1st case, and it's worth to
      actively migrate its task if the idle CPU has got full capacity. This test has
      been added in need_active_balance.
      
      As a sidenote, this will note generate more spurious ilb because we already
      trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
      has a task, we will trig the ilb once for migrating the task.
      
      The nohz_kick_needed function has been cleaned up a bit while adding the new
      test
      
      env.src_cpu and env.src_rq must be set unconditionnally because they are used
      in need_active_balance which is called even if busiest->nr_running equals 1
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      5a08f925
    • vingu-linaro's avatar
      sched: add SD_PREFER_SIBLING for SMT level · 6bf0fb9a
      vingu-linaro authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
      the scheduler will put at least 1 task per core.
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      6bf0fb9a