1. 04 Feb, 2015 24 commits
    • Morten Rasmussen's avatar
      sched: Disable energy-unfriendly nohz kicks · f05d25b9
      Morten Rasmussen authored
      
      
      With energy-aware scheduling enabled nohz_kick_needed() generates many
      nohz idle-balance kicks which lead to nothing when multiple tasks get
      packed on a single cpu to save energy. This causes unnecessary wake-ups
      and hence wastes energy. Make these conditions depend on !energy_aware()
      for now until the energy-aware nohz story gets sorted out.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      f05d25b9
    • Morten Rasmussen's avatar
      sched: Enable active migration for cpus of lower capacity · c13fa615
      Morten Rasmussen authored
      
      
      Add an extra criteria to need_active_balance() to kick off active load
      balance if the source cpu is overutilized and has lower capacity than
      the destination cpus.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      c13fa615
    • Dietmar Eggemann's avatar
      sched: Turn off fast idling of cpus on a partially loaded system · add7d058
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      We do not want to miss out on the ability to do energy-aware idle load
      balancing if the system is only partially loaded since the operational
      range of energy-aware scheduling corresponds to a partially loaded
      system. We might want to pull a single remaining task from a potential
      src cpu towards an idle destination cpu if the energy model tells us
      this is worth doing to save energy.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      add7d058
    • Dietmar Eggemann's avatar
      sched: Skip cpu as lb src which has one task and capacity gte the dst cpu · 7369d2b4
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Skip cpu as a potential src (costliest) in case it has only one task
      running and its original capacity is greater than or equal to the
      original capacity of the dst cpu.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      7369d2b4
    • Dietmar Eggemann's avatar
      sched: Tipping point from energy-aware to conventional load balancing · bdd70613
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Energy-aware load balancing bases on cpu usage so the upper bound of its
      operational range is a fully utilized cpu. Above this tipping point it
      makes more sense to use weighted_cpuload to preserve smp_nice.
      This patch implements the tipping point detection in update_sg_lb_stats
      as if one cpu is over-utilized the current energy-aware load balance
      operation will fall back into the conventional weighted load based one.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      bdd70613
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into detach_tasks · 3aafbd09
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Energy-aware load balancing does not rely on env->imbalance but instead it
      evaluates the system-wide energy difference for each task on the src rq by
      potentially moving it to the dst rq. If this energy difference is lesser
      than zero the task is actually moved from src to dst rq.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      3aafbd09
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into find_busiest_queue · 42a039b5
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      In case that after the gathering of sched domain statistics the current
      load balancing operation is still in energy-aware mode and a least
      efficient sched group has been found, detect the least efficient cpu by
      comparing the cpu efficiency (ratio between cpu usage and cpu energy
      consumption) among all cpus of the least efficient sched group.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      42a039b5
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into find_busiest_group · cbd6cef7
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      In case that after the gathering of sched domain statistics the current
      load balancing operation is still in energy-aware mode, just return the
      least efficient (costliest) reference. That implies the system is
      considered to be balanced in case no least efficient sched group was
      found.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      cbd6cef7
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into update_sd_lb_stats · b6ab76d3
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Energy-aware load balancing has to work alongside the conventional load
      based functionality. This includes the tipping point feature, i.e. being
      able to fall back from energy aware to the conventional load based
      functionality during an ongoing load balancing action.
      That is why this patch introduces an additional reference to hold the
      least efficient sched group (costliest) as well its statistics in form of
      an extra sg_lb_stats structure (costliest_stat).
      The function update_sd_pick_costliest is used to assign the least
      efficient sched group parallel to the existing update_sd_pick_busiest.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      b6ab76d3
    • Dietmar Eggemann's avatar
      sched: Introduce energy awareness into update_sg_lb_stats · c758855f
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      To be able to identify the least efficient (costliest) sched group
      introduce group_eff as the efficiency of the sched group into sg_lb_stats.
      The group efficiency is defined as the ratio between the group usage and
      the group energy consumption.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      c758855f
    • Dietmar Eggemann's avatar
      sched: Infrastructure to query if load balancing is energy-aware · 90f23141
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Energy-aware load balancing should only happen if the ENERGY_AWARE feature
      is turned on and the sched domain on which the load balancing is performed
      on contains energy data.
      There is also a need during a load balance action to be able to query if we
      should continue to load balance energy-aware or if we reached the tipping
      point which forces us to fall back to the conventional load balancing
      functionality.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      90f23141
    • Morten Rasmussen's avatar
      sched: Determine the current sched_group idle-state · 243cf415
      Morten Rasmussen authored
      
      
      To estimate the energy consumption of a sched_group in
      sched_group_energy() it is necessary to know which idle-state the group
      is in when it is idle. For now, it is assumed that this is the current
      idle-state (though it might be wrong). Based on the individual cpu
      idle-states group_idle_state() finds the group idle-state.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      243cf415
    • Morten Rasmussen's avatar
      sched: Bias new task wakeups towards higher capacity cpus · 5ea824fc
      Morten Rasmussen authored
      
      
      Make wake-ups of new tasks (find_idlest_group) aware of any differences
      in cpu compute capacity so new tasks don't get handed off to a cpus with
      lower capacity.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      5ea824fc
    • Morten Rasmussen's avatar
      sched: Energy-aware wake-up task placement · cc8d42bb
      Morten Rasmussen authored
      
      
      Let available compute capacity and estimated energy impact select
      wake-up target cpu when energy-aware scheduling is enabled.
      energy_aware_wake_cpu() attempts to find group of cpus with sufficient
      compute capacity to accommodate the task and find a cpu with enough spare
      capacity to handle the task within that group. Preference is given to
      cpus with enough spare capacity at the current OPP. Finally, the energy
      impact of the new target and the previous task cpu is compared to select
      the wake-up target cpu.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      cc8d42bb
    • Morten Rasmussen's avatar
      sched: Estimate energy impact of scheduling decisions · 0a6e3510
      Morten Rasmussen authored
      
      
      Adds a generic energy-aware helper function, energy_diff(), that
      calculates energy impact of adding, removing, and migrating utilization
      in the system.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      0a6e3510
    • Morten Rasmussen's avatar
      sched: Extend sched_group_energy to test load-balancing decisions · 24736b79
      Morten Rasmussen authored
      
      
      Extended sched_group_energy() to support energy prediction with usage
      (tasks) added/removed from a specific cpu or migrated between a pair of
      cpus. Useful for load-balancing decision making.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      24736b79
    • Morten Rasmussen's avatar
      sched: Calculate energy consumption of sched_group · 716bd56f
      Morten Rasmussen authored
      
      
      For energy-aware load-balancing decisions it is necessary to know the
      energy consumption estimates of groups of cpus. This patch introduces a
      basic function, sched_group_energy(), which estimates the energy
      consumption of the cpus in the group and any resources shared by the
      members of the group.
      
      NOTE: The function has five levels of identation and breaks the 80
      character limit. Refactoring is necessary.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      716bd56f
    • Morten Rasmussen's avatar
      sched: Use capacity_curr to cap utilization in get_cpu_usage() · c6a4f6ed
      Morten Rasmussen authored
      
      
      With scale-invariant usage tracking get_cpu_usage() should never return
      a usage above the current compute capacity of the cpu (capacity_curr).
      The scaling of the utilization tracking contributions should generally
      cause the cpu utilization to saturate at capacity_curr, but it may
      temporarily exceed this value in certain situations. This patch changes
      the cap from capacity_orig to capacity_curr.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      c6a4f6ed
    • Morten Rasmussen's avatar
      sched: Relocated get_cpu_usage() · 857c1377
      Morten Rasmussen authored
      
      
      Move get_cpu_usage() to an earlier position in fair.c.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      857c1377
    • Morten Rasmussen's avatar
      sched: Compute cpu capacity available at current frequency · cefa515b
      Morten Rasmussen authored
      
      
      capacity_orig_of() returns the max available compute capacity of a cpu.
      For scale-invariant utilization tracking and energy-aware scheduling
      decisions it is useful to know the compute capacity available at the
      current OPP of a cpu.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      cefa515b
    • Morten Rasmussen's avatar
      sched: Make energy awareness a sched feature · 0fbd4fdf
      Morten Rasmussen authored
      
      
      This patch introduces the ENERGY_AWARE sched feature, which is
      implemented using jump labels when SCHED_DEBUG is defined. It is
      statically set false when SCHED_DEBUG is not defined. Hence this doesn't
      allow energy awareness to be enabled without SCHED_DEBUG. This
      sched_feature knob will be replaced later with a more appropriate
      control knob when things have matured a bit.
      
      ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
      must be enable. This dependency isn't checked at compile time yet.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      0fbd4fdf
    • Morten Rasmussen's avatar
      sched: Include blocked utilization in usage tracking · 6814d05e
      Morten Rasmussen authored
      
      
      Add the blocked utilization contribution to group sched_entity
      utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage().
      With this change cpu usage now includes recent usage by currently
      non-runnable tasks, hence it provides a more stable view of the cpu
      usage. It does, however, also mean that the meaning of usage is changed:
      A cpu may be momentarily idle while usage >0. It can no longer be
      assumed that cpu usage >0 implies runnable tasks on the rq.
      cfs_rq->utilization_load_avg or nr_running should be used instead to get
      the current rq status.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      6814d05e
    • Morten Rasmussen's avatar
      sched: Track blocked utilization contributions · f49ff016
      Morten Rasmussen authored
      
      
      Introduces the blocked utilization, the utilization counter-part to
      cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
      contributions of entities that were recently on the cfs_rq that are
      currently blocked. Combined with sum of contributions of entities
      currently on the cfs_rq or currently running
      (cfs_rq->utilization_load_avg) this can provide a more stable average
      view of the cpu usage.
      
      cc: Ingo Molnar <mingo@redhat.com>
      cc: Peter Zijlstra <peterz@infradead.org>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      f49ff016
    • Dietmar Eggemann's avatar
      sched: Get rid of scaling usage by cpu_capacity_orig · e10cda6c
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Since now we have besides frequency invariant also cpu (uarch plus max
      system frequency) invariant cfs_rq::utilization_load_avg both, frequency
      and cpu scaling happens as part of the load tracking.
      So cfs_rq::utilization_load_avg does not have to be scaled by the original
      capacity of the cpu again.
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      e10cda6c
  2. 02 Feb, 2015 11 commits
    • Dietmar Eggemann's avatar
      sched: Make usage tracking cpu scale-invariant · 23909cd1
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Besides the existing frequency scale-invariance correction factor, apply
      cpu scale-invariance correction factor to usage tracking.
      
      Cpu scale-invariance takes cpu performance deviations due to
      micro-architectural differences (i.e. instructions per seconds) between
      cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
      value of the highest OPP between cpus in SMP systems into consideration.
      
      Each segment of the sched_avg::running_avg_sum geometric series is now
      scaled by the cpu performance factor too so the
      sched_avg::utilization_avg_contrib of each entity will be invariant from
      the particular cpu of the HMP/SMP system it is gathered on.
      
      So the usage level that is returned by get_cpu_usage stays relative to
      the max cpu performance of the system.
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      23909cd1
    • Dietmar Eggemann's avatar
      sched: Make load tracking frequency scale-invariant · 51318de5
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Apply frequency scale-invariance correction factor to load tracking.
      Each segment of the sched_avg::runnable_avg_sum geometric series is now
      scaled by the current frequency so the sched_avg::load_avg_contrib of each
      entity will be invariant with frequency scaling. As a result,
      cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib,
      becomes invariant too. So the load level that is returned by
      weighted_cpuload, stays relative to the max frequency of the cpu.
      
      Then, we want the keep the load tracking values in a 32bits type, which
      implies that the max value of sched_avg::{runnable|running}_avg_sum must
      be lower than 2^32/88761=48388 (88761 is the max weight of a task). As
      LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less
      than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY =
      1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to
      avoid overflow.
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Acked-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      51318de5
    • vingu-linaro's avatar
      sched: move cfs task on a CPU with higher capacity · 5a08f925
      vingu-linaro authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
      capacity for CFS tasks can be significantly reduced. Once we detect such
      situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
      load balance to check if it's worth moving its tasks on an idle CPU.
      
      Once the idle load_balance has selected the busiest CPU, it will look for an
      active load balance for only two cases :
      - there is only 1 task on the busiest CPU.
      - we haven't been able to move a task of the busiest rq.
      
      A CPU with a reduced capacity is included in the 1st case, and it's worth to
      actively migrate its task if the idle CPU has got full capacity. This test has
      been added in need_active_balance.
      
      As a sidenote, this will note generate more spurious ilb because we already
      trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
      has a task, we will trig the ilb once for migrating the task.
      
      The nohz_kick_needed function has been cleaned up a bit while adding the new
      test
      
      env.src_cpu and env.src_rq must be set unconditionnally because they are used
      in need_active_balance which is called even if busiest->nr_running equals 1
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      5a08f925
    • vingu-linaro's avatar
      sched: replace capacity_factor by usage · e5aa6306
      vingu-linaro authored and Morten Rasmussen's avatar Morten Rasmussen committed
      The scheduler tries to compute how many tasks a group of CPUs can handle by
      assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
      SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group
      by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
      compares this value with the sum of nr_running to decide if the group is
      overloaded or not. But the group_capacity_factor is hardly working for SMT
       system, it sometimes works for big cores but fails to do the right thing for
       little cores.
      
      Below are two examples to illustrate the problem that this patch solves:
      
      1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
      (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
      (div_round_closest(3x640/1024) = 2) which means that it will be seen as
      overloaded even if we have only one task per CPU.
      
      2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
      (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
      (at max and thanks to the fix [0] for SMT system that prevent the apparition
      of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
      reduced to nearly nothing), the capacity factor of the group will still be 4
      (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
      
      So, this patch tries to solve this issue by removing capacity_factor and
      replacing it with the 2 following metrics :
      -The available CPU's capacity for CFS tasks which is already used by
       load_balance.
      -The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
      has been re-introduced to compute the usage of a CPU by CFS tasks.
      
      group_capacity_factor and group_has_free_capacity has been removed and replaced
      by group_no_capacity. We compare the number of task with the number of CPUs and
      we evaluate the level of utilization of the CPUs to define if a group is
      overloaded or if a group has capacity to handle more tasks.
      
      For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
      so it will be selected in priority (among the overloaded groups). Since [1],
      SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity
      because local is not overloaded.
      
      Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
      because it's no more used during load balance.
      
      [1] https://lkml.org/lkml/2014/8/12/295
      
      
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      [Fixed merge conflict on v3.19-rc6: Morten Rasmussen
      <morten.rasmussen@arm.com>]
      e5aa6306
    • vingu-linaro's avatar
      sched: get CPU's usage statistic · 9adcac16
      vingu-linaro authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Monitor the usage level of each group of each sched_domain level. The usage is
      the portion of cpu_capacity_orig that is currently used on a CPU or group of
      CPUs. We use the utilization_load_avg to evaluate the usage level of each
      group.
      
      The utilization_load_avg only takes into account the running time of the CFS
      tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
      utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
      greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
      until the metrics are stabilized.
      
      The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
      running load on the CPU whereas the available capacity for the CFS task is in
      the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
      by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
      of the CPU to get the usage of the latter. The usage can then be compared with
      the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
      
      The frequency scaling invariance of the usage is not taken into account in this
      patch, it will be solved in another patch which will deal with frequency
      scaling invariance on the running_load_avg.
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      9adcac16
    • vingu-linaro's avatar
      sched: add per rq cpu_capacity_orig · eb203dda
      vingu-linaro authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      This new field cpu_capacity_orig reflects the original capacity of a CPU
      before being altered by rt tasks and/or IRQ
      
      The cpu_capacity_orig will be used:
      - to detect when the capacity of a CPU has been noticeably reduced so we can
        trig load balance to look for a CPU with better capacity. As an example, we
        can detect when a CPU handles a significant amount of irq
        (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
        scheduler whereas CPUs, which are really idle, are available.
      - evaluate the available capacity for CFS tasks
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Acked-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      eb203dda
    • vingu-linaro's avatar
      sched: make scale_rt invariant with frequency · e25ec760
      vingu-linaro authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      The average running time of RT tasks is used to estimate the remaining compute
      capacity for CFS tasks. This remaining capacity is the original capacity scaled
      down by a factor (aka scale_rt_capacity). This estimation of available capacity
      must also be invariant with frequency scaling.
      
      A frequency scaling factor is applied on the running time of the RT tasks for
      computing scale_rt_capacity.
      
      In sched_rt_avg_update, we scale the RT execution time like below:
      rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
      
      Then, scale_rt_capacity can be summarized by:
      scale_rt_capacity = SCHED_CAPACITY_SCALE -
      		((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)
      
      We can optimize by removing right and left shift in the computation of rq->rt_avg
      and scale_rt_capacity
      
      The call to arch_scale_frequency_capacity in the rt scheduling path might be
      a concern for RT folks because I'm not sure whether we can rely on
      arch_scale_freq_capacity to be short and efficient ?
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      e25ec760
    • Morten Rasmussen's avatar
      sched: Make sched entity usage tracking frequency-invariant · 9c209ef7
      Morten Rasmussen authored
      
      
      Apply frequency scale-invariance correction factor to usage tracking.
      Each segment of the running_load_avg geometric series is now scaled by the
      current frequency so the utilization_avg_contrib of each entity will be
      invariant with frequency scaling. As a result, utilization_load_avg which is
      the sum of utilization_avg_contrib, becomes invariant too. So the usage level
      that is returned by get_cpu_usage, stays relative to the max frequency as the
      cpu_capacity which is is compared against.
      Then, we want the keep the load tracking values in a 32bits type, which implies
      that the max value of {runnable|running}_avg_sum must be lower than
      2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
      arch_scale_freq_capacity must return a value less than
      (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
      So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
      
      cc: Paul Turner <pjt@google.com>
      cc: Ben Segall <bsegall@google.com>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      9c209ef7
    • vingu-linaro's avatar
      sched: remove frequency scaling from cpu_capacity · 076171b3
      vingu-linaro authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Now that arch_scale_cpu_capacity has been introduced to scale the original
      capacity, the arch_scale_freq_capacity is no longer used (it was
      previously used by ARM arch). Remove arch_scale_freq_capacity from the
      computation of cpu_capacity. The frequency invariance will be handled in the
      load tracking and not in the CPU capacity. arch_scale_freq_capacity will be
      revisited for scaling load with the current frequency of the CPUs in a later
      patch.
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      076171b3
    • Morten Rasmussen's avatar
      sched: Track group sched_entity usage contributions · 62544681
      Morten Rasmussen authored
      
      
      Adds usage contribution tracking for group entities. Unlike
      se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
      entities is the sum of se->avg.utilization_avg_contrib for all entities on the
      group runqueue. It is _not_ influenced in any way by the task group
      h_load. Hence it is representing the actual cpu usage of the group, not
      its intended load contribution which may differ significantly from the
      utilization on lightly utilized systems.
      
      cc: Paul Turner <pjt@google.com>
      cc: Ben Segall <bsegall@google.com>
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      62544681
    • vingu-linaro's avatar
      sched: add utilization_avg_contrib · 90839680
      vingu-linaro authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      Add new statistics which reflect the average time a task is running on the CPU
      and the sum of these running time of the tasks on a runqueue. The latter is
      named utilization_load_avg.
      
      This patch is based on the usage metric that was proposed in the 1st
      versions of the per-entity load tracking patchset by Paul Turner
      <pjt@google.com> but that has be removed afterwards. This version differs from
      the original one in the sense that it's not linked to task_group.
      
      The rq's utilization_load_avg will be used to check if a rq is overloaded or
      not instead of trying to compute how many tasks a group of CPUs can handle.
      
      Rename runnable_avg_period into avg_period as it is now used with both
      runnable_avg_sum and running_avg_sum
      
      Add some descriptions of the variables to explain their differences
      
      cc: Paul Turner <pjt@google.com>
      cc: Ben Segall <bsegall@google.com>
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      90839680
  3. 09 Jan, 2015 2 commits
    • Tetsuo Handa's avatar
      sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group() · 7f1a169b
      Tetsuo Handa authored
      
      
      When alloc_fair_sched_group() in sched_create_group() fails,
      free_sched_group() is called, and free_fair_sched_group() is called by
      free_sched_group(). Since destroy_cfs_bandwidth() is called by
      free_fair_sched_group() without calling init_cfs_bandwidth(),
      RCU stall occurs at hrtimer_cancel():
      
        INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies g=13074 c=13073 q=0)
        Task dump for CPU 1:
        (fprintd)       R  running task        0  6249      1 0x00000088
        ...
        Call Trace:
         <IRQ>  [<ffffffff81094988>] sched_show_task+0xa8/0x110
         [<ffffffff81097acd>] dump_cpu_task+0x3d/0x50
         [<ffffffff810c3a80>] rcu_dump_cpu_stacks+0x90/0xd0
         [<ffffffff810c7751>] rcu_check_callbacks+0x491/0x700
         [<ffffffff810cbf2b>] update_process_times+0x4b/0x80
         [<ffffffff810db046>] tick_sched_handle.isra.20+0x36/0x50
         [<ffffffff810db0a2>] tick_sched_timer+0x42/0x70
         [<ffffffff810ccb19>] __run_hrtimer+0x69/0x1a0
         [<ffffffff810db060>] ? tick_sched_handle.isra.20+0x50/0x50
         [<ffffffff810ccedf>] hrtimer_interrupt+0xef/0x230
         [<ffffffff810452cb>] local_apic_timer_interrupt+0x3b/0x70
         [<ffffffff8164a465>] smp_apic_timer_interrupt+0x45/0x60
         [<ffffffff816485bd>] apic_timer_interrupt+0x6d/0x80
         <EOI>  [<ffffffff810cc588>] ? lock_hrtimer_base.isra.23+0x18/0x50
         [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230
         [<ffffffff810cc9d2>] hrtimer_try_to_cancel+0x22/0xd0
         [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230
         [<ffffffff810ccaa2>] hrtimer_cancel+0x22/0x30
         [<ffffffff810a3cb5>] free_fair_sched_group+0x25/0xd0
         [<ffffffff8108df46>] free_sched_group+0x16/0x40
         [<ffffffff810971bb>] sched_create_group+0x4b/0x80
         [<ffffffff810aa383>] sched_autogroup_create_attach+0x43/0x1c0
         [<ffffffff8107dc9c>] sys_setsid+0x7c/0x110
         [<ffffffff81647729>] system_call_fastpath+0x12/0x17
      
      Check whether init_cfs_bandwidth() was called before calling
      destroy_cfs_bandwidth().
      
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      [ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7f1a169b
    • Yuyang Du's avatar
      sched: Fix odd values in effective_load() calculations · 32a8df4e
      Yuyang Du authored
      
      
      In effective_load, we have (long w * unsigned long tg->shares) / long W,
      when w is negative, it is cast to unsigned long and hence the product is
      insanely large. Fix this by casting tg->shares to long.
      
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarYuyang Du <yuyang.du@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      32a8df4e
  4. 16 Nov, 2014 3 commits