1. 16 Jun, 2014 16 commits
    • Morten Rasmussen's avatar
      sched: Disable wake_affine to broaden the scope of wakeup target cpus · 387e6491
      Morten Rasmussen authored
      
      
      SD_WAKE_AFFINE is currently set by default on all levels which means
      that wakeups are always handled inside the lowest level sched_domain.
      That means a tiny periodic task is very likely to stay on the cpu it was
      forked on forever. To save energy we need to revisit the task placement
      decision every now and again to ensure that we don't keep waking the
      same cpu if there are cheaper alternatives.
      
      One way is to simply disable wake_affine and rely on the fork/exec
      balancing mechanism (find_idlest_{group, cpu}). This is what this patch
      does.
      
      An alternative is to let the platform remove the SD_WAKE_AFFINE flag
      from lower levels to increase the search space for
      select_idle_sibling().
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      387e6491
    • Morten Rasmussen's avatar
      sched: Use energy to guide wakeup task placement · 7961cf80
      Morten Rasmussen authored
      
      
      Attempt to pick most energy efficient wakeup in find_idlest_{group,
      cpu}(). Finding the optimum target requires an exhaustive search
      through all cpus in the groups. Instead, the target group is determined
      based on load and probing the energy cost on a single cpu in each group.
      The target cpu is the cpu with the lowest energy cost.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      7961cf80
    • Morten Rasmussen's avatar
      sched: Use energy model in select_idle_sibling · 95c58189
      Morten Rasmussen authored
      
      
      Make select_idle_sibling() consider energy when picking an idle cpu.
      
      Only idle cpus are still considered. A more aggressive energy conserving
      approach could go further and consider partly utilized cpus.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      95c58189
    • Morten Rasmussen's avatar
      sched: Take task wakeups into account in energy estimates · 701d613e
      Morten Rasmussen authored
      
      
      The energy cost of waking a cpu and sending it back to sleep can be
      quite significant for short running frequently waking tasks if placed on
      an idle cpu in a deep sleep state. By factoring task wakeups in such
      tasks can be placed on cpus where the wakeup energy cost is lower. For
      example, partly utilized cpus in a shallower idle state, or cpus in a
      cluster/die that is already awake.
      
      Current cpu utilization of the target cpu is factored in guess how many
      task wakeups that translate into cpu wakeups (idle exits). It is a very
      naive approach, but it is virtually impossible to get an accurate estimate.
      
      wake_energy(task) = unused_util(cpu) * wakeups(task) * wakeup_energy(cpu)
      
      There is no per cpu wakeup tracking, so we can't estimate the energy
      savings when removing tasks from a cpu. It is also nearly impossible to
      figure out which task is the cause of cpu wakeups if multiple tasks are
      scheduled on the same cpu.
      
      Support for multiple idle-states per sched_group (e.g. WFI and core
      shutdown on ARM) is not implemented yet. wakeup_energy in struct
      sched_energy needs to be a table instead and cpuidle needs to tells
      what the most likely state is.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      701d613e
    • Morten Rasmussen's avatar
      sched: Task wakeup tracking · 14e7d498
      Morten Rasmussen authored
      
      
      Track task wakeup rate in wakeup_avg_sum by counting wakeups. Note that
      this is _not_ cpu wakeups (idle exits). Task wakeups only cause cpu
      wakeups if the cpu is idle when the task wakeup occurs.
      
      The wakeup rate decays over time at the same rate as used for the
      existing entity load tracking. Unlike runnable_avg_sum, wakeup_avg_sum
      is counting events, not time, and is therefore theoretically unbounded
      and should be used with care.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      14e7d498
    • Morten Rasmussen's avatar
      sched: Energy model functions · 879e0df4
      Morten Rasmussen authored
      
      
      Introduces energy_diff_util() which finds the energy impacts of adding
      or removing utilization from a specific cpu. The calculation is based on
      the energy information provided by the platform through sched_energy
      data in the sched_domain hierarchy.
      
      Task and cpu utilization is currently based on load_avg_contrib and
      weighted_cpuload() which are actually load, not utilization.  We don't
      have a solution for utilization yet. There are several other loose ends
      that need to be addressed, such as load/utilization invariance and
      proper representation of compute capacity. However, the energy model is
      there.
      
      The energy cost model only considers utilization (busy time) and idle
      energy (remaining time) for now. The basic idea is to determine the
      energy cost at each level in the sched_domain hierarchy.
      
      	for_each_domain(cpu, sd) {
      		sg = sched_group_of(cpu)
      		energy_before = curr_util(sg) * busy_power(sg)
      				+ 1-curr_util(sg) * idle_power(sg)
      		energy_after = new_util(sg) * busy_power(sg)
      				+ 1-new_util(sg) * idle_power(sg)
      		energy_diff += energy_before - energy_after
      	}
      
      The idle power estimate currently only supports a single idle state per
      power (sub-)domain. Extending the support to multiple states requires a
      way of predicting which states is going to be the most likely. This
      prediction could be provided by cpuidle. Wake-ups energy is added later
      in this series.
      
      Assumptions and the basic algorithm are described in the code comments.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      879e0df4
    • Morten Rasmussen's avatar
      sched, cpufreq: Current compute capacity hack for ARM TC2 · 0e2698e4
      Morten Rasmussen authored
      
      
      Hack to report different cpu capacities for big and little cpus.
      This is for experimentation on ARM TC2 _only_. A proper solution
      has to address this problem.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      0e2698e4
    • Morten Rasmussen's avatar
      sched, cpufreq: Introduce current cpu compute capacity into scheduler · bfb2fc9c
      Morten Rasmussen authored
      The scheduler is currently unaware of frequency changes and the current
      compute capacity offered by the cpus. This patch is not the solution.
      It is a hack to give us something to experiment with for now.
      
      A proper solution could be based on the frequency invariant load
      tracking proposed in the past: https://lkml.org/lkml/2013/4/16/289
      
      
      
      This patch should _not_ be considered safe.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      bfb2fc9c
    • Morten Rasmussen's avatar
      sched: Introduce SD_SHARE_CAP_STATES sched_domain flag · 335a231d
      Morten Rasmussen authored
      
      
      cpufreq is currently keeping it a secret which cpus are sharing
      clock source. The scheduler needs to know about clock domains as well
      to become more energy aware. The SD_SHARE_CAP_STATES domain indicates
      whether cpus belonging to the domain share capacity states (P-states).
      
      There is no connection with cpufreq (yet). The flag must be set by
      the arch specific topology code.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      335a231d
    • Dietmar Eggemann's avatar
      sched: Introduce system-wide sched_energy · 5dc2739a
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      The Energy-aware algorithm needs system wide sched energy information on
      certain platforms (e.g. a one socket system with multiple cpus).
      
      In such a system, the sched energy data is only attached to the sched
      groups for the individual cpus in the sched domain MC level.
      
      For those systems, this patch adds a _hack_ to provide system-wide sched
      energy data via the sched_domain_topology_level table.
      
      The problem is that the sched_domain_topology_level table is not an
      interface to provide system-wide data but we want to keep the
      configuration of all sched energy related data in one place.
      
      The sched_domain_energy_f of the last entry (the one which is
      initialized with {NULL, }) of the sched_domain_topology_level table is
      set to cpu_sys_energy(). Since the sched_domain_mask_f of this entry
      stays NULL it is still not considered for the existing scheduler set-up
      code (see for_each_sd_topology()).
      
      A second call to init_sched_energy() with a struct sched_domain pointer
      equal NULL as an argument will initialize the system-wide sched energy
      structure sse.
      
      For the example platform (ARM TC2 (MC and DIE sd level)), the
      system-wide sched_domain_energy_f returns NULL, so struct sched_energy
      *sse stays NULL.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      5dc2739a
    • Dietmar Eggemann's avatar
      arm: topology: Define TC2 sched energy and provide it to scheduler · 39bd6abf
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      !!! This patch is only here to be able to test provisioning of sched
      energy related data from an arch topology shim layer to the scheduler.
      Since there is no code today which deals with extracting sched energy
      related data from the dtb or acpi, and process it in the topology shim
      layer, the struct sched_energy and the related struct capacity_state
      arrays are hard-coded here !!!
      
      This patch defines the struct sched_energy and the related struct
      capacity_state array for the cluster (relates to sg's in DIE sd level)
      and for the core (relates to sg's in MC sd level) for a Cortex A7 as
      well as for a Cortex A15. It further provides related implementations of
      the sched_domain_energy_f functions (cpu_cluster_energy() and
      cpu_core_energy()).
      
      To be able to propagate this information from the topology shim layer to
      the scheduler, the elements of the arm_topology[] table have been
      provisioned with the appropriate sched_domain_energy_f functions.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      39bd6abf
    • Dietmar Eggemann's avatar
      sched: Add sd energy procfs interface · d285705d
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      This patch makes the values of the sd energy data structure available via
      procfs.  The related files are placed as sub-directory named 'energy'
      inside the /proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for
      those cpu/domain/group tuples which have sd energy information.
      
      The following example depicts the contents of
      /proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which
      has sd energy information attached to domain level 0.
      
      ├── cpu0
      │   ├── domain0
      │   │   ├── busy_factor
      │   │   ├── busy_idx
      │   │   ├── cache_nice_tries
      │   │   ├── flags
      │   │   ├── forkexec_idx
      │   │   ├── group0
      │   │   │   └── energy
      │   │   │       ├── cap_states
      │   │   │       ├── idle_power
      │   │   │       ├── max_capacity
      │   │   │       ├── nr_cap_states
      │   │   │       └── wakeup_energy
      │   │   ├── group1
      │   │   │   └── energy
      │   │   │       ├── cap_states
      │   │   │       ├── idle_power
      │   │   │       ├── max_capacity
      │   │   │       ├── nr_cap_states
      │   │   │       └── wakeup_energy
      │   │   ├── idle_idx
      │   │   ├── imbalance_pct
      │   │   ├── max_interval
      │   │   ├── max_newidle_lb_cost
      │   │   ├── min_interval
      │   │   ├── name
      │   │   ├── newidle_idx
      │   │   └── wake_idx
      │   └── domain1
      │       ├── busy_factor
      │       ├── busy_idx
      │       ├── cache_nice_tries
      │       ├── flags
      │       ├── forkexec_idx
      │       ├── idle_idx
      │       ├── imbalance_pct
      │       ├── max_interval
      │       ├── max_newidle_lb_cost
      │       ├── min_interval
      │       ├── name
      │       ├── newidle_idx
      │       └── wake_idx
      
      The files 'idle_power', 'max_capacity', 'nr_cap_states' and 'wakeup_energy'
      contain a scalar value whereas 'cap_states' contains a vector of
      (compute capacity, power consumption @ this compute capacity) tuples.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      d285705d
    • Dietmar Eggemann's avatar
      sched: Allocate and initialize sched energy · 432df25c
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      The per sg struct sched_group_energy structure plus the related struct
      capacity_state array are allocated like the other sd hierarchy data
      structures (e.g. struct sched_group).  This includes the freeing of
      struct sched_group_energy structures which are not used.
      One problem is that the sd energy information consists of two structures
      per sg, the actual struct sched_group_energy and the related
      capacity_state array and that the number of elements of this array can be
      configured (see struct sched_group_energy.nr_cap_states).  That means
      that the number of capacity states has to be figured out in __sdt_alloc()
      and since both data structures are allocated at the same time, struct
      sched_group_energy.cap_states is initialized to point to the start of the
      capacity state array memory.
      
      The new function init_sched_energy() initializes the per sg struct
      sched_group_energy and the struct capacity_state array in case the struct
      sched_domain_topology_level contains sd energy information.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      432df25c
    • Dietmar Eggemann's avatar
      sched: Introduce sd energy data structures · 1320b9ce
      Dietmar Eggemann authored and Morten Rasmussen's avatar Morten Rasmussen committed
      
      
      The struct sched_energy represents the per scheduler group related data
      which is needed for the energy aware scheduler.
      
      It contains a pointer to a struct capacity_state array which contains
      (compute capacity, power consumption @ this compute capacity) tuples.
      
      The struct sched_group_energy wraps struct sched_energy and an atomic
      reference counter, latter is used for scheduler internal bookkeeping of
      data allocation and freeing.
      
      Allocation and freeing of struct sched_group_energy uses the existing
      infrastructure of the scheduler which is currently used for the other sd
      hierarchy data structures (e.g. struct sched_domain).  That's why struct
      sd_data is provisioned with a per cpu struct sched_group_energy double
      pointer.
      
      The struct sched_group gets a pointer to a struct sched_group_energy.
      
      The function ptr sched_domain_energy_f is introduced into struct
      sched_domain_topology_level which will allow the arch to set a pass a
      particular struct sd_energy from the topology shim layer into the
      scheduler core.
      
      The function ptr sched_domain_energy_f has an 'int cpu' parameter since
      the folding of two adjacent sd levels via sd degenerate doesn't work
      for all sd levels.  E.g. it is not possible to use this feature to
      provide per-cpu sd energy in sd level DIE (former CPU) on ARM's TC2
      platform.
      
      It was discussed that the folding of sd levels approach is preferable
      over the cpu parameter approach, simply because the user (the arch
      specifying the sd topology table) can introduce less errors.  But since
      it is not working, the 'int cpu' parameter is the only way out.  It's
      possible to use the folding of sd levels approach for
      sched_domain_flags_f and the cpu parameter approach for the
      sched_domain_energy_f at the same time set-up though.  With the use of
      the 'int cpu' parameter, an extra check function has to be provided to
      make sure that all cpus spanned by a scheduler building block (e.g a
      sched domain or a group) are provisioned with the same energy data.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      1320b9ce
    • Morten Rasmussen's avatar
      sched: Introduce CONFIG_SCHED_ENERGY · bbfa38e2
      Morten Rasmussen authored
      
      
      The Energy-aware scheduler implementation is guarded by
      CONFIG_SCHED_ENERGY.
      
      Signed-off-by: Dietmar Eggemann's avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      bbfa38e2
    • Morten Rasmussen's avatar
      sched: Documentation for scheduler energy cost model · c7aadf15
      Morten Rasmussen authored
      
      
      This documentation patch provide a brief overview of the experimental
      scheduler energy costing model and associated data structures.
      
      Signed-off-by: Morten Rasmussen's avatarMorten Rasmussen <morten.rasmussen@arm.com>
      c7aadf15
  2. 12 Jun, 2014 1 commit
  3. 05 Jun, 2014 18 commits
  4. 22 May, 2014 5 commits