- 04 Feb, 2015 34 commits
-
-
Morten Rasmussen authored
With energy-aware scheduling enabled nohz_kick_needed() generates many nohz idle-balance kicks which lead to nothing when multiple tasks get packed on a single cpu to save energy. This causes unnecessary wake-ups and hence wastes energy. Make these conditions depend on !energy_aware() for now until the energy-aware nohz story gets sorted out. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Add an extra criteria to need_active_balance() to kick off active load balance if the source cpu is overutilized and has lower capacity than the destination cpus. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
We do not want to miss out on the ability to do energy-aware idle load balancing if the system is only partially loaded since the operational range of energy-aware scheduling corresponds to a partially loaded system. We might want to pull a single remaining task from a potential src cpu towards an idle destination cpu if the energy model tells us this is worth doing to save energy. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Skip cpu as a potential src (costliest) in case it has only one task running and its original capacity is greater than or equal to the original capacity of the dst cpu. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Energy-aware load balancing bases on cpu usage so the upper bound of its operational range is a fully utilized cpu. Above this tipping point it makes more sense to use weighted_cpuload to preserve smp_nice. This patch implements the tipping point detection in update_sg_lb_stats as if one cpu is over-utilized the current energy-aware load balance operation will fall back into the conventional weighted load based one. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Energy-aware load balancing does not rely on env->imbalance but instead it evaluates the system-wide energy difference for each task on the src rq by potentially moving it to the dst rq. If this energy difference is lesser than zero the task is actually moved from src to dst rq. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
In case that after the gathering of sched domain statistics the current load balancing operation is still in energy-aware mode and a least efficient sched group has been found, detect the least efficient cpu by comparing the cpu efficiency (ratio between cpu usage and cpu energy consumption) among all cpus of the least efficient sched group. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
In case that after the gathering of sched domain statistics the current load balancing operation is still in energy-aware mode, just return the least efficient (costliest) reference. That implies the system is considered to be balanced in case no least efficient sched group was found. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Energy-aware load balancing has to work alongside the conventional load based functionality. This includes the tipping point feature, i.e. being able to fall back from energy aware to the conventional load based functionality during an ongoing load balancing action. That is why this patch introduces an additional reference to hold the least efficient sched group (costliest) as well its statistics in form of an extra sg_lb_stats structure (costliest_stat). The function update_sd_pick_costliest is used to assign the least efficient sched group parallel to the existing update_sd_pick_busiest. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
To be able to identify the least efficient (costliest) sched group introduce group_eff as the efficiency of the sched group into sg_lb_stats. The group efficiency is defined as the ratio between the group usage and the group energy consumption. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Energy-aware load balancing should only happen if the ENERGY_AWARE feature is turned on and the sched domain on which the load balancing is performed on contains energy data. There is also a need during a load balance action to be able to query if we should continue to load balance energy-aware or if we reached the tipping point which forces us to fall back to the conventional load balancing functionality. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Morten Rasmussen authored
To estimate the energy consumption of a sched_group in sched_group_energy() it is necessary to know which idle-state the group is in when it is idle. For now, it is assumed that this is the current idle-state (though it might be wrong). Based on the individual cpu idle-states group_idle_state() finds the group idle-state. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
cpuidle associates all idle-states with each cpu while the energy model associates them with the sched_group covering the cpus coordinating entry to the idle-state. To get idle-state power consumption it is therefore necessary to translate from cpuidle idle-state index to energy model index. For this purpose it is helpful to know how many idle-states that are listed in lower level sched_groups (in struct sched_group_energy). cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
The idle-state of each cpu is currently pointed to by rq->idle_state but there isn't any information in the struct cpuidle_state that can used to look up the idle-state energy model data stored in struct sched_group_energy. For this purpose is necessary to store the idle state index as well. Ideally, the idle-state data should be unified. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Make wake-ups of new tasks (find_idlest_group) aware of any differences in cpu compute capacity so new tasks don't get handed off to a cpus with lower capacity. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Let available compute capacity and estimated energy impact select wake-up target cpu when energy-aware scheduling is enabled. energy_aware_wake_cpu() attempts to find group of cpus with sufficient compute capacity to accommodate the task and find a cpu with enough spare capacity to handle the task within that group. Preference is given to cpus with enough spare capacity at the current OPP. Finally, the energy impact of the new target and the previous task cpu is compared to select the wake-up target cpu. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Adds a generic energy-aware helper function, energy_diff(), that calculates energy impact of adding, removing, and migrating utilization in the system. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Extended sched_group_energy() to support energy prediction with usage (tasks) added/removed from a specific cpu or migrated between a pair of cpus. Useful for load-balancing decision making. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
For energy-aware load-balancing decisions it is necessary to know the energy consumption estimates of groups of cpus. This patch introduces a basic function, sched_group_energy(), which estimates the energy consumption of the cpus in the group and any resources shared by the members of the group. NOTE: The function has five levels of identation and breaks the 80 character limit. Refactoring is necessary. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Add another member to the family of per-cpu sched_domain shortcut pointers. This one, sd_ea, points to the highest level at which energy model is provided. At this level and all levels below all sched_groups have energy model data attached. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
With scale-invariant usage tracking get_cpu_usage() should never return a usage above the current compute capacity of the cpu (capacity_curr). The scaling of the utilization tracking contributions should generally cause the cpu utilization to saturate at capacity_curr, but it may temporarily exceed this value in certain situations. This patch changes the cap from capacity_orig to capacity_curr. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Move get_cpu_usage() to an earlier position in fair.c. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
capacity_orig_of() returns the max available compute capacity of a cpu. For scale-invariant utilization tracking and energy-aware scheduling decisions it is useful to know the compute capacity available at the current OPP of a cpu. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
This patch is only here to be able to test provisioning of energy related data from an arch topology shim layer to the scheduler. Since there is no code today which deals with extracting energy related data from the dtb or acpi, and process it in the topology shim layer, the content of the sched_group_energy structures as well as the idle_state and capacity_state arrays are hard-coded here. This patch defines the sched_group_energy structure as well as the idle_state and capacity_state array for the cluster (relates to sched groups (sgs) in DIE sched domain level) and for the core (relates to sgs in MC sd level) for a Cortex A7 as well as for a Cortex A15. It further provides related implementations of the sched_domain_energy_f functions (cpu_cluster_energy() and cpu_core_energy()). To be able to propagate this information from the topology shim layer to the scheduler, the elements of the arm_topology[] table have been provisioned with the appropriate sched_domain_energy_f functions. cc: Russell King <linux@arm.linux.org.uk> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Morten Rasmussen authored
cpufreq is currently keeping it a secret which cpus are sharing clock source. The scheduler needs to know about clock domains as well to become more energy aware. The SD_SHARE_CAP_STATES domain flag indicates whether cpus belonging to the sched_domain share capacity states (P-states). There is no connection with cpufreq (yet). The flag must be set by the arch specific topology code. cc: Russell King <linux@arm.linux.org.uk> cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
The per sched group sched_group_energy structure plus the related idle_state and capacity_state arrays are allocated like the other sched domain (sd) hierarchy data structures. This includes the freeing of sched_group_energy structures which are not used. One problem is that the number of elements of the idle_state and the capacity_state arrays is not fixed and has to be retrieved in __sdt_alloc() to allocate memory for the sched_group_energy structure and the two arrays in one chunk. The array pointers (idle_states and cap_states) are initialized here to point to the correct place inside the memory chunk. The new function init_sched_energy() initializes the sched_group_energy structure and the two arrays in case the sd topology level contains energy information. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
The struct sched_group_energy represents the per sched_group related data which is needed for energy aware scheduling. It contains: (1) atomic reference counter for scheduler internal bookkeeping of data allocation and freeing (2) number of elements of the idle state array (3) pointer to the idle state array which comprises 'power consumption' for each idle state (4) number of elements of the capacity state array (5) pointer to the capacity state array which comprises 'compute capacity and power consumption' tuples for each capacity state Allocation and freeing of struct sched_group_energy utilizes the existing infrastructure of the scheduler which is currently used for the other sd hierarchy data structures (e.g. struct sched_domain) as well. That's why struct sd_data is provisioned with a per cpu struct sched_group_energy double pointer. The struct sched_group obtains a pointer to a struct sched_group_energy. The function pointer sched_domain_energy_f is introduced into struct sched_domain_topology_level which will allow the arch to pass a particular struct sched_group_energy from the topology shim layer into the scheduler core. The function pointer sched_domain_energy_f has an 'int cpu' parameter since the folding of two adjacent sd levels via sd degenerate doesn't work for all sd levels. I.e. it is not possible for example to use this feature to provide per-cpu energy in sd level DIE on ARM's TC2 platform. It was discussed that the folding of sd levels approach is preferable over the cpu parameter approach, simply because the user (the arch specifying the sd topology table) can introduce less errors. But since it is not working, the 'int cpu' parameter is the only way out. It's possible to use the folding of sd levels approach for sched_domain_flags_f and the cpu parameter approach for the sched_domain_energy_f at the same time though. With the use of the 'int cpu' parameter, an extra check function has to be provided to make sure that all cpus spanned by a sched group are provisioned with the same energy data. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Morten Rasmussen authored
This patch introduces the ENERGY_AWARE sched feature, which is implemented using jump labels when SCHED_DEBUG is defined. It is statically set false when SCHED_DEBUG is not defined. Hence this doesn't allow energy awareness to be enabled without SCHED_DEBUG. This sched_feature knob will be replaced later with a more appropriate control knob when things have matured a bit. ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED must be enable. This dependency isn't checked at compile time yet. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
This documentation patch provides an overview of the experimental scheduler energy costing model, associated data structures, and a reference recipe on how platforms can be characterized to derive energy models. Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Add the blocked utilization contribution to group sched_entity utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage(). With this change cpu usage now includes recent usage by currently non-runnable tasks, hence it provides a more stable view of the cpu usage. It does, however, also mean that the meaning of usage is changed: A cpu may be momentarily idle while usage >0. It can no longer be assumed that cpu usage >0 implies runnable tasks on the rq. cfs_rq->utilization_load_avg or nr_running should be used instead to get the current rq status. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Morten Rasmussen authored
Introduces the blocked utilization, the utilization counter-part to cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization contributions of entities that were recently on the cfs_rq that are currently blocked. Combined with sum of contributions of entities currently on the cfs_rq or currently running (cfs_rq->utilization_load_avg) this can provide a more stable average view of the cpu usage. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Since now we have besides frequency invariant also cpu (uarch plus max system frequency) invariant cfs_rq::utilization_load_avg both, frequency and cpu scaling happens as part of the load tracking. So cfs_rq::utilization_load_avg does not have to be scaled by the original capacity of the cpu again. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Reuses the existing infrastructure for cpu_scale to provide the scheduler with a cpu scaling correction factor for more accurate load-tracking. This factor comprises a micro-architectural part, which is based on the cpu efficiency value of a cpu as well as a platform-wide max frequency part, which relates to the dtb property clock-frequency of a cpu node. The calculation of cpu_scale, return value of arch_scale_cpu_capacity, changes from: capacity / middle_capacity with capacity = (clock_frequency >> 20) * cpu_efficiency to: SCHED_CAPACITY_SCALE * cpu_perf / max_cpu_perf The range of the cpu_scale value changes from [0..3*SCHED_CAPACITY_SCALE/2] to [0..SCHED_CAPACITY_SCALE]. The functionality to calculate the middle_capacity which corresponds to an 'average' cpu has been taken out since the scaling is now done differently. In the case that either the cpu efficiency or the clock-frequency value for a cpu is missing, no cpu scaling is done for any cpu. The platform-wide max frequency part of the factor should not be confused with the frequency invariant scheduler load-tracking support which deals with frequency related scaling due to DFVS functionality on a cpu. Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
To enable the parsing of clock frequency and cpu efficiency values inside parse_dt_topology [arch/arm/kernel/topology.c] to scale the relative capacity of the cpus, this property has to be provided within the cpu nodes of the dts file. The patch is a copy of commit 8f15973e ("ARM: vexpress: Add CPU clock-frequencies to TC2 device-tree") taken from Linaro Stable Kernel (LSK) massaged into mainline. Cc: Jon Medhurst <tixy@linaro.org> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
- 02 Feb, 2015 6 commits
-
-
Implements arch-specific function to provide the scheduler with a frequency scaling correction factor for more accurate load-tracking. The factor is: current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu) This implementation only provides frequency invariance. No micro-architecture invariance yet. Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Architectures that don't have any other means for tracking cpu frequency changes need a callback from cpufreq to implement a scaling factor to enable scale-invariant per-entity load-tracking in the scheduler. To compute the scale invariance correction factor the architecture would need to know both the max frequency and the current frequency. This patch defines weak functions for setting both from cpufreq. Related architecture specific functions use weak function definitions. The same approach is followed here. These callbacks can be used to implement frequency scaling of cpu capacity later. Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by:
Morten Rasmussen <morten.rasmussen@arm.com>
-
Besides the existing frequency scale-invariance correction factor, apply cpu scale-invariance correction factor to usage tracking. Cpu scale-invariance takes cpu performance deviations due to micro-architectural differences (i.e. instructions per seconds) between cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency value of the highest OPP between cpus in SMP systems into consideration. Each segment of the sched_avg::running_avg_sum geometric series is now scaled by the cpu performance factor too so the sched_avg::utilization_avg_contrib of each entity will be invariant from the particular cpu of the HMP/SMP system it is gathered on. So the usage level that is returned by get_cpu_usage stays relative to the max cpu performance of the system. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>
-
Apply frequency scale-invariance correction factor to load tracking. Each segment of the sched_avg::runnable_avg_sum geometric series is now scaled by the current frequency so the sched_avg::load_avg_contrib of each entity will be invariant with frequency scaling. As a result, cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib, becomes invariant too. So the load level that is returned by weighted_cpuload, stays relative to the max frequency of the cpu. Then, we want the keep the load tracking values in a 32bits type, which implies that the max value of sched_avg::{runnable|running}_avg_sum must be lower than 2^32/88761=48388 (88761 is the max weight of a task). As LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by:
Vincent Guittot <vincent.guittot@linaro.org>
-
When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining capacity for CFS tasks can be significantly reduced. Once we detect such situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle load balance to check if it's worth moving its tasks on an idle CPU. Once the idle load_balance has selected the busiest CPU, it will look for an active load balance for only two cases : - there is only 1 task on the busiest CPU. - we haven't been able to move a task of the busiest rq. A CPU with a reduced capacity is included in the 1st case, and it's worth to actively migrate its task if the idle CPU has got full capacity. This test has been added in need_active_balance. As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task. The nohz_kick_needed function has been cleaned up a bit while adding the new test env.src_cpu and env.src_rq must be set unconditionnally because they are used in need_active_balance which is called even if busiest->nr_running equals 1 Signed-off-by:
Vincent Guittot <vincent.guittot@linaro.org>
-
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will put at least 1 task per core. Signed-off-by:
Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by:
Preeti U. Murthy <preeti@linux.vnet.ibm.com>
-