Skip to content
  • Joel Savitz's avatar
    cpuset: restore sanity to cpuset_cpus_allowed_fallback() · d477f8c2
    Joel Savitz authored
    
    
    In the case that a process is constrained by taskset(1) (i.e.
    sched_setaffinity(2)) to a subset of available cpus, and all of those are
    subsequently offlined, the scheduler will set tsk->cpus_allowed to
    the current value of task_cs(tsk)->effective_cpus.
    
    This is done via a call to do_set_cpus_allowed() in the context of
    cpuset_cpus_allowed_fallback() made by the scheduler when this case is
    detected. This is the only call made to cpuset_cpus_allowed_fallback()
    in the latest mainline kernel.
    
    However, this is not sane behavior.
    
    I will demonstrate this on a system running the latest upstream kernel
    with the following initial configuration:
    
    	# grep -i cpu /proc/$$/status
    	Cpus_allowed:	ffffffff,fffffff
    	Cpus_allowed_list:	0-63
    
    (Where cpus 32-63 are provided via smt.)
    
    If we limit our current shell process to cpu2 only and then offline it
    and reonline it:
    
    	# taskset -p 4 $$
    	pid 2272's current affinity mask: ffffffffffffffff
    	pid 2272's new affinity mask: 4
    
    	# echo off > /sys/devices/system/cpu/cpu2/online
    	# dmesg | tail -3
    	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
    	[ 2195.872700] IRQ 114: no longer affine to CPU2
    	[ 2195.879128] smpboot: CPU 2 is now offline
    
    	# echo on > /sys/devices/system/cpu/cpu2/online
    	# dmesg | tail -1
    	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4
    
    We see that our current process now has an affinity mask containing
    every cpu available on the system _except_ the one we originally
    constrained it to:
    
    	# grep -i cpu /proc/$$/status
    	Cpus_allowed:   ffffffff,fffffffb
    	Cpus_allowed_list:      0-1,3-63
    
    This is not sane behavior, as the scheduler can now not only place the
    process on previously forbidden cpus, it can't even schedule it on
    the cpu it was originally constrained to!
    
    Other cases result in even more exotic affinity masks. Take for instance
    a process with an affinity mask containing only cpus provided by smt at
    the moment that smt is toggled, in a configuration such as the following:
    
    	# taskset -p f000000000 $$
    	# grep -i cpu /proc/$$/status
    	Cpus_allowed:	000000f0,00000000
    	Cpus_allowed_list:	36-39
    
    A double toggle of smt results in the following behavior:
    
    	# echo off > /sys/devices/system/cpu/smt/control
    	# echo on > /sys/devices/system/cpu/smt/control
    	# grep -i cpus /proc/$$/status
    	Cpus_allowed:	ffffff00,ffffffff
    	Cpus_allowed_list:	0-31,40-63
    
    This is even less sane than the previous case, as the new affinity mask
    excludes all smt-provided cpus with ids less than those that were
    previously in the affinity mask, as well as those that were actually in
    the mask.
    
    With this patch applied, both of these cases end in the following state:
    
    	# grep -i cpu /proc/$$/status
    	Cpus_allowed:	ffffffff,ffffffff
    	Cpus_allowed_list:	0-63
    
    The original policy is discarded. Though not ideal, it is the simplest way
    to restore sanity to this fallback case without reinventing the cpuset
    wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
    for the previous affinity mask to be restored in this fallback case can use
    that mechanism instead.
    
    This patch modifies scheduler behavior by instead resetting the mask to
    task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
    mode. I tested the cases above on both modes.
    
    Note that the scheduler uses this fallback mechanism if and only if
    _every_ other valid avenue has been traveled, and it is the last resort
    before calling BUG().
    
    Suggested-by: default avatarWaiman Long <longman@redhat.com>
    Suggested-by: default avatarPhil Auld <pauld@redhat.com>
    Signed-off-by: default avatarJoel Savitz <jsavitz@redhat.com>
    Acked-by: default avatarPhil Auld <pauld@redhat.com>
    Acked-by: default avatarWaiman Long <longman@redhat.com>
    Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    d477f8c2