1. 19 Apr, 2019 1 commit
    • Roman Gushchin's avatar
      cgroup: cgroup v2 freezer · 76f969e8
      Roman Gushchin authored
      
      
      Cgroup v1 implements the freezer controller, which provides an ability
      to stop the workload in a cgroup and temporarily free up some
      resources (cpu, io, network bandwidth and, potentially, memory)
      for some other tasks. Cgroup v2 lacks this functionality.
      
      This patch implements freezer for cgroup v2.
      
      Cgroup v2 freezer tries to put tasks into a state similar to jobctl
      stop. This means that tasks can be killed, ptraced (using
      PTRACE_SEIZE*), and interrupted. It is possible to attach to
      a frozen task, get some information (e.g. read registers) and detach.
      It's also possible to migrate a frozen tasks to another cgroup.
      
      This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
      tried to imitate the system-wide freezer. However uninterruptible
      sleep is fine when all tasks are going to be frozen (hibernation case),
      it's not the acceptable state for some subset of the system.
      
      Cgroup v2 freezer is not supporting freezing kthreads.
      If a non-root cgroup contains kthread, the cgroup still can be frozen,
      but the kthread will remain running, the cgroup will be shown
      as non-frozen, and the notification will not be delivered.
      
      * PTRACE_ATTACH is not working because non-fatal signal delivery
      is blocked in frozen state.
      
      There are some interface differences between cgroup v1 and cgroup v2
      freezer too, which are required to conform the cgroup v2 interface
      design principles:
      1) There is no separate controller, which has to be turned on:
      the functionality is always available and is represented by
      cgroup.freeze and cgroup.events cgroup control files.
      2) The desired state is defined by the cgroup.freeze control file.
      Any hierarchical configuration is allowed.
      3) The interface is asynchronous. The actual state is available
      using cgroup.events control file ("frozen" field). There are no
      dedicated transitional states.
      4) It's allowed to make any changes with the cgroup hierarchy
      (create new cgroups, remove old cgroups, move tasks between cgroups)
      no matter if some cgroups are frozen.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      No-objection-from-me-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@fb.com
      76f969e8
  2. 06 Mar, 2019 2 commits
    • Aneesh Kumar K.V's avatar
      mm/cma: add PF flag to force non cma alloc · d7fefcc8
      Aneesh Kumar K.V authored
      Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA
      region", v8.
      
      ppc64 uses the CMA area for the allocation of guest page table (hash
      page table).  We won't be able to start guest if we fail to allocate
      hash page table.  We have observed hash table allocation failure because
      we failed to migrate pages out of CMA region because they were pinned.
      This happen when we are using VFIO.  VFIO on ppc64 pins the entire guest
      RAM.  If the guest RAM pages get allocated out of CMA region, we won't
      be able to migrate those pages.  The pages are also pinned for the
      lifetime of the guest.
      
      Currently we support migration of non-compound pages.  With THP and with
      the addition of hugetlb migration we can end up allocating compound
      pages from CMA region.  This patch series add support for migrating
      compound pages.
      
      This patch (of 4):
      
      Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is
      marked non-movable and hence cannot be satisfied by CMA region.
      
      This is useful with get_user_pages_longterm where we want to take a page
      pin by migrating pages from CMA region.  Marking the section
      PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration
      later.
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.com
      
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7fefcc8
    • Mel Gorman's avatar
      mm, compaction: capture a page under direct compaction · 5e1f0f09
      Mel Gorman authored
      Compaction is inherently race-prone as a suitable page freed during
      compaction can be allocated by any parallel task.  This patch uses a
      capture_control structure to isolate a page immediately when it is freed
      by a direct compactor in the slow path of the page allocator.  The
      intent is to avoid redundant scanning.
      
                                           5.0.0-rc1              5.0.0-rc1
                                     selective-v3r17          capture-v3r19
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      2582.11 (   0.00%)     2563.68 (   0.71%)
      Amean     fault-both-5      4500.26 (   0.00%)     4233.52 (   5.93%)
      Amean     fault-both-7      5819.53 (   0.00%)     6333.65 (  -8.83%)
      Amean     fault-both-12     9321.18 (   0.00%)     9759.38 (  -4.70%)
      Amean     fault-both-18     9782.76 (   0.00%)    10338.76 (  -5.68%)
      Amean     fault-both-24    15272.81 (   0.00%)    13379.55 *  12.40%*
      Amean     fault-both-30    15121.34 (   0.00%)    16158.25 (  -6.86%)
      Amean     fault-both-32    18466.67 (   0.00%)    18971.21 (  -2.73%)
      
      Latency is only moderately affected but the devil is in the details.  A
      closer examination indicates that base page fault latency is reduced but
      latency of huge pages is increased as it takes creater care to succeed.
      Part of the "problem" is that allocation success rates are close to 100%
      even when under pressure and compaction gets harder
      
                                      5.0.0-rc1              5.0.0-rc1
                                selective-v3r17          capture-v3r19
      Percentage huge-3        96.70 (   0.00%)       98.23 (   1.58%)
      Percentage huge-5        96.99 (   0.00%)       95.30 (  -1.75%)
      Percentage huge-7        94.19 (   0.00%)       97.24 (   3.24%)
      Percentage huge-12       94.95 (   0.00%)       97.35 (   2.53%)
      Percentage huge-18       96.74 (   0.00%)       97.30 (   0.58%)
      Percentage huge-24       97.07 (   0.00%)       97.55 (   0.50%)
      Percentage huge-30       95.69 (   0.00%)       98.50 (   2.95%)
      Percentage huge-32       96.70 (   0.00%)       99.27 (   2.65%)
      
      And scan rates are reduced as expected by 6% for the migration scanner
      and 29% for the free scanner indicating that there is less redundant
      work.
      
      Compaction migrate scanned    20815362    19573286
      Compaction free scanned       16352612    11510663
      
      [mgorman@techsingularity.net: remove redundant check]
        Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
      Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e1f0f09
  3. 25 Feb, 2019 1 commit
    • Linus Torvalds's avatar
      Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" · 53a41cb7
      Linus Torvalds authored
      This reverts commit 9da3f2b7
      
      .
      
      It was well-intentioned, but wrong.  Overriding the exception tables for
      instructions for random reasons is just wrong, and that is what the new
      code did.
      
      It caused problems for tracing, and it caused problems for strncpy_from_user(),
      because the new checks made perfectly valid use cases break, rather than
      catch things that did bad things.
      
      Unchecked user space accesses are a problem, but that's not a reason to
      add invalid checks that then people have to work around with silly flags
      (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
      odd way to say "this commit was wrong" and was sprinked into random
      places to hide the wrongness).
      
      The real fix to unchecked user space accesses is to get rid of the
      special "let's not check __get_user() and __put_user() at all" logic.
      Make __{get|put}_user() be just aliases to the regular {get|put}_user()
      functions, and make it impossible to access user space without having
      the proper checks in places.
      
      The raison d'être of the special double-underscore versions used to be
      that the range check was expensive, and if you did multiple user
      accesses, you'd do the range check up front (like the signal frame
      handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
      made that optimization pointless, because the _real_ expense is the "set
      CPU flag to allow user space access".
      
      Do let's not break the valid cases to catch invalid cases that shouldn't
      even exist.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53a41cb7
  4. 04 Feb, 2019 4 commits
    • Andrea Parri's avatar
      sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() · c546951d
      Andrea Parri authored
      
      
      move_queued_task() synchronizes with task_rq_lock() as follows:
      
      	move_queued_task()		task_rq_lock()
      
      	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
      	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
      	[S] ->cpu = new_cpu		[L] ->on_rq
      
      where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
      address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
      "[L] ->on_rq" by the ACQUIRE itself.
      
      Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
      this address dependency.  Also, mark the accesses to ->cpu and ->on_rq
      with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.
      
      Signed-off-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c546951d
    • vingu-linaro's avatar
      sched/fair: Update scale invariance of PELT · 23127296
      vingu-linaro authored
      
      
      The current implementation of load tracking invariance scales the
      contribution with current frequency and uarch performance (only for
      utilization) of the CPU. One main result of this formula is that the
      figures are capped by current capacity of CPU. Another one is that the
      load_avg is not invariant because not scaled with uarch.
      
      The util_avg of a periodic task that runs r time slots every p time slots
      varies in the range :
      
          U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
      
      with U is the max util_avg value = SCHED_CAPACITY_SCALE
      
      At a lower capacity, the range becomes:
      
          U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
      
      with C reflecting the compute capacity ratio between current capacity and
      max capacity.
      
      so C tries to compensate changes in (1-y^r') but it can't be accurate.
      
      Instead of scaling the contribution value of PELT algo, we should scale the
      running time. The PELT signal aims to track the amount of computation of
      tasks and/or rq so it seems more correct to scale the running time to
      reflect the effective amount of computation done since the last update.
      
      In order to be fully invariant, we need to apply the same amount of
      running time and idle time whatever the current capacity. Because running
      at lower capacity implies that the task will run longer, we have to ensure
      that the same amount of idle time will be applied when system becomes idle
      and no idle time has been "stolen". But reaching the maximum utilization
      value (SCHED_CAPACITY_SCALE) means that the task is seen as an
      always-running task whatever the capacity of the CPU (even at max compute
      capacity). In this case, we can discard this "stolen" idle times which
      becomes meaningless.
      
      In order to achieve this time scaling, a new clock_pelt is created per rq.
      The increase of this clock scales with current capacity when something
      is running on rq and synchronizes with clock_task when rq is idle. With
      this mechanism, we ensure the same running and idle time whatever the
      current capacity. This also enables to simplify the pelt algorithm by
      removing all references of uarch and frequency and applying the same
      contribution to utilization and loads. Furthermore, the scaling is done
      only once per update of clock (update_rq_clock_task()) instead of during
      each update of sched_entities and cfs/rt/dl_rq of the rq like the current
      implementation. This is interesting when cgroup are involved as shown in
      the results below:
      
      On a hikey (octo Arm64 platform).
      Performance cpufreq governor and only shallowest c-state to remove variance
      generated by those power features so we only track the impact of pelt algo.
      
      each test runs 16 times:
      
      	./perf bench sched pipe
      	(higher is better)
      	kernel	tip/sched/core     + patch
      	        ops/seconds        ops/seconds         diff
      	cgroup
      	root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
      	level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
      	level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%
      
      	hackbench -l 1000
      	(lower is better)
      	kernel	tip/sched/core     + patch
      	        duration(sec)      duration(sec)        diff
      	cgroup
      	root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
      	level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
      	level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%
      
      Then, the responsiveness of PELT is improved when CPU is not running at max
      capacity with this new algorithm. I have put below some examples of
      duration to reach some typical load values according to the capacity of the
      CPU with current implementation and with this patch. These values has been
      computed based on the geometric series and the half period value:
      
        Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
        972 (95%)    138ms         not reachable            276ms
        486 (47.5%)  30ms          138ms                     60ms
        256 (25%)    13ms           32ms                     26ms
      
      On my hikey (octo Arm64 platform) with schedutil governor, the time to
      reach max OPP when starting from a null utilization, decreases from 223ms
      with current scale invariance down to 121ms with the new algorithm.
      
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: pjt@google.com
      Cc: pkondeti@codeaurora.org
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: srinivas.pandruvada@linux.intel.com
      Cc: thara.gopinath@linaro.org
      Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      23127296
    • Elena Reshetova's avatar
      sched/core: Convert task_struct.stack_refcount to refcount_t · f0b89d39
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.stack_refcount is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57
      
       and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.stack_refcount it might make a difference
      in following places:
      
       - try_get_task_stack(): increment in refcount_inc_not_zero() only
         guarantees control dependency on success vs. fully ordered
         atomic counterpart
       - put_task_stack(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f0b89d39
    • Elena Reshetova's avatar
      sched/core: Convert task_struct.usage to refcount_t · ec1d2819
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.usage is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57
      
       and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.usage it might make a difference
      in following places:
      
       - put_task_struct(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ec1d2819
  5. 03 Feb, 2019 1 commit
  6. 02 Feb, 2019 1 commit
    • Johannes Weiner's avatar
      x86/resctrl: Avoid confusion over the new X86_RESCTRL config · e6d42931
      Johannes Weiner authored
      
      
      "Resource Control" is a very broad term for this CPU feature, and a term
      that is also associated with containers, cgroups etc. This can easily
      cause confusion.
      
      Make the user prompt more specific. Match the config symbol name.
      
       [ bp: In the future, the corresponding ARM arch-specific code will be
         under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
         under the CPU_RESCTRL umbrella symbol. ]
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Babu Moger <Babu.Moger@amd.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
      e6d42931
  7. 29 Jan, 2019 2 commits
    • Waiman Long's avatar
      x86/speculation: Add PR_SPEC_DISABLE_NOEXEC · 71368af9
      Waiman Long authored
      
      
      With the default SPEC_STORE_BYPASS_SECCOMP/SPEC_STORE_BYPASS_PRCTL mode,
      the TIF_SSBD bit will be inherited when a new task is fork'ed or cloned.
      It will also remain when a new program is execve'ed.
      
      Only certain class of applications (like Java) that can run on behalf of
      multiple users on a single thread will require disabling speculative store
      bypass for security purposes. Those applications will call prctl(2) at
      startup time to disable SSB. They won't rely on the fact the SSB might have
      been disabled. Other applications that don't need SSBD will just move on
      without checking if SSBD has been turned on or not.
      
      The fact that the TIF_SSBD is inherited across execve(2) boundary will
      cause performance of applications that don't need SSBD but their
      predecessors have SSBD on to be unwittingly impacted especially if they
      write to memory a lot.
      
      To remedy this problem, a new PR_SPEC_DISABLE_NOEXEC argument for the
      PR_SET_SPECULATION_CTRL option of prctl(2) is added to allow applications
      to specify that the SSBD feature bit on the task structure should be
      cleared whenever a new program is being execve'ed.
      
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: linux-doc@vger.kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: KarimAllah Ahmed <karahmed@amazon.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: https://lkml.kernel.org/r/1547676096-3281-1-git-send-email-longman@redhat.com
      71368af9
    • Thomas Gleixner's avatar
      sched: Remove stale PF_MUTEX_TESTER bit · 15917dc0
      Thomas Gleixner authored
      
      
      The RTMUTEX tester was removed long ago but the PF bit stayed
      around. Remove it and free up the space.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      15917dc0
  8. 25 Jan, 2019 1 commit
  9. 12 Jan, 2019 1 commit
    • Taehee Yoo's avatar
      umh: add exit routine for UMH process · 73ab1cb2
      Taehee Yoo authored
      
      
      A UMH process which is created by the fork_usermode_blob() such as
      bpfilter needs to release members of the umh_info when process is
      terminated.
      But the do_exit() does not release members of the umh_info. hence module
      which uses UMH needs own code to detect whether UMH process is
      terminated or not.
      But this implementation needs extra code for checking the status of
      UMH process. it eventually makes the code more complex.
      
      The new PF_UMH flag is added and it is used to identify UMH processes.
      The exit_umh() does not release members of the umh_info.
      Hence umh_info->cleanup callback should release both members of the
      umh_info and the private data.
      
      Suggested-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73ab1cb2
  10. 09 Jan, 2019 1 commit
  11. 03 Dec, 2018 1 commit
    • Ingo Molnar's avatar
      sched: Fix various typos in comments · dfcb245e
      Ingo Molnar authored
      
      
      Go over the scheduler source code and fix common typos
      in comments - and a typo in an actual variable name.
      
      No change in functionality intended.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dfcb245e
  12. 28 Nov, 2018 2 commits
    • Thomas Gleixner's avatar
      x86/speculation: Add prctl() control for indirect branch speculation · 9137bb27
      Thomas Gleixner authored
      
      
      Add the PR_SPEC_INDIRECT_BRANCH option for the PR_GET_SPECULATION_CTRL and
      PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of
      indirect branch speculation via STIBP and IBPB.
      
      Invocations:
       Check indirect branch speculation status with
       - prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0);
      
       Enable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
      
       Disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
      
       Force disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
      
      See Documentation/userspace-api/spec_ctrl.rst.
      
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.866780996@linutronix.de
      9137bb27
    • Steven Rostedt (VMware)'s avatar
      function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack · 39eb456d
      Steven Rostedt (VMware) authored
      Currently, the depth of the ret_stack is determined by curr_ret_stack index.
      The issue is that there's a race between setting of the curr_ret_stack and
      calling of the callback attached to the return of the function.
      
      Commit 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling
      trace return callback") moved the calling of the callback to after the
      setting of the curr_ret_stack, even stating that it was safe to do so, when
      in fact, it was the reason there was a barrier() there (yes, I should have
      commented that barrier()).
      
      Not only does the curr_ret_stack keep track of the current call graph depth,
      it also keeps the ret_stack content from being overwritten by new data.
      
      The function profiler, uses the "subtime" variable of ret_stack structure
      and by moving the curr_ret_stack, it allows for interrupts to use the same
      structure it was using, corrupting the data, and breaking the profiler.
      
      To fix this, there needs to be two variables to handle the call stack depth
      and the pointer to where the ret_stack is being used, as they need to change
      at two different locations.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f
      
       ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      39eb456d
  13. 22 Nov, 2018 1 commit
    • Babu Moger's avatar
      x86/resctrl: Rename the config option INTEL_RDT to RESCTRL · 6fe07ce3
      Babu Moger authored
      
      
      The resource control feature is supported by both Intel and AMD. So,
      rename CONFIG_INTEL_RDT to the vendor-neutral CONFIG_RESCTRL.
      
      Now CONFIG_RESCTRL will be used for both Intel and AMD to enable
      Resource Control support. Update the texts in config and condition
      accordingly.
      
       [ bp: Simplify Kconfig text. ]
      
      Signed-off-by: default avatarBabu Moger <babu.moger@amd.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <linux-doc@vger.kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: <qianyue.zj@alibaba-inc.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Rian Hunter <rian@alum.mit.edu>
      Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: <xiaochen.shen@intel.com>
      Link: https://lkml.kernel.org/r/20181121202811.4492-9-babu.moger@amd.com
      6fe07ce3
  14. 12 Nov, 2018 1 commit
    • Paul E. McKenney's avatar
      rcu: Speed up expedited GPs when interrupting RCU reader · 05f41571
      Paul E. McKenney authored
      
      
      In PREEMPT kernels, an expedited grace period might send an IPI to a
      CPU that is executing an RCU read-side critical section.  In that case,
      it would be nice if the rcu_read_unlock() directly interacted with the
      RCU core code to immediately report the quiescent state.  And this does
      happen in the case where the reader has been preempted.  But it would
      also be a nice performance optimization if immediate reporting also
      happened in the preemption-free case.
      
      This commit therefore adds an ->exp_hint field to the task_struct structure's
      ->rcu_read_unlock_special field.  The IPI handler sets this hint when
      it has interrupted an RCU read-side critical section, and this causes
      the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
      which, if preemption is enabled, reports the quiescent state immediately.
      If preemption is disabled, then the report is required to be deferred
      until preemption (or bottom halves or interrupts or whatever) is re-enabled.
      
      Because this is a hint, it does nothing for more complicated cases.  For
      example, if the IPI interrupts an RCU reader, but interrupts are disabled
      across the rcu_read_unlock(), but another rcu_read_lock() is executed
      before interrupts are re-enabled, the hint will already have been cleared.
      If you do crazy things like this, reporting will be deferred until some
      later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
      
      Reported-by: default avatarJoel Fernandes <joel@joelfernandes.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Acked-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      05f41571
  15. 26 Oct, 2018 2 commits
  16. 03 Oct, 2018 1 commit
    • Eric W. Biederman's avatar
      signal: Distinguish between kernel_siginfo and siginfo · ae7795bc
      Eric W. Biederman authored
      
      
      Linus recently observed that if we did not worry about the padding
      member in struct siginfo it is only about 48 bytes, and 48 bytes is
      much nicer than 128 bytes for allocating on the stack and copying
      around in the kernel.
      
      The obvious thing of only adding the padding when userspace is
      including siginfo.h won't work as there are sigframe definitions in
      the kernel that embed struct siginfo.
      
      So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
      traditional name for the userspace definition.  While the version that
      is used internally to the kernel and ultimately will not be padded to
      128 bytes is called kernel_siginfo.
      
      The definition of struct kernel_siginfo I have put in include/signal_types.h
      
      A set of buildtime checks has been added to verify the two structures have
      the same field offsets.
      
      To make it easy to verify the change kernel_siginfo retains the same
      size as siginfo.  The reduction in size comes in a following change.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      ae7795bc
  17. 04 Sep, 2018 2 commits
    • Alexander Popov's avatar
      fs/proc: Show STACKLEAK metrics in the /proc file system · c8d12627
      Alexander Popov authored
      
      
      Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
      tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
      shows the maximum kernel stack consumption for the current and previous
      syscalls. Although this information is not precise, it can be useful for
      estimating the STACKLEAK performance impact for your workloads.
      
      Suggested-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAlexander Popov <alex.popov@linux.com>
      Tested-by: default avatarLaura Abbott <labbott@redhat.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      c8d12627
    • Alexander Popov's avatar
      x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls · afaef01c
      Alexander Popov authored
      The STACKLEAK feature (initially developed by PaX Team) has the following
      benefits:
      
      1. Reduces the information that can be revealed through kernel stack leak
         bugs. The idea of erasing the thread stack at the end of syscalls is
         similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
         crypto, which all comply with FDP_RIP.2 (Full Residual Information
         Protection) of the Common Criteria standard.
      
      2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
         CVE-2010-2963). That kind of bugs should be killed by improving C
         compilers in future, which might take a long time.
      
      This commit introduces the code filling the used part of the kernel
      stack with a poison value before returning to userspace. Full
      STACKLEAK feature also contains the gcc plugin which comes in a
      separate commit.
      
      The STACKLEAK feature is ported from grsecurity/PaX. More information at:
        https://grsecurity.net/
        https://pax.grsecurity.net/
      
      
      
      This code is modified from Brad Spengler/PaX Team's code in the last
      public patch of grsecurity/PaX based on our understanding of the code.
      Changes or omissions from the original code are ours and don't reflect
      the original grsecurity/PaX code.
      
      Performance impact:
      
      Hardware: Intel Core i7-4770, 16 GB RAM
      
      Test #1: building the Linux kernel on a single core
              0.91% slowdown
      
      Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
              4.2% slowdown
      
      So the STACKLEAK description in Kconfig includes: "The tradeoff is the
      performance impact: on a single CPU system kernel compilation sees a 1%
      slowdown, other systems and workloads may vary and you are advised to
      test this feature on your expected workload before deploying it".
      
      Signed-off-by: default avatarAlexander Popov <alex.popov@linux.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      afaef01c
  18. 03 Sep, 2018 1 commit
    • Jann Horn's avatar
      x86/fault: BUG() when uaccess helpers fault on kernel addresses · 9da3f2b7
      Jann Horn authored
      There have been multiple kernel vulnerabilities that permitted userspace to
      pass completely unchecked pointers through to userspace accessors:
      
       - the waitid() bug - commit 96ca579a
      
       ("waitid(): Add missing
         access_ok() checks")
       - the sg/bsg read/write APIs
       - the infiniband read/write APIs
      
      These don't happen all that often, but when they do happen, it is hard to
      test for them properly; and it is probably also hard to discover them with
      fuzzing. Even when an unmapped kernel address is supplied to such buggy
      code, it just returns -EFAULT instead of doing a proper BUG() or at least
      WARN().
      
      Try to make such misbehaving code a bit more visible by refusing to do a
      fixup in the pagefault handler code when a userspace accessor causes a #PF
      on a kernel address and the current context isn't whitelisted.
      
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: dvyukov@google.com
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Borislav Petkov <bp@alien8.de>
      Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
      9da3f2b7
  19. 30 Aug, 2018 1 commit
  20. 22 Aug, 2018 1 commit
    • Dmitry Vyukov's avatar
      kernel/hung_task.c: allow to set checking interval separately from timeout · a2e51445
      Dmitry Vyukov authored
      Currently task hung checking interval is equal to timeout, as the result
      hung is detected anywhere between timeout and 2*timeout.  This is fine for
      most interactive environments, but this hurts automated testing setups
      (syzbot).  In an automated setup we need to strictly order CPU lockup <
      RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
      is not detected as task hung and task hung is not detected as silent
      machine loss.  The large variance in task hung detection timeout requires
      setting silent machine loss timeout to a very large value (e.g.  if task
      hung is 3 mins, then silent loss need to be set to ~7 mins).  The
      additional 3 minutes significantly reduce testing efficiency because
      usually we crash kernel within a minute, and this can add hours to bug
      localization process as it needs to do dozens of tests.
      
      Allow setting checking interval separately from timeout.  This allows to
      set timeout to, say, 3 minutes, but checking interval to 10 secs.
      
      The interval is controlled via a new hung_task_check_interval_secs sysctl,
      similar to the existing hung_task_timeout_secs sysctl.  The default value
      of 0 results in the current behavior: checking interval is equal to
      timeout.
      
      [akpm@linux-foundation.org: update hung_task_timeout_max's comment]
      Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
      
      
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2e51445
  21. 17 Aug, 2018 3 commits
    • Kirill Tkhai's avatar
      mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB · 84c07d11
      Kirill Tkhai authored
      Introduce new config option, which is used to replace repeating
      CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
      memcg+kmem related code, so let's keep the defines more clearly.
      
      Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
      
      
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84c07d11
    • Michal Hocko's avatar
      memcg, oom: move out_of_memory back to the charge path · 29ef680a
      Michal Hocko authored
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") has changed the ENOMEM semantic of memcg charges.
      Rather than invoking the oom killer from the charging context it delays
      the oom killer to the page fault path (pagefault_out_of_memory).  This
      in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
      the corresponding memcg hits the hard limit and the memcg is is OOM.
      This is behavior is inconsistent with !memcg case where the oom killer
      is invoked from the allocation context and the allocator keeps retrying
      until it succeeds.
      
      The difference in the behavior is user visible.  mmap(MAP_POPULATE)
      might result in not fully populated ranges while the mmap return code
      doesn't tell that to the userspace.  Random syscalls might fail with
      ENOMEM etc.
      
      The primary motivation of the different memcg oom semantic was the
      deadlock avoidance.  Things have changed since then, though.  We have an
      async oom teardown by the oom reaper now and so we do not have to rely
      on the victim to tear down its memory anymore.  Therefore we can return
      to the original semantic as long as the memcg oom killer is not handed
      over to the users space.
      
      There is still one thing to be careful about here though.  If the oom
      killer is not able to make any forward progress - e.g.  because there is
      no eligible task to kill - then we have to bail out of the charge path
      to prevent from same class of deadlocks.  We have basically two options
      here.  Either we fail the charge with ENOMEM or force the charge and
      allow overcharge.  The first option has been considered more harmful
      than useful because rare inconsistencies in the ENOMEM behavior is hard
      to test for and error prone.  Basically the same reason why the page
      allocator doesn't fail allocations under such conditions.  The later
      might allow runaways but those should be really unlikely unless somebody
      misconfigures the system.  E.g.  allowing to migrate tasks away from the
      memcg to a different unlimited memcg with move_charge_at_immigrate
      disabled.
      
      Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29ef680a
    • Shakeel Butt's avatar
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt authored
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
      
      
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
  22. 25 Jul, 2018 1 commit
  23. 21 Jul, 2018 3 commits
    • Eric W. Biederman's avatar
      pid: Implement PIDTYPE_TGID · 6883f81a
      Eric W. Biederman authored
      
      
      Everywhere except in the pid array we distinguish between a tasks pid and
      a tasks tgid (thread group id).  Even in the enumeration we want that
      distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
      we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
      
      Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
      into the pids array.  Then remove the __PIDTYPE_TGID special case and the
      leader_pid in signal_struct.
      
      The net size increase is just an extra pointer added to struct pid and
      an extra pair of pointers of an hlist_node added to task_struct.
      
      The effect on code maintenance is the removal of a number of special
      cases today and the potential to remove many more special cases as
      PIDTYPE_TGID gets used to it's fullest.  The long term potential
      is allowing zombie thread group leaders to exit, which will remove
      a lot more special cases in the code.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      6883f81a
    • Eric W. Biederman's avatar
      pids: Move the pgrp and session pid pointers from task_struct to signal_struct · 2c470475
      Eric W. Biederman authored
      
      
      To access these fields the code always has to go to group leader so
      going to signal struct is no loss and is actually a fundamental simplification.
      
      This saves a little bit of memory by only allocating the pid pointer array
      once instead of once for every thread, and even better this removes a
      few potential races caused by the fact that group_leader can be changed
      by de_thread, while signal_struct can not.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      2c470475
    • Eric W. Biederman's avatar
      pids: Compute task_tgid using signal->leader_pid · 7a36094d
      Eric W. Biederman authored
      
      
      The cost is the the same and this removes the need
      to worry about complications that come from de_thread
      and group_leader changing.
      
      __task_pid_nr_ns has been updated to take advantage of this change.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      7a36094d
  24. 17 Jul, 2018 1 commit
  25. 09 Jul, 2018 1 commit
    • Josef Bacik's avatar
      blkcg: add generic throttling mechanism · d09d8df3
      Josef Bacik authored
      
      
      Since IO can be issued from literally anywhere it's almost impossible to
      do throttling without having some sort of adverse effect somewhere else
      in the system because of locking or other dependencies.  The best way to
      solve this is to do the throttling when we know we aren't holding any
      other kernel resources.  Do this by tracking throttling in a per-blkg
      basis, and if we require throttling flag the task that it needs to check
      before it returns to user space and possibly sleep there.
      
      This is to address the case where a process is doing work that is
      generating IO that can't be throttled, whether that is directly with a
      lot of REQ_META IO, or indirectly by allocating so much memory that it
      is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
      don't want to induce a memory allocation in the IO path, so simply
      saving the request queue in the task and flagging it to do the
      notify_resume thing achieves the same result without the overhead of a
      memory allocation.
      
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d09d8df3
  26. 03 Jul, 2018 1 commit
    • Peter Zijlstra's avatar
      kthread, sched/core: Fix kthread_parkme() (again...) · 1cef1150
      Peter Zijlstra authored
      Gaurav reports that commit:
      
        85f1abe0
      
       ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      isn't working for him. Because of the following race:
      
      > controller Thread                               CPUHP Thread
      > takedown_cpu
      > kthread_park
      > kthread_parkme
      > Set KTHREAD_SHOULD_PARK
      >                                                 smpboot_thread_fn
      >                                                 set Task interruptible
      >
      >
      > wake_up_process
      >  if (!(p->state & state))
      >                 goto out;
      >
      >                                                 Kthread_parkme
      >                                                 SET TASK_PARKED
      >                                                 schedule
      >                                                 raw_spin_lock(&rq->lock)
      > ttwu_remote
      > waiting for __task_rq_lock
      >                                                 context_switch
      >
      >                                                 finish_lock_switch
      >
      >
      >
      >                                                 Case TASK_PARKED
      >                                                 kthread_park_complete
      >
      >
      > SET Running
      
      Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
      handling is buggered because the TASK_DEAD thing is done with
      preemption disabled, the current code can still complete early on
      preemption :/
      
      So basically revert that earlier fix and go with a variant of the
      alternative mentioned in the commit. Promote TASK_PARKED to special
      state to avoid the store-store issue on task->state leading to the
      WARN in kthread_unpark() -> __kthread_bind().
      
      But in addition, add wait_task_inactive() to kthread_park() to ensure
      the task really is PARKED when we return from kthread_park(). This
      avoids the whole kthread still gets migrated nonsense -- although it
      would be really good to get this done differently.
      
      Reported-by: default avatarGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0
      
       ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1cef1150
  27. 22 Jun, 2018 1 commit
    • Will Deacon's avatar
      rseq: Avoid infinite recursion when delivering SIGSEGV · 784e0300
      Will Deacon authored
      
      
      When delivering a signal to a task that is using rseq, we call into
      __rseq_handle_notify_resume() so that the registers pushed in the
      sigframe are updated to reflect the state of the restartable sequence
      (for example, ensuring that the signal returns to the abort handler if
      necessary).
      
      However, if the rseq management fails due to an unrecoverable fault when
      accessing userspace or certain combinations of RSEQ_CS_* flags, then we
      will attempt to deliver a SIGSEGV. This has the potential for infinite
      recursion if the rseq code continuously fails on signal delivery.
      
      Avoid this problem by using force_sigsegv() instead of force_sig(), which
      is explicitly designed to reset the SEGV handler to SIG_DFL in the case
      of a recursive fault. In doing so, remove rseq_signal_deliver() from the
      internal rseq API and have an optional struct ksignal * parameter to
      rseq_handle_notify_resume() instead.
      
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: peterz@infradead.org
      Cc: paulmck@linux.vnet.ibm.com
      Cc: boqun.feng@gmail.com
      Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com
      784e0300
  28. 21 Jun, 2018 1 commit
    • Mathieu Desnoyers's avatar
      rseq/cleanup: Do not abort rseq c.s. in child on fork() · 9a789fcf
      Mathieu Desnoyers authored
      
      
      Considering that we explicitly forbid system calls in rseq critical
      sections, it is not valid to issue a fork or clone system call within a
      rseq critical section, so rseq_fork() is not required to restart an
      active rseq c.s. in the child process.
      
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kselftest@vger.kernel.org
      Link: https://lore.kernel.org/lkml/20180619133230.4087-4-mathieu.desnoyers@efficios.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9a789fcf