1. 16 Nov, 2017 3 commits
  2. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      
      
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  3. 04 Oct, 2017 1 commit
    • Johannes Weiner's avatar
      mm: memcontrol: use vmalloc fallback for large kmem memcg arrays · f80c7dab
      Johannes Weiner authored
      For quick per-memcg indexing, slab caches and list_lru structures
      maintain linear arrays of descriptors.  As the number of concurrent
      memory cgroups in the system goes up, this requires large contiguous
      allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
      existing slab cache and list_lru, which can easily fail on loaded
      systems.  E.g.:
      
        mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
        CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
        Call Trace:
         ? __alloc_pages_direct_compact+0x4c/0x110
         __alloc_pages_nodemask+0xf50/0x1430
         alloc_pages_current+0x60/0xc0
         kmalloc_order_trace+0x29/0x1b0
         __kmalloc+0x1f4/0x320
         memcg_update_all_list_lrus+0xca/0x2e0
         mem_cgroup_css_alloc+0x612/0x670
         cgroup_apply_control_enable+0x19e/0x360
         cgroup_mkdir+0x322/0x490
         kernfs_iop_mkdir+0x55/0x80
         vfs_mkdir+0xd0/0x120
         SyS_mkdirat+0x6c/0xe0
         SyS_mkdir+0x14/0x20
         entry_SYSCALL_64_fastpath+0x18/0xad
        Mem-Info:
        active_anon:2965 inactive_anon:19 isolated_anon:0
         active_file:100270 inactive_file:98846 isolated_file:0
         unevictable:0 dirty:0 writeback:0 unstable:0
         slab_reclaimable:7328 slab_unreclaimable:16402
         mapped:771 shmem:52 pagetables:278 bounce:0
         free:13718 free_pcp:0 free_cma:0
      
      This output is from an artificial reproducer, but we have repeatedly
      observed order-7 failures in production in the Facebook fleet.  These
      systems become useless as they cannot run more jobs, even though there
      is plenty of memory to allocate 128 individual pages.
      
      Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
      prove too large for allocating them physically contiguous.
      
      Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f80c7dab
  4. 06 Jul, 2017 1 commit
    • Kees Cook's avatar
      mm: allow slab_nomerge to be set at build time · 7660a6fd
      Kees Cook authored
      Some hardened environments want to build kernels with slab_nomerge
      already set (so that they do not depend on remembering to set the kernel
      command line option).  This is desired to reduce the risk of kernel heap
      overflows being able to overwrite objects from merged caches and changes
      the requirements for cache layout control, increasing the difficulty of
      these attacks.  By keeping caches unmerged, these kinds of exploits can
      usually only damage objects in the same cache (though the risk to
      metadata exploitation is unchanged).
      
      Link: http://lkml.kernel.org/r/20170620230911.GA25238@beast
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Daniel Mack <daniel@zonque.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7660a6fd
  5. 18 Apr, 2017 1 commit
    • Paul E. McKenney's avatar
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney authored
      
      
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      5f0d5a3a
  6. 25 Feb, 2017 1 commit
    • Greg Thelen's avatar
      kasan: drain quarantine of memcg slab objects · f9fa1d91
      Greg Thelen authored
      Per memcg slab accounting and kasan have a problem with kmem_cache
      destruction.
       - kmem_cache_create() allocates a kmem_cache, which is used for
         allocations from processes running in root (top) memcg.
       - Processes running in non root memcg and allocating with either
         __GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg
         kmem_cache.
       - Kasan catches use-after-free by having kfree() and kmem_cache_free()
         defer freeing of objects. Objects are placed in a quarantine.
       - kmem_cache_destroy() destroys root and non root kmem_caches. It takes
         care to drain the quarantine of objects from the root memcg's
         kmem_cache, but ignores objects associated with non root memcg. This
         causes leaks because quarantined per memcg objects refer to per memcg
         kmem cache being destroyed.
      
      To see the problem:
      
       1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,)
       2) from non root memcg, allocate and free a few objects from cache
       3) dispose of the cache with kmem_cache_destroy() kmem_cache_destroy()
          will trigger a "Slab cache still has objects" warning indicating
          that the per memcg kmem_cache structure was leaked.
      
      Fix the leak by draining kasan quarantined objects allocated from non
      root memcg.
      
      Racing memcg deletion is tricky, but handled.  kmem_cache_destroy() =>
      shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache()
      flushes per memcg quarantined objects, even if that memcg has been
      rmdir'd and gone through memcg_deactivate_kmem_caches().
      
      This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is
      enabled.  So I don't think it's worth patching stable kernels.
      
      Link: http://lkml.kernel.org/r/1482257462-36948-1-git-send-email-gthelen@google.com
      
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9fa1d91
  7. 23 Feb, 2017 11 commits
    • Tejun Heo's avatar
      slab: use memcg_kmem_cache_wq for slab destruction operations · 17cc4dfe
      Tejun Heo authored
      If there's contention on slab_mutex, queueing the per-cache destruction
      work item on the system_wq can unnecessarily create and tie up a lot of
      kworkers.
      
      Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
      global and use that workqueue for the destruction work items too.  While
      at it, convert the workqueue from an unbound workqueue to a per-cpu one
      with concurrency limited to 1.  It's generally preferable to use per-cpu
      workqueues and concurrency limit of 1 is safe enough.
      
      This is suggested by Joonsoo Kim.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17cc4dfe
    • Tejun Heo's avatar
      slab: remove synchronous synchronize_sched() from memcg cache deactivation path · 01fb58bc
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slub uses synchronize_sched() to deactivate a memcg cache.
      synchronize_sched() is an expensive and slow operation and doesn't scale
      when a huge number of caches are destroyed back-to-back.  While there
      used to be a simple batching mechanism, the batching was too restricted
      to be helpful.
      
      This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
      can use to schedule sched RCU callback instead of performing
      synchronize_sched() synchronously while holding cgroup_mutex.  While
      this adds online cpus, mems and slab_mutex operations, operating on
      these locks back-to-back from the same kworker, which is what's gonna
      happen when there are many to deactivate, isn't expensive at all and
      this gets rid of the scalability problem completely.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01fb58bc
    • Tejun Heo's avatar
      slab: introduce __kmemcg_cache_deactivate() · c9fc5864
      Tejun Heo authored
      __kmem_cache_shrink() is called with %true @deactivate only for memcg
      caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
      __kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
      should implement it and it should deactivate and drain the cache.
      
      This is to allow memcg cache deactivation behavior to further deviate
      from simple shrinking without messing up __kmem_cache_shrink().
      
      This is pure reorganization and doesn't introduce any observable
      behavior changes.
      
      v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9fc5864
    • Tejun Heo's avatar
      slab: implement slab_root_caches list · 510ded33
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slab_caches currently lists all caches including root and memcg ones.
      This is the only data structure which lists the root caches and
      iterating root caches can only be done by walking the list while
      skipping over memcg caches.  As there can be a huge number of memcg
      caches, this can become very expensive.
      
      This also can make /proc/slabinfo behave very badly.  seq_file processes
      reads in 4k chunks and seeks to the previous Nth position on slab_caches
      list to resume after each chunk.  With a lot of memcg cache churns on
      the list, reading /proc/slabinfo can become very slow and its content
      often ends up with duplicate and/or missing entries.
      
      This patch adds a new list slab_root_caches which lists only the root
      caches.  When memcg is not enabled, it becomes just an alias of
      slab_caches.  memcg specific list operations are collected into
      memcg_[un]link_cache().
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      510ded33
    • Tejun Heo's avatar
      slab: link memcg kmem_caches on their associated memory cgroup · bc2791f8
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      While a memcg kmem_cache is listed on its root cache's ->children list,
      there is no direct way to iterate all kmem_caches which are assocaited
      with a memory cgroup.  The only way to iterate them is walking all
      caches while filtering out caches which don't match, which would be most
      of them.
      
      This makes memcg destruction operations O(N^2) where N is the total
      number of slab caches which can be huge.  This combined with the
      synchronous RCU operations can tie up a CPU and affect the whole machine
      for many hours when memory reclaim triggers offlining and destruction of
      the stale memcgs.
      
      This patch adds mem_cgroup->kmem_caches list which goes through
      memcg_cache_params->kmem_caches_node of all kmem_caches which are
      associated with the memcg.  All memcg specific iterations, including
      stat file access, are updated to use the new list instead.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc2791f8
    • Tejun Heo's avatar
      slab: reorganize memcg_cache_params · 9eeadc8b
      Tejun Heo authored
      We're going to change how memcg caches are iterated.  In preparation,
      clean up and reorganize memcg_cache_params.
      
      * The shared ->list is replaced by ->children in root and
        ->children_node in children.
      
      * ->is_root_cache is removed.  Instead ->root_cache is moved out of
        the child union and now used by both root and children.  NULL
        indicates root cache.  Non-NULL a memcg one.
      
      This patch doesn't cause any observable behavior changes.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-5-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9eeadc8b
    • Tejun Heo's avatar
      slab: remove synchronous rcu_barrier() call in memcg cache release path · 657dc2f9
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      SLAB_DESTORY_BY_RCU caches need to flush all RCU operations before
      destruction because slab pages are freed through RCU and they need to be
      able to dereference the associated kmem_cache.  Currently, it's done
      synchronously with rcu_barrier().  As rcu_barrier() is expensive
      time-wise, slab implements a batching mechanism so that rcu_barrier()
      can be done for multiple caches at the same time.
      
      Unfortunately, the rcu_barrier() is in synchronous path which is called
      while holding cgroup_mutex and the batching is too limited to be
      actually helpful.
      
      This patch updates the cache release path so that the batching is
      asynchronous and global.  All SLAB_DESTORY_BY_RCU caches are queued
      globally and a work item consumes the list.  The work item calls
      rcu_barrier() only once for all caches that are currently queued.
      
      * release_caches() is removed and shutdown_cache() now either directly
        release the cache or schedules a RCU callback to do that.  This
        makes the cache inaccessible once shutdown_cache() is called and
        makes it impossible for shutdown_memcg_caches() to do memcg-specific
        cleanups afterwards.  Move memcg-specific part into a helper,
        unlink_memcg_cache(), and make shutdown_cache() call it directly.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-4-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      657dc2f9
    • Tejun Heo's avatar
      slub: separate out sysfs_slab_release() from sysfs_slab_remove() · bf5eb3de
      Tejun Heo authored
      Separate out slub sysfs removal and release, and call the former earlier
      from __kmem_cache_shutdown().  There's no reason to defer sysfs removal
      through RCU and this will later allow us to remove sysfs files way
      earlier during memory cgroup offline instead of release.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf5eb3de
    • Tejun Heo's avatar
      Revert "slub: move synchronize_sched out of slab_mutex on shrink" · 290b6a58
      Tejun Heo authored
      Patch series "slab: make memcg slab destruction scalable", v3.
      
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.
      
      I've seen machines which end up with hundred thousands of caches and
      many millions of kernfs_nodes.  The current code is O(N^2) on the total
      number of caches and has synchronous rcu_barrier() and
      synchronize_sched() in cgroup offline / release path which is executed
      while holding cgroup_mutex.  Combined, this leads to very expensive and
      slow cache destruction operations which can easily keep running for half
      a day.
      
      This also messes up /proc/slabinfo along with other cache iterating
      operations.  seq_file operates on 4k chunks and on each 4k boundary
      tries to seek to the last position in the list.  With a huge number of
      caches on the list, this becomes very slow and very prone to the list
      content changing underneath it leading to a lot of missing and/or
      duplicate entries.
      
      This patchset addresses the scalability problem.
      
      * Add root and per-memcg lists.  Update each user to use the
        appropriate list.
      
      * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
        and asynchronous.
      
      * For dying empty slub caches, remove the sysfs files after
        deactivation so that we don't end up with millions of sysfs files
        without any useful information on them.
      
      This patchset contains the following nine patches.
      
       0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
       0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
       0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
       0004-slab-reorganize-memcg_cache_params.patch
       0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
       0006-slab-implement-slab_root_caches-list.patch
       0007-slab-introduce-__kmemcg_cache_deactivate.patch
       0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
       0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
       0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch
      
      0001 reverts an existing optimization to prepare for the following
      changes.  0002 is a prep patch.  0003 makes rcu_barrier() in release
      path batched and asynchronous.  0004-0006 separate out the lists.
      0007-0008 replace synchronize_sched() in slub destruction path with
      call_rcu_sched().  0009 removes sysfs files early for empty dying
      caches.  0010 makes destruction work items use a workqueue with limited
      concurrency.
      
      This patch (of 10):
      
      Revert 89e364db ("slub: move synchronize_sched out of slab_mutex on
      shrink").
      
      With kmem cgroup support enabled, kmem_caches can be created and destroyed
      frequently and a great number of near empty kmem_caches can accumulate if
      there are a lot of transient cgroups and the system is not under memory
      pressure.  When memory reclaim starts under such conditions, it can lead
      to consecutive deactivation and destruction of many kmem_caches, easily
      hundreds of thousands on moderately large systems, exposing scalability
      issues in the current slab management code.  This is one of the patches to
      address the issue.
      
      Moving synchronize_sched() out of slab_mutex isn't enough as it's still
      inside cgroup_mutex.  The whole deactivation / release path will be
      updated to avoid all synchronous RCU operations.  Revert this insufficient
      optimization in preparation to ease future changes.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      290b6a58
    • Vlastimil Babka's avatar
      mm, slab: rename kmalloc-node cache to kmalloc-<size> · af3b5f87
      Vlastimil Babka authored
      SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit
      the kmem_cache_node management structure, and puts it into the generic
      kmalloc cache array (e.g. for 128b objects).  The name of this cache is
      "kmalloc-node", which is confusing for readers of /proc/slabinfo as the
      cache is used for generic allocations (and not just the kmem_cache_node
      struct) and it appears as the kmalloc-128 cache is missing.
      
      An easy solution is to use the kmalloc-<size> name when pre-creating the
      cache, which we can get from the kmalloc_info array.
      
      Example /proc/slabinfo before the patch:
      
        ...
        kmalloc-256         1647   1984    256   16    1 : tunables  120   60    8 : slabdata    124    124    828
        kmalloc-192         1974   1974    192   21    1 : tunables  120   60    8 : slabdata     94     94    133
        kmalloc-96          1332   1344    128   32    1 : tunables  120   60    8 : slabdata     42     42    219
        kmalloc-64          2505   5952     64   64    1 : tunables  120   60    8 : slabdata     93     93    715
        kmalloc-32          4278   4464     32  124    1 : tunables  120   60    8 : slabdata     36     36    346
        kmalloc-node        1352   1376    128   32    1 : tunables  120   60    8 : slabdata     43     43     53
        kmem_cache           132    147    192   21    1 : tunables  120   60    8 : slabdata      7      7      0
      
      After the patch:
      
        ...
        kmalloc-256         1672   2160    256   16    1 : tunables  120   60    8 : slabdata    135    135    807
        kmalloc-192         1992   2016    192   21    1 : tunables  120   60    8 : slabdata     96     96    203
        kmalloc-96          1159   1184    128   32    1 : tunables  120   60    8 : slabdata     37     37    116
        kmalloc-64          2561   4864     64   64    1 : tunables  120   60    8 : slabdata     76     76    785
        kmalloc-32          4253   4340     32  124    1 : tunables  120   60    8 : slabdata     35     35    270
        kmalloc-128         1256   1280    128   32    1 : tunables  120   60    8 : slabdata     40     40     39
        kmem_cache           125    147    192   21    1 : tunables  120   60    8 : slabdata      7      7      0
      
      [vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter]
        Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz
      Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af3b5f87
    • Grygorii Maistrenko's avatar
      slub: do not merge cache if slub_debug contains a never-merge flag · c6e28895
      Grygorii Maistrenko authored
      In case CONFIG_SLUB_DEBUG_ON=n, find_mergeable() gets debug features from
      commandline but never checks if there are features from the
      SLAB_NEVER_MERGE set.
      
      As a result selected by slub_debug caches are always mergeable if they
      have been created without a custom constructor set or without one of the
      SLAB_* debug features on.
      
      This moves the SLAB_NEVER_MERGE check below the flags update from
      commandline to make sure it won't merge the slab cache if one of the debug
      features is on.
      
      Link: http://lkml.kernel.org/r/20170101124451.GA4740@lp-laptop-d
      
      Signed-off-by: default avatarGrygorii Maistrenko <grygoriimkd@gmail.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6e28895
  8. 13 Dec, 2016 2 commits
  9. 11 Nov, 2016 1 commit
  10. 26 Jul, 2016 2 commits
    • Vladimir Davydov's avatar
      mm: charge/uncharge kmemcg from generic page allocator paths · 4949148a
      Vladimir Davydov authored
      Currently, to charge a non-slab allocation to kmemcg one has to use
      alloc_kmem_pages helper with __GFP_ACCOUNT flag.  A page allocated with
      this helper should finally be freed using free_kmem_pages, otherwise it
      won't be uncharged.
      
      This API suits its current users fine, but it turns out to be impossible
      to use along with page reference counting, i.e.  when an allocation is
      supposed to be freed with put_page, as it is the case with pipe or unix
      socket buffers.
      
      To overcome this limitation, this patch moves charging/uncharging to
      generic page allocator paths, i.e.  to __alloc_pages_nodemask and
      free_pages_prepare, and zaps alloc/free_kmem_pages helpers.  This way,
      one can use any of the available page allocation functions to get the
      allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
      just like in case of kmalloc and friends.  A charged page will be
      automatically uncharged on free.
      
      To make it possible, we need to mark pages charged to kmemcg somehow.
      To avoid introducing a new page flag, we make use of page->_mapcount for
      marking such pages.  Since pages charged to kmemcg are not supposed to
      be mapped to userspace, it should work just fine.  There are other
      (ab)users of page->_mapcount - buddy and balloon pages - but we don't
      conflict with them.
      
      In case kmemcg is compiled out or not used at runtime, this patch
      introduces no overhead to generic page allocator paths.  If kmemcg is
      used, it will be plus one gfp flags check on alloc and plus one
      page->_mapcount check on free, which shouldn't hurt performance, because
      the data accessed are hot.
      
      Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4949148a
    • Thomas Garnier's avatar
      mm: reorganize SLAB freelist randomization · 7c00fce9
      Thomas Garnier authored
      The kernel heap allocators are using a sequential freelist making their
      allocation predictable.  This predictability makes kernel heap overflow
      easier to exploit.  An attacker can careful prepare the kernel heap to
      control the following chunk overflowed.
      
      For example these attacks exploit the predictability of the heap:
       - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
       - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)
      
      ***Problems that needed solving:
       - Randomize the Freelist (singled linked) used in the SLUB allocator.
       - Ensure good performance to encourage usage.
       - Get best entropy in early boot stage.
      
      ***Parts:
       - 01/02 Reorganize the SLAB Freelist randomization to share elements
         with the SLUB implementation.
       - 02/02 The SLUB Freelist randomization implementation. Similar approach
         than the SLAB but tailored to the singled freelist used in SLUB.
      
      ***Performance data:
      
      slab_test impact is between 3% to 4% on average for 100000 attempts
      without smp.  It is a very focused testing, kernbench show the overall
      impact on the system is way lower.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
        100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
        100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
        100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
        100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
        100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
        100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
        100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
        100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
        100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 70 cycles
        100000 times kmalloc(16)/kfree -> 70 cycles
        100000 times kmalloc(32)/kfree -> 70 cycles
        100000 times kmalloc(64)/kfree -> 70 cycles
        100000 times kmalloc(128)/kfree -> 70 cycles
        100000 times kmalloc(256)/kfree -> 69 cycles
        100000 times kmalloc(512)/kfree -> 70 cycles
        100000 times kmalloc(1024)/kfree -> 73 cycles
        100000 times kmalloc(2048)/kfree -> 72 cycles
        100000 times kmalloc(4096)/kfree -> 71 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
        100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
        100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
        100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
        100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
        100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
        100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
        100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
        100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
        100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 66 cycles
        100000 times kmalloc(16)/kfree -> 66 cycles
        100000 times kmalloc(32)/kfree -> 66 cycles
        100000 times kmalloc(64)/kfree -> 66 cycles
        100000 times kmalloc(128)/kfree -> 65 cycles
        100000 times kmalloc(256)/kfree -> 67 cycles
        100000 times kmalloc(512)/kfree -> 67 cycles
        100000 times kmalloc(1024)/kfree -> 64 cycles
        100000 times kmalloc(2048)/kfree -> 67 cycles
        100000 times kmalloc(4096)/kfree -> 67 cycles
      
      Kernbench, before:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 101.873 (1.16069)
        User Time 1045.22 (1.60447)
        System Time 88.969 (0.559195)
        Percent CPU 1112.9 (13.8279)
        Context Switches 189140 (2282.15)
        Sleeps 99008.6 (768.091)
      
      After:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 102.47 (0.562732)
        User Time 1045.3 (1.34263)
        System Time 88.311 (0.342554)
        Percent CPU 1105.8 (6.49444)
        Context Switches 189081 (2355.78)
        Sleeps 99231.5 (800.358)
      
      This patch (of 2):
      
      This commit reorganizes the previous SLAB freelist randomization to
      prepare for the SLUB implementation.  It moves functions that will be
      shared to slab_common.
      
      The entropy functions are changed to align with the SLUB implementation,
      now using get_random_(int|long) functions.  These functions were chosen
      because they provide a bit more entropy early on boot and better
      performance when specific arch instructions are not available.
      
      [akpm@linux-foundation.org: fix build]
      Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.com
      
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c00fce9
  11. 23 Jul, 2016 1 commit
    • Johannes Weiner's avatar
      mm: memcontrol: fix cgroup creation failure after many small jobs · 73f576c0
      Johannes Weiner authored
      The memory controller has quite a bit of state that usually outlives the
      cgroup and pins its CSS until said state disappears.  At the same time
      it imposes a 16-bit limit on the CSS ID space to economically store IDs
      in the wild.  Consequently, when we use cgroups to contain frequent but
      small and short-lived jobs that leave behind some page cache, we quickly
      run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
      fails with -ENOSPC while there are only a few, or even no user-visible
      cgroups in existence.
      
      Although pinning CSSs past cgroup removal is common, there are only two
      instances that actually need an ID after a cgroup is deleted: cache
      shadow entries and swapout records.
      
      Cache shadow entries reference the ID weakly and can deal with the CSS
      having disappeared when it's looked up later.  They pose no hurdle.
      
      Swap-out records do need to pin the css to hierarchically attribute
      swapins after the cgroup has been deleted; though the only pages that
      remain swapped out after offlining are tmpfs/shmem pages.  And those
      references are under the user's control, so they are manageable.
      
      This patch introduces a private 16-bit memcg ID and switches swap and
      cache shadow entries over to using that.  This ID can then be recycled
      after offlining when the CSS remains pinned only by objects that don't
      specifically need it.
      
      This script demonstrates the problem by faulting one cache page in a new
      cgroup and deleting it again:
      
        set -e
        mkdir -p pages
        for x in `seq 128000`; do
          [ $((x % 1000)) -eq 0 ] && echo $x
          mkdir /cgroup/foo
          echo $$ >/cgroup/foo/cgroup.procs
          echo trex >pages/$x
          echo $$ >/cgroup/cgroup.procs
          rmdir /cgroup/foo
        done
      
      When run on an unpatched kernel, we eventually run out of possible IDs
      even though there are no visible cgroups:
      
        [root@ham ~]# ./cssidstress.sh
        [...]
        65000
        mkdir: cannot create directory '/cgroup/foo': No space left on device
      
      After this patch, the IDs get released upon cgroup destruction and the
      cache and css objects get released once memory reclaim kicks in.
      
      [hannes@cmpxchg.org: init the IDR]
        Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
      Fixes: b2052564 ("mm: memcontrol: continue cache reclaim from offlined groups")
      Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarJohn Garcia <john.garcia@mesosphere.io>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Nikolay Borisov <kernel@kyup.com>
      Cc: <stable@vger.kernel.org>	[3.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73f576c0
  12. 21 May, 2016 1 commit
    • Alexander Potapenko's avatar
      mm: kasan: initial memory quarantine implementation · 55834c59
      Alexander Potapenko authored
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      When the object is freed, its state changes from KASAN_STATE_ALLOC to
      KASAN_STATE_QUARANTINE.  The object is poisoned and put into quarantine
      instead of being returned to the allocator, therefore every subsequent
      access to that object triggers a KASAN error, and the error handler is
      able to say where the object has been allocated and deallocated.
      
      When it's time for the object to leave quarantine, its state becomes
      KASAN_STATE_FREE and it's returned to the allocator.  From now on the
      allocator may reuse it for another allocation.  Before that happens,
      it's still possible to detect a use-after free on that object (it
      retains the allocation/deallocation stacks).
      
      When the allocator reuses this object, the shadow is unpoisoned and old
      allocation/deallocation stacks are wiped.  Therefore a use of this
      object, even an incorrect one, won't trigger ASan warning.
      
      Without the quarantine, it's not guaranteed that the objects aren't
      reused immediately, that's why the probability of catching a
      use-after-free is lower than with quarantine in place.
      
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      Freed objects are first added to per-cpu quarantine queues.  When a
      cache is destroyed or memory shrinking is requested, the objects are
      moved into the global quarantine queue.  Whenever a kmalloc call allows
      memory reclaiming, the oldest objects are popped out of the global queue
      until the total size of objects in quarantine is less than 3/4 of the
      maximum quarantine size (which is a fraction of installed physical
      memory).
      
      As long as an object remains in the quarantine, KASAN is able to report
      accesses to it, so the chance of reporting a use-after-free is
      increased.  Once the object leaves quarantine, the allocator may reuse
      it, in which case the object is unpoisoned and KASAN can't detect
      incorrect accesses to it.
      
      Right now quarantine support is only enabled in SLAB allocator.
      Unification of KASAN features in SLAB and SLUB will be done later.
      
      This patch is based on the "mm: kasan: quarantine" patch originally
      prepared by Dmitry Chernenkov.  A number of improvements have been
      suggested by Andrey Ryabinin.
      
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55834c59
  13. 25 Mar, 2016 2 commits
    • Alexander Potapenko's avatar
      mm, kasan: add GFP flags to KASAN API · 505f5dcb
      Alexander Potapenko authored
      
      
      Add GFP flags to KASAN hooks for future patches to use.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505f5dcb
    • Alexander Potapenko's avatar
      mm, kasan: SLAB support · 7ed2f9e6
      Alexander Potapenko authored
      
      
      Add KASAN hooks to SLAB allocator.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ed2f9e6
  14. 17 Mar, 2016 3 commits
  15. 15 Mar, 2016 1 commit
    • Jesper Dangaard Brouer's avatar
      mm: new API kfree_bulk() for SLAB+SLUB allocators · ca257195
      Jesper Dangaard Brouer authored
      
      
      This patch introduce a new API call kfree_bulk() for bulk freeing memory
      objects not bound to a single kmem_cache.
      
      Christoph pointed out that it is possible to implement freeing of
      objects, without knowing the kmem_cache pointer as that information is
      available from the object's page->slab_cache.  Proposing to remove the
      kmem_cache argument from the bulk free API.
      
      Jesper demonstrated that these extra steps per object comes at a
      performance cost.  It is only in the case CONFIG_MEMCG_KMEM is compiled
      in and activated runtime that these steps are done anyhow.  The extra
      cost is most visible for SLAB allocator, because the SLUB allocator does
      the page lookup (virt_to_head_page()) anyhow.
      
      Thus, the conclusion was to keep the kmem_cache free bulk API with a
      kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
      easily.  Simply by handling if kmem_cache_free_bulk() gets called with a
      kmem_cache NULL pointer.
      
      This does increase the code size a bit, but implementing a separate
      kfree_bulk() call would likely increase code size even more.
      
      Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
      @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.
      
      Code size increase for SLAB:
      
       add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
       function                                     old     new   delta
       kmem_cache_free_bulk                         660     734     +74
      
      SLAB fastpath: 87 cycles(tsc) 21.814
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 - 103 cycles 25.878 ns -  41 cycles 10.498 ns - 81 cycles 20.312 ns
         2 -  94 cycles 23.673 ns -  26 cycles  6.682 ns - 42 cycles 10.649 ns
         3 -  92 cycles 23.181 ns -  21 cycles  5.325 ns - 39 cycles 9.950 ns
         4 -  90 cycles 22.727 ns -  18 cycles  4.673 ns - 26 cycles 6.693 ns
         8 -  89 cycles 22.270 ns -  14 cycles  3.664 ns - 23 cycles 5.835 ns
        16 -  88 cycles 22.038 ns -  14 cycles  3.503 ns - 22 cycles 5.543 ns
        30 -  89 cycles 22.284 ns -  13 cycles  3.310 ns - 20 cycles 5.197 ns
        32 -  88 cycles 22.249 ns -  13 cycles  3.420 ns - 20 cycles 5.166 ns
        34 -  88 cycles 22.224 ns -  14 cycles  3.643 ns - 20 cycles 5.170 ns
        48 -  88 cycles 22.088 ns -  14 cycles  3.507 ns - 20 cycles 5.203 ns
        64 -  88 cycles 22.063 ns -  13 cycles  3.428 ns - 20 cycles 5.152 ns
       128 -  89 cycles 22.483 ns -  15 cycles  3.891 ns - 23 cycles 5.885 ns
       158 -  89 cycles 22.381 ns -  15 cycles  3.779 ns - 22 cycles 5.548 ns
       250 -  91 cycles 22.798 ns -  16 cycles  4.152 ns - 23 cycles 5.967 ns
      
      SLAB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
       1 - 148 cycles 37.220 ns -  66 cycles 16.622 ns - 66 cycles 16.583 ns
       2 - 141 cycles 35.510 ns -  51 cycles 12.820 ns - 58 cycles 14.625 ns
       3 - 140 cycles 35.017 ns -  37 cycles 9.326 ns - 33 cycles 8.474 ns
       4 - 137 cycles 34.507 ns -  31 cycles 7.888 ns - 33 cycles 8.300 ns
       8 - 140 cycles 35.069 ns -  25 cycles 6.461 ns - 25 cycles 6.436 ns
       16 - 138 cycles 34.542 ns -  23 cycles 5.945 ns - 22 cycles 5.670 ns
       30 - 136 cycles 34.227 ns -  22 cycles 5.502 ns - 22 cycles 5.587 ns
       32 - 136 cycles 34.253 ns -  21 cycles 5.475 ns - 21 cycles 5.324 ns
       34 - 136 cycles 34.254 ns -  21 cycles 5.448 ns - 20 cycles 5.194 ns
       48 - 136 cycles 34.075 ns -  21 cycles 5.458 ns - 21 cycles 5.367 ns
       64 - 135 cycles 33.994 ns -  21 cycles 5.350 ns - 21 cycles 5.259 ns
       128 - 137 cycles 34.446 ns -  23 cycles 5.816 ns - 22 cycles 5.688 ns
       158 - 137 cycles 34.379 ns -  22 cycles 5.727 ns - 22 cycles 5.602 ns
       250 - 138 cycles 34.755 ns -  24 cycles 6.093 ns - 23 cycles 5.986 ns
      
      Code size increase for SLUB:
       function                                     old     new   delta
       kmem_cache_free_bulk                         717     799     +82
      
      SLUB benchmark:
       SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 -  61 cycles 15.486 ns -  53 cycles 13.364 ns - 57 cycles 14.464 ns
         2 -  54 cycles 13.703 ns -  32 cycles  8.110 ns - 33 cycles 8.482 ns
         3 -  53 cycles 13.272 ns -  25 cycles  6.362 ns - 27 cycles 6.947 ns
         4 -  51 cycles 12.994 ns -  24 cycles  6.087 ns - 24 cycles 6.078 ns
         8 -  50 cycles 12.576 ns -  21 cycles  5.354 ns - 22 cycles 5.513 ns
        16 -  49 cycles 12.368 ns -  20 cycles  5.054 ns - 20 cycles 5.042 ns
        30 -  49 cycles 12.273 ns -  18 cycles  4.748 ns - 19 cycles 4.758 ns
        32 -  49 cycles 12.401 ns -  19 cycles  4.821 ns - 19 cycles 4.810 ns
        34 -  98 cycles 24.519 ns -  24 cycles  6.154 ns - 24 cycles 6.157 ns
        48 -  83 cycles 20.833 ns -  21 cycles  5.446 ns - 21 cycles 5.429 ns
        64 -  75 cycles 18.891 ns -  20 cycles  5.247 ns - 20 cycles 5.238 ns
       128 -  93 cycles 23.271 ns -  27 cycles  6.856 ns - 27 cycles 6.823 ns
       158 - 102 cycles 25.581 ns -  30 cycles  7.714 ns - 30 cycles 7.695 ns
       250 - 107 cycles 26.917 ns -  38 cycles  9.514 ns - 38 cycles 9.506 ns
      
      SLUB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
       1 - 85 cycles 21.484 ns -  78 cycles 19.569 ns - 75 cycles 18.938 ns
       2 - 81 cycles 20.363 ns -  45 cycles 11.258 ns - 44 cycles 11.076 ns
       3 - 78 cycles 19.709 ns -  33 cycles 8.354 ns - 32 cycles 8.044 ns
       4 - 77 cycles 19.430 ns -  28 cycles 7.216 ns - 28 cycles 7.003 ns
       8 - 101 cycles 25.288 ns -  23 cycles 5.849 ns - 23 cycles 5.787 ns
       16 - 76 cycles 19.148 ns -  20 cycles 5.162 ns - 20 cycles 5.081 ns
       30 - 76 cycles 19.067 ns -  19 cycles 4.868 ns - 19 cycles 4.821 ns
       32 - 76 cycles 19.052 ns -  19 cycles 4.857 ns - 19 cycles 4.815 ns
       34 - 121 cycles 30.291 ns -  25 cycles 6.333 ns - 25 cycles 6.268 ns
       48 - 108 cycles 27.111 ns -  21 cycles 5.498 ns - 21 cycles 5.458 ns
       64 - 100 cycles 25.164 ns -  20 cycles 5.242 ns - 20 cycles 5.229 ns
       128 - 155 cycles 38.976 ns -  27 cycles 6.886 ns - 27 cycles 6.892 ns
       158 - 132 cycles 33.034 ns -  30 cycles 7.711 ns - 30 cycles 7.728 ns
       250 - 130 cycles 32.612 ns -  38 cycles 9.560 ns - 38 cycles 9.549 ns
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca257195
  16. 19 Feb, 2016 1 commit
    • Dmitry Safonov's avatar
      mm: slab: free kmem_cache_node after destroy sysfs file · 52b4b950
      Dmitry Safonov authored
      When slub_debug alloc_calls_show is enabled we will try to track
      location and user of slab object on each online node, kmem_cache_node
      structure and cpu_cache/cpu_slub shouldn't be freed till there is the
      last reference to sysfs file.
      
      This fixes the following panic:
      
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
         IP:  list_locations+0x169/0x4e0
         PGD 257304067 PUD 438456067 PMD 0
         Oops: 0000 [#1] SMP
         CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
         Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
         task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
         RIP: list_locations+0x169/0x4e0
         Call Trace:
           alloc_calls_show+0x1d/0x30
           slab_attr_show+0x1b/0x30
           sysfs_read_file+0x9a/0x1a0
           vfs_read+0x9c/0x170
           SyS_read+0x58/0xb0
           system_call_fastpath+0x16/0x1b
         Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 <48> 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
         CR2: 0000000000000020
      
      Separated __kmem_cache_release from __kmem_cache_shutdown which now
      called on slab_kmem_cache_release (after the last reference to sysfs
      file object has dropped).
      
      Reintroduced locking in free_partial as sysfs file might access cache's
      partial list after shutdowning - partial revert of the commit
      69cb8e6b ("slub: free slabs without holding locks").  Zap
      __remove_partial and use remove_partial (w/o underscores) as
      free_partial now takes list_lock which s partial revert for commit
      1e4dd946
      
       ("slub: do not assert not having lock in removing freed
      partial")
      Signed-off-by: default avatarDmitry Safonov <dsafonov@virtuozzo.com>
      Suggested-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52b4b950
  17. 21 Jan, 2016 2 commits
  18. 15 Jan, 2016 1 commit
    • Vladimir Davydov's avatar
      slab: add SLAB_ACCOUNT flag · 230e9fc2
      Vladimir Davydov authored
      
      
      Currently, if we want to account all objects of a particular kmem cache,
      we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
      inconvenient.  This patch introduces SLAB_ACCOUNT flag which if passed
      to kmem_cache_create will force accounting for every allocation from
      this cache even if __GFP_ACCOUNT is not passed.
      
      This patch does not make any of the existing caches use this flag - it
      will be done later in the series.
      
      Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
      SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
      hence cannot have different sets of SLAB_* flags.  Thus using this flag
      will probably reduce the number of merged slabs even if kmem accounting
      is not used (only compiled in).
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      230e9fc2
  19. 22 Nov, 2015 1 commit
    • Jesper Dangaard Brouer's avatar
      slab/slub: adjust kmem_cache_alloc_bulk API · 865762a8
      Jesper Dangaard Brouer authored
      
      
      Adjust kmem_cache_alloc_bulk API before we have any real users.
      
      Adjust API to return type 'int' instead of previously type 'bool'.  This
      is done to allow future extension of the bulk alloc API.
      
      A future extension could be to allow SLUB to stop at a page boundary, when
      specified by a flag, and then return the number of objects.
      
      The advantage of this approach, would make it easier to make bulk alloc
      run without local IRQs disabled.  With an approach of cmpxchg "stealing"
      the entire c->freelist or page->freelist.  To avoid overshooting we would
      stop processing at a slab-page boundary.  Else we always end up returning
      some objects at the cost of another cmpxchg.
      
      To keep compatible with future users of this API linking against an older
      kernel when using the new flag, we need to return the number of allocated
      objects with this API change.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      865762a8
  20. 06 Nov, 2015 3 commits