Skip to content
  • Tejun Heo's avatar
    Revert "slub: move synchronize_sched out of slab_mutex on shrink" · 290b6a58
    Tejun Heo authored
    Patch series "slab: make memcg slab destruction scalable", v3.
    
    With kmem cgroup support enabled, kmem_caches can be created and
    destroyed frequently and a great number of near empty kmem_caches can
    accumulate if there are a lot of transient cgroups and the system is not
    under memory pressure.  When memory reclaim starts under such
    conditions, it can lead to consecutive deactivation and destruction of
    many kmem_caches, easily hundreds of thousands on moderately large
    systems, exposing scalability issues in the current slab management
    code.
    
    I've seen machines which end up with hundred thousands of caches and
    many millions of kernfs_nodes.  The current code is O(N^2) on the total
    number of caches and has synchronous rcu_barrier() and
    synchronize_sched() in cgroup offline / release path which is executed
    while holding cgroup_mutex.  Combined, this leads to very expensive and
    slow cache destruction operations which can easily keep running for half
    a day.
    
    This also messes up /proc/slabinfo along with other cache iterating
    operations.  seq_file operates on 4k chunks and on each 4k boundary
    tries to seek to the last position in the list.  With a huge number of
    caches on the list, this becomes very slow and very prone to the list
    content changing underneath it leading to a lot of missing and/or
    duplicate entries.
    
    This patchset addresses the scalability problem.
    
    * Add root and per-memcg lists.  Update each user to use the
      appropriate list.
    
    * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
      and asynchronous.
    
    * For dying empty slub caches, remove the sysfs files after
      deactivation so that we don't end up with millions of sysfs files
      without any useful information on them.
    
    This patchset contains the following nine patches.
    
     0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
     0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
     0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
     0004-slab-reorganize-memcg_cache_params.patch
     0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
     0006-slab-implement-slab_root_caches-list.patch
     0007-slab-introduce-__kmemcg_cache_deactivate.patch
     0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
     0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
     0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch
    
    0001 reverts an existing optimization to prepare for the following
    changes.  0002 is a prep patch.  0003 makes rcu_barrier() in release
    path batched and asynchronous.  0004-0006 separate out the lists.
    0007-0008 replace synchronize_sched() in slub destruction path with
    call_rcu_sched().  0009 removes sysfs files early for empty dying
    caches.  0010 makes destruction work items use a workqueue with limited
    concurrency.
    
    This patch (of 10):
    
    Revert 89e364db ("slub: move synchronize_sched out of slab_mutex on
    shrink").
    
    With kmem cgroup support enabled, kmem_caches can be created and destroyed
    frequently and a great number of near empty kmem_caches can accumulate if
    there are a lot of transient cgroups and the system is not under memory
    pressure.  When memory reclaim starts under such conditions, it can lead
    to consecutive deactivation and destruction of many kmem_caches, easily
    hundreds of thousands on moderately large systems, exposing scalability
    issues in the current slab management code.  This is one of the patches to
    address the issue.
    
    Moving synchronize_sched() out of slab_mutex isn't enough as it's still
    inside cgroup_mutex.  The whole deactivation / release path will be
    updated to avoid all synchronous RCU operations.  Revert this insufficient
    optimization in preparation to ease future changes.
    
    Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
    
    
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reported-by: default avatarJay Vana <jsvana@fb.com>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    290b6a58