Skip to content
  • Roman Gushchin's avatar
    mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() · c03914b7
    Roman Gushchin authored
    Patch series "mm: reparent slab memory on cgroup removal", v7.
    
    # Why do we need this?
    
    We've noticed that the number of dying cgroups is steadily growing on most
    of our hosts in production.  The following investigation revealed an issue
    in the userspace memory reclaim code [1], accounting of kernel stacks [2],
    and also the main reason: slab objects.
    
    The underlying problem is quite simple: any page charged to a cgroup holds
    a reference to it, so the cgroup can't be reclaimed unless all charged
    pages are gone.  If a slab object is actively used by other cgroups, it
    won't be reclaimed, and will prevent the origin cgroup from being
    reclaimed.
    
    Slab objects, and first of all vfs cache, is shared between cgroups, which
    are using the same underlying fs, and what's even more important, it's
    shared between multiple generations of the same workload.  So if something
    is running periodically every time in a new cgroup (like how systemd
    works), we do accumulate multiple dying cgroups.
    
    Strictly speaking pagecache isn't different here, but there is a key
    difference: we disable protection and apply some extra pressure on LRUs of
    dying cgroups, and these LRUs contain all charged pages.  My experiments
    show that with the disabled kernel memory accounting the number of dying
    cgroups stabilizes at a relatively small number (~100, depends on memory
    pressure and cgroup creation rate), and with kernel memory accounting it
    grows pretty steadily up to several thousands.
    
    Memory cgroups are quite complex and big objects (mostly due to percpu
    stats), so it leads to noticeable memory losses.  Memory occupied by dying
    cgroups is measured in hundreds of megabytes.  I've even seen a host with
    more than 100Gb of memory wasted for dying cgroups.  It leads to a
    degradation of performance with the uptime, and generally limits the usage
    of cgroups.
    
    My previous attempt [3] to fix the problem by applying extra pressure on
    slab shrinker lists caused a regressions with xfs and ext4, and has been
    reverted [4].  The following attempts to find the right balance [5, 6]
    were not successful.
    
    So instead of trying to find a maybe non-existing balance, let's do
    reparent accounted slab caches to the parent cgroup on cgroup removal.
    
    # Implementation approach
    
    There is however a significant problem with reparenting of slab memory:
    there is no list of charged pages.  Some of them are in shrinker lists,
    but not all.  Introducing of a new list is really not an option.
    
    But fortunately there is a way forward: every slab page has a stable
    pointer to the corresponding kmem_cache.  So the idea is to reparent
    kmem_caches instead of slab pages.
    
    It's actually simpler and cheaper, but requires some underlying changes:
    1) Make kmem_caches to hold a single reference to the memory cgroup,
       instead of a separate reference per every slab page.
    2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
       page->kmem_cache->memcg indirection instead. It's used only on
       slab page release, so performance overhead shouldn't be a big issue.
    3) Introduce a refcounter for non-root slab caches. It's required to
       be able to destroy kmem_caches when they become empty and release
       the associated memory cgroup.
    
    There is a bonus: currently we release all memcg kmem_caches all together
    with the memory cgroup itself.  This patchset allows individual
    kmem_caches to be released as soon as they become inactive and free.
    
    Some additional implementation details are provided in corresponding
    commit messages.
    
    # Results
    
    Below is the average number of dying cgroups on two groups of our
    production hosts.  They do run some sort of web frontend workload, the
    memory pressure is moderate.  As we can see, with the kernel memory
    reparenting the number stabilizes in 60s range; however with the original
    version it grows almost linearly and doesn't show any signs of plateauing.
    The difference in slab and percpu usage between patched and unpatched
    versions also grows linearly.  In 7 days it exceeded 200Mb.
    
    day           0    1    2    3    4    5    6    7
    original     56  362  628  752 1070 1250 1490 1560
    patched      23   46   51   55   60   57   67   69
    mem diff(Mb) 22   74  123  152  164  182  214  241
    
    # Links
    
    [1]: commit 68600f62 ("mm: don't miss the last page because of round-off error")
    [2]: commit 9b6f7e16 ("mm: rework memcg kernel stack accounting")
    [3]: commit 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
    [4]: commit a9a238e8 ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
    [5]: https://lkml.org/lkml/2019/1/28/1865
    [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
    
    This patch (of 10):
    
    Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
    rather than in init_memcg_params().
    
    Once kmem_cache will hold a reference to the memory cgroup, it will
    simplify the refcounting.
    
    For non-root kmem_caches memcg_link_cache() is always called before the
    kmem_cache becomes visible to a user, so it's safe.
    
    Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
    
    
    Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Cc: Waiman Long <longman@redhat.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Qian Cai <cai@lca.pw>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    c03914b7