1. 16 Nov, 2019 1 commit
  2. 14 Oct, 2019 2 commits
    • Alexander Potapenko's avatar
      mm/slub.c: init_on_free=1 should wipe freelist ptr for bulk allocations · 0f181f9f
      Alexander Potapenko authored
      slab_alloc_node() already zeroed out the freelist pointer if
      init_on_free was on.  Thibaut Sautereau noticed that the same needs to
      be done for kmem_cache_alloc_bulk(), which performs the allocations
      separately.
      
      kmem_cache_alloc_bulk() is currently used in two places in the kernel,
      so this change is unlikely to have a major performance impact.
      
      SLAB doesn't require a similar change, as auto-initialization makes the
      allocator store the freelist pointers off-slab.
      
      Link: http://lkml.kernel.org/r/20191007091605.30530-1-glider@google.com
      Fixes: 6471384a
      
       ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarThibaut Sautereau <thibaut@sautereau.fr>
      Reported-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f181f9f
    • Qian Cai's avatar
      mm/slub: fix a deadlock in show_slab_objects() · e4f8e513
      Qian Cai authored
      A long time ago we fixed a similar deadlock in show_slab_objects() [1].
      However, it is apparently due to the commits like 01fb58bc ("slab:
      remove synchronous synchronize_sched() from memcg cache deactivation
      path") and 03afc0e2 ("slab: get_online_mems for
      kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
      just reading files in /sys/kernel/slab which will generate a lockdep
      splat below.
      
      Since the "mem_hotplug_lock" here is only to obtain a stable online node
      mask while racing with NUMA node hotplug, in the worst case, the results
      may me miscalculated while doing NUMA node hotplug, but they shall be
      corrected by later reads of the same files.
      
        WARNING: possible circular locking dependency detected
        ------------------------------------------------------
        cat/5224 is trying to acquire lock:
        ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
        show_slab_objects+0x94/0x3a8
      
        but task is already holding lock:
        b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (kn->count#45){++++}:
               lock_acquire+0x31c/0x360
               __kernfs_remove+0x290/0x490
               kernfs_remove+0x30/0x44
               sysfs_remove_dir+0x70/0x88
               kobject_del+0x50/0xb0
               sysfs_slab_unlink+0x2c/0x38
               shutdown_cache+0xa0/0xf0
               kmemcg_cache_shutdown_fn+0x1c/0x34
               kmemcg_workfn+0x44/0x64
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #1 (slab_mutex){+.+.}:
               lock_acquire+0x31c/0x360
               __mutex_lock_common+0x16c/0xf78
               mutex_lock_nested+0x40/0x50
               memcg_create_kmem_cache+0x38/0x16c
               memcg_kmem_cache_create_func+0x3c/0x70
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #0 (mem_hotplug_lock.rw_sem){++++}:
               validate_chain+0xd10/0x2bcc
               __lock_acquire+0x7f4/0xb8c
               lock_acquire+0x31c/0x360
               get_online_mems+0x54/0x150
               show_slab_objects+0x94/0x3a8
               total_objects_show+0x28/0x34
               slab_attr_show+0x38/0x54
               sysfs_kf_seq_show+0x198/0x2d4
               kernfs_seq_show+0xa4/0xcc
               seq_read+0x30c/0x8a8
               kernfs_fop_read+0xa8/0x314
               __vfs_read+0x88/0x20c
               vfs_read+0xd8/0x10c
               ksys_read+0xb0/0x120
               __arm64_sys_read+0x54/0x88
               el0_svc_handler+0x170/0x240
               el0_svc+0x8/0xc
      
        other info that might help us debug this:
      
        Chain exists of:
          mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(kn->count#45);
                                       lock(slab_mutex);
                                       lock(kn->count#45);
          lock(mem_hotplug_lock.rw_sem);
      
         *** DEADLOCK ***
      
        3 locks held by cat/5224:
         #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8
         #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0
         #2: b8ff009693eee398 (kn->count#45){++++}, at:
        kernfs_seq_start+0x44/0xf0
      
        stack backtrace:
        Call trace:
         dump_backtrace+0x0/0x248
         show_stack+0x20/0x2c
         dump_stack+0xd0/0x140
         print_circular_bug+0x368/0x380
         check_noncircular+0x248/0x250
         validate_chain+0xd10/0x2bcc
         __lock_acquire+0x7f4/0xb8c
         lock_acquire+0x31c/0x360
         get_online_mems+0x54/0x150
         show_slab_objects+0x94/0x3a8
         total_objects_show+0x28/0x34
         slab_attr_show+0x38/0x54
         sysfs_kf_seq_show+0x198/0x2d4
         kernfs_seq_show+0xa4/0xcc
         seq_read+0x30c/0x8a8
         kernfs_fop_read+0xa8/0x314
         __vfs_read+0x88/0x20c
         vfs_read+0xd8/0x10c
         ksys_read+0xb0/0x120
         __arm64_sys_read+0x54/0x88
         el0_svc_handler+0x170/0x240
         el0_svc+0x8/0xc
      
      I think it is important to mention that this doesn't expose the
      show_slab_objects to use-after-free.  There is only a single path that
      might really race here and that is the slab hotplug notifier callback
      __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
      doesn't really destroy kmem_cache_node data structures.
      
      [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html
      
      [akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock]
      Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw
      Fixes: 01fb58bc ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path")
      Fixes: 03afc0e2
      
       ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4f8e513
  3. 07 Oct, 2019 1 commit
    • Vlastimil Babka's avatar
      mm, sl[ou]b: improve memory accounting · 6a486c0a
      Vlastimil Babka authored
      Patch series "guarantee natural alignment for kmalloc()", v2.
      
      This patch (of 2):
      
      SLOB currently doesn't account its pages at all, so in /proc/meminfo the
      Slab field shows zero.  Modifying a counter on page allocation and
      freeing should be acceptable even for the small system scenarios SLOB is
      intended for.  Since reclaimable caches are not separated in SLOB,
      account everything as unreclaimable.
      
      SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
      larger than order-1 page, that are passed directly to the page
      allocator.  As they also don't appear in /proc/slabinfo, it might look
      like a memory leak.  For consistency, account them as well.  (SLAB
      doesn't actually use page allocator directly, so no change there).
      
      Ideally SLOB and SLUB would be handled in separate patches, but due to
      the shared kmalloc_order() function and different kfree()
      implementations, it's easier to patch both at once to prevent
      inconsistencies.
      
      Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a486c0a
  4. 24 Sep, 2019 3 commits
  5. 31 Jul, 2019 1 commit
    • Laura Abbott's avatar
      mm: slub: Fix slab walking for init_on_free · 1b7e816f
      Laura Abbott authored
      To properly clear the slab on free with slab_want_init_on_free, we walk
      the list of free objects using get_freepointer/set_freepointer.
      
      The value we get from get_freepointer may not be valid.  This isn't an
      issue since an actual value will get written later but this means
      there's a chance of triggering a bug if we use this value with
      set_freepointer:
      
        kernel BUG at mm/slub.c:306!
        invalid opcode: 0000 [#1] PREEMPT PTI
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-05754-g6471384a
      
       #4
        RIP: 0010:kfree+0x58a/0x5c0
        Code: 48 83 05 78 37 51 02 01 0f 0b 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 d6 37 51 02 01 <0f> 0b 48 83 05 d4 37 51 02 01 48 83 05 d4 37 51 02 01 48 83 05 d4
        RSP: 0000:ffffffff82603d90 EFLAGS: 00010002
        RAX: ffff8c3976c04320 RBX: ffff8c3976c04300 RCX: 0000000000000000
        RDX: ffff8c3976c04300 RSI: 0000000000000000 RDI: ffff8c3976c04320
        RBP: ffffffff82603db8 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff8c3976c04320 R11: ffffffff8289e1e0 R12: ffffd52cc8db0100
        R13: ffff8c3976c01a00 R14: ffffffff810f10d4 R15: ffff8c3976c04300
        FS:  0000000000000000(0000) GS:ffffffff8266b000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffff8c397ffff000 CR3: 0000000125020000 CR4: 00000000000406b0
        Call Trace:
         apply_wqattrs_prepare+0x154/0x280
         apply_workqueue_attrs_locked+0x4e/0xe0
         apply_workqueue_attrs+0x36/0x60
         alloc_workqueue+0x25a/0x6d0
         workqueue_init_early+0x246/0x348
         start_kernel+0x3c7/0x7ec
         x86_64_start_reservations+0x40/0x49
         x86_64_start_kernel+0xda/0xe4
         secondary_startup_64+0xb6/0xc0
        Modules linked in:
        ---[ end trace f67eb9af4d8d492b ]---
      
      Fix this by ensuring the value we set with set_freepointer is either NULL
      or another value in the chain.
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarLaura Abbott <labbott@redhat.com>
      Fixes: 6471384a
      
       ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b7e816f
  6. 12 Jul, 2019 7 commits
    • Alexander Potapenko's avatar
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko authored
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out may result in things going out of control.
      
      This patch (of 2):
      
      The new options are needed to prevent possible information leaks and make
      control-flow bugs that depend on uninitialized values more deterministic.
      
      This is expected to be on-by-default on Android and Chrome OS.  And it
      gives the opportunity for anyone else to use it under distros too via the
      boot args.  (The init_on_free feature is regularly requested by folks
      where memory forensics is included in their threat models.)
      
      init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
      objects with zeroes.  Initialization is done at allocation time at the
      places where checks for __GFP_ZERO are performed.
      
      init_on_free=1 makes the kernel initialize freed pages and heap objects
      with zeroes upon their deletion.  This helps to ensure sensitive data
      doesn't leak via use-after-free accesses.
      
      Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
      returns zeroed memory.  The two exceptions are slab caches with
      constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
      zero-initialized to preserve their semantics.
      
      Both init_on_alloc and init_on_free default to zero, but those defaults
      can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
      CONFIG_INIT_ON_FREE_DEFAULT_ON.
      
      If either SLUB poisoning or page poisoning is enabled, those options take
      precedence over init_on_alloc and init_on_free: initialization is only
      applied to unpoisoned allocations.
      
      Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
      
      hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
      hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
      
      Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
      Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
      Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
      Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
      
      The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
      is within the standard error.
      
      The new features are also going to pave the way for hardware memory
      tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
      hooks to set the tags for heap objects.  With MTE, tagging will have the
      same cost as memory initialization.
      
      Although init_on_free is rather costly, there are paranoid use-cases where
      in-memory data lifetime is desired to be minimized.  There are various
      arguments for/against the realism of the associated threat models, but
      given that we'll need the infrastructure for MTE anyway, and there are
      people who want wipe-on-free behavior no matter what the performance cost,
      it seems reasonable to include it in this series.
      
      [glider@google.com: v8]
        Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
      [glider@google.com: v10]
        Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
      Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
      Acked-by: James Morris <jamorris@linux.microsoft.com>]
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6471384a
    • Roman Gushchin's avatar
      mm: memcg/slab: unify SLAB and SLUB page accounting · 6cea1d56
      Roman Gushchin authored
      Currently the page accounting code is duplicated in SLAB and SLUB
      internals.  Let's move it into new (un)charge_slab_page helpers in the
      slab_common.c file.  These helpers will be responsible for statistics
      (global and memcg-aware) and memcg charging.  So they are replacing direct
      memcg_(un)charge_slab() calls.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6cea1d56
    • Roman Gushchin's avatar
      mm: memcg/slab: generalize postponed non-root kmem_cache deactivation · 43486694
      Roman Gushchin authored
      Currently SLUB uses a work scheduled after an RCU grace period to
      deactivate a non-root kmem_cache.  This mechanism can be reused for
      kmem_caches release, but requires generalization for SLAB case.
      
      Introduce kmemcg_cache_deactivate() function, which calls
      allocator-specific __kmem_cache_deactivate() and schedules execution of
      __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
      context after an rcu grace period.
      
      Here is the new calling scheme:
        kmemcg_cache_deactivate()
          __kmemcg_cache_deactivate()                  SLAB/SLUB-specific
          kmemcg_rcufn()                               rcu
            kmemcg_workfn()                            work
              __kmemcg_cache_deactivate_after_rcu()    SLAB/SLUB-specific
      
      instead of:
        __kmemcg_cache_deactivate()                    SLAB/SLUB-specific
          slab_deactivate_memcg_cache_rcu_sched()      SLUB-only
            kmemcg_rcufn()                             rcu
              kmemcg_workfn()                          work
                kmemcg_cache_deact_after_rcu()         SLUB-only
      
      For consistency, all allocator-specific functions start with "__".
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43486694
    • Roman Gushchin's avatar
      mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() · c03914b7
      Roman Gushchin authored
      Patch series "mm: reparent slab memory on cgroup removal", v7.
      
      # Why do we need this?
      
      We've noticed that the number of dying cgroups is steadily growing on most
      of our hosts in production.  The following investigation revealed an issue
      in the userspace memory reclaim code [1], accounting of kernel stacks [2],
      and also the main reason: slab objects.
      
      The underlying problem is quite simple: any page charged to a cgroup holds
      a reference to it, so the cgroup can't be reclaimed unless all charged
      pages are gone.  If a slab object is actively used by other cgroups, it
      won't be reclaimed, and will prevent the origin cgroup from being
      reclaimed.
      
      Slab objects, and first of all vfs cache, is shared between cgroups, which
      are using the same underlying fs, and what's even more important, it's
      shared between multiple generations of the same workload.  So if something
      is running periodically every time in a new cgroup (like how systemd
      works), we do accumulate multiple dying cgroups.
      
      Strictly speaking pagecache isn't different here, but there is a key
      difference: we disable protection and apply some extra pressure on LRUs of
      dying cgroups, and these LRUs contain all charged pages.  My experiments
      show that with the disabled kernel memory accounting the number of dying
      cgroups stabilizes at a relatively small number (~100, depends on memory
      pressure and cgroup creation rate), and with kernel memory accounting it
      grows pretty steadily up to several thousands.
      
      Memory cgroups are quite complex and big objects (mostly due to percpu
      stats), so it leads to noticeable memory losses.  Memory occupied by dying
      cgroups is measured in hundreds of megabytes.  I've even seen a host with
      more than 100Gb of memory wasted for dying cgroups.  It leads to a
      degradation of performance with the uptime, and generally limits the usage
      of cgroups.
      
      My previous attempt [3] to fix the problem by applying extra pressure on
      slab shrinker lists caused a regressions with xfs and ext4, and has been
      reverted [4].  The following attempts to find the right balance [5, 6]
      were not successful.
      
      So instead of trying to find a maybe non-existing balance, let's do
      reparent accounted slab caches to the parent cgroup on cgroup removal.
      
      # Implementation approach
      
      There is however a significant problem with reparenting of slab memory:
      there is no list of charged pages.  Some of them are in shrinker lists,
      but not all.  Introducing of a new list is really not an option.
      
      But fortunately there is a way forward: every slab page has a stable
      pointer to the corresponding kmem_cache.  So the idea is to reparent
      kmem_caches instead of slab pages.
      
      It's actually simpler and cheaper, but requires some underlying changes:
      1) Make kmem_caches to hold a single reference to the memory cgroup,
         instead of a separate reference per every slab page.
      2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
         page->kmem_cache->memcg indirection instead. It's used only on
         slab page release, so performance overhead shouldn't be a big issue.
      3) Introduce a refcounter for non-root slab caches. It's required to
         be able to destroy kmem_caches when they become empty and release
         the associated memory cgroup.
      
      There is a bonus: currently we release all memcg kmem_caches all together
      with the memory cgroup itself.  This patchset allows individual
      kmem_caches to be released as soon as they become inactive and free.
      
      Some additional implementation details are provided in corresponding
      commit messages.
      
      # Results
      
      Below is the average number of dying cgroups on two groups of our
      production hosts.  They do run some sort of web frontend workload, the
      memory pressure is moderate.  As we can see, with the kernel memory
      reparenting the number stabilizes in 60s range; however with the original
      version it grows almost linearly and doesn't show any signs of plateauing.
      The difference in slab and percpu usage between patched and unpatched
      versions also grows linearly.  In 7 days it exceeded 200Mb.
      
      day           0    1    2    3    4    5    6    7
      original     56  362  628  752 1070 1250 1490 1560
      patched      23   46   51   55   60   57   67   69
      mem diff(Mb) 22   74  123  152  164  182  214  241
      
      # Links
      
      [1]: commit 68600f62 ("mm: don't miss the last page because of round-off error")
      [2]: commit 9b6f7e16 ("mm: rework memcg kernel stack accounting")
      [3]: commit 172b06c3 ("mm: slowly shrink slabs with a relatively small number of objects")
      [4]: commit a9a238e8 ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
      [5]: https://lkml.org/lkml/2019/1/28/1865
      [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
      
      This patch (of 10):
      
      Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
      rather than in init_memcg_params().
      
      Once kmem_cache will hold a reference to the memory cgroup, it will
      simplify the refcounting.
      
      For non-root kmem_caches memcg_link_cache() is always called before the
      kmem_cache becomes visible to a user, so it's safe.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c03914b7
    • Marco Elver's avatar
      mm/slab: refactor common ksize KASAN logic into slab_common.c · 10d1f8cb
      Marco Elver authored
      This refactors common code of ksize() between the various allocators into
      slab_common.c: __ksize() is the allocator-specific implementation without
      instrumentation, whereas ksize() includes the required KASAN logic.
      
      Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com
      
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10d1f8cb
    • Shakeel Butt's avatar
      slub: don't panic for memcg kmem cache creation failure · cb097cd4
      Shakeel Butt authored
      Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and
      the corresponding root kmem cache has SLAB_PANIC flag, the kernel will
      be crashed.  This is unnecessary as the kernel can handle the creation
      failures of memcg kmem caches.  Additionally CONFIG_SLAB does not
      implement this behavior.  So, to keep the behavior consistent between
      SLAB and SLUB, removing the panic for memcg kmem cache creation
      failures.  The root kmem cache creation failure for SLAB_PANIC correctly
      panics for both SLAB and SLUB.
      
      Link: http://lkml.kernel.org/r/20190619232514.58994-1-shakeelb@google.com
      
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb097cd4
    • Yury Norov's avatar
      mm/slub.c: avoid double string traverse in kmem_cache_flags() · 9cf3a8d8
      Yury Norov authored
      If ',' is not found, kmem_cache_flags() calls strlen() to find the end of
      line.  We can do it in a single pass using strchrnul().
      
      Link: http://lkml.kernel.org/r/20190501053111.7950-1-ynorov@marvell.com
      
      Signed-off-by: default avatarYury Norov <ynorov@marvell.com>
      Acked-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cf3a8d8
  7. 14 May, 2019 4 commits
  8. 29 Apr, 2019 1 commit
    • Thomas Gleixner's avatar
      mm/slub: Simplify stack trace retrieval · 79716799
      Thomas Gleixner authored
      
      
      Replace the indirection through struct stack_trace with an invocation of
      the storage array based interface.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: linux-mm@kvack.org
      Cc: David Rientjes <rientjes@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: kasan-dev@googlegroups.com
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: iommu@lists.linux-foundation.org
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: linux-btrfs@vger.kernel.org
      Cc: dm-devel@redhat.com
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: intel-gfx@lists.freedesktop.org
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: dri-devel@lists.freedesktop.org
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190425094801.771410441@linutronix.de
      79716799
  9. 14 Apr, 2019 1 commit
  10. 29 Mar, 2019 1 commit
    • Nicolas Boichat's avatar
      mm: add support for kmem caches in DMA32 zone · 6d6ea1e9
      Nicolas Boichat authored
      Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
      v6.
      
      This is a followup to the discussion in [1], [2].
      
      IOMMUs using ARMv7 short-descriptor format require page tables (level 1
      and 2) to be allocated within the first 4GB of RAM, even on 64-bit
      systems.
      
      For L1 tables that are bigger than a page, we can just use
      __get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
      use GFP_DMA).
      
      For L2 tables that only take 1KB, it would be a waste to allocate a full
      page, so we considered 3 approaches:
       1. This series, adding support for GFP_DMA32 slab caches.
       2. genalloc, which requires pre-allocating the maximum number of L2 page
          tables (4096, so 4MB of memory).
       3. page_frag, which is not very memory-efficient as it is unable to reuse
          freed fragments until the whole page is freed. [3]
      
      This series is the most memory-efficient approach.
      
      stable@ note:
        We confirmed that this is a regression, and IOMMU errors happen on 4.19
        and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
        most likely starts from commit ad67f5a6 ("arm64: replace ZONE_DMA
        with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
        platforms (and maybe others?).
      
      [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
      [2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
      [3] https://patchwork.codeaurora.org/patch/671639/
      
      This patch (of 3):
      
      IOMMUs using ARMv7 short-descriptor format require page tables to be
      allocated within the first 4GB of RAM, even on 64-bit systems.  On arm64,
      this is done by passing GFP_DMA32 flag to memory allocation functions.
      
      For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
      a full page using get_free_pages, so we considered 3 approaches:
       1. This patch, adding support for GFP_DMA32 slab caches.
       2. genalloc, which requires pre-allocating the maximum number of L2
          page tables (4096, so 4MB of memory).
       3. page_frag, which is not very memory-efficient as it is unable
          to reuse freed fragments until the whole page is freed.
      
      This change makes it possible to create a custom cache in DMA32 zone using
      kmem_cache_create, then allocate memory using kmem_cache_alloc.
      
      We do not create a DMA32 kmalloc cache array, as there are currently no
      users of kmalloc(..., GFP_DMA32).  These calls will continue to trigger a
      warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
      
      This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
      kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
      unnecessary).
      
      Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
      
      Signed-off-by: default avatarNicolas Boichat <drinkcat@chromium.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Sasha Levin <Alexander.Levin@microsoft.com>
      Cc: Huaisheng Ye <yehs1@lenovo.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yong Wu <yong.wu@mediatek.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tomasz Figa <tfiga@google.com>
      Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d6ea1e9
  11. 06 Mar, 2019 5 commits
  12. 21 Feb, 2019 7 commits
    • Qian Cai's avatar
      slub: fix a crash with SLUB_DEBUG + KASAN_SW_TAGS · 6373dca1
      Qian Cai authored
      In process_slab(), "p = get_freepointer()" could return a tagged
      pointer, but "addr = page_address()" always return a native pointer.  As
      the result, slab_index() is messed up here,
      
          return (p - addr) / s->size;
      
      All other callers of slab_index() have the same situation where "addr"
      is from page_address(), so just need to untag "p".
      
          # cat /sys/kernel/slab/hugetlbfs_inode_cache/alloc_calls
      
          Unable to handle kernel paging request at virtual address 2bff808aa4856d48
          Mem abort info:
            ESR = 0x96000007
            Exception class = DABT (current EL), IL = 32 bits
            SET = 0, FnV = 0
            EA = 0, S1PTW = 0
          Data abort info:
            ISV = 0, ISS = 0x00000007
            CM = 0, WnR = 0
          swapper pgtable: 64k pages, 48-bit VAs, pgdp = 0000000002498338
          [2bff808aa4856d48] pgd=00000097fcfd0003, pud=00000097fcfd0003, pmd=00000097fca30003, pte=00e8008b24850712
          Internal error: Oops: 96000007 [#1] SMP
          CPU: 3 PID: 79210 Comm: read_all Tainted: G             L    5.0.0-rc7+ #84
          Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.0.6 07/10/2018
          pstate: 00400089 (nzcv daIf +PAN -UAO)
          pc : get_map+0x78/0xec
          lr : get_map+0xa0/0xec
          sp : aeff808989e3f8e0
          x29: aeff808989e3f940 x28: ffff800826200000
          x27: ffff100012d47000 x26: 9700000000002500
          x25: 0000000000000001 x24: 52ff8008200131f8
          x23: 52ff8008200130a0 x22: 52ff800820013098
          x21: ffff800826200000 x20: ffff100013172ba0
          x19: 2bff808a8971bc00 x18: ffff1000148f5538
          x17: 000000000000001b x16: 00000000000000ff
          x15: ffff1000148f5000 x14: 00000000000000d2
          x13: 0000000000000001 x12: 0000000000000000
          x11: 0000000020000002 x10: 2bff808aa4856d48
          x9 : 0000020000000000 x8 : 68ff80082620ebb0
          x7 : 0000000000000000 x6 : ffff1000105da1dc
          x5 : 0000000000000000 x4 : 0000000000000000
          x3 : 0000000000000010 x2 : 2bff808a8971bc00
          x1 : ffff7fe002098800 x0 : ffff80082620ceb0
          Process read_all (pid: 79210, stack limit = 0x00000000f65b9361)
          Call trace:
           get_map+0x78/0xec
           process_slab+0x7c/0x47c
           list_locations+0xb0/0x3c8
           alloc_calls_show+0x34/0x40
           slab_attr_show+0x34/0x48
           sysfs_kf_seq_show+0x2e4/0x570
           kernfs_seq_show+0x12c/0x1a0
           seq_read+0x48c/0xf84
           kernfs_fop_read+0xd4/0x448
           __vfs_read+0x94/0x5d4
           vfs_read+0xcc/0x194
           ksys_read+0x6c/0xe8
           __arm64_sys_read+0x68/0xb0
           el0_svc_handler+0x230/0x3bc
           el0_svc+0x8/0xc
          Code: d3467d2a 9ac92329 8b0a0e6a f9800151 (c85f7d4b)
          ---[ end trace a383a9a44ff13176 ]---
          Kernel panic - not syncing: Fatal exception
          SMP: stopping secondary CPUs
          SMP: failed to stop secondary CPUs 1-7,32,40,127
          Kernel Offset: disabled
          CPU features: 0x002,20000c18
          Memory Limit: none
          ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Link: http://lkml.kernel.org/r/20190220020251.82039-1-cai@lca.pw
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6373dca1
    • Qian Cai's avatar
      slub: fix SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS · 338cfaad
      Qian Cai authored
      Enabling SLUB_DEBUG's SLAB_CONSISTENCY_CHECKS with KASAN_SW_TAGS
      triggers endless false positives during boot below due to
      check_valid_pointer() checks tagged pointers which have no addresses
      that is valid within slab pages:
      
        BUG radix_tree_node (Tainted: G    B            ): Freelist Pointer check fails
        -----------------------------------------------------------------------------
      
        INFO: Slab objects=69 used=69 fp=0x          (null) flags=0x7ffffffc000200
        INFO: Object @offset=15060037153926966016 fp=0x
      
        Redzone: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 18 6b 06 00 08 80 ff d0  .........k......
        Object : 18 6b 06 00 08 80 ff d0 00 00 00 00 00 00 00 00  .k..............
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        Redzone: bb bb bb bb bb bb bb bb                          ........
        Padding: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
        CPU: 0 PID: 0 Comm: swapper/0 Tainted: G    B             5.0.0-rc5+ #18
        Call trace:
          dump_backtrace+0x0/0x450
          show_stack+0x20/0x2c
          __dump_stack+0x20/0x28
          dump_stack+0xa0/0xfc
          print_trailer+0x1bc/0x1d0
          object_err+0x40/0x50
          alloc_debug_processing+0xf0/0x19c
          ___slab_alloc+0x554/0x704
          kmem_cache_alloc+0x2f8/0x440
          radix_tree_node_alloc+0x90/0x2fc
          idr_get_free+0x1e8/0x6d0
          idr_alloc_u32+0x11c/0x2a4
          idr_alloc+0x74/0xe0
          worker_pool_assign_id+0x5c/0xbc
          workqueue_init_early+0x49c/0xd50
          start_kernel+0x52c/0xac4
        FIX radix_tree_node: Marking all objects used
      
      Link: http://lkml.kernel.org/r/20190209044128.3290-1-cai@lca.pw
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      338cfaad
    • Andrey Konovalov's avatar
      kasan, slub: fix more conflicts with CONFIG_SLAB_FREELIST_HARDENED · d36a63a9
      Andrey Konovalov authored
      When CONFIG_KASAN_SW_TAGS is enabled, ptr_addr might be tagged.  Normally,
      this doesn't cause any issues, as both set_freepointer() and
      get_freepointer() are called with a pointer with the same tag.  However,
      there are some issues with CONFIG_SLUB_DEBUG code.  For example, when
      __free_slub() iterates over objects in a cache, it passes untagged
      pointers to check_object().  check_object() in turns calls
      get_freepointer() with an untagged pointer, which causes the freepointer
      to be restored incorrectly.
      
      Add kasan_reset_tag to freelist_ptr(). Also add a detailed comment.
      
      Link: http://lkml.kernel.org/r/bf858f26ef32eb7bd24c665755b3aee4bc58d0e4.1550103861.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Tested-by: default avatarQian Cai <cai@lca.pw>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d36a63a9
    • Andrey Konovalov's avatar
      kasan, slub: fix conflicts with CONFIG_SLAB_FREELIST_HARDENED · 18e50661
      Andrey Konovalov authored
      CONFIG_SLAB_FREELIST_HARDENED hashes freelist pointer with the address of
      the object where the pointer gets stored.  With tag based KASAN we don't
      account for that when building freelist, as we call set_freepointer() with
      the first argument untagged.  This patch changes the code to properly
      propagate tags throughout the loop.
      
      Link: http://lkml.kernel.org/r/3df171559c52201376f246bf7ce3184fe21c1dc7.1549921721.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Evgeniy Stepanov <eugenis@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18e50661
    • Andrey Konovalov's avatar
      kasan, slub: move kasan_poison_slab hook before page_address · a7101224
      Andrey Konovalov authored
      With tag based KASAN page_address() looks at the page flags to see whether
      the resulting pointer needs to have a tag set.  Since we don't want to set
      a tag when page_address() is called on SLAB pages, we call
      page_kasan_tag_reset() in kasan_poison_slab().  However in allocate_slab()
      page_address() is called before kasan_poison_slab().  Fix it by changing
      the order.
      
      [andreyknvl@google.com: fix compilation error when CONFIG_SLUB_DEBUG=n]
        Link: http://lkml.kernel.org/r/ac27cc0bbaeb414ed77bcd6671a877cf3546d56e.1550066133.git.andreyknvl@google.com
      Link: http://lkml.kernel.org/r/cd895d627465a3f1c712647072d17f10883be2a1.1549921721.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgeniy Stepanov <eugenis@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7101224
    • Andrey Konovalov's avatar
      kmemleak: account for tagged pointers when calculating pointer range · a2f77575
      Andrey Konovalov authored
      kmemleak keeps two global variables, min_addr and max_addr, which store
      the range of valid (encountered by kmemleak) pointer values, which it
      later uses to speed up pointer lookup when scanning blocks.
      
      With tagged pointers this range will get bigger than it needs to be.  This
      patch makes kmemleak untag pointers before saving them to min_addr and
      max_addr and when performing a lookup.
      
      Link: http://lkml.kernel.org/r/16e887d442986ab87fe87a755815ad92fa431a5f.1550066133.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Tested-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: Catalin Marinas's avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgeniy Stepanov <eugenis@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2f77575
    • Andrey Konovalov's avatar
      kasan, kmemleak: pass tagged pointers to kmemleak · 53128245
      Andrey Konovalov authored
      Right now we call kmemleak hooks before assigning tags to pointers in
      KASAN hooks.  As a result, when an objects gets allocated, kmemleak sees a
      differently tagged pointer, compared to the one it sees when the object
      gets freed.  Fix it by calling KASAN hooks before kmemleak's ones.
      
      Link: http://lkml.kernel.org/r/cd825aa4897b0fc37d3316838993881daccbe9f5.1549921721.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgeniy Stepanov <eugenis@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53128245
  13. 09 Jan, 2019 1 commit
  14. 28 Dec, 2018 5 commits