1. 25 Mar, 2016 2 commits
    • Alexander Potapenko's avatar
      mm, kasan: add GFP flags to KASAN API · 505f5dcb
      Alexander Potapenko authored
      
      
      Add GFP flags to KASAN hooks for future patches to use.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505f5dcb
    • Alexander Potapenko's avatar
      mm, kasan: SLAB support · 7ed2f9e6
      Alexander Potapenko authored
      
      
      Add KASAN hooks to SLAB allocator.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ed2f9e6
  2. 17 Mar, 2016 4 commits
    • Joe Perches's avatar
      mm: convert printk(KERN_<LEVEL> to pr_<level> · 1170532b
      Joe Perches authored
      
      
      Most of the mm subsystem uses pr_<level> so make it consistent.
      
      Miscellanea:
      
       - Realign arguments
       - Add missing newline to format
       - kmemleak-test.c has a "kmemleak: " prefix added to the
         "Kmemleak testing" logging message via pr_fmt
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1170532b
    • Joe Perches's avatar
      mm: coalesce split strings · 756a025f
      Joe Perches authored
      
      
      Kernel style prefers a single string over split strings when the string is
      'user-visible'.
      
      Miscellanea:
      
       - Add a missing newline
       - Realign arguments
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      756a025f
    • Mel Gorman's avatar
      mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4
      Mel Gorman authored
      
      
      THP defrag is enabled by default to direct reclaim/compact but not wake
      kswapd in the event of a THP allocation failure.  The problem is that
      THP allocation requests potentially enter reclaim/compaction.  This
      potentially incurs a severe stall that is not guaranteed to be offset by
      reduced TLB misses.  While there has been considerable effort to reduce
      the impact of reclaim/compaction, it is still a high cost and workloads
      that should fit in memory fail to do so.  Specifically, a simple
      anon/file streaming workload will enter direct reclaim on NUMA at least
      even though the working set size is 80% of RAM.  It's been years and
      it's time to throw in the towel.
      
      First, this patch defines THP defrag as follows;
      
       madvise: A failed allocation will direct reclaim/compact if the application requests it
       never:   Neither reclaim/compact nor wake kswapd
       defer:   A failed allocation will wake kswapd/kcompactd
       always:  A failed allocation will direct reclaim/compact (historical behaviour)
                khugepaged defrag will enter direct/reclaim but not wake kswapd.
      
      Next it sets the default defrag option to be "madvise" to only enter
      direct reclaim/compaction for applications that specifically requested
      it.
      
      Lastly, it removes a check from the page allocator slowpath that is
      related to __GFP_THISNODE to allow "defer" to work.  The callers that
      really cares are slub/slab and they are updated accordingly.  The slab
      one may be surprising because it also corrects a comment as kswapd was
      never woken up by that path.
      
      This means that a THP fault will no longer stall for most applications
      by default and the ideal for most users that get THP if they are
      immediately available.  There are still options for users that prefer a
      stall at startup of a new application by either restoring historical
      behaviour with "always" or pick a half-way point with "defer" where
      kswapd does some of the work in the background and wakes kcompactd if
      necessary.  THP defrag for khugepaged remains enabled and will enter
      direct/reclaim but no wakeup kswapd or kcompactd.
      
      After this patch a THP allocation failure will quickly fallback and rely
      on khugepaged to recover the situation at some time in the future.  In
      some cases, this will reduce THP usage but the benefit of THP is hard to
      measure and not a universal win where as a stall to reclaim/compaction
      is definitely measurable and can be painful.
      
      The first test for this is using "usemem" to read a large file and write
      a large anonymous mapping (to avoid the zero page) multiple times.  The
      total size of the mappings is 80% of RAM and the benchmark simply
      measures how long it takes to complete.  It uses multiple threads to see
      if that is a factor.  On UMA, the performance is almost identical so is
      not reported but on NUMA, we see this
      
      usemem
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
      Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
      Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
      Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
      Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
      Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
      Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
      Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
      Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
      Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
      Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
      Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
      Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
      Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
      
      For a single thread, the benchmark completes 43.23% faster with this
      patch applied with smaller benefits as the thread increases.  Similar,
      notice the large reduction in most cases in system CPU usage.  The
      overall CPU time is
      
                     4.4.0       4.4.0
              kcompactd-v1r1 nodefrag-v1r3
      User        10357.65    10438.33
      System       3988.88     3543.94
      Elapsed      2203.01     1634.41
      
      Which is substantial. Now, the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 128458477   278352931
      Major Faults                   2174976         225
      Swap Ins                      16904701           0
      Swap Outs                     17359627           0
      Allocation stalls                43611           0
      DMA allocs                           0           0
      DMA32 allocs                  19832646    19448017
      Normal allocs                614488453   580941839
      Movable allocs                       0           0
      Direct pages scanned          24163800           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed        20691346           0
      Compaction stalls                42263           0
      Compaction success                 938           0
      Compaction failures              41325           0
      
      This patch eliminates almost all swapping and direct reclaim activity.
      There is still overhead but it's from NUMA balancing which does not
      identify that it's pointless trying to do anything with this workload.
      
      I also tried the thpscale benchmark which forces a corner case where
      compaction can be used heavily and measures the latency of whether base
      or huge pages were used
      
      thpscale Fault Latencies
                                             4.4.0                 4.4.0
                                    kcompactd-v1r1         nodefrag-v1r3
      Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
      Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
      Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
      Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
      Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
      Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
      Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
      Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
      Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
      Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
      Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
      Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
      Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
      Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
      Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
      Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
      Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
      Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
      
      The average time to fault pages is substantially reduced in the majority
      of caseds but with the obvious caveat that fewer THPs are actually used
      in this adverse workload
      
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
      Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
      Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
      Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
      Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
      Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
      Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
      Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
      Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                  37429143    47564000
      Major Faults                      1916        1558
      Swap Ins                          1466        1079
      Swap Outs                      2936863      149626
      Allocation stalls                62510           3
      DMA allocs                           0           0
      DMA32 allocs                   6566458     6401314
      Normal allocs                216361697   216538171
      Movable allocs                       0           0
      Direct pages scanned          25977580       17998
      Kswapd pages scanned                 0     3638931
      Kswapd pages reclaimed               0      207236
      Direct pages reclaimed         8833714          88
      Compaction stalls               103349           5
      Compaction success                 270           4
      Compaction failures             103079           1
      
      Note again that while this does swap as it's an aggressive workload, the
      direct relcim activity and allocation stalls is substantially reduced.
      There is some kswapd activity but ftrace showed that the kswapd activity
      was due to normal wakeups from 4K pages being allocated.
      Compaction-related stalls and activity are almost eliminated.
      
      I also tried the stutter benchmark.  For this, I do not have figures for
      NUMA but it's something that does impact UMA so I'll report what is
      available
      
      stutter
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
      Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
      1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
      2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
      3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
      Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
      Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
      Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
      Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
      Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
      Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
      
      This benchmark is trying to fault an anonymous mapping while there is a
      heavy IO load -- a scenario that desktop users used to complain about
      frequently.  This shows a mix because the ideal case of mapping with THP
      is not hit as often.  However, note that 99% of the mappings complete
      13.79% faster.  The CPU usage here is particularly interesting
      
                     4.4.0       4.4.0
              kcompactd-v1r1nodefrag-v1r3
      User           67.50        0.99
      System       1327.88       91.30
      Elapsed      2079.00     2128.98
      
      And once again we look at the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 335241922  1314582827
      Major Faults                       715         819
      Swap Ins                             0           0
      Swap Outs                            0           0
      Allocation stalls               532723           0
      DMA allocs                           0           0
      DMA32 allocs                1822364341  1177950222
      Normal allocs               1815640808  1517844854
      Movable allocs                       0           0
      Direct pages scanned          21892772           0
      Kswapd pages scanned          20015890    41879484
      Kswapd pages reclaimed        19961986    41822072
      Direct pages reclaimed        21892741           0
      Compaction stalls              1065755           0
      Compaction success                 514           0
      Compaction failures            1065241           0
      
      Allocation stalls and all direct reclaim activity is eliminated as well
      as compaction-related stalls.
      
      THP gives impressive gains in some cases but only if they are quickly
      available.  We're not going to reach the point where they are completely
      free so lets take the costs out of the fast paths finally and defer the
      cost to kswapd, kcompactd and khugepaged where it belongs.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      444eb2a4
    • Vladimir Davydov's avatar
      mm: memcontrol: report slab usage in cgroup2 memory.stat · 27ee57c9
      Vladimir Davydov authored
      
      
      Show how much memory is used for storing reclaimable and unreclaimable
      in-kernel data structures allocated from slab caches.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27ee57c9
  3. 15 Mar, 2016 26 commits
    • Vlastimil Babka's avatar
      mm, sl[au]b: print gfp_flags as strings in slab_out_of_memory() · 5b3810e5
      Vlastimil Babka authored
      
      
      We can now print gfp_flags more human-readable.  Make use of this in
      slab_out_of_memory() for SLUB and SLAB.  Also convert the SLAB variant
      it to pr_warn() along the way.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b3810e5
    • Joonsoo Kim's avatar
      mm/slab: re-implement pfmemalloc support · f68f8ddd
      Joonsoo Kim authored
      
      
      Current implementation of pfmemalloc handling in SLAB has some problems.
      
      1) pfmemalloc_active is set to true when there is just one or more
         pfmemalloc slabs in the system, but it is cleared when there is no
         pfmemalloc slab in one arbitrary kmem_cache.  So, pfmemalloc_active
         could be wrongly cleared.
      
      2) Search to partial and free list doesn't happen when non-pfmemalloc
         object are not found in cpu cache.  Instead, allocating new slab
         happens and it is not optimal.
      
      3) Even after sk_memalloc_socks() is disabled, cpu cache would keep
         pfmemalloc objects tagged with SLAB_OBJ_PFMEMALLOC.  It isn't cleared
         if sk_memalloc_socks() is disabled so it could cause problem.
      
      4) If cpu cache is filled with pfmemalloc objects, it would cause slow
         down non-pfmemalloc allocation.
      
      To me, current pointer tagging approach looks complex and fragile so this
      patch re-implement whole thing instead of fixing problems one by one.
      
      Design principle for new implementation is that
      
      1) Don't disrupt non-pfmemalloc allocation in fast path even if
         sk_memalloc_socks() is enabled.  It's more likely case than pfmemalloc
         allocation.
      
      2) Ensure that pfmemalloc slab is used only for pfmemalloc allocation.
      
      3) Don't consider performance of pfmemalloc allocation in memory
         deficiency state.
      
      As a result, all pfmemalloc alloc/free in memory tight state will be
      handled in slow-path.  If there is non-pfmemalloc free object, it will be
      returned first even for pfmemalloc user in fast-path so that performance
      of pfmemalloc user isn't affected in normal case and pfmemalloc objects
      will be kept as long as possible.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f68f8ddd
    • Joonsoo Kim's avatar
      mm/slab: avoid returning values by reference · 70f75067
      Joonsoo Kim authored
      
      
      Returing values by reference is bad practice.  Instead, just use
      function return value.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Suggested-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70f75067
    • Joonsoo Kim's avatar
      mm/slab: introduce new slab management type, OBJFREELIST_SLAB · b03a017b
      Joonsoo Kim authored
      
      
      SLAB needs an array to manage freed objects in a slab.  It is only used
      if some objects are freed so we can use free object itself as this
      array.  This requires additional branch in somewhat critical lock path
      to check if it is first freed object or not but that's all we need.
      Benefits is that we can save extra memory usage and reduce some
      computational overhead by allocating a management array when new slab is
      created.
      
      Code change is rather complex than what we can expect from the idea, in
      order to handle debugging feature efficiently.  If you want to see core
      idea only, please remove '#if DEBUG' block in the patch.
      
      Although this idea can apply to all caches whose size is larger than
      management array size, it isn't applied to caches which have a
      constructor.  If such cache's object is used for management array,
      constructor should be called for it before that object is returned to
      user.  I guess that overhead overwhelm benefit in that case so this idea
      doesn't applied to them at least now.
      
      For summary, from now on, slab management type is determined by
      following logic.
      
      1) if management array size is smaller than object size and no ctor, it
         becomes OBJFREELIST_SLAB.
      
      2) if management array size is smaller than leftover, it becomes
         NORMAL_SLAB which uses leftover as a array.
      
      3) if OFF_SLAB help to save memory than way 4), it becomes OFF_SLAB.
         It allocate a management array from the other cache so memory waste
         happens.
      
      4) others become NORMAL_SLAB.  It uses dedicated internal memory in a
         slab as a management array so it causes memory waste.
      
      In my system, without enabling CONFIG_DEBUG_SLAB, Almost caches become
      OBJFREELIST_SLAB and NORMAL_SLAB (using leftover) which doesn't waste
      memory.  Following is the result of number of caches with specific slab
      management type.
      
      TOTAL = OBJFREELIST + NORMAL(leftover) + NORMAL + OFF
      
      /Before/
      126 = 0 + 60 + 25 + 41
      
      /After/
      126 = 97 + 12 + 15 + 2
      
      Result shows that number of caches that doesn't waste memory increase
      from 60 to 109.
      
      I did some benchmarking and it looks that benefit are more than loss.
      
      Kmalloc: Repeatedly allocate then free test
      
      /Before/
      [    0.286809] 1. Kmalloc: Repeatedly allocate then free test
      [    1.143674] 100000 times kmalloc(32) -> 116 cycles kfree -> 78 cycles
      [    1.441726] 100000 times kmalloc(64) -> 121 cycles kfree -> 80 cycles
      [    1.815734] 100000 times kmalloc(128) -> 168 cycles kfree -> 85 cycles
      [    2.380709] 100000 times kmalloc(256) -> 287 cycles kfree -> 95 cycles
      [    3.101153] 100000 times kmalloc(512) -> 370 cycles kfree -> 117 cycles
      [    3.942432] 100000 times kmalloc(1024) -> 413 cycles kfree -> 156 cycles
      [    5.227396] 100000 times kmalloc(2048) -> 622 cycles kfree -> 248 cycles
      [    7.519793] 100000 times kmalloc(4096) -> 1102 cycles kfree -> 452 cycles
      
      /After/
      [    1.205313] 100000 times kmalloc(32) -> 117 cycles kfree -> 78 cycles
      [    1.510526] 100000 times kmalloc(64) -> 124 cycles kfree -> 81 cycles
      [    1.827382] 100000 times kmalloc(128) -> 130 cycles kfree -> 84 cycles
      [    2.226073] 100000 times kmalloc(256) -> 177 cycles kfree -> 92 cycles
      [    2.814747] 100000 times kmalloc(512) -> 286 cycles kfree -> 112 cycles
      [    3.532952] 100000 times kmalloc(1024) -> 344 cycles kfree -> 141 cycles
      [    4.608777] 100000 times kmalloc(2048) -> 519 cycles kfree -> 210 cycles
      [    6.350105] 100000 times kmalloc(4096) -> 789 cycles kfree -> 391 cycles
      
      In fact, I tested another idea implementing OBJFREELIST_SLAB with
      extendable linked array through another freed object.  It can remove
      memory waste completely but it causes more computational overhead in
      critical lock path and it seems that overhead outweigh benefit.  So, this
      patch doesn't include it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b03a017b
    • Joonsoo Kim's avatar
      mm/slab: factor out debugging initialization in cache_init_objs() · 10b2e9e8
      Joonsoo Kim authored
      
      
      cache_init_objs() will be changed in following patch and current form
      doesn't fit well for that change.  So, before doing it, this patch
      separates debugging initialization.  This would cause two loop iteration
      when debugging is enabled, but, this overhead seems too light than debug
      feature itself so effect may not be visible.  This patch will greatly
      simplify changes in cache_init_objs() in following patch.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10b2e9e8
    • Joonsoo Kim's avatar
      mm/slab: factor out slab list fixup code · d8410234
      Joonsoo Kim authored
      
      
      Slab list should be fixed up after object is detached from the slab and
      this happens at two places.  They do exactly same thing.  They will be
      changed in the following patch, so, to reduce code duplication, this
      patch factor out them and make it common function.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8410234
    • Joonsoo Kim's avatar
      mm/slab: make criteria for off slab determination robust and simple · 3217fd9b
      Joonsoo Kim authored
      
      
      To become an off slab, there are some constraints to avoid bootstrapping
      problem and recursive call.  This can be avoided differently by simply
      checking that corresponding kmalloc cache is ready and it's not a off
      slab.  It would be more robust because static size checking can be
      affected by cache size change or architecture type but dynamic checking
      isn't.
      
      One check 'freelist_cache->size > cachep->size / 2' is added to check
      benefit of choosing off slab, because, now, there is no size constraint
      which ensures enough advantage when selecting off slab.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3217fd9b
    • Joonsoo Kim's avatar
      mm/slab: do not change cache size if debug pagealloc isn't possible · f3a3c320
      Joonsoo Kim authored
      
      
      We can fail to setup off slab in some conditions.  Even in this case,
      debug pagealloc increases cache size to PAGE_SIZE in advance and it is
      waste because debug pagealloc cannot work for it when it isn't the off
      slab.  To improve this situation, this patch checks first that this
      cache with increased size is suitable for off slab.  It actually
      increases cache size when it is suitable for off-slab, so possible waste
      is removed.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3a3c320
    • Joonsoo Kim's avatar
      mm/slab: clean up cache type determination · 158e319b
      Joonsoo Kim authored
      
      
      Current cache type determination code is open-code and looks not
      understandable.  Following patch will introduce one more cache type and
      it would make code more complex.  So, before it happens, this patch
      abstracts these codes.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      158e319b
    • Joonsoo Kim's avatar
      mm/slab: align cache size first before determination of OFF_SLAB candidate · 832a15d2
      Joonsoo Kim authored
      
      
      Finding suitable OFF_SLAB candidate is more related to aligned cache
      size rather than original size.  Same reasoning can be applied to the
      debug pagealloc candidate.  So, this patch moves up alignment fixup to
      proper position.  From that point, size is aligned so we can remove some
      alignment fixups.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      832a15d2
    • Joonsoo Kim's avatar
      mm/slab: put the freelist at the end of slab page · 2e6b3602
      Joonsoo Kim authored
      
      
      Currently, the freelist is at the front of slab page.  This requires
      extra space to meet object alignment requirement.  If we put the
      freelist at the end of a slab page, objects could start at page boundary
      and will be at correct alignment.  This is possible because freelist has
      no alignment constraint itself.
      
      This gives us two benefits: It removes extra memory space for the
      freelist alignment and remove complex calculation at cache
      initialization step.  I can't think notable drawback here.
      
      I mentioned that this would reduce extra memory space, but, this benefit
      is rather theoretical because it can be applied to very few cases.
      Following is the example cache type that can get benefit from this
      change.
      
        size align num before after
          32    8  124  4100  4092
          64    8   63  4103  4095
          88    8   46  4102  4094
         272    8   15  4103  4095
         408    8   10  4098  4090
          32   16  124  4108  4092
          64   16   63  4111  4095
          32   32  124  4124  4092
          64   32   63  4127  4095
          96   32   42  4106  4074
      
      before means whole size for objects and aligned freelist before applying
      patch and after shows the result of this patch.
      
      Since before is more than 4096, number of object should decrease and
      memory waste happens.
      
      Anyway, this patch removes complex calculation so looks beneficial to
      me.
      
      [akpm@linux-foundation.org: fix kerneldoc]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e6b3602
    • Joonsoo Kim's avatar
      mm/slab: remove object status buffer for DEBUG_SLAB_LEAK · 249247b6
      Joonsoo Kim authored
      
      
      Now, we don't use object status buffer in any setup. Remove it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      249247b6
    • Joonsoo Kim's avatar
      mm/slab: alternative implementation for DEBUG_SLAB_LEAK · d31676df
      Joonsoo Kim authored
      
      
      DEBUG_SLAB_LEAK is a debug option.  It's current implementation requires
      status buffer so we need more memory to use it.  And, it cause
      kmem_cache initialization step more complex.
      
      To remove this extra memory usage and to simplify initialization step,
      this patch implement this feature with another way.
      
      When user requests to get slab object owner information, it marks that
      getting information is started.  And then, all free objects in caches
      are flushed to corresponding slab page.  Now, we can distinguish all
      freed object so we can know all allocated objects, too.  After
      collecting slab object owner information on allocated objects, mark is
      checked that there is no free during the processing.  If true, we can be
      sure that our information is correct so information is returned to user.
      
      Although this way is rather complex, it has two important benefits
      mentioned above.  So, I think it is worth changing.
      
      There is one drawback that it takes more time to get slab object owner
      information but it is just a debug option so it doesn't matter at all.
      
      To help review, this patch implements new way only.  Following patch
      will remove useless code.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d31676df
    • Joonsoo Kim's avatar
      mm/slab: clean up DEBUG_PAGEALLOC processing code · 40b44137
      Joonsoo Kim authored
      
      
      Currently, open code for checking DEBUG_PAGEALLOC cache is spread to
      some sites.  It makes code unreadable and hard to change.
      
      This patch cleans up this code.  The following patch will change the
      criteria for DEBUG_PAGEALLOC cache so this clean-up will help it, too.
      
      [akpm@linux-foundation.org: fix build with CONFIG_DEBUG_PAGEALLOC=n]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40b44137
    • Joonsoo Kim's avatar
      mm/slab: use more appropriate condition check for debug_pagealloc · 40323278
      Joonsoo Kim authored
      
      
      debug_pagealloc debugging is related to SLAB_POISON flag rather than
      FORCED_DEBUG option, although FORCED_DEBUG option will enable
      SLAB_POISON.  Fix it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40323278
    • Joonsoo Kim's avatar
      mm/slab: activate debug_pagealloc in SLAB when it is actually enabled · a307ebd4
      Joonsoo Kim authored
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a307ebd4
    • Joonsoo Kim's avatar
      mm/slab: remove the checks for slab implementation bug · 260b61dd
      Joonsoo Kim authored
      
      
      Some of "#if DEBUG" are for reporting slab implementation bug rather
      than user usecase bug.  It's not really needed because slab is stable
      for a quite long time and it makes code too dirty.  This patch remove
      it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      260b61dd
    • Joonsoo Kim's avatar
      mm/slab: remove useless structure define · 6fb92430
      Joonsoo Kim authored
      
      
      It is obsolete so remove it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fb92430
    • Joonsoo Kim's avatar
      mm/slab: fix stale code comment · 12c61fe9
      Joonsoo Kim authored
      
      
      This patchset implements a new freed object management way, that is,
      OBJFREELIST_SLAB.  Purpose of it is to reduce memory overhead in SLAB.
      
      SLAB needs a array to manage freed objects in a slab.  If there is
      leftover after objects are packed into a slab, we can use it as a
      management array, and, in this case, there is no memory waste.  But, in
      the other cases, we need to allocate extra memory for a management array
      or utilize dedicated internal memory in a slab for it.  Both cases
      causes memory waste so it's not good.
      
      With this patchset, freed object itself can be used for a management
      array.  So, memory waste could be reduced.  Detailed idea and numbers
      are described in last patch's commit description.  Please refer it.
      
      In fact, I tested another idea implementing OBJFREELIST_SLAB with
      extendable linked array through another freed object.  It can remove
      memory waste completely but it causes more computational overhead in
      critical lock path and it seems that overhead outweigh benefit.  So,
      this patchset doesn't include it.  I will attach prototype just for a
      reference.
      
      This patch (of 16):
      
      We use freelist_idx_t type for free object management whose size would be
      smaller than size of unsigned int.  Fix it.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12c61fe9
    • Jesper Dangaard Brouer's avatar
      mm: new API kfree_bulk() for SLAB+SLUB allocators · ca257195
      Jesper Dangaard Brouer authored
      
      
      This patch introduce a new API call kfree_bulk() for bulk freeing memory
      objects not bound to a single kmem_cache.
      
      Christoph pointed out that it is possible to implement freeing of
      objects, without knowing the kmem_cache pointer as that information is
      available from the object's page->slab_cache.  Proposing to remove the
      kmem_cache argument from the bulk free API.
      
      Jesper demonstrated that these extra steps per object comes at a
      performance cost.  It is only in the case CONFIG_MEMCG_KMEM is compiled
      in and activated runtime that these steps are done anyhow.  The extra
      cost is most visible for SLAB allocator, because the SLUB allocator does
      the page lookup (virt_to_head_page()) anyhow.
      
      Thus, the conclusion was to keep the kmem_cache free bulk API with a
      kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
      easily.  Simply by handling if kmem_cache_free_bulk() gets called with a
      kmem_cache NULL pointer.
      
      This does increase the code size a bit, but implementing a separate
      kfree_bulk() call would likely increase code size even more.
      
      Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
      @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.
      
      Code size increase for SLAB:
      
       add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
       function                                     old     new   delta
       kmem_cache_free_bulk                         660     734     +74
      
      SLAB fastpath: 87 cycles(tsc) 21.814
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 - 103 cycles 25.878 ns -  41 cycles 10.498 ns - 81 cycles 20.312 ns
         2 -  94 cycles 23.673 ns -  26 cycles  6.682 ns - 42 cycles 10.649 ns
         3 -  92 cycles 23.181 ns -  21 cycles  5.325 ns - 39 cycles 9.950 ns
         4 -  90 cycles 22.727 ns -  18 cycles  4.673 ns - 26 cycles 6.693 ns
         8 -  89 cycles 22.270 ns -  14 cycles  3.664 ns - 23 cycles 5.835 ns
        16 -  88 cycles 22.038 ns -  14 cycles  3.503 ns - 22 cycles 5.543 ns
        30 -  89 cycles 22.284 ns -  13 cycles  3.310 ns - 20 cycles 5.197 ns
        32 -  88 cycles 22.249 ns -  13 cycles  3.420 ns - 20 cycles 5.166 ns
        34 -  88 cycles 22.224 ns -  14 cycles  3.643 ns - 20 cycles 5.170 ns
        48 -  88 cycles 22.088 ns -  14 cycles  3.507 ns - 20 cycles 5.203 ns
        64 -  88 cycles 22.063 ns -  13 cycles  3.428 ns - 20 cycles 5.152 ns
       128 -  89 cycles 22.483 ns -  15 cycles  3.891 ns - 23 cycles 5.885 ns
       158 -  89 cycles 22.381 ns -  15 cycles  3.779 ns - 22 cycles 5.548 ns
       250 -  91 cycles 22.798 ns -  16 cycles  4.152 ns - 23 cycles 5.967 ns
      
      SLAB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
       1 - 148 cycles 37.220 ns -  66 cycles 16.622 ns - 66 cycles 16.583 ns
       2 - 141 cycles 35.510 ns -  51 cycles 12.820 ns - 58 cycles 14.625 ns
       3 - 140 cycles 35.017 ns -  37 cycles 9.326 ns - 33 cycles 8.474 ns
       4 - 137 cycles 34.507 ns -  31 cycles 7.888 ns - 33 cycles 8.300 ns
       8 - 140 cycles 35.069 ns -  25 cycles 6.461 ns - 25 cycles 6.436 ns
       16 - 138 cycles 34.542 ns -  23 cycles 5.945 ns - 22 cycles 5.670 ns
       30 - 136 cycles 34.227 ns -  22 cycles 5.502 ns - 22 cycles 5.587 ns
       32 - 136 cycles 34.253 ns -  21 cycles 5.475 ns - 21 cycles 5.324 ns
       34 - 136 cycles 34.254 ns -  21 cycles 5.448 ns - 20 cycles 5.194 ns
       48 - 136 cycles 34.075 ns -  21 cycles 5.458 ns - 21 cycles 5.367 ns
       64 - 135 cycles 33.994 ns -  21 cycles 5.350 ns - 21 cycles 5.259 ns
       128 - 137 cycles 34.446 ns -  23 cycles 5.816 ns - 22 cycles 5.688 ns
       158 - 137 cycles 34.379 ns -  22 cycles 5.727 ns - 22 cycles 5.602 ns
       250 - 138 cycles 34.755 ns -  24 cycles 6.093 ns - 23 cycles 5.986 ns
      
      Code size increase for SLUB:
       function                                     old     new   delta
       kmem_cache_free_bulk                         717     799     +82
      
      SLUB benchmark:
       SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 -  61 cycles 15.486 ns -  53 cycles 13.364 ns - 57 cycles 14.464 ns
         2 -  54 cycles 13.703 ns -  32 cycles  8.110 ns - 33 cycles 8.482 ns
         3 -  53 cycles 13.272 ns -  25 cycles  6.362 ns - 27 cycles 6.947 ns
         4 -  51 cycles 12.994 ns -  24 cycles  6.087 ns - 24 cycles 6.078 ns
         8 -  50 cycles 12.576 ns -  21 cycles  5.354 ns - 22 cycles 5.513 ns
        16 -  49 cycles 12.368 ns -  20 cycles  5.054 ns - 20 cycles 5.042 ns
        30 -  49 cycles 12.273 ns -  18 cycles  4.748 ns - 19 cycles 4.758 ns
        32 -  49 cycles 12.401 ns -  19 cycles  4.821 ns - 19 cycles 4.810 ns
        34 -  98 cycles 24.519 ns -  24 cycles  6.154 ns - 24 cycles 6.157 ns
        48 -  83 cycles 20.833 ns -  21 cycles  5.446 ns - 21 cycles 5.429 ns
        64 -  75 cycles 18.891 ns -  20 cycles  5.247 ns - 20 cycles 5.238 ns
       128 -  93 cycles 23.271 ns -  27 cycles  6.856 ns - 27 cycles 6.823 ns
       158 - 102 cycles 25.581 ns -  30 cycles  7.714 ns - 30 cycles 7.695 ns
       250 - 107 cycles 26.917 ns -  38 cycles  9.514 ns - 38 cycles 9.506 ns
      
      SLUB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
       1 - 85 cycles 21.484 ns -  78 cycles 19.569 ns - 75 cycles 18.938 ns
       2 - 81 cycles 20.363 ns -  45 cycles 11.258 ns - 44 cycles 11.076 ns
       3 - 78 cycles 19.709 ns -  33 cycles 8.354 ns - 32 cycles 8.044 ns
       4 - 77 cycles 19.430 ns -  28 cycles 7.216 ns - 28 cycles 7.003 ns
       8 - 101 cycles 25.288 ns -  23 cycles 5.849 ns - 23 cycles 5.787 ns
       16 - 76 cycles 19.148 ns -  20 cycles 5.162 ns - 20 cycles 5.081 ns
       30 - 76 cycles 19.067 ns -  19 cycles 4.868 ns - 19 cycles 4.821 ns
       32 - 76 cycles 19.052 ns -  19 cycles 4.857 ns - 19 cycles 4.815 ns
       34 - 121 cycles 30.291 ns -  25 cycles 6.333 ns - 25 cycles 6.268 ns
       48 - 108 cycles 27.111 ns -  21 cycles 5.498 ns - 21 cycles 5.458 ns
       64 - 100 cycles 25.164 ns -  20 cycles 5.242 ns - 20 cycles 5.229 ns
       128 - 155 cycles 38.976 ns -  27 cycles 6.886 ns - 27 cycles 6.892 ns
       158 - 132 cycles 33.034 ns -  30 cycles 7.711 ns - 30 cycles 7.728 ns
       250 - 130 cycles 32.612 ns -  38 cycles 9.560 ns - 38 cycles 9.549 ns
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca257195
    • Jesper Dangaard Brouer's avatar
      slab: implement bulk free in SLAB allocator · e6cdb58d
      Jesper Dangaard Brouer authored
      This patch implements the free side of bulk API for the SLAB allocator
      kmem_cache_free_bulk(), and concludes the implementation of optimized
      bulk API for SLAB allocator.
      
      Benchmarked[1] cost of alloc+free (obj size 256 bytes) on CPU i7-4790K @
      4.00GHz, with no debug options, no PREEMPT and CONFIG_MEMCG_KMEM=y but
      no active user of kmemcg.
      
      SLAB single alloc+free cost: 87 cycles(tsc) 21.814 ns with this
      optimized config.
      
      bulk- Current fallback          - optimized SLAB bulk
        1 - 102 cycles(tsc) 25.747 ns - 41 cycles(tsc) 10.490 ns - improved 59.8%
        2 -  94 cycles(tsc) 23.546 ns - 26 cycles(tsc)  6.567 ns - improved 72.3%
        3 -  92 cycles(tsc) 23.127 ns - 20 cycles(tsc)  5.244 ns - improved 78.3%
        4 -  90 cycles(tsc) 22.663 ns - 18 cycles(tsc)  4.588 ns - improved 80.0%
        8 -  88 cycles(tsc) 22.242 ns - 14 cycles(tsc)  3.656 ns - improved 84.1%
       16 -  88 cycles(tsc) 22.010 ns - 13 cycles(tsc)  3.480 ns - improved 85.2%
       30 -  89 cycles(tsc) 22.305 ns - 13 cycles(tsc)  3.303 ns - improved 85.4%
       32 -  89 cycles(tsc) 22.277 ns - 13 cycles(tsc)  3.309 ns - improved 85.4%
       34 -  88 cycles(tsc) 22.246 ns - 13 cycles(tsc)  3.294 ns - improved 85.2%
       48 -  88 cycles(tsc) 22.121 ns - 13 cycles(tsc)  3.492 ns - improved 85.2%
       64 -  88 cycles(tsc) 22.052 ns - 13 cycles(tsc)  3.411 ns - improved 85.2%
      128 -  89 cycles(tsc) 22.452 ns - 15 cycles(tsc)  3.841 ns - improved 83.1%
      158 -  89 cycles(tsc) 22.403 ns - 14 cycles(tsc)  3.746 ns - improved 84.3%
      250 -  91 cycles(tsc) 22.775 ns - 16 cycles(tsc)  4.111 ns - improved 82.4%
      
      Notice it is not recommended to do very large bulk operation with
      this bulk API, because local IRQs are disabled in this period.
      
      [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
      
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6cdb58d
    • Jesper Dangaard Brouer's avatar
      slab: avoid running debug SLAB code with IRQs disabled for alloc_bulk · 7b0501dd
      Jesper Dangaard Brouer authored
      
      
      Move the call to cache_alloc_debugcheck_after() outside the IRQ disabled
      section in kmem_cache_alloc_bulk().
      
      When CONFIG_DEBUG_SLAB is disabled the compiler should remove this code.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b0501dd
    • Jesper Dangaard Brouer's avatar
      slab: implement bulk alloc in SLAB allocator · 2a777eac
      Jesper Dangaard Brouer authored
      
      
      This patch implements the alloc side of bulk API for the SLAB allocator.
      
      Further optimization are still possible by changing the call to
      __do_cache_alloc() into something that can return multiple objects.
      This optimization is left for later, given end results already show in
      the area of 80% speedup.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a777eac
    • Jesper Dangaard Brouer's avatar
      slab: use slab_post_alloc_hook in SLAB allocator shared with SLUB · d5e3ed66
      Jesper Dangaard Brouer authored
      
      
      Reviewers notice that the order in slab_post_alloc_hook() of
      kmemcheck_slab_alloc() and kmemleak_alloc_recursive() gets swapped
      compared to slab.c / SLAB allocator.
      
      Also notice memset now occurs before calling kmemcheck_slab_alloc() and
      kmemleak_alloc_recursive().
      
      I assume this reordering of kmemcheck, kmemleak and memset is okay
      because this is the order they are used by the SLUB allocator.
      
      This patch completes the sharing of alloc_hook's between SLUB and SLAB.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5e3ed66
    • Jesper Dangaard Brouer's avatar
      slab: use slab_pre_alloc_hook in SLAB allocator shared with SLUB · 011eceaf
      Jesper Dangaard Brouer authored
      
      
      Deduplicate code in SLAB allocator functions slab_alloc() and
      slab_alloc_node() by using the slab_pre_alloc_hook() call, which is now
      shared between SLUB and SLAB.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      011eceaf
    • Jesper Dangaard Brouer's avatar
      mm: fault-inject take over bootstrap kmem_cache check · fab9963a
      Jesper Dangaard Brouer authored
      
      
      Remove the SLAB specific function slab_should_failslab(), by moving the
      check against fault-injection for the bootstrap slab, into the shared
      function should_failslab() (used by both SLAB and SLUB).
      
      This is a step towards sharing alloc_hook's between SLUB and SLAB.
      
      This bootstrap slab "kmem_cache" is used for allocating struct
      kmem_cache objects to the allocator itself.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fab9963a
  4. 19 Feb, 2016 1 commit
    • Dmitry Safonov's avatar
      mm: slab: free kmem_cache_node after destroy sysfs file · 52b4b950
      Dmitry Safonov authored
      When slub_debug alloc_calls_show is enabled we will try to track
      location and user of slab object on each online node, kmem_cache_node
      structure and cpu_cache/cpu_slub shouldn't be freed till there is the
      last reference to sysfs file.
      
      This fixes the following panic:
      
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
         IP:  list_locations+0x169/0x4e0
         PGD 257304067 PUD 438456067 PMD 0
         Oops: 0000 [#1] SMP
         CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
         Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
         task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
         RIP: list_locations+0x169/0x4e0
         Call Trace:
           alloc_calls_show+0x1d/0x30
           slab_attr_show+0x1b/0x30
           sysfs_read_file+0x9a/0x1a0
           vfs_read+0x9c/0x170
           SyS_read+0x58/0xb0
           system_call_fastpath+0x16/0x1b
         Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 <48> 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
         CR2: 0000000000000020
      
      Separated __kmem_cache_release from __kmem_cache_shutdown which now
      called on slab_kmem_cache_release (after the last reference to sysfs
      file object has dropped).
      
      Reintroduced locking in free_partial as sysfs file might access cache's
      partial list after shutdowning - partial revert of the commit
      69cb8e6b ("slub: free slabs without holding locks").  Zap
      __remove_partial and use remove_partial (w/o underscores) as
      free_partial now takes list_lock which s partial revert for commit
      1e4dd946
      
       ("slub: do not assert not having lock in removing freed
      partial")
      Signed-off-by: default avatarDmitry Safonov <dsafonov@virtuozzo.com>
      Suggested-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52b4b950
  5. 15 Jan, 2016 3 commits
  6. 22 Nov, 2015 1 commit
    • Jesper Dangaard Brouer's avatar
      slab/slub: adjust kmem_cache_alloc_bulk API · 865762a8
      Jesper Dangaard Brouer authored
      
      
      Adjust kmem_cache_alloc_bulk API before we have any real users.
      
      Adjust API to return type 'int' instead of previously type 'bool'.  This
      is done to allow future extension of the bulk alloc API.
      
      A future extension could be to allow SLUB to stop at a page boundary, when
      specified by a flag, and then return the number of objects.
      
      The advantage of this approach, would make it easier to make bulk alloc
      run without local IRQs disabled.  With an approach of cmpxchg "stealing"
      the entire c->freelist or page->freelist.  To avoid overshooting we would
      stop processing at a slab-page boundary.  Else we always end up returning
      some objects at the cost of another cmpxchg.
      
      To keep compatible with future users of this API linking against an older
      kernel when using the new flag, we need to return the number of allocated
      objects with this API change.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      865762a8
  7. 07 Nov, 2015 2 commits
    • Kirill A. Shutemov's avatar
      slab, slub: use page->rcu_head instead of page->lru plus cast · bc4f610d
      Kirill A. Shutemov authored
      
      
      We have properly typed page->rcu_head, no need to cast page->lru.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc4f610d
    • Mel Gorman's avatar
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman authored
      
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  8. 06 Nov, 2015 1 commit
    • Vladimir Davydov's avatar
      memcg: unify slab and other kmem pages charging · f3ccb2c4
      Vladimir Davydov authored
      
      
      We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
      uncharging kmem pages to memcg, but currently they are not used for
      charging slab pages (i.e.  they are only used for charging pages allocated
      with alloc_kmem_pages).  The only reason why the slab subsystem uses
      special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
      needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
      to the memcg that the current task belongs to.
      
      To remove this diversity, this patch adds an extra argument to
      __memcg_kmem_charge that can be a pointer to a memcg or NULL.  If it is
      not NULL, the function tries to charge to the memcg it points to,
      otherwise it charge to the current context.  Next, it makes the slab
      subsystem use this function to charge slab pages.
      
      Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
      in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined.  Since
      __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
      don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
      Besides, one can now detect which memcg a slab page belongs to by reading
      /proc/kpagecgroup.
      
      Note, this patch switches slab to charge-after-alloc design.  Since this
      design is already used for all other memcg charges, it should not make any
      difference.
      
      [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3ccb2c4