Skip to content
  • Mel Gorman's avatar
    mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4
    Mel Gorman authored
    
    
    THP defrag is enabled by default to direct reclaim/compact but not wake
    kswapd in the event of a THP allocation failure.  The problem is that
    THP allocation requests potentially enter reclaim/compaction.  This
    potentially incurs a severe stall that is not guaranteed to be offset by
    reduced TLB misses.  While there has been considerable effort to reduce
    the impact of reclaim/compaction, it is still a high cost and workloads
    that should fit in memory fail to do so.  Specifically, a simple
    anon/file streaming workload will enter direct reclaim on NUMA at least
    even though the working set size is 80% of RAM.  It's been years and
    it's time to throw in the towel.
    
    First, this patch defines THP defrag as follows;
    
     madvise: A failed allocation will direct reclaim/compact if the application requests it
     never:   Neither reclaim/compact nor wake kswapd
     defer:   A failed allocation will wake kswapd/kcompactd
     always:  A failed allocation will direct reclaim/compact (historical behaviour)
              khugepaged defrag will enter direct/reclaim but not wake kswapd.
    
    Next it sets the default defrag option to be "madvise" to only enter
    direct reclaim/compaction for applications that specifically requested
    it.
    
    Lastly, it removes a check from the page allocator slowpath that is
    related to __GFP_THISNODE to allow "defer" to work.  The callers that
    really cares are slub/slab and they are updated accordingly.  The slab
    one may be surprising because it also corrects a comment as kswapd was
    never woken up by that path.
    
    This means that a THP fault will no longer stall for most applications
    by default and the ideal for most users that get THP if they are
    immediately available.  There are still options for users that prefer a
    stall at startup of a new application by either restoring historical
    behaviour with "always" or pick a half-way point with "defer" where
    kswapd does some of the work in the background and wakes kcompactd if
    necessary.  THP defrag for khugepaged remains enabled and will enter
    direct/reclaim but no wakeup kswapd or kcompactd.
    
    After this patch a THP allocation failure will quickly fallback and rely
    on khugepaged to recover the situation at some time in the future.  In
    some cases, this will reduce THP usage but the benefit of THP is hard to
    measure and not a universal win where as a stall to reclaim/compaction
    is definitely measurable and can be painful.
    
    The first test for this is using "usemem" to read a large file and write
    a large anonymous mapping (to avoid the zero page) multiple times.  The
    total size of the mappings is 80% of RAM and the benchmark simply
    measures how long it takes to complete.  It uses multiple threads to see
    if that is a factor.  On UMA, the performance is almost identical so is
    not reported but on NUMA, we see this
    
    usemem
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
    Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
    Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
    Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
    Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
    Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
    Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
    Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
    Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
    Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
    Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
    Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
    Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
    Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
    Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
    
    For a single thread, the benchmark completes 43.23% faster with this
    patch applied with smaller benefits as the thread increases.  Similar,
    notice the large reduction in most cases in system CPU usage.  The
    overall CPU time is
    
                   4.4.0       4.4.0
            kcompactd-v1r1 nodefrag-v1r3
    User        10357.65    10438.33
    System       3988.88     3543.94
    Elapsed      2203.01     1634.41
    
    Which is substantial. Now, the reclaim figures
    
                                     4.4.0       4.4.0
                              kcompactd-v1r1nodefrag-v1r3
    Minor Faults                 128458477   278352931
    Major Faults                   2174976         225
    Swap Ins                      16904701           0
    Swap Outs                     17359627           0
    Allocation stalls                43611           0
    DMA allocs                           0           0
    DMA32 allocs                  19832646    19448017
    Normal allocs                614488453   580941839
    Movable allocs                       0           0
    Direct pages scanned          24163800           0
    Kswapd pages scanned                 0           0
    Kswapd pages reclaimed               0           0
    Direct pages reclaimed        20691346           0
    Compaction stalls                42263           0
    Compaction success                 938           0
    Compaction failures              41325           0
    
    This patch eliminates almost all swapping and direct reclaim activity.
    There is still overhead but it's from NUMA balancing which does not
    identify that it's pointless trying to do anything with this workload.
    
    I also tried the thpscale benchmark which forces a corner case where
    compaction can be used heavily and measures the latency of whether base
    or huge pages were used
    
    thpscale Fault Latencies
                                           4.4.0                 4.4.0
                                  kcompactd-v1r1         nodefrag-v1r3
    Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
    Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
    Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
    Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
    Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
    Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
    Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
    Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
    Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
    Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
    Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
    Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
    Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
    Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
    Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
    Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
    Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
    Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
    
    The average time to fault pages is substantially reduced in the majority
    of caseds but with the obvious caveat that fewer THPs are actually used
    in this adverse workload
    
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
    Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
    Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
    Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
    Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
    Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
    Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
    Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
    Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
    Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
    
                                     4.4.0       4.4.0
                              kcompactd-v1r1nodefrag-v1r3
    Minor Faults                  37429143    47564000
    Major Faults                      1916        1558
    Swap Ins                          1466        1079
    Swap Outs                      2936863      149626
    Allocation stalls                62510           3
    DMA allocs                           0           0
    DMA32 allocs                   6566458     6401314
    Normal allocs                216361697   216538171
    Movable allocs                       0           0
    Direct pages scanned          25977580       17998
    Kswapd pages scanned                 0     3638931
    Kswapd pages reclaimed               0      207236
    Direct pages reclaimed         8833714          88
    Compaction stalls               103349           5
    Compaction success                 270           4
    Compaction failures             103079           1
    
    Note again that while this does swap as it's an aggressive workload, the
    direct relcim activity and allocation stalls is substantially reduced.
    There is some kswapd activity but ftrace showed that the kswapd activity
    was due to normal wakeups from 4K pages being allocated.
    Compaction-related stalls and activity are almost eliminated.
    
    I also tried the stutter benchmark.  For this, I do not have figures for
    NUMA but it's something that does impact UMA so I'll report what is
    available
    
    stutter
                                     4.4.0                 4.4.0
                            kcompactd-v1r1         nodefrag-v1r3
    Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
    1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
    2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
    3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
    Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
    Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
    Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
    Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
    Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
    Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
    
    This benchmark is trying to fault an anonymous mapping while there is a
    heavy IO load -- a scenario that desktop users used to complain about
    frequently.  This shows a mix because the ideal case of mapping with THP
    is not hit as often.  However, note that 99% of the mappings complete
    13.79% faster.  The CPU usage here is particularly interesting
    
                   4.4.0       4.4.0
            kcompactd-v1r1nodefrag-v1r3
    User           67.50        0.99
    System       1327.88       91.30
    Elapsed      2079.00     2128.98
    
    And once again we look at the reclaim figures
    
                                     4.4.0       4.4.0
                              kcompactd-v1r1nodefrag-v1r3
    Minor Faults                 335241922  1314582827
    Major Faults                       715         819
    Swap Ins                             0           0
    Swap Outs                            0           0
    Allocation stalls               532723           0
    DMA allocs                           0           0
    DMA32 allocs                1822364341  1177950222
    Normal allocs               1815640808  1517844854
    Movable allocs                       0           0
    Direct pages scanned          21892772           0
    Kswapd pages scanned          20015890    41879484
    Kswapd pages reclaimed        19961986    41822072
    Direct pages reclaimed        21892741           0
    Compaction stalls              1065755           0
    Compaction success                 514           0
    Compaction failures            1065241           0
    
    Allocation stalls and all direct reclaim activity is eliminated as well
    as compaction-related stalls.
    
    THP gives impressive gains in some cases but only if they are quickly
    available.  We're not going to reach the point where they are completely
    free so lets take the costs out of the fast paths finally and defer the
    cost to kswapd, kcompactd and khugepaged where it belongs.
    
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Acked-by: default avatarRik van Riel <riel@redhat.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    444eb2a4