• Huang Ying's avatar
    mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
    Huang Ying authored
    Patch series "THP swap: Delay splitting THP during swapping out", v11.
    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.
    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine.  Because the
    performance of the storage device improved faster than that of single
    logical CPU.  And it seems that the trend will not change in the near
    future.  On the other hand, the THP becomes more and more popular
    because of increased memory size.  So it becomes necessary to optimize
    THP swap performance.
    The advantages of the THP swap support include:
     - Batch the swap operations for the THP to reduce lock
       acquiring/releasing, including allocating/freeing the swap space,
       adding/deleting to/from the swap cache, and writing/reading the swap
       space, etc. This will help improve the performance of the THP swap.
     - The THP swap space read/write will be 2M sequential IO. It is
       particularly helpful for the swap read, which are usually 4k random
       IO. This will improve the performance of the THP swap too.
     - It will help the memory fragmentation, especially when the THP is
       heavily used by the applications. The 2M continuous pages will be
       free up after THP swapping out.
     - It will improve the THP utilization on the system with the swap
       turned on. Because the speed for khugepaged to collapse the normal
       pages into the THP is quite slow. After the THP is split during the
       swapping out, it will take quite long time for the normal pages to
       collapse back into the THP after being swapped in. The high THP
       utilization helps the efficiency of the page based memory management
    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device.  To deal with that, the THP swap in should be turned
    on only when necessary.  For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
    This patchset is the first step for the THP swap support.  The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.
    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache.  This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
    device used is a RAM simulated PMEM (persistent memory) device.  To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.
    This patch (of 5):
    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache.  This
    will batch the corresponding operation, thus improve THP swap out
    This is the first step for the THP swap optimization.  The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out.  So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512).  For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture.  In effect, this will enlarge swap cluster size by 2
    times on x86_64.  Which may make it harder to find a free cluster when
    the swap space becomes fragmented.  So that, this may reduce the
    continuous swap space allocation and sequential write in theory.  The
    performance test in 0day shows no regressions caused by this.
    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.
    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.
    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP.  A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster.  The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead.  This works good enough for normal cases.  If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary.  For example, this could be caused by big size
    difference among multiple swap devices.
    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
    enhanced in the future with multi-order radix tree.  But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.
    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out.  The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP.  So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.
    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.
    [ying.huang@intel.com: fix two issues in THP optimize patch]
      Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
    Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Shaohua Li <shli@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>