1. 14 Jun, 2018 1 commit
  2. 08 Jun, 2018 8 commits
  3. 11 Apr, 2018 1 commit
  4. 06 Apr, 2018 1 commit
  5. 23 Mar, 2018 1 commit
  6. 01 Feb, 2018 3 commits
  7. 27 Nov, 2017 1 commit
    • Linus Torvalds's avatar
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds authored
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      The script to do this was:
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
                ACTIVE NOUSER"
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  8. 18 Nov, 2017 1 commit
  9. 16 Nov, 2017 4 commits
  10. 07 Oct, 2017 1 commit
  11. 14 Sep, 2017 1 commit
    • Michal Hocko's avatar
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko authored
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      I have checked some random users and none of them has added the flag
      with a specific...
  12. 07 Sep, 2017 5 commits
    • Huang Ying's avatar
      mm, swap: VMA based swap readahead · ec560175
      Huang Ying authored
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory.  And the different tasks in the system may have different access
      patterns, which makes the global space locality estimation incorrect.
      In this patch, when page fault occurs, the virtual pages near the fault
      address will be readahead instead of the swap slots near the fault swap
      slot in swap device.  This avoid to readahead the unrelated swap slots.
      At the same time, the swap readahead is changed to work on per-VMA from
      globally.  So that the different access patterns of the different VMAs
      could be distinguished, and the different readahead policy could be
      applied accordingly.  The original core readahead detection and scaling
      algorithm is reused, because it is an effect algorithm to detect the
      space locality.
      The test and result is as follow,
      Common test condition
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
      NVMe disk
      Micro-benchmark with combined access pattern
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      The test memory size is bigger than RAM to trigger swapping.
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mike Kravetz's avatar
      mm/shmem: add hugetlbfs support to memfd_create() · 749df87b
      Mike Kravetz authored
      This patch came out of discussions in this e-mail thread:
      The Oracle JVM team is developing a new garbage collection model.  This
      new model requires multiple mappings of the same anonymous memory.  One
      straight forward way to accomplish this is with memfd_create.  They can
      use the returned fd to create multiple mappings of the same memory.
      The JVM today has an option to use (static hugetlb) huge pages.  If this
      option is specified, they would like to use the same garbage collection
      model requiring multiple mappings to the same memory.  Using hugetlbfs,
      it is possible to explicitly mount a filesystem and specify file paths
      in order to get an fd that can be used for multiple mappings.  However,
      this introduces additional system admin work and coordination.
      Ideally they would like to get a hugetlbfs fd without requiring explicit
      mounting of a filesystem.  Today, mmap and shmget can make use of
      hugetlbfs without explicitly mounting a filesystem.  The patch adds this
      functionality to memfd_create.
      Add a new flag MFD_HUGETLB to memfd_create() that will specify the file
      to be created resides in the hugetlbfs filesystem.  This is the generic
      hugetlbfs filesystem not associated with any specific mount point.  As
      with other system calls that request hugetlbfs backed pages, there is
      the ability to encode huge page size in the flag arguments.
      hugetlbfs does not support sealing operations, therefore specifying
      MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL.
      Of course, the memfd_man page would need updating if this type of
      functionality moves forward.
      Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.com
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mike Rapoport's avatar
      userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support · 8d103963
      Mike Rapoport authored
      shmem_mfill_zeropage_pte is the low level routine that implements the
      userfaultfd UFFDIO_ZEROPAGE command.  Since for shmem mappings zero
      pages are always allocated and accounted, the new method is a slight
      extension of the existing shmem_mcopy_atomic_pte.
      Link: http://lkml.kernel.org/r/1497939652-16528-4-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mike Rapoport's avatar
      shmem: introduce shmem_inode_acct_block · 0f079694
      Mike Rapoport authored
      The shmem_acct_block and the update of used_blocks are following one
      another in all the places they are used.  Combine these two into a
      helper function.
      Link: http://lkml.kernel.org/r/1497939652-16528-3-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mike Rapoport's avatar
      shmem: shmem_charge: verify max_block is not exceeded before inode update · b1cc94ab
      Mike Rapoport authored
      Patch series "userfaultfd: enable zeropage support for shmem".
      These patches enable support for UFFDIO_ZEROPAGE for shared memory.
      The first two patches are not strictly related to userfaultfd, they are
      just minor refactoring to reduce amount of code duplication.
      This patch (of 7):
      Currently we update inode and shmem_inode_info before verifying that
      used_blocks will not exceed max_blocks.  In case it will, we undo the
      update.  Let's switch the order and move the verification of the blocks
      count before the inode and shmem_inode_info update.
      Link: http://lkml.kernel.org/r/1497939652-16528-2-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 25 Aug, 2017 1 commit
  14. 10 Aug, 2017 1 commit
    • Cong Wang's avatar
      mm: fix list corruptions on shmem shrinklist · d041353d
      Cong Wang authored
      We saw many list corruption warnings on shmem shrinklist:
        WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
        list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
        Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
        CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
        Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
        Call Trace:
        WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
        list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
        Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
        CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G        W       4.9.34-t3.el7.twitter.x86_64 #1
        Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
        Call Trace:
      The problem is that shmem_unused_huge_shrink() moves entries from the
      global sbinfo->shrinklist to its local lists and then releases the
      spinlock.  However, a parallel shmem_setattr() could access one of these
      entries directly and add it back to the global shrinklist if it is
      removed, with the spinlock held.
      The logic itself looks solid since an entry could be either in a local
      list or the global list, otherwise it is removed from one of them by
      list_del_init().  So probably the race condition is that, one CPU is in
      the middle of INIT_LIST_HEAD() but the other CPU calls list_empty()
      which returns true too early then the following list_add_tail() sees a
      corrupted entry.
      list_empty_careful() is designed to fix this situation.
      [akpm@linux-foundation.org: add comments]
      Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com
      Fixes: 779750d2
       ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  15. 10 Jul, 2017 1 commit
    • Michal Hocko's avatar
      mm: make PR_SET_THP_DISABLE immediately active · 18600332
      Michal Hocko authored
      PR_SET_THP_DISABLE has a rather subtle semantic.  It doesn't affect any
      existing mapping because it only updated mm->def_flags which is a
      template for new mappings.
      The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
      flag set.  This can be quite surprising for all those applications which
      do not do prctl(); fork() & exec() and want to control their own THP
      Another usecase when the immediate semantic of the prctl might be useful
      is a combination of pre- and post-copy migration of containers with
      CRIU.  In this case CRIU populates a part of a memory region with data
      that was saved during the pre-copy stage.  Afterwards, the region is
      registered with userfaultfd and CRIU expects to get page faults for the
      parts of the region that were not yet populated.  However, khugepaged
      collapses the pages and the expected page faults do not occur.
      In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
      temporary mechanism for enabling/disabling THP process wide.
      Implementation wise, a new MMF_DISABLE_THP flag is added.  This flag is
      tested when decision whether to use huge pages is taken either during
      page fault of at the time of THP collapse.
      It should be noted, that the new implementation makes PR_SET_THP_DISABLE
      master override to any per-VMA setting, which was not the case
      Fixes: a0715cc2 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
      Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 06 Jul, 2017 3 commits
    • Roman Gushchin's avatar
      mm: per-cgroup memory reclaim stats · 2262185c
      Roman Gushchin authored
      Track the following reclaim counters for every memory cgroup: PGREFILL,
      These values are exposed using the memory.stats interface of cgroup v2.
      The meaning of each value is the same as for global counters, available
      using /proc/vmstat.
      Also, for consistency, rename mem_cgroup_count_vm_event() to
      Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Minchan Kim's avatar
      mm, THP, swap: unify swap slot free functions to put_swap_page · 75f6d6d2
      Minchan Kim authored
      Now, get_swap_page takes struct page and allocates swap space according
      to page size(ie, normal or THP) so it would be more cleaner to introduce
      put_swap_page which is a counter function of get_swap_page.  Then, it
      calls right swap slot free function depending on page's size.
      [ying.huang@intel.com: minor cleanup and fix]
      Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Huang Ying's avatar
      mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
      Huang Ying authored
      Patch series "THP swap: Delay splitting THP during swapping out", v11.
      This patchset is to optimize the performance of Transparent Huge Page
      (THP) swap.
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      The advantages of the THP swap support include:
       - Batch the swap operations for the THP to reduce lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.  For example, it can be selected via
      "always/never/madvise" logic, to be turned on globally, turned off
      globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
      This patchset is the first step for the THP swap support.  The plan is
      to delay splitting THP step by step, finally avoid splitting THP during
      the THP swapping out and swap out/in the THP as a whole.
      As the first step, in this patchset, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  This will
      reduce lock acquiring/releasing for the locks used for the swap cache
      With the patchset, the swap out throughput improves 15.5% (from about
      3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      This patch (of 5):
      In this patch, splitting huge page is delayed from almost the first step
      of swapping out to after allocating the swap space for the THP
      (Transparent Huge Page) and adding the THP into the swap cache.  This
      will batch the corresponding operation, thus improve THP swap out
      This is the first step for the THP swap optimization.  The plan is to
      delay splitting the THP step by step and avoid splitting the THP
      In this patch, one swap cluster is used to hold the contents of each THP
      swapped out.  So, the size of the swap cluster is changed to that of the
      THP (Transparent Huge Page) on x86_64 architecture (512).  For other
      architectures which want such THP swap optimization,
      ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
      the architecture.  In effect, this will enlarge swap cluster size by 2
      times on x86_64.  Which may make it harder to find a free cluster when
      the swap space becomes fragmented.  So that, this may reduce the
      continuous swap space allocation and sequential write in theory.  The
      performance test in 0day shows no regressions caused by this.
      In the future of THP swap optimization, some information of the swapped
      out THP (such as compound map count) will be recorded in the
      swap_cluster_info data structure.
      The mem cgroup swap accounting functions are enhanced to support charge
      or uncharge a swap cluster backing a THP as a whole.
      The swap cluster allocate/free functions are added to allocate/free a
      swap cluster for a THP.  A fair simple algorithm is used for swap
      cluster allocation, that is, only the first swap device in priority list
      will be tried to allocate the swap cluster.  The function will fail if
      the trying is not successful, and the caller will fallback to allocate a
      single swap slot instead.  This works good enough for normal cases.  If
      the difference of the number of the free swap clusters among multiple
      swap devices is significant, it is possible that some THPs are split
      earlier than necessary.  For example, this could be caused by big size
      difference among multiple swap devices.
      The swap cache functions is enhanced to support add/delete THP to/from
      the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
      enhanced in the future with multi-order radix tree.  But because we will
      split the THP soon during swapping out, that optimization doesn't make
      much sense for this first step.
      The THP splitting functions are enhanced to support to split THP in swap
      cache during swapping out.  The page lock will be held during allocating
      the swap cluster, adding the THP into the swap cache and splitting the
      THP.  So in the code path other than swapping out, if the THP need to be
      split, the PageSwapCache(THP) will be always false.
      The swap cluster is only available for SSD, so the THP swap optimization
      in this patchset has no effect for HDD.
      [ying.huang@intel.com: fix two issues in THP optimize patch]
        Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
      [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
      Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  17. 20 Jun, 2017 2 commits
    • Ingo Molnar's avatar
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar authored
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      For example, this code:
      	rqw->wait.task_list.next != &wait->task_list
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      	rqw->wait.head.next != &wait->entry
      ... which makes it pretty clear that we are iterating a list until we see the head.
      Other examples are:
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Ingo Molnar's avatar
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar authored
      	wait_queue_t		=>	wait_queue_entry_t
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  18. 05 Jun, 2017 1 commit
  19. 03 Mar, 2017 1 commit
    • David Howells's avatar
      statx: Add a system call to make enhanced file info available · a528d35e
      David Howells authored
      Add a system call to make extended file information available, including
      file creation and some attribute flags where available through the
      underlying filesystem.
      The getattr inode operation is altered to take two additional arguments: a
      u32 request_mask and an unsigned int flags that indicate the
      synchronisation mode.  This change is propagated to the vfs_getattr*()
      Functions like vfs_stat() are now inline wrappers around new functions
      vfs_statx() and vfs_statx_fd() to reduce stack usage.
      The idea was initially proposed as a set of xattrs that could be retrieved
      with getxattr(), but the general preference proved to be for a new syscall
      with an extended stat structure.
      A number of requests were gathered for features to be included.  The
      following have been included:
       (1) Make the fields a consistent size on all arches and make them large.
       (2) Spare space, request flags and information flags are provided for
           future expansion.
       (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
       (4) Creation time: The SMB protocol carries the creation time, which could
           be exported by Samba, which will in turn help CIFS make use of
           FS-Cache as that can be used for coherency data (stx_btime).
           This is also specified in NFSv4 as a recommended attribute and could
           be exported by NFSD [Steve French].
       (5) Lightweight stat: Ask for just those details of interest, and allow a
           netfs (such as NFS) to approximate anything not of interest, possibly
           without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
           Dilger] (AT_STATX_DONT_SYNC).
       (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
           its cached attributes are up to date [Trond Myklebust]
      And the following have been left out for future extension:
       (7) Data version number: Could be used by userspace NFS servers [Aneesh
           Can also be used to modify fill_post_wcc() in NFSD which retrieves
           i_version directly, but has just called vfs_getattr().  It could get
           it from the kstat struct if it used vfs_xgetattr() instead.
           (There's disagreement on the exact semantics of a single field, since
           not all filesystems do this the same way).
       (8) BSD stat compatibility: Including more fields from the BSD stat such
           as creation time (st_btime) and inode generation number (st_gen)
           [Jeremy Allison, Bernd Schubert].
       (9) Inode generation number: Useful for FUSE and userspace NFS servers
           [Bernd Schubert].
           (This was asked for but later deemed unnecessary with the
           open-by-handle capability available and caused disagreement as to
           whether it's a security hole or not).
      (10) Extra coherency data may be useful in making backups [Andreas Dilger].
           (No particular data were offered, but things like last backup
           timestamp, the data version number and the DOS archive bit would come
           into this category).
      (11) Allow the filesystem to indicate what it can/cannot provide: A
           filesystem can now say it doesn't support a standard stat feature if
           that isn't available, so if, for instance, inode numbers or UIDs don't
           exist or are fabricated locally...
           (This requires a separate system call - I have an fsinfo() call idea
           for this).
      (12) Store a 16-byte volume ID in the superblock that can be returned in
           struct xstat [Steve French].
           (Deferred to fsinfo).
      (13) Include granularity fields in the time data to indicate the
           granularity of each of the times (NFSv4 time_delta) [Steve French].
           (Deferred to fsinfo).
      (14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
           Note that the Linux IOC flags are a mess and filesystems such as Ext4
           define flags that aren't in linux/fs.h, so translation in the kernel
           may be a necessity (or, possibly, we provide the filesystem type too).
           (Some attributes are made available in stx_attributes, but the general
           feeling was that the IOC flags were to ext[234]-specific and shouldn't
           be exposed through statx this way).
      (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
           Michael Kerrisk].
           (Deferred, probably to fsinfo.  Finding out if there's an ACL or
           seclabal might require extra filesystem operations).
      (16) Femtosecond-resolution timestamps [Dave Chinner].
           (A __reserved field has been left in the statx_timestamp struct for
           this - if there proves to be a need).
      (17) A set multiple attributes syscall to go with this.
      The new system call is:
      	int ret = statx(int dfd,
      			const char *filename,
      			unsigned int flags,
      			unsigned int mask,
      			struct statx *buffer);
      The dfd, filename and flags parameters indicate the file to query, in a
      similar way to fstatat().  There is no equivalent of lstat() as that can be
      emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
      also no equivalent of fstat() as that can be emulated by passing a NULL
      filename to statx() with the fd of interest in dfd.
      Whether or not statx() synchronises the attributes with the backing store
      can be controlled by OR'ing a value into the flags argument (this typically
      only affects network filesystems):
       (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
       (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
           its attributes with the server - which might require data writeback to
           occur to get the timestamps correct.
       (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
           network filesystem.  The resulting values should be considered
      mask is a bitmask indicating the fields in struct statx that are of
      interest to the caller.  The user should set this to STATX_BASIC_STATS to
      get the basic set returned by stat().  It should be noted that asking for
      more information may entail extra I/O operations.
      buffer points to the destination for the data.  This must be 256 bytes in
      The following structures are defined in which to return the main attribute
      	struct statx_timestamp {
      		__s64	tv_sec;
      		__s32	tv_nsec;
      		__s32	__reserved;
      	struct statx {
      		__u32	stx_mask;
      		__u32	stx_blksize;
      		__u64	stx_attributes;
      		__u32	stx_nlink;
      		__u32	stx_uid;
      		__u32	stx_gid;
      		__u16	stx_mode;
      		__u16	__spare0[1];
      		__u64	stx_ino;
      		__u64	stx_size;
      		__u64	stx_blocks;
      		__u64	__spare1[1];
      		struct statx_timestamp	stx_atime;
      		struct statx_timestamp	stx_btime;
      		struct statx_timestamp	stx_ctime;
      		struct statx_timestamp	stx_mtime;
      		__u32	stx_rdev_major;
      		__u32	stx_rdev_minor;
      		__u32	stx_dev_major;
      		__u32	stx_dev_minor;
      		__u64	__spare2[14];
      The defined bits in request_mask and stx_mask are:
      	STATX_TYPE		Want/got stx_mode & S_IFMT
      	STATX_MODE		Want/got stx_mode & ~S_IFMT
      	STATX_NLINK		Want/got stx_nlink
      	STATX_UID		Want/got stx_uid
      	STATX_GID		Want/got stx_gid
      	STATX_ATIME		Want/got stx_atime{,_ns}
      	STATX_MTIME		Want/got stx_mtime{,_ns}
      	STATX_CTIME		Want/got stx_ctime{,_ns}
      	STATX_INO		Want/got stx_ino
      	STATX_SIZE		Want/got stx_size
      	STATX_BLOCKS		Want/got stx_blocks
      	STATX_BASIC_STATS	[The stuff in the normal stat struct]
      	STATX_BTIME		Want/got stx_btime{,_ns}
      	STATX_ALL		[All currently available stuff]
      stx_btime is the file creation time, stx_mask is a bitmask indicating the
      data provided and __spares*[] are where as-yet undefined fields can be
      Time fields are structures with separate seconds and nanoseconds fields
      plus a reserved field in case we want to add even finer resolution.  Note
      that times will be negative if before 1970; in such a case, the nanosecond
      fields will also be negative if not zero.
      The bits defined in the stx_attributes field convey information about a
      file, how it is accessed, where it is and what it does.  The following
      attributes map to FS_*_FL flags and are the same numerical value:
      	STATX_ATTR_COMPRESSED		File is compressed by the fs
      	STATX_ATTR_IMMUTABLE		File is marked immutable
      	STATX_ATTR_APPEND		File is append-only
      	STATX_ATTR_NODUMP		File is not to be dumped
      	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs
      Within the kernel, the supported flags are listed by:
      [Are any other IOC flags of sufficient general interest to be exposed
      through this interface?]
      New flags include:
      	STATX_ATTR_AUTOMOUNT		Object is an automount trigger
      These are for the use of GUI tools that might want to mark files specially,
      depending on what they are.
      Fields in struct statx come in a number of classes:
       (0) stx_dev_*, stx_blksize.
           These are local system information and are always available.
       (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
           stx_size, stx_blocks.
           These will be returned whether the caller asks for them or not.  The
           corresponding bits in stx_mask will be set to indicate whether they
           actually have valid values.
           If the caller didn't ask for them, then they may be approximated.  For
           example, NFS won't waste any time updating them from the server,
           unless as a byproduct of updating something requested.
           If the values don't actually exist for the underlying object (such as
           UID or GID on a DOS file), then the bit won't be set in the stx_mask,
           even if the caller asked for the value.  In such a case, the returned
           value will be a fabrication.
           Note that there are instances where the type might not be valid, for
           instance Windows reparse points.
       (2) stx_rdev_*.
           This will be set only if stx_mode indicates we're looking at a
           blockdev or a chardev, otherwise will be 0.
       (3) stx_btime.
           Similar to (1), except this will be set to 0 if it doesn't exist.
      The following test program can be used to test the statx system call:
      Just compile and run, passing it paths to the files you want to examine.
      The file is built automatically if CONFIG_SAMPLES is enabled.
      Here's some example output.  Firstly, an NFS directory that crosses to
      another FSID.  Note that the AUTOMOUNT attribute is set because transiting
      this directory will cause d_automount to be invoked by the VFS.
      	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
      	statx(/warthog/data) = 0
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:26           Inode: 1703937     Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
      Secondly, the result of automounting on that directory.
      	[root@andromeda ~]# /tmp/test-statx /warthog/data
      	statx(/warthog/data) = 0
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:27           Inode: 2           Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  20. 02 Mar, 2017 1 commit
  21. 25 Feb, 2017 1 commit