1. 16 Nov, 2019 1 commit
    • Song Liu's avatar
      mm,thp: recheck each page before collapsing file THP · 4655e5e5
      Song Liu authored
      In collapse_file(), for !is_shmem case, current check cannot guarantee
      the locked page is up-to-date.  Specifically, xas_unlock_irq() should
      not be called before lock_page() and get_page(); and it is necessary to
      recheck PageUptodate() after locking the page.
      
      With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
      may contain corrupted data.  This is because khugepaged mistakenly
      collapses some not up-to-date sub pages into a huge page, and assumes
      the huge page is up-to-date.  This will NOT corrupt data in the disk,
      because the page is read-only and never written back.  Fix this by
      properly checking PageUptodate() after locking the page.  This check
      replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".
      
      Also, move PageDirty() check after locking the page.  Current khugepaged
      should not try to collapse dirty file THP, because it is limited to
      read-only .text.  The only case we hit a dirty page here is when the
      page hasn't been written since write.  Bail out and retry when this
      happens.
      
      syzbot reported bug on previous version of this patch.
      
      Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
      Fixes: 99cb0dbd
      
       ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4655e5e5
  2. 06 Nov, 2019 1 commit
  3. 24 Sep, 2019 5 commits
  4. 03 Sep, 2019 1 commit
    • Matt Fleming's avatar
      sched/topology: Improve load balancing on AMD EPYC systems · a55c7454
      Matt Fleming authored
      SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
      for any sched domains with a NUMA distance greater than 2 hops
      (RECLAIM_DISTANCE). The idea being that it's expensive to balance
      across domains that far apart.
      
      However, as is rather unfortunately explained in:
      
        commit 32e45ff4 ("mm: increase RECLAIM_DISTANCE to 30")
      
      the value for RECLAIM_DISTANCE is based on node distance tables from
      2011-era hardware.
      
      Current AMD EPYC machines have the following NUMA node distances:
      
       node distances:
       node   0   1   2   3   4   5   6   7
         0:  10  16  16  16  32  32  32  32
         1:  16  10  16  16  32  32  32  32
         2:  16  16  10  16  32  32  32  32
         3:  16  16  16  10  32  32  32  32
         4:  32  32  32  32  10  16  16  16
         5:  32  32  32  32  16  10  16  16
         6:  32  32  32  32  16  16  10  16
         7:  32  32  32  32  16  16  16  10
      
      where 2 hops is 32.
      
      The result is that the scheduler fails to load balance properly across
      NUMA nodes...
      a55c7454
  5. 06 Jul, 2019 1 commit
    • Linus Torvalds's avatar
      Revert "mm: page cache: store only head pages in i_pages" · 69bf4b6b
      Linus Torvalds authored
      This reverts commit 5fd4ca2d.
      
      Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
      __delete_from_swap_cache() to trigger:
      
         page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
         anon
         flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
         raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
         raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
         page dumped because: VM_BUG_ON_PAGE(entry != page)
         page->mem_cgroup:ffff978433ace000
         ------------[ cut here ]------------
         kernel BUG at mm/swap_state.c:170!
         invalid opcode: 0000 [#1] SMP NOPTI
         CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
         Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
         RIP: 0010:__delete_from_swap_cache+0x20d/0x240
         Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
         RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
         RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
         RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
         RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
         R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
         R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
         FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
         Call Trace:
          delete_from_swap_cache+0x46/0xa0
          try_to_free_swap+0xbc/0x110
          swap_writepage+0x13/0x70
          pageout.isra.0+0x13c/0x350
          shrink_page_list+0xc14/0xdf0
          shrink_inactive_list+0x1e5/0x3c0
          shrink_node_memcg+0x202/0x760
          shrink_node+0xe0/0x470
          balance_pgdat+0x2d1/0x510
          kswapd+0x220/0x420
          kthread+0xfb/0x130
          ret_from_fork+0x22/0x40
      
      and it's not immediately obvious why it happens.  It's too late in the
      rc cycle to do anything but revert for now.
      
      Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
      
      Reported-and-bisected-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69bf4b6b
  6. 14 Jun, 2019 1 commit
    • Andrea Arcangeli's avatar
      coredump: fix race condition between collapse_huge_page() and core dumping · 59ea6d06
      Andrea Arcangeli authored
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f
      
       ("thp: khugepaged")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59ea6d06
  7. 14 May, 2019 3 commits
  8. 06 Mar, 2019 1 commit
  9. 28 Dec, 2018 1 commit
  10. 30 Nov, 2018 7 commits
  11. 12 Nov, 2018 1 commit
    • Lance Roy's avatar
      mm: Replace spin_is_locked() with lockdep · 35f3aa39
      Lance Roy authored
      
      
      lockdep_assert_held() is better suited to checking locking requirements,
      since it only checks if the current thread holds the lock regardless of
      whether someone else does. This is also a step towards possibly removing
      spin_is_locked().
      Signed-off-by: default avatarLance Roy <ldr709@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      35f3aa39
  12. 21 Oct, 2018 2 commits
  13. 30 Sep, 2018 1 commit
    • Matthew Wilcox's avatar
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox authored
      
      
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      3159f943
  14. 24 Aug, 2018 1 commit
  15. 17 Aug, 2018 3 commits
  16. 11 Apr, 2018 3 commits
    • Matthew Wilcox's avatar
      page cache: use xa_lock · b93b0163
      Matthew Wilcox authored
      Remove the address_space ->tree_lock and use the xa_lock newly added to
      the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
      since we don't really care that it's a tree.
      
      [willy@infradead.org: fix nds32, fs/dax.c]
        Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
      
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b93b0163
    • Joonsoo Kim's avatar
      mm/thp: don't count ZONE_MOVABLE as the target for freepage reserving · b7d349c7
      Joonsoo Kim authored
      There was a regression report for "mm/cma: manage the memory of the CMA
      area by using the ZONE_MOVABLE" [1] and I think that it is related to
      this problem.  CMA patchset makes the system use one more zone
      (ZONE_MOVABLE) and then increases min_free_kbytes.  It reduces usable
      memory and it could cause regression.
      
      ZONE_MOVABLE only has movable pages so we don't need to keep enough
      freepages to avoid or deal with fragmentation.  So, don't count it.
      
      This changes min_free_kbytes and thus min_watermark greatly if
      ZONE_MOVABLE is used.  It will make the user uses more memory.
      
      System:
        22GB ram, fakenuma, 2 nodes. 5 zones are used.
      
      Before:
        min_free_kbytes: 112640
      
        zone_info (min_watermark):
        Node 0, zone      DMA
                min      19
        Node 0, zone    DMA32
                min      3778
        Node 0, zone   Normal
                min      10191
        Node 0, zone  Movable
                min      0
        Node 0, zone   Device
                min      0
        Node 1, zone      DMA
                min      0
        Node 1, zone    DMA32
                min      0
        Node 1, zone   Normal
                min      14043
        Node 1, zone  Movable
                min      127
        Node 1, zone   Device
                min      0
      
      After:
        min_free_kbytes: 90112
      
        zone_info (min_watermark):
        Node 0, zone      DMA
                min      15
        Node 0, zone    DMA32
                min      3022
        Node 0, zone   Normal
                min      8152
        Node 0, zone  Movable
                min      0
        Node 0, zone   Device
                min      0
        Node 1, zone      DMA
                min      0
        Node 1, zone    DMA32
                min      0
        Node 1, zone   Normal
                min      11234
        Node 1, zone  Movable
                min      102
        Node 1, zone   Device
                min      0
      
      [1] (lkml.kernel.org/r/20180102063528.GG30397%20()%20yexl-desktop)
      
      Link: http://lkml.kernel.org/r/1522913236-15776-1-git-send-email-iamjoonsoo.kim@lge.com
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7d349c7
    • Michal Hocko's avatar
      memcg, thp: do not invoke oom killer on thp charges · 2a70f6a7
      Michal Hocko authored
      A THP memcg charge can trigger the oom killer since 25160354 ("mm,
      thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
      We have used an explicit __GFP_NORETRY previously which ruled the OOM
      killer automagically.
      
      Memcg charge path should be semantically compliant with the allocation
      path and that means that if we do not trigger the OOM killer for costly
      orders which should do the same in the memcg charge path as well.
      Otherwise we are forcing callers to distinguish the two and use
      different gfp masks which is both non-intuitive and bug prone.  As soon
      as we get a costly high order kmalloc user we even do not have any means
      to tell the memcg specific gfp mask to prevent from OOM because the
      charging is deep within guts of the slab allocator.
      
      The unexpected memcg OOM on THP has already been fixed upstream by
      9d3c3354 ("mm, thp: do not cause memcg oom for thp") but this is a
      one-off fix rather than a generic solution.  Teach mem_cgroup_oom to
      bail out on costly order requests to fix the THP issue as well as any
      other costly OOM eligible allocations to be added in future.
      
      Also revert 9d3c3354 because special gfp for THP is no longer
      needed.
      
      Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
      Fixes: 25160354
      
       ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a70f6a7
  17. 23 Mar, 2018 2 commits
  18. 01 Feb, 2018 2 commits
  19. 29 Nov, 2017 1 commit
  20. 27 Nov, 2017 1 commit
    • Kirill A. Shutemov's avatar
      mm, thp: Do not make pmd/pud dirty without a reason · 152e93af
      Kirill A. Shutemov authored
      
      
      Currently we make page table entries dirty all the time regardless of
      access type and don't even consider if the mapping is write-protected.
      The reasoning is that we don't really need dirty tracking on THP and
      making the entry dirty upfront may save some time on first write to the
      page.
      
      Unfortunately, such approach may result in false-positive
      can_follow_write_pmd() for huge zero page or read-only shmem file.
      
      Let's only make page dirty only if we about to write to the page anyway
      (as we do for small pages).
      
      I've restructured the code to make entry dirty inside
      maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
      write-protected.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      152e93af
  21. 16 Nov, 2017 1 commit