1. 16 Nov, 2017 6 commits
    • Pavel Tatashin's avatar
      mm: define memblock_virt_alloc_try_nid_raw · ea1f5f37
      Pavel Tatashin authored
      * A new variant of memblock_virt_alloc_* allocations:
      memblock_virt_alloc_try_nid_raw()
          - Does not zero the allocated memory
          - Does not panic if request cannot be satisfied
      
      * optimize early system hash allocations
      
      Clients can call alloc_large_system_hash() with flag: HASH_ZERO to
      specify that memory that was allocated for system hash needs to be
      zeroed, otherwise the memory does not need to be zeroed, and client will
      initialize it.
      
      If memory does not need to be zero'd, call the new
      memblock_virt_alloc_raw() interface, and thus improve the boot
      performance.
      
      * debug for raw alloctor
      
      When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
      returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
      places excpect zeroed memory.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea1f5f37
    • Pavel Tatashin's avatar
      mm: deferred_init_memmap improvements · 2f47a91f
      Pavel Tatashin authored
      Patch series "complete deferred page initialization", v12.
      
      SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config
      option, which defers initializing struct pages until all cpus have been
      started so it can be done in parallel.
      
      However, this feature is sub-optimal, because the deferred page
      initialization code expects that the struct pages have already been
      zeroed, and the zeroing is done early in boot with a single thread only.
      Also, we access that memory and set flags before struct pages are
      initialized.  All of this is fixed in this patchset.
      
      In this work we do the following:
       - Never read access struct page until it was initialized
       - Never set any fields in struct pages before they are initialized
       - Zero struct page at the beginning of struct page initialization
      
      ==========================================================================
      Performance improvements on x86 machine with 8 nodes:
      Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
                              TIME          SPEED UP
      base no deferred:       95.796233s
      fix no deferred:        79.978956s    19.77%
      
      base deferred:          77.254713s
      fix deferred:           55.050509s    40.34%
      ==========================================================================
      SPARC M6 3600 MHz with 15T of memory
                              TIME          SPEED UP
      base no deferred:       358.335727s
      fix no deferred:        302.320936s   18.52%
      
      base deferred:          237.534603s
      fix deferred:           182.103003s   30.44%
      ==========================================================================
      Raw dmesg output with timestamps:
      x86 base no deferred:    https://hastebin.com/ofunepurit.scala
      x86 base deferred:       https://hastebin.com/ifazegeyas.scala
      x86 fix no deferred:     https://hastebin.com/pegocohevo.scala
      x86 fix deferred:        https://hastebin.com/ofupevikuk.scala
      sparc base no deferred:  https://hastebin.com/ibobeteken.go
      sparc base deferred:     https://hastebin.com/fariqimiyu.go
      sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
      sparc fix deferred:      https://hastebin.com/xadinobutu.go
      
      This patch (of 11):
      
      deferred_init_memmap() is called when struct pages are initialized later
      in boot by slave CPUs.  This patch simplifies and optimizes this
      function, and also fixes a couple issues (described below).
      
      The main change is that now we are iterating through free memblock areas
      instead of all configured memory.  Thus, we do not have to check if the
      struct page has already been initialized.
      
      =====
      In deferred_init_memmap() where all deferred struct pages are
      initialized we have a check like this:
      
        if (page->flags) {
      	VM_BUG_ON(page_zone(page) != zone);
      	goto free_range;
        }
      
      This way we are checking if the current deferred page has already been
      initialized.  It works, because memory for struct pages has been zeroed,
      and the only way flags are not zero if it went through
      __init_single_page() before.  But, once we change the current behavior
      and won't zero the memory in memblock allocator, we cannot trust
      anything inside "struct page"es until they are initialized.  This patch
      fixes this.
      
      The deferred_init_memmap() is re-written to loop through only free
      memory ranges provided by memblock.
      
      Note, this first issue is relevant only when the following change is
      merged:
      
      =====
      This patch fixes another existing issue on systems that have holes in
      zones i.e CONFIG_HOLES_IN_ZONE is defined.
      
      In for_each_mem_pfn_range() we have code like this:
      
        if (!pfn_valid_within(pfn)
      	goto free_range;
      
      Note: 'page' is not set to NULL and is not incremented but 'pfn'
      advances.  Thus means if deferred struct pages are enabled on systems
      with these kind of holes, linux would get memory corruptions.  I have
      fixed this issue by defining a new macro that performs all the necessary
      operations when we free the current set of pages.
      
      [pasha.tatashin@oracle.com: buddy page accessed before initialized]
        Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
      Tested-by: default avatarBob Picco <bob.picco@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f47a91f
    • Levin, Alexander (Sasha Levin)'s avatar
      kmemcheck: remove annotations · 49502766
      Levin, Alexander (Sasha Levin) authored
      Patch series "kmemcheck: kill kmemcheck", v2.
      
      As discussed at LSF/MM, kill kmemcheck.
      
      KASan is a replacement that is able to work without the limitation of
      kmemcheck (single CPU, slow).  KASan is already upstream.
      
      We are also not aware of any users of kmemcheck (or users who don't
      consider KASan as a suitable replacement).
      
      The only objection was that since KASAN wasn't supported by all GCC
      versions provided by distros at that time we should hold off for 2
      years, and try again.
      
      Now that 2 years have passed, and all distros provide gcc that supports
      KASAN, kill kmemcheck again for the very same reasons.
      
      This patch (of 4):
      
      Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
      
      [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
        Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
      Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
      
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49502766
    • Michal Hocko's avatar
      mm, page_alloc: fail has_unmovable_pages when seeing reserved pages · d7ab3672
      Michal Hocko authored
      Reserved pages should be completely ignored by the core mm because they
      have a special meaning for their owners.  has_unmovable_pages doesn't
      check those so we rely on other tests (reference count, or PageLRU) to
      fail on such pages.  Althought this happens to work it is safer to
      simply check for those explicitly and do not rely on the owner of the
      page to abuse those fields for special purposes.
      
      Please note that this is more of a further fortification of the code
      rahter than a fix of an existing issue.
      
      Link: http://lkml.kernel.org/r/20171013120756.jeopthigbmm3c7bl@dhcp22.suse.cz
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7ab3672
    • Michal Hocko's avatar
      mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages() · 4da2ce25
      Michal Hocko authored
      Joonsoo has noticed that "mm: drop migrate type checks from
      has_unmovable_pages" would break CMA allocator because it relies on
      has_unmovable_pages returning false even for CMA pageblocks which in
      fact don't have to be movable:
      
       alloc_contig_range
         start_isolate_page_range
           set_migratetype_isolate
             has_unmovable_pages
      
      This is a result of the code sharing between CMA and memory hotplug
      while each one has a different idea of what has_unmovable_pages should
      return.  This is unfortunate but fixing it properly would require a lot
      of code duplication.
      
      Fix the issue by introducing the requested migrate type argument and
      special case MIGRATE_CMA case where CMA page blocks are handled
      properly.  This will work for memory hotplug because it requires
      MIGRATE_MOVABLE.
      
      Link: http://lkml.kernel.org/r/20171019122118.y6cndierwl2vnguj@dhcp22.suse.cz
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Tested-by: default avatarRan Wang <ran.wang_1@nxp.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4da2ce25
    • Michal Hocko's avatar
      mm: drop migrate type checks from has_unmovable_pages · d7b236e1
      Michal Hocko authored
      Michael has noticed that the memory offline tries to migrate kernel code
      pages when doing
      
       echo 0 > /sys/devices/system/memory/memory0/online
      
      The current implementation will fail the operation after several failed
      page migration attempts but we shouldn't even attempt to migrate that
      memory and fail right away because this memory is clearly not
      migrateable.  This will become a real problem when we drop the retry
      loop counter resp.  timeout.
      
      The real problem is in has_unmovable_pages in fact.  We should fail if
      there are any non migrateable pages in the area.  In orther to guarantee
      that remove the migrate type checks because MIGRATE_MOVABLE is not
      guaranteed to contain only migrateable pages.  It is merely a heuristic.
      Similarly MIGRATE_CMA does guarantee that the page allocator doesn't
      allocate any non-migrateable pages from the block but CMA allocations
      themselves are unlikely to migrateable.  Therefore remove both checks.
      
      [akpm@linux-foundation.org: remove unused local `mt']
      Link: http://lkml.kernel.org/r/20171013120013.698-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarTony Lindgren <tony@atomide.com>
      Tested-by: default avatarRan Wang <ran.wang_1@nxp.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7b236e1
  2. 07 Nov, 2017 1 commit
  3. 20 Oct, 2017 1 commit
  4. 04 Oct, 2017 2 commits
  5. 09 Sep, 2017 2 commits
    • Tetsuo Handa's avatar
      mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt · f19360f0
      Tetsuo Handa authored
      We are by error initializing alloc_flags before gfp_allowed_mask is
      applied.  This could cause problems after pm_restrict_gfp_mask() is called
      during suspend operation.  Apply gfp_allowed_mask before initializing
      alloc_flags so that the first allocation attempt uses correct flags.
      
      Link: http://lkml.kernel.org/r/201709020016.ADJ21342.OFLJHOOSMFVtFQ@I-love.SAKURA.ne.jp
      Fixes: 83d4ca81
      
       ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f19360f0
    • Kemi Wang's avatar
      mm: change the call sites of numa statistics items · 3a321d2a
      Kemi Wang authored
      Patch series "Separate NUMA statistics from zone statistics", v2.
      
      Each page allocation updates a set of per-zone statistics with a call to
      zone_statistics().  As discussed in 2017 MM summit, these are a
      substantial source of overhead in the page allocator and are very rarely
      consumed.  This significant overhead in cache bouncing caused by zone
      counters (NUMA associated counters) update in parallel in multi-threaded
      page allocation (pointed out by Dave Hansen).
      
      A link to the MM summit slides:
        http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
      
      To mitigate this overhead, this patchset separates NUMA statistics from
      zone statistics framework, and update NUMA counter threshold to a fixed
      size of MAX_U16 - 2, as a small threshold greatly increases the update
      frequency of the global counter from local per cpu counter (suggested by
      Ying Huang).  The rationality is that these statistics counters don't
      need to be read often, unlike other VM counters, so it's not a problem
      to use a large threshold and make readers more expensive.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
      below) for per single page allocation and reclaim on Jesper's
      page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
      of virtual memory statistics with little end-user-visible effects (only
      move the numa stats to show behind zone page stats, see the first patch
      for details).
      
      I did an experiment of single page allocation and reclaim concurrently
      using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
      server (88 processors with 126G memory) with different size of threshold
      of pcp counter.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
         Threshold   CPU cycles    Throughput(88 threads)
            32        799         241760478
            64        640         301628829
            125       537         358906028 <==> system by default
            256       468         412397590
            512       428         450550704
            4096      399         482520943
            20000     394         489009617
            30000     395         488017817
            65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
            N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      This patch (of 3):
      
      In this patch, NUMA statistics is separated from zone statistics
      framework, all the call sites of NUMA stats are changed to use
      numa-stats-specific functions, it does not have any functionality change
      except that the number of NUMA stats is shown behind zone page stats
      when users *read* the zone info.
      
      E.g. cat /proc/zoneinfo
          ***Base***                           ***With this patch***
      nr_free_pages 3976                         nr_free_pages 3976
      nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
      nr_zone_active_anon 0                      nr_zone_active_anon 0
      nr_zone_inactive_file 0                    nr_zone_inactive_file 0
      nr_zone_active_file 0                      nr_zone_active_file 0
      nr_zone_unevictable 0                      nr_zone_unevictable 0
      nr_zone_write_pending 0                    nr_zone_write_pending 0
      nr_mlock     0                             nr_mlock     0
      nr_page_table_pages 0                      nr_page_table_pages 0
      nr_kernel_stack 0                          nr_kernel_stack 0
      nr_bounce    0                             nr_bounce    0
      nr_zspages   0                             nr_zspages   0
      numa_hit 0                                *nr_free_cma  0*
      numa_miss 0                                numa_hit     0
      numa_foreign 0                             numa_miss    0
      numa_interleave 0                          numa_foreign 0
      numa_local   0                             numa_interleave 0
      numa_other   0                             numa_local   0
      *nr_free_cma 0*                            numa_other 0
          ...                                        ...
      vm stats threshold: 10                     vm stats threshold: 10
          ...                                        ...
      
      The next patch updates the numa stats counter size and threshold.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
      
      Signed-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a321d2a
  6. 07 Sep, 2017 10 commits
    • Michal Hocko's avatar
      mm, oom: do not rely on TIF_MEMDIE for memory reserves access · cd04ae1e
      Michal Hocko authored
      For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
      victims and then, among other things, to give these threads full access
      to memory reserves.  There are few shortcomings of this implementation,
      though.
      
      First of all and the most serious one is that the full access to memory
      reserves is quite dangerous because we leave no safety room for the
      system to operate and potentially do last emergency steps to move on.
      
      Secondly this flag is per task_struct while the OOM killer operates on
      mm_struct granularity so all processes sharing the given mm are killed.
      Giving the full access to all these task_structs could lead to a quick
      memory reserves depletion.  We have tried to reduce this risk by giving
      TIF_MEMDIE only to the main thread and the currently allocating task but
      that doesn't really solve this problem while it surely opens up a room
      for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
      allocator without access to memory reserves because a particular thread
      was not the group leader.
      
      Now that we have the oom reaper and that all oom victims are reapable
      after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the
      kthreads") we can be more conservative and grant only partial access to
      memory reserves because there are reasonable chances of the parallel
      memory freeing.  We still want some access to reserves because we do not
      want other consumers to eat up the victim's freed memory.  oom victims
      will still contend with __GFP_HIGH users but those shouldn't be so
      aggressive to starve oom victims completely.
      
      Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
      the half of the reserves.  This makes the access to reserves independent
      on which task has passed through mark_oom_victim.  Also drop any usage
      of TIF_MEMDIE from the page allocator proper and replace it by
      tsk_is_oom_victim as well which will make page_alloc.c completely
      TIF_MEMDIE free finally.
      
      CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
      ALLOC_NO_WATERMARKS approach.
      
      There is a demand to make the oom killer memcg aware which will imply
      many tasks killed at once.  This change will allow such a usecase
      without worrying about complete memory reserves depletion.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd04ae1e
    • Michal Hocko's avatar
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko authored
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • Michal Hocko's avatar
      mm, memory_hotplug: get rid of zonelists_mutex · b93e0f32
      Michal Hocko authored
      zonelists_mutex was introduced by commit 4eaf3f64 ("mem-hotplug: fix
      potential race while building zonelist for new populated zone") to
      protect zonelist building from races.  This is no longer needed though
      because both memory online and offline are fully serialized.  New users
      have grown since then.
      
      Notably setup_per_zone_wmarks wants to prevent from races between memory
      hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
      (see cfd3da1e ("mm: Serialize access to min_free_kbytes").  Let's
      add a private lock for that purpose.  This will not prevent from seeing
      halfway through memory hotplug operation but that shouldn't be a big
      deal becuse memory hotplug will update watermarks explicitly so we will
      eventually get a full picture.  The lock just makes sure we won't race
      when updating watermarks leading to weird results.
      
      Also __build_all_zonelists manipulates global data so add a private lock
      for it as well.  This doesn't seem to be necessary today but it is more
      robust to have a lock there.
      
      While we are at it make sure we document that memory online/offline
      depends on a full serialization either via mem_hotplug_begin() or
      device_lock.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Haicheng Li <haicheng.li@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b93e0f32
    • Michal Hocko's avatar
      mm, page_alloc: remove stop_machine from build_all_zonelists · 11cd8638
      Michal Hocko authored
      build_all_zonelists has been (ab)using stop_machine to make sure that
      zonelists do not change while somebody is looking at them.  This is is
      just a gross hack because a) it complicates the context from which we
      can call build_all_zonelists (see 3f906ba2 ("mm/memory-hotplug:
      switch locking to a percpu rwsem")) and b) is is not really necessary
      especially after "mm, page_alloc: simplify zonelist initialization" and
      c) it doesn't really provide the protection it claims (see below).
      
      Updates of the zonelists happen very seldom, basically only when a zone
      becomes populated during memory online or when it loses all the memory
      during offline.  A racing iteration over zonelists could either miss a
      zone or try to work on one zone twice.  Both of these are something we
      can live with occasionally because there will always be at least one
      zone visible so we are not likely to fail allocation too easily for
      example.
      
      Please note that the original stop_machine approach doesn't really
      provide a better exclusion because the iteration might be interrupted
      half way (unless the whole iteration is preempt disabled which is not
      the case in most cases) so the some zones could still be seen twice or a
      zone missed.
      
      I have run the pathological online/offline of the single memblock in the
      movable zone while stressing the same small node with some memory
      pressure.
      
      Node 1, zone      DMA
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 943, 943, 943)
      Node 1, zone    DMA32
        pages free     227310
              min      8294
              low      10367
              high     12440
              spanned  262112
              present  262112
              managed  241436
              protection: (0, 0, 0, 0)
      Node 1, zone   Normal
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
              managed  0
              protection: (0, 0, 0, 1024)
      Node 1, zone  Movable
        pages free     32722
              min      85
              low      117
              high     149
              spanned  32768
              present  32768
              managed  32768
              protection: (0, 0, 0, 0)
      
      root@test1:/sys/devices/system/node/node1# while true
      do
      	echo offline > memory34/state
      	echo online_movable > memory34/state
      done
      
      root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4
      
      and it survived without any unexpected behavior.  While this is not
      really a great testing coverage it should exercise the allocation path
      quite a lot.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-8-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11cd8638
    • Michal Hocko's avatar
      mm, page_alloc: simplify zonelist initialization · 9d3be21b
      Michal Hocko authored
      build_zonelists gradually builds zonelists from the nearest to the most
      distant node.  As we do not know how many populated zones we will have
      in each node we rely on the _zoneref to terminate initialized part of
      the zonelist by a NULL zone.  While this is functionally correct it is
      quite suboptimal because we cannot allow updaters to race with zonelists
      users because they could see an empty zonelist and fail the allocation
      or hit the OOM killer in the worst case.
      
      We can do much better, though.  We can store the node ordering into an
      already existing node_order array and then give this array to
      build_zonelists_in_node_order and do the whole initialization at once.
      zonelists consumers still might see halfway initialized state but that
      should be much more tolerateable because the list will not be empty and
      they would either see some zone twice or skip over some zone(s) in the
      worst case which shouldn't lead to immediate failures.
      
      While at it let's simplify build_zonelists_node which is rather
      confusing now.  It gets an index into the zoneref array and returns the
      updated index for the next iteration.  Let's rename the function to
      build_zonerefs_node to better reflect its purpose and give it zoneref
      array to update.  The function doesn't the index anymore.  It just
      returns the number of added zones so that the caller can advance the
      zonered array start for the next update.
      
      This patch alone doesn't introduce any functional change yet, though, it
      is merely a preparatory work for later changes.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-7-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d3be21b
    • Michal Hocko's avatar
      mm, memory_hotplug: drop zone from build_all_zonelists · 72675e13
      Michal Hocko authored
      build_all_zonelists gets a zone parameter to initialize zone's pagesets.
      There is only a single user which gives a non-NULL zone parameter and
      that one doesn't really need the rest of the build_all_zonelists (see
      commit 6dcd73d7 ("memory-hotplug: allocate zone's pcp before
      onlining pages")).
      
      Therefore remove setup_zone_pageset from build_all_zonelists and call it
      from its only user directly.  This will also remove a pointless zonlists
      rebuilding which is always good.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72675e13
    • Michal Hocko's avatar
      mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization · d9c9a0b9
      Michal Hocko authored
      __build_all_zonelists reinitializes each online cpu local node for
      CONFIG_HAVE_MEMORYLESS_NODES.  This makes sense because previously
      memory less nodes could gain some memory during memory hotplug and so
      the local node should be changed for CPUs close to such a node.  It
      makes less sense to do that unconditionally for a newly creaded NUMA
      node which is still offline and without any memory.
      
      Let's also simplify the cpu loop and use for_each_online_cpu instead of
      an explicit cpu_online check for all possible cpus.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-4-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9c9a0b9
    • Michal Hocko's avatar
      mm, page_alloc: remove boot pageset initialization from memory hotplug · afb6ebb3
      Michal Hocko authored
      boot_pageset is a boot time hack which gets superseded by normal
      pagesets later in the boot process.  It makes zero sense to reinitialize
      it again and again during memory hotplug.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      afb6ebb3
    • Michal Hocko's avatar
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE · c9bff3ee
      Michal Hocko authored
      Patch series "cleanup zonelists initialization", v1.
      
      This is aimed at cleaning up the zonelists initialization code we have
      but the primary motivation was bug report [2] which got resolved but the
      usage of stop_machine is just too ugly to live.  Most patches are
      straightforward but 3 of them need a special consideration.
      
      Patch 1 removes zone ordered zonelists completely.  I am CCing linux-api
      because this is a user visible change.  As I argue in the patch
      description I do not think we have a strong usecase for it these days.
      I have kept sysctl in place and warn into the log if somebody tries to
      configure zone lists ordering.  If somebody has a real usecase for it we
      can revert this patch but I do not expect anybody will actually notice
      runtime differences.  This patch is not strictly needed for the rest but
      it made patch 6 easier to implement.
      
      Patch 7 removes stop_machine from build_all_zonelists without adding any
      special synchronization between iterators and updater which I _believe_
      is acceptable as explained in the changelog.  I hope I am not missing
      anything.
      
      Patch 8 then removes zonelists_mutex which is kind of ugly as well and
      not really needed AFAICS but a care should be taken when double checking
      my thinking.
      
      This patch (of 9):
      
      Supporting zone ordered zonelists costs us just a lot of code while the
      usefulness is arguable if existent at all.  Mel has already made node
      ordering default on 64b systems.  32b systems are still using
      ZONELIST_ORDER_ZONE because it is considered better to fallback to a
      different NUMA node rather than consume precious lowmem zones.
      
      This argument is, however, weaken by the fact that the memory reclaim
      has been reworked to be node rather than zone oriented.  This means that
      lowmem requests have to skip over all highmem pages on LRUs already and
      so zone ordering doesn't save the reclaim time much.  So the only
      advantage of the zone ordering is under a light memory pressure when
      highmem requests do not ever hit into lowmem zones and the lowmem
      pressure doesn't need to reclaim.
      
      Considering that 32b NUMA systems are rather suboptimal already and it
      is generally advisable to use 64b kernel on such a HW I believe we
      should rather care about the code maintainability and just get rid of
      ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if
      somebody tries to set zone ordering either from kernel command line or
      the sysctl.
      
      [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
      Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9bff3ee
    • Wei Yang's avatar
      mm/memory_hotplug: just build zonelist for newly added node · c1152583
      Wei Yang authored
      Commit 9adb62a5 ("mm/hotplug: correctly setup fallback zonelists
      when creating new pgdat") tries to build the correct zonelist for a
      newly added node, while it is not necessary to rebuild it for already
      exist nodes.
      
      In build_zonelists(), it will iterate on nodes with memory.  For a newly
      added node, it will have memory until node_states_set_node() is called
      in online_pages().
      
      This patch avoids rebuilding the zonelists for already existing nodes.
      
      build_zonelists_node() uses managed_zone(zone) checks, so it should not
      include empty zones anyway.  So effectively we avoid some pointless work
      under stop_machine().
      
      [akpm@linux-foundation.org: tweak comment text]
      [akpm@linux-foundation.org: coding-style tweak, per Vlastimil]
      Link: http://lkml.kernel.org/r/20170626035822.50155-1-richard.weiyang@gmail.com
      
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1152583
  7. 31 Aug, 2017 1 commit
  8. 25 Aug, 2017 1 commit
    • Chen Yu's avatar
      PM/hibernate: touch NMI watchdog when creating snapshot · 556b969a
      Chen Yu authored
      There is a problem that when counting the pages for creating the
      hibernation snapshot will take significant amount of time, especially on
      system with large memory.  Since the counting job is performed with irq
      disabled, this might lead to NMI lockup.  The following warning were
      found on a system with 1.5TB DRAM:
      
        Freezing user space processes ... (elapsed 0.002 seconds) done.
        OOM killer disabled.
        PM: Preallocating image memory...
        NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
        CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
        task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
        RIP: 0010:memory_bm_find_bit+0xf4/0x100
        Call Trace:
         swsusp_set_page_free+0x2b/0x30
         mark_free_pages+0x147/0x1c0
         count_data_pages+0x41/0xa0
         hibernate_preallocate_memory+0x80/0x450
         hibernation_snapshot+0x58/0x410
         hibernate+0x17c/0x310
         state_store+0xdf/0xf0
         kobj_attr_store+0xf/0x20
         sysfs_kf_write+0x37/0x40
         kernfs_fop_write+0x11c/0x1a0
         __vfs_write+0x37/0x170
         vfs_write+0xb1/0x1a0
         SyS_write+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        ...
        done (allocated 6590003 pages)
        PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)
      
      It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
      triggered.  In case the timeout of the NMI watch dog has been set to 1
      second, a safe interval should be 6590003/20 = 320k pages in theory.
      However there might also be some platforms running at a lower frequency,
      so feed the watchdog every 100k pages.
      
      [yu.c.chen@intel.com: simplification]
        Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
      [yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
      Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.com
      
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Reported-by: default avatarJan Filipcewicz <jan.filipcewicz@intel.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      556b969a
  9. 18 Aug, 2017 1 commit
  10. 10 Aug, 2017 3 commits
    • Jonathan Toppins's avatar
      mm: ratelimit PFNs busy info message · 75dddef3
      Jonathan Toppins authored
      The RDMA subsystem can generate several thousand of these messages per
      second eventually leading to a kernel crash.  Ratelimit these messages
      to prevent this crash.
      
      Doug said:
       "I've been carrying a version of this for several kernel versions. I
        don't remember when they started, but we have one (and only one) class
        of machines: Dell PE R730xd, that generate these errors. When it
        happens, without a rate limit, we get rcu timeouts and kernel oopses.
        With the rate limit, we just get a lot of annoying kernel messages but
        the machine continues on, recovers, and eventually the memory
        operations all succeed"
      
      And:
       "> Well... why are all these EBUSY's occurring? It sounds inefficient
        > (at least) but if it is expected, normal and unavoidable then
        > perhaps we should just remove that message altogether?
      
        I don't have an answer to that question. To be honest, I haven't
        looked real hard. We never had this at all, then it started out of the
        blue, but only on our Dell 730xd machines (and it hits all of them),
        but no other classes or brands of machines. And we have our 730xd
        machines loaded up with different brands and models of cards (for
        instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
        ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
        meant it wasn't tied to any particular brand/model of RDMA hardware.
        To me, it always smelled of a hardware oddity specific to maybe the
        CPUs or mainboard chipsets in these machines, so given that I'm not an
        mm expert anyway, I never chased it down.
      
        A few other relevant details: it showed up somewhere around 4.8/4.9 or
        thereabouts. It never happened before, but the prinkt has been there
        since the 3.18 days, so possibly the test to trigger this message was
        changed, or something else in the allocator changed such that the
        situation started happening on these machines?
      
        And, like I said, it is specific to our 730xd machines (but they are
        all identical, so that could mean it's something like their specific
        ram configuration is causing the allocator to hit this on these
        machine but not on other machines in the cluster, I don't want to say
        it's necessarily the model of chipset or CPU, there are other bits of
        identicalness between these machines)"
      
      Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com
      
      Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Reviewed-by: default avatarDoug Ledford <dledford@redhat.com>
      Tested-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75dddef3
    • Johannes Weiner's avatar
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner authored
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
    • Peter Zijlstra's avatar
      locking/lockdep: Rework FS_RECLAIM annotation · d92a8cfc
      Peter Zijlstra authored
      
      
      A while ago someone, and I cannot find the email just now, asked if we
      could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
      like we use for other things like workqueues etc. I think this should
      be possible which allows reducing the 'irq' states and will reduce the
      amount of __bfs() lookups we do.
      
      Removing the 1 IRQ state results in 4 less __bfs() walks per
      dependency, improving lockdep performance. And by moving this
      annotation out of the lockdep code it becomes easier for the mm people
      to extend.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: iamjoonsoo.kim@lge.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d92a8cfc
  11. 03 Aug, 2017 1 commit
    • Heiko Carstens's avatar
      mm: take memory hotplug lock within numa_zonelist_order_handler() · 167d0f25
      Heiko Carstens authored
      Andre Wild reported the following warning:
      
        WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
        Modules linked in:
        CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57 #10
        Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
        task: 00000000701d8100 task.stack: 0000000073594000
        Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
        ...
        Call Trace:
         lockdep_assert_cpus_held+0x42/0x60)
         stop_machine_cpuslocked+0x62/0xf0
         build_all_zonelists+0x92/0x150
         numa_zonelist_order_handler+0x102/0x150
         proc_sys_call_handler.isra.12+0xda/0x118
         proc_sys_write+0x34/0x48
         __vfs_write+0x3c/0x178
         vfs_write+0xbc/0x1a0
         SyS_write+0x66/0xc0
         system_call+0xc4/0x2b0
         locks held by bash/1205:
         #0:  (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
         #1:  (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
         #2:  (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
        Last Breaking-Event-Address:
          lockdep_assert_cpus_held+0x48/0x60
      
      This can be easily triggered with e.g.
      
          echo n > /proc/sys/vm/numa_zonelist_order
      
      In commit 3f906ba2 ("mm/memory-hotplug: switch locking to a percpu
      rwsem") memory hotplug locking was changed to fix a potential deadlock.
      
      This also switched the stop_machine() invocation within
      build_all_zonelists() to stop_machine_cpuslocked() which now expects
      that online cpus are locked when being called.
      
      This assumption is not true if build_all_zonelists() is being called
      from numa_zonelist_order_handler().
      
      In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
      pair to numa_zonelist_order_handler().
      
      Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
      Fixes: 3f906ba2
      
       ("mm/memory-hotplug: switch locking to a percpu rwsem")
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: default avatarAndre Wild <wild@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      167d0f25
  12. 12 Jul, 2017 1 commit
    • Michal Hocko's avatar
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko authored
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  13. 10 Jul, 2017 3 commits
    • Thomas Gleixner's avatar
      mm/memory-hotplug: switch locking to a percpu rwsem · 3f906ba2
      Thomas Gleixner authored
      Andrey reported a potential deadlock with the memory hotplug lock and
      the cpu hotplug lock.
      
      The reason is that memory hotplug takes the memory hotplug lock and then
      calls stop_machine() which calls get_online_cpus().  That's the reverse
      lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c
      
      The problem has been there forever.  The reason why this was never
      reported is that the cpu hotplug locking had this homebrewn recursive
      reader writer semaphore construct which due to the recursion evaded the
      full lock dep coverage.  The memory hotplug code copied that construct
      verbatim and therefor has similar issues.
      
      Three steps to fix this:
      
      1) Convert the memory hotplug locking to a per cpu rwsem so the
         potential issues get reported proper by lockdep.
      
      2) Lock the online cpus in mem_hotplug_begin() before taking the memory
         hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
         code to avoid recursive locking.
      
      3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
         hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
         by invoking lru_add_drain_all_cpuslocked() instead.
      
      Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de
      
      Reported-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f906ba2
    • Rasmus Villemoes's avatar
      mm/page_alloc.c: eliminate unsigned confusion in __rmqueue_fallback · b002529d
      Rasmus Villemoes authored
      Since current_order starts as MAX_ORDER-1 and is then only decremented,
      the second half of the loop condition seems superfluous.  However, if
      order is 0, we may decrement current_order past 0, making it UINT_MAX.
      This is obviously too subtle ([1], [2]).
      
      Since we need to add some comment anyway, change the two variables to
      signed, making the counting-down for loop look more familiar, and
      apparently also making gcc generate slightly smaller code.
      
      [1] https://lkml.org/lkml/2016/6/20/493
      [2] https://lkml.org/lkml/2017/6/19/345
      
      [akpm@linux-foundation.org: fix up reject fixupping]
      Link: http://lkml.kernel.org/r/20170621185529.2265-1-linux@rasmusvillemoes.dk
      
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reported-by: default avatarHao Lee <haolee.swjtu@gmail.com>
      Acked-by: default avatarWei Yang <weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b002529d
    • Vlastimil Babka's avatar
      mm, page_alloc: fallback to smallest page when not stealing whole pageblock · 7a8f58f3
      Vlastimil Babka authored
      Since commit 3bc48f96 ("mm, page_alloc: split smallest stolen page
      in fallback") we pick the smallest (but sufficient) page of all that
      have been stolen from a pageblock of different migratetype.  However,
      there are cases when we decide not to steal the whole pageblock.
      
      Practically in the current implementation it means that we are trying to
      fallback for a MIGRATE_MOVABLE allocation of order X, go through the
      freelists from MAX_ORDER-1 down to X, and find free page of order Y.  If
      Y is less than pageblock_order / 2, we decide not to steal all pages
      from the pageblock.  When Y > X, it means we are potentially splitting a
      larger page than we need, as there might be other pages of order Z,
      where X <= Z < Y.  Since Y is already too small to steal whole
      pageblock, picking smallest available Z will result in the same decision
      and we avoid splitting a higher-order page in a MIGRATE_UNMOVABLE or
      MIGRATE_RECLAIMABLE pageblock.
      
      This patch therefore changes the fallback algorithm so that in the
      situation described above, we switch the fallback search strategy to go
      from order X upwards to find the smallest suitable fallback.  In theory
      there shouldn't be a downside of this change wrt fragmentation.
      
      This has been tested with mmtests' stress-highalloc performing
      GFP_KERNEL order-4 allocations, here is the relevant extfrag tracepoint
      statistics:
      
                                                              4.12.0-rc2      4.12.0-rc2
                                                               1-kernel4       2-kernel4
        Page alloc extfrag event                                  25640976    69680977
        Extfrag fragmenting                                       25621086    69661364
        Extfrag fragmenting for unmovable                            74409       73204
        Extfrag fragmenting unmovable placed with movable            69003       67684
        Extfrag fragmenting unmovable placed with reclaim.            5406        5520
        Extfrag fragmenting for reclaimable                           6398        8467
        Extfrag fragmenting reclaimable placed with movable            869         884
        Extfrag fragmenting reclaimable placed with unmov.            5529        7583
        Extfrag fragmenting for movable                           25540279    69579693
      
      Since we force movable allocations to steal the smallest available page
      (which we then practially always split), we steal less per fallback, so
      the number of fallbacks increases and steals potentially happen from
      different pageblocks.  This is however not an issue for movable pages
      that can be compacted.
      
      Importantly, the "unmovable placed with movable" statistics is lower,
      which is the result of less fragmentation in the unmovable pageblocks.
      The effect on reclaimable allocation is a bit unclear.
      
      Link: http://lkml.kernel.org/r/20170529093947.22618-1-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a8f58f3
  14. 06 Jul, 2017 7 commits
    • Michal Hocko's avatar
      mm, memory_hotplug: drop CONFIG_MOVABLE_NODE · f70029bb
      Michal Hocko authored
      Commit 20b2f52b ("numa: add CONFIG_MOVABLE_NODE for
      movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
      good explanation on why it is actually useful.
      
      It makes a lot of sense to make movable node semantic opt in but we
      already have that because the feature has to be explicitly enabled on
      the kernel command line.  A config option on top only makes the
      configuration space larger without a good reason.  It also adds an
      additional ifdefery that pollutes the code.
      
      Just drop the config option and make it de-facto always enabled.  This
      shouldn't introduce any change to the semantic.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f70029bb
    • Johannes Weiner's avatar
      mm: vmstat: move slab statistics from zone to node counters · 385386cf
      Johannes Weiner authored
      Patch series "mm: per-lruvec slab stats"
      
      Josef is working on a new approach to balancing slab caches and the page
      cache.  For this to work, he needs slab cache statistics on the lruvec
      level.  These patches implement that by adding infrastructure that
      allows updating and reading generic VM stat items per lruvec, then
      switches some existing VM accounting sites, including the slab
      accounting ones, to this new cgroup-aware API.
      
      I'll follow up with more patches on this, because there is actually
      substantial simplification that can be done to the memory controller
      when we replace private memcg accounting with making the existing VM
      accounting sites cgroup-aware.  But this is enough for Josef to base his
      slab reclaim work on, so here goes.
      
      This patch (of 5):
      
      To re-implement slab cache vs.  page cache balancing, we'll need the
      slab counters at the lruvec level, which, ever since lru reclaim was
      moved from the zone to the node, is the intersection of the node, not
      the zone, and the memcg.
      
      We could retain the per-zone counters for when the page allocator dumps
      its memory information on failures, and have counters on both levels -
      which on all but NUMA node 0 is usually redundant.  But let's keep it
      simple for now and just move them.  If anybody complains we can restore
      the per-zone counters.
      
      [hannes@cmpxchg.org: fix oops]
        Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
      Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      385386cf
    • Vlastimil Babka's avatar
      mm, page_alloc: pass preferred nid instead of zonelist to allocator · 04ec6264
      Vlastimil Babka authored
      The main allocator function __alloc_pages_nodemask() takes a zonelist
      pointer as one of its parameters.  All of its callers directly or
      indirectly obtain the zonelist via node_zonelist() using a preferred
      node id and gfp_mask.  We can make the code a bit simpler by doing the
      zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
      id instead (gfp_mask is already another parameter).
      
      There are some code size benefits thanks to removal of inlined
      node_zonelist():
      
        bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)
      
      This will also make things simpler if we proceed with converting cpusets
      to zonelists.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04ec6264
    • Vlastimil Babka's avatar
      mm, page_alloc: fix more premature OOM due to race with cpuset update · 902b6281
      Vlastimil Babka authored
      I would like to stress that this patchset aims to fix issues and cleanup
      the code *within the existing documented semantics*, i.e.  patch 1
      ignores mempolicy restrictions if the set of allowed nodes has no
      intersection with set of nodes allowed by cpuset.  I believe discussing
      potential changes of the semantics can be better done once we have a
      baseline with no known bugs of the current semantics.
      
      I've recently summarized the cpuset/mempolicy issues in a LSF/MM
      proposal [1] and the discussion itself [2].  I've been trying to rewrite
      the handling as proposed, with the idea that changing semantics to make
      all mempolicies static wrt cpuset updates (and discarding the relative
      and default modes) can be tried on top, as there's a high risk of being
      rejected/reverted because somebody might still care about the removed
      modes.
      
      However I haven't yet figured out how to properly:
      
      1) make mempolicies swappable instead of rebinding in place. I thought
         mbind() already works that way and uses refcounting to avoid
         use-after-free of the old policy by a parallel allocation, but turns
         out true refcounting is only done for shared (shmem) mempolicies, and
         the actual protection for mbind() comes from mmap_sem. Extending the
         refcounting means more overhead in allocator hot path. Also swapping
         whole mempolicies means that we have to allocate the new ones, which
         can fail, and reverting of the partially done work also means
         allocating (note that mbind() doesn't care and will just leave part
         of the range updated and part not updated when returning -ENOMEM...).
      
      2) make cpuset's task->mems_allowed also swappable (after converting it
         from nodemask to zonelist, which is the easy part) for mostly the
         same reasons.
      
      The good news is that while trying to do the above, I've at least
      figured out how to hopefully close the remaining premature OOM's, and do
      a buch of cleanups on top, removing quite some of the code that was also
      supposed to prevent the cpuset update races, but doesn't work anymore
      nowadays.  This should fix the most pressing concerns with this topic
      and give us a better baseline before either proceeding with the original
      proposal, or pushing a change of semantics that removes the problem 1)
      above.  I'd be then fine with trying to change the semantic first and
      rewrite later.
      
      Patchset has been tested with the LTP cpuset01 stress test.
      
      [1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz
      [2] https://lwn.net/Articles/717797/
      [3] https://marc.info/?l=linux-mm&m=149191957922828&w=2
      
      This patch (of 6):
      
      Commit e47483bc ("mm, page_alloc: fix premature OOM when racing with
      cpuset mems update") has fixed known recent regressions found by LTP's
      cpuset01 testcase.  I have however found that by modifying the testcase
      to use per-vma mempolicies via bind(2) instead of per-task mempolicies
      via set_mempolicy(2), the premature OOM still happens and the issue is
      much older.
      
      The root of the problem is that the cpuset's mems_allowed and
      mempolicy's nodemask can temporarily have no intersection, thus
      get_page_from_freelist() cannot find any usable zone.  The current
      semantic for empty intersection is to ignore mempolicy's nodemask and
      honour cpuset restrictions.  This is checked in node_zonelist(), but the
      racy update can happen after we already passed the check.  Such races
      should be protected by the seqlock task->mems_allowed_seq, but it
      doesn't work here, because 1) mpol_rebind_mm() does not happen under
      seqlock for write, and doing so would lead to deadlock, as it takes
      mmap_sem for write, while the allocation can have mmap_sem for read when
      it's taking the seqlock for read.  And 2) the seqlock cookie of callers
      of node_zonelist() (alloc_pages_vma() and alloc_pages_current()) is
      different than the one of __alloc_pages_slowpath(), so there's still a
      potential race window.
      
      This patch fixes the issue by having __alloc_pages_slowpath() check for
      empty intersection of cpuset and ac->nodemask before OOM or allocation
      failure.  If it's indeed empty, the nodemask is ignored and allocation
      retried, which mimics node_zonelist().  This works fine, because almost
      all callers of __alloc_pages_nodemask are obtaining the nodemask via
      node_zonelist().  The only exception is new_node_page() from hotplug,
      where the potential violation of nodemask isn't an issue, as there's
      already a fallback allocation attempt without any nodemask.  If there's
      a future caller that needs to have its specific nodemask honoured over
      task's cpuset restrictions, we'll have to e.g.  add a gfp flag for that.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-2-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      902b6281
    • Matthias Kaehlcke's avatar
      mm/page_alloc.c: mark bad_range() and meminit_pfn_in_nid() as __maybe_unused · d73d3c9f
      Matthias Kaehlcke authored
      The functions are not used in some configurations.  Adding the attribute
      fixes the following warnings when building with clang:
      
        mm/page_alloc.c:409:19: error: function 'bad_range' is not needed and
            will not be emitted [-Werror,-Wunneeded-internal-declaration]
      
        mm/page_alloc.c:1106:30: error: unused function 'meminit_pfn_in_nid'
            [-Werror,-Wunused-function]
      
      Link: http://lkml.kernel.org/r/20170518182030.165633-1-mka@chromium.org
      
      Signed-off-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d73d3c9f
    • Pavel Tatashin's avatar
      mm: adaptive hash table scaling · 9017217b
      Pavel Tatashin authored
      Allow hash tables to scale with memory but at slower pace, when
      HASH_ADAPT is provided every time memory quadruples the sizes of hash
      tables will only double instead of quadrupling as well.  This algorithm
      starts working only when memory size reaches a certain point, currently
      set to 64G.
      
      This is example of dentry hash table size, before and after four various
      memory configurations:
      
      MEMORY	   SCALE	 HASH_SIZE
      	old	new	old	new
          8G	 13	 13      8M      8M
         16G	 13	 13     16M     16M
         32G	 13	 13     32M     32M
         64G	 13	 13     64M     64M
        128G	 13	 14    128M     64M
        256G	 13	 14    256M    128M
        512G	 13	 15    512M    128M
       1024G	 13	 15   1024M    256M
       2048G	 13	 16   2048M    256M
       4096G	 13	 16   4096M    512M
       8192G	 13	 17   8192M    512M
      16384G	 13	 17  16384M   1024M
      32768G	 13	 18  32768M   1024M
      65536G	 13	 18  65536M   2048M
      
      The effect of this change on runtime is undetectable as filesystem
      growth is not proportional to machine memory size as is currently
      assumed.  The change effects only large memory machine.  Additional
      tuning might be needed, but that can be done by the clients of the
      kmem_cache_create interface, not the generic cache allocator itself.
      
      The adaptive hashing is disabled on 32 bit systems to avoid confusion of
      whether base should be different for smaller systems, and to avoid
      overflows.
      
      [mhocko@suse.com: drop HASH_ADAPT]
        Link: http://lkml.kernel.org/r/20170509094607.GG6481@dhcp22.suse.cz
      [pasha.tatashin@oracle.com: UL -> ULL fix]
        Link: http://lkml.kernel.org/r/1495300013-653283-2-git-send-email-pasha.tatashin@oracle.com
      [pasha.tatashin@oracle.com: disable adaptive hash on 32 bit systems]
        Link: http://lkml.kernel.org/r/1495469329-755807-2-git-send-email-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/1488432825-92126-5-git-send-email-pasha.tatashin@oracle.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Babu Moger <babu.moger@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9017217b
    • Pavel Tatashin's avatar
      mm: zero hash tables in allocator · 3749a8f0
      Pavel Tatashin authored
      Add a new flag HASH_ZERO which when provided grantees that the hash
      table that is returned by alloc_large_system_hash() is zeroed.  In most
      cases that is what is needed by the caller.  Use page level allocator's
      __GFP_ZERO flags to zero the memory.  It is using memset() which is
      efficient method to zero memory and is optimized for most platforms.
      
      Link: http://lkml.kernel.org/r/1488432825-92126-3-git-send-email-pasha.tatashin@oracle.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3749a8f0