1. 26 May, 2018 1 commit
    • Michal Hocko's avatar
      mm, memory_hotplug: make has_unmovable_pages more robust · 15c30bc0
      Michal Hocko authored
      Oscar has reported:
      : Due to an unfortunate setting with movablecore, memblocks containing bootmem
      : memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
      : So while trying to remove that memory, the system failed in do_migrate_range
      : and __offline_pages never returned.
      :
      : This can be reproduced by running
      : qemu-system-x86_64 -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M
      : and movablecore=4G kernel command line
      :
      : linux kernel: BIOS-provided physical RAM map:
      : linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
      : linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
      : linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
      : linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
      : linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
      : linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
      : linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
      : linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
      : linux kernel: NX (Execute Disable) protection: active
      : linux kernel: SMBIOS 2.8 present.
      : linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org
      : linux kernel: Hypervisor detected: KVM
      : linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
      : linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
      : linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
      :
      : linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
      : linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
      : linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
      : linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
      : linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
      : linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
      : linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]
      :
      : zoneinfo shows that the zone movable is placed into both numa nodes:
      : Node 0, zone  Movable
      :   pages free     160140
      :         min      1823
      :         low      2278
      :         high     2733
      :         spanned  262144
      :         present  262144
      :         managed  245670
      : Node 1, zone  Movable
      :   pages free     448427
      :         min      3827
      :         low      4783
      :         high     5739
      :         spanned  524288
      :         present  524288
      :         managed  515766
      
      Note how only Node 0 has a hutplugable memory region which would rule it
      out from the early memblock allocations (most likely memmap).  Node1
      will surely contain memmaps on the same node and those would prevent
      offlining to succeed.  So this is arguably a configuration issue.
      Although one could argue that we should be more clever and rule early
      allocations from the zone movable.  This would be correct but probably
      not worth the effort considering what a hack movablecore is.
      
      Anyway, We could do better for those cases though.  We rely on
      start_isolate_page_range resp.  has_unmovable_pages to do their job.
      The first one isolates the whole range to be offlined so that we do not
      allocate from it anymore and the later makes sure we are not stumbling
      over non-migrateable pages.
      
      has_unmovable_pages is overly optimistic, however.  It doesn't check all
      the pages if we are withing zone_movable because we rely that those
      pages will be always migrateable.  As it turns out we are still not
      perfect there.  While bootmem pages in zonemovable sound like a clear
      bug which should be fixed let's remove the optimization for now and warn
      if we encounter unmovable pages in zone_movable in the meantime.  That
      should help for now at least.
      
      Btw.  this wasn't a real problem until commit 72b39cfc ("mm,
      memory_hotplug: do not fail offlining too early") because we used to
      have a small number of retries and then failed.  This turned out to be
      too fragile though.
      
      Link: http://lkml.kernel.org/r/20180523125555.30039-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Tested-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15c30bc0
  2. 24 May, 2018 1 commit
    • Joonsoo Kim's avatar
      Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE" · d883c6cf
      Joonsoo Kim authored
      This reverts the following commits that change CMA design in MM.
      
       3d2054ad ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")
      
       1d47a3ec ("mm/cma: remove ALLOC_CMA")
      
       bad8c6c0
      
       ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
      
      Ville reported a following error on i386.
      
        Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
        microcode: microcode updated early to revision 0x4, date = 2013-06-28
        Initializing CPU#0
        Initializing HighMem for node 0 (000377fe:00118000)
        Initializing Movable for node 0 (00000001:00118000)
        BUG: Bad page state in process swapper  pfn:377fe
        page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
        flags: 0x80000000()
        raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
        page dumped because: nonzero mapcount
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
        Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
        Call Trace:
         dump_stack+0x60/0x96
         bad_page+0x9a/0x100
         free_pages_check_bad+0x3f/0x60
         free_pcppages_bulk+0x29d/0x5b0
         free_unref_page_commit+0x84/0xb0
         free_unref_page+0x3e/0x70
         __free_pages+0x1d/0x20
         free_highmem_page+0x19/0x40
         add_highpages_with_active_regions+0xab/0xeb
         set_highmem_pages_init+0x66/0x73
         mem_init+0x1b/0x1d7
         start_kernel+0x17a/0x363
         i386_start_kernel+0x95/0x99
         startup_32_smp+0x164/0x168
      
      The reason for this error is that the span of MOVABLE_ZONE is extended
      to whole node span for future CMA initialization, and, normal memory is
      wrongly freed here.  I submitted the fix and it seems to work, but,
      another problem happened.
      
      It's so late time to fix the later problem so I decide to reverting the
      series.
      
      Reported-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Acked-by: default avatarLaura Abbott <labbott@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d883c6cf
  3. 11 Apr, 2018 5 commits
    • Pavel Tatashin's avatar
      xen, mm: allow deferred page initialization for xen pv domains · 6f84f8d1
      Pavel Tatashin authored
      Juergen Gross noticed that commit f7f99100 ("mm: stop zeroing memory
      during allocation in vmemmap") broke XEN PV domains when deferred struct
      page initialization is enabled.
      
      This is because the xen's PagePinned() flag is getting erased from
      struct pages when they are initialized later in boot.
      
      Juergen fixed this problem by disabling deferred pages on xen pv
      domains.  It is desirable, however, to have this feature available as it
      reduces boot time.  This fix re-enables the feature for pv-dmains, and
      fixes the problem the following way:
      
      The fix is to delay setting PagePinned flag until struct pages for all
      allocated memory are initialized, i.e.  until after free_all_bootmem().
      
      A new x86_init.hyper op init_after_bootmem() is called to let xen know
      that boot allocator is done, and hence struct pages for all the
      allocated memory are now initialized.  If deferred page initialization
      is enabled, the rest of struct pages are going to be initialized later
      in boot once page_alloc_init_late() is called.
      
      xen_after_bootmem() walks page table's pages and marks them pinned.
      
      Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Tested-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jia Zhang <zhang.jia@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f84f8d1
    • Joonsoo Kim's avatar
      mm/cma: remove ALLOC_CMA · 1d47a3ec
      Joonsoo Kim authored
      Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE
      and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE.
      
      Therefore, we don't need to maintain ALLOC_CMA at all.
      
      Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.com
      
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: default avatarTony Lindgren <tony@atomide.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d47a3ec
    • Joonsoo Kim's avatar
      mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE · bad8c6c0
      Joonsoo Kim authored
      Patch series "mm/cma: manage the memory of the CMA area by using the
      ZONE_MOVABLE", v2.
      
      0. History
      
      This patchset is the follow-up of the discussion about the "Introduce
      ZONE_CMA (v7)" [1].  Please reference it if more information is needed.
      
      1. What does this patch do?
      
      This patch changes the management way for the memory of the CMA area in
      the MM subsystem.  Currently the memory of the CMA area is managed by
      the zone where their pfn is belong to.  However, this approach has some
      problems since MM subsystem doesn't have enough logic to handle the
      situation that different characteristic memories are in a single zone.
      To solve this issue, this patch try to manage all the memory of the CMA
      area by using the MOVABLE zone.  In MM subsystem's point of view,
      characteristic of the memory on the MOVABLE zone and the memory of the
      CMA area are the same.  So, managing the memory of the CMA area by using
      the MOVABLE zone will not have any problem.
      
      2. Motivation
      
      There are some problems with current approach.  See following.  Although
      these problem would not be inherent and it could be fixed without this
      conception change, it requires many hooks addition in various code path
      and it would be intrusive to core MM and would be really error-prone.
      Therefore, I try to solve them with this new approach.  Anyway,
      following is the problems of the current implementation.
      
      o CMA memory utilization
      
      First, following is the freepage calculation logic in MM.
      
       - For movable allocation: freepage = total freepage
       - For unmovable allocation: freepage = total freepage - CMA freepage
      
      Freepages on the CMA area is used after the normal freepages in the zone
      where the memory of the CMA area is belong to are exhausted.  At that
      moment that the number of the normal freepages is zero, so
      
       - For movable allocation: freepage = total freepage = CMA freepage
       - For unmovable allocation: freepage = 0
      
      If unmovable allocation comes at this moment, allocation request would
      fail to pass the watermark check and reclaim is started.  After reclaim,
      there would exist the normal freepages so freepages on the CMA areas
      would not be used.
      
      FYI, there is another attempt [2] trying to solve this problem in lkml.
      And, as far as I know, Qualcomm also has out-of-tree solution for this
      problem.
      
      Useless reclaim:
      
      There is no logic to distinguish CMA pages in the reclaim path.  Hence,
      CMA page is reclaimed even if the system just needs the page that can be
      usable for the kernel allocation.
      
      Atomic allocation failure:
      
      This is also related to the fallback allocation policy for the memory of
      the CMA area.  Consider the situation that the number of the normal
      freepages is *zero* since the bunch of the movable allocation requests
      come.  Kswapd would not be woken up due to following freepage
      calculation logic.
      
      - For movable allocation: freepage = total freepage = CMA freepage
      
      If atomic unmovable allocation request comes at this moment, it would
      fails due to following logic.
      
      - For unmovable allocation: freepage = total freepage - CMA freepage = 0
      
      It was reported by Aneesh [3].
      
      Useless compaction:
      
      Usual high-order allocation request is unmovable allocation request and
      it cannot be served from the memory of the CMA area.  In compaction,
      migration scanner try to migrate the page in the CMA area and make
      high-order page there.  As mentioned above, it cannot be usable for the
      unmovable allocation request so it's just waste.
      
      3. Current approach and new approach
      
      Current approach is that the memory of the CMA area is managed by the
      zone where their pfn is belong to.  However, these memory should be
      distinguishable since they have a strong limitation.  So, they are
      marked as MIGRATE_CMA in pageblock flag and handled specially.  However,
      as mentioned in section 2, the MM subsystem doesn't have enough logic to
      deal with this special pageblock so many problems raised.
      
      New approach is that the memory of the CMA area is managed by the
      MOVABLE zone.  MM already have enough logic to deal with special zone
      like as HIGHMEM and MOVABLE zone.  So, managing the memory of the CMA
      area by the MOVABLE zone just naturally work well because constraints
      for the memory of the CMA area that the memory should always be
      migratable is the same with the constraint for the MOVABLE zone.
      
      There is one side-effect for the usability of the memory of the CMA
      area.  The use of MOVABLE zone is only allowed for a request with
      GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
      only allowed for this gfp flag.  Before this patchset, a request with
      GFP_MOVABLE can use them.  IMO, It would not be a big issue since most
      of GFP_MOVABLE request also has GFP_HIGHMEM flag.  For example, file
      cache page and anonymous page.  However, file cache page for blockdev
      file is an exception.  Request for it has no GFP_HIGHMEM flag.  There is
      pros and cons on this exception.  In my experience, blockdev file cache
      pages are one of the top reason that causes cma_alloc() to fail
      temporarily.  So, we can get more guarantee of cma_alloc() success by
      discarding this case.
      
      Note that there is no change in admin POV since this patchset is just
      for internal implementation change in MM subsystem.  Just one minor
      difference for admin is that the memory stat for CMA area will be
      printed in the MOVABLE zone.  That's all.
      
      4. Result
      
      Following is the experimental result related to utilization problem.
      
      8 CPUs, 1024 MB, VIRTUAL MACHINE
      make -j16
      
      <Before>
        CMA area:               0 MB            512 MB
        Elapsed-time:           92.4		186.5
        pswpin:                 82		18647
        pswpout:                160		69839
      
      <After>
        CMA        :            0 MB            512 MB
        Elapsed-time:           93.1		93.4
        pswpin:                 84		46
        pswpout:                183		92
      
      akpm: "kernel test robot" reported a 26% improvement in
      vm-scalability.throughput:
      http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop
      
      [1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
      [2]: https://lkml.org/lkml/2014/10/15/623
      [3]: http://www.spinics.net/lists/linux-mm/msg100562.html
      
      Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.com
      
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: default avatarTony Lindgren <tony@atomide.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bad8c6c0
    • Joonsoo Kim's avatar
      mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request · d3cda233
      Joonsoo Kim authored
      Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
      important to reserve.  When ZONE_MOVABLE is used, this problem would
      theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
      allocation request which is mainly used for page cache and anon page
      allocation.  So, fix it by setting 0 to
      sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM].
      
      And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
      makes code complex.  For example, if there is highmem system, following
      reserve ratio is activated for *NORMAL ZONE* which would be easyily
      misleading people.
      
       #ifdef CONFIG_HIGHMEM
       32
       #endif
      
      This patch also fixes this situation by defining
      sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to
      right place.
      
      Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.com
      
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarTony Lindgren <tony@atomide.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3cda233
    • Roman Gushchin's avatar
      mm: treat indirectly reclaimable memory as available in MemAvailable · 034ebf65
      Roman Gushchin authored
      Adjust /proc/meminfo MemAvailable calculation by adding the amount of
      indirectly reclaimable memory (rounded to the PAGE_SIZE).
      
      Link: http://lkml.kernel.org/r/20180305133743.12746-4-guro@fb.com
      
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      034ebf65
  4. 06 Apr, 2018 11 commits
  5. 23 Mar, 2018 2 commits
    • Daniel Vacek's avatar
      Revert "mm: page_alloc: skip over regions of invalid pfns where possible" · f59f1caf
      Daniel Vacek authored
      This reverts commit b92df1de ("mm: page_alloc: skip over regions of
      invalid pfns where possible").  The commit is meant to be a boot init
      speed up skipping the loop in memmap_init_zone() for invalid pfns.
      
      But given some specific memory mapping on x86_64 (or more generally
      theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
      implementation also skips valid pfns which is plain wrong and causes
      'kernel BUG at mm/page_alloc.c:1389!'
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010: move_freepages+0x15e/0x160
        --
        Call Trace:
          move_freepages_block+0x73/0x80
          __rmqueue+0x263/0x460
          get_page_from_freelist+0x7e1/0x9e0
          __alloc_pages_nodemask+0x176/0x420
        --
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff       System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff       System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff       System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff       System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff       System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200>       505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0       <<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
      Fixes: b92df1de
      
       ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: default avatarDaniel Vacek <neelx@redhat.com>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f59f1caf
    • Tetsuo Handa's avatar
      lockdep: fix fs_reclaim warning · 2e517d68
      Tetsuo Handa authored
      Dave Jones reported fs_reclaim lockdep warnings.
      
        ============================================
        WARNING: possible recursive locking detected
        4.15.0-rc9-backup-debug+ #1 Not tainted
        --------------------------------------------
        sshd/24800 is trying to acquire lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        but task is already holding lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(fs_reclaim);
          lock(fs_reclaim);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        2 locks held by sshd/24800:
         #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
         #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        stack backtrace:
        CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
        Call Trace:
         dump_stack+0xbc/0x13f
         __lock_acquire+0xa09/0x2040
         lock_acquire+0x12e/0x350
         fs_reclaim_acquire.part.102+0x29/0x30
         kmem_cache_alloc+0x3d/0x2c0
         alloc_extent_state+0xa7/0x410
         __clear_extent_bit+0x3ea/0x570
         try_release_extent_mapping+0x21a/0x260
         __btrfs_releasepage+0xb0/0x1c0
         btrfs_releasepage+0x161/0x170
         try_to_release_page+0x162/0x1c0
         shrink_page_list+0x1d5a/0x2fb0
         shrink_inactive_list+0x451/0x940
         shrink_node_memcg.constprop.88+0x4c9/0x5e0
         shrink_node+0x12d/0x260
         try_to_free_pages+0x418/0xaf0
         __alloc_pages_slowpath+0x976/0x1790
         __alloc_pages_nodemask+0x52c/0x5c0
         new_slab+0x374/0x3f0
         ___slab_alloc.constprop.81+0x47e/0x5a0
         __slab_alloc.constprop.80+0x32/0x60
         __kmalloc_track_caller+0x267/0x310
         __kmalloc_reserve.isra.40+0x29/0x80
         __alloc_skb+0xee/0x390
         sk_stream_alloc_skb+0xb8/0x340
         tcp_sendmsg_locked+0x8e6/0x1d30
         tcp_sendmsg+0x27/0x40
         inet_sendmsg+0xd0/0x310
         sock_write_iter+0x17a/0x240
         __vfs_write+0x2ab/0x380
         vfs_write+0xfb/0x260
         SyS_write+0xb6/0x140
         do_syscall_64+0x1e5/0xc05
         entry_SYSCALL64_slow_path+0x25/0x25
      
      This warning is caused by commit d92a8cfc ("locking/lockdep:
      Rework FS_RECLAIM annotation") which replaced the use of
      lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
      and lockdep_trace_alloc() in slab_pre_alloc_hook() with
      fs_reclaim_acquire()/ fs_reclaim_release().
      
      Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
      __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
      __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
      trying to grab the 'fake' lock again when __perform_reclaim() already
      grabbed the 'fake' lock.
      
      The
      
        /* this guy won't enter reclaim */
        if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
                return false;
      
      test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
      was added by commit cf40bd16 ("lockdep: annotate reclaim context
      (__GFP_NOFS)").  But that test is outdated because PF_MEMALLOC thread
      won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
      341ce06f ("page allocator: calculate the alloc_flags for allocation
      only once") added the PF_MEMALLOC safeguard (
      
        /* Avoid recursion of direct reclaim */
        if (p->flags & PF_MEMALLOC)
                goto nopage;
      
      in __alloc_pages_slowpath()).
      
      Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
      allow __need_fs_reclaim() to return false.
      
      Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
      Fixes: d92a8cfc
      
       ("locking/lockdep: Rework FS_RECLAIM annotation")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Tested-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e517d68
  6. 16 Mar, 2018 1 commit
  7. 14 Mar, 2018 1 commit
    • Ard Biesheuvel's avatar
      Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment" · 3e04040d
      Ard Biesheuvel authored
      This reverts commit 864b75f9.
      
      Commit 864b75f9 ("mm/page_alloc: fix memmap_init_zone pageblock
      alignment") modified the logic in memmap_init_zone() to initialize
      struct pages associated with invalid PFNs, to appease a VM_BUG_ON()
      in move_freepages(), which is redundant by its own admission, and
      dereferences struct page fields to obtain the zone without checking
      whether the struct pages in question are valid to begin with.
      
      Commit 864b75f9 only makes it worse, since the rounding it does
      may cause pfn assume the same value it had in a prior iteration of
      the loop, resulting in an infinite loop and a hang very early in the
      boot. Also, since it doesn't perform the same rounding on start_pfn
      itself but only on intermediate values following an invalid PFN, we
      may still hit the same VM_BUG_ON() as before.
      
      So instead, let's fix this at the core, and ensure that the BUG
      check doesn't dereference struct page fields of invalid pages.
      
      Fixes: 864b75f9
      
       ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
      Tested-by: default avatarJan Glauber <jglauber@cavium.com>
      Tested-by: default avatarShanker Donthineni <shankerd@codeaurora.org>
      Cc: Daniel Vacek <neelx@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e04040d
  8. 10 Mar, 2018 1 commit
    • Daniel Vacek's avatar
      mm/page_alloc: fix memmap_init_zone pageblock alignment · 864b75f9
      Daniel Vacek authored
      Commit b92df1de ("mm: page_alloc: skip over regions of invalid pfns
      where possible") introduced a bug where move_freepages() triggers a
      VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
      To fix this, simply align the skipped pfns in memmap_init_zone() the
      same way as in move_freepages_block().
      
      Seen in one of the RHEL reports:
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010:[<ffffffff8118833e>]  [<ffffffff8118833e>] move_freepages+0x15e/0x160
        RSP: 0018:ffff88054d727688  EFLAGS: 00010087
        --
        Call Trace:
         [<ffffffff811883b3>] move_freepages_block+0x73/0x80
         [<ffffffff81189e63>] __rmqueue+0x263/0x460
         [<ffffffff8118c781>] get_page_from_freelist+0x7e1/0x9e0
         [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420
        --
        RIP  [<ffffffff8118833e>] move_freepages+0x15e/0x160
         RSP <ffff88054d727688>
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff	System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff	System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff	System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff	System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff	System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff	System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff	System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
      Note that this range follows two not populated sections
      68000000-77ffffff in this zone.  7b788000-7b7fffff is the first one
      after a gap.  This makes memmap_init_zone() skip all the pfns up to the
      beginning of this range.  But this range is not pageblock (2M) aligned.
      In fact no range has to be.
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0	<<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Top part of page flags should contain nodeid and zonenr, which is not
      the case for page ffffea0001ed8000 here (<<<<).
      
        crash> log | grep -o fffea0001ed[^\ ]* | sort -u
        fffea0001ed8000
        fffea0001eded20
        fffea0001edffc0
      
        crash> bt -r | grep -o fffea0001ed[^\ ]* | sort -u
        fffea0001ed8000
        fffea0001eded00
        fffea0001eded20
        fffea0001edffc0
      
      Initialization of the whole beginning of the section is skipped up to
      the start of the range due to the commit b92df1de.  Now any code
      calling move_freepages_block() (like reusing the page from a freelist as
      in this example) with a page from the beginning of the range will get
      the page rounded down to start_page ffffea0001ed8000 and passed to
      move_freepages() which crashes on assertion getting wrong zonenr.
      
        >         VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
      
      Note, page_zone() derives the zone from page flags here.
      
      From similar machine before commit b92df1de:
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        fffff73941e00000  78000000                0        0  1 1fffff00000000
        fffff73941ed7fc0  7b5ff000                0        0  1 1fffff00000000
        fffff73941ed8000  7b600000                0        0  1 1fffff00000000
        fffff73941edff80  7b7fe000                0        0  1 1fffff00000000
        fffff73941edffc0  7b7ff000 ffff8e67e04d3ae0     ad84  1 1fffff00020068 uptodate,lru,active,mappedtodisk
      
      All the pages since the beginning of the section are initialized.
      move_freepages()' not gonna blow up.
      
      The same machine with this fix applied:
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001e00000  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  1 1fffff00000000
        ffffea0001edff80  7b7fe000                0        0  1 1fffff00000000
        ffffea0001edffc0  7b7ff000 ffff88017fb13720        8  2 1fffff00020068 uptodate,lru,active,mappedtodisk
      
      At least the bare minimum of pages is initialized preventing the crash
      as well.
      
      Customers started to report this as soon as 7.4 (where b92df1de was
      merged in RHEL) was released.  I remember reports from
      September/October-ish times.  It's not easily reproduced and happens on
      a handful of machines only.  I guess that's why.  But that does not make
      it less serious, I think.
      
      Though there actually is a report here:
        https://bugzilla.kernel.org/show_bug.cgi?id=196443
      
      And there are reports for Fedora from July:
        https://bugzilla.redhat.com/show_bug.cgi?id=1473242
      and CentOS:
        https://bugs.centos.org/view.php?id=13964
      and we internally track several dozens reports for RHEL bug
        https://bugzilla.redhat.com/show_bug.cgi?id=1525121
      
      Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com
      Fixes: b92df1de
      
       ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: default avatarDaniel Vacek <neelx@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      864b75f9
  9. 21 Feb, 2018 1 commit
    • Juergen Gross's avatar
      mm: don't defer struct page initialization for Xen pv guests · 895f7b8e
      Juergen Gross authored
      Commit f7f99100 ("mm: stop zeroing memory during allocation in
      vmemmap") broke Xen pv domains in some configurations, as the "Pinned"
      information in struct page of early page tables could get lost.
      
      This will lead to the kernel trying to write directly into the page
      tables instead of asking the hypervisor to do so.  The result is a crash
      like the following:
      
        BUG: unable to handle kernel paging request at ffff8801ead19008
        IP: xen_set_pud+0x4e/0xd0
        PGD 1c0a067 P4D 1c0a067 PUD 23a0067 PMD 1e9de0067 PTE 80100001ead19065
        Oops: 0003 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-default+ #271
        Hardware name: Dell Inc. Latitude E6440/0159N7, BIOS A07 06/26/2014
        task: ffffffff81c10480 task.stack: ffffffff81c00000
        RIP: e030:xen_set_pud+0x4e/0xd0
        Call Trace:
         __pmd_alloc+0x128/0x140
         ioremap_page_range+0x3f4/0x410
         __ioremap_caller+0x1c3/0x2e0
         acpi_os_map_iomem+0x175/0x1b0
         acpi_tb_acquire_table+0x39/0x66
         acpi_tb_validate_table+0x44/0x7c
         acpi_tb_verify_temp_table+0x45/0x304
         acpi_reallocate_root_table+0x12d/0x141
         acpi_early_init+0x4d/0x10a
         start_kernel+0x3eb/0x4a1
         xen_start_kernel+0x528/0x532
        Code: 48 01 e8 48 0f 42 15 a2 fd be 00 48 01 d0 48 ba 00 00 00 00 00 ea ff ff 48 c1 e8 0c 48 c1 e0 06 48 01 d0 48 8b 00 f6 c4 02 75 5d <4c> 89 65 00 5b 5d 41 5c c3 65 8b 05 52 9f fe 7e 89 c0 48 0f a3
        RIP: xen_set_pud+0x4e/0xd0 RSP: ffffffff81c03cd8
        CR2: ffff8801ead19008
        ---[ end trace 38eca2e56f1b642e ]---
      
      Avoid this problem by not deferring struct page initialization when
      running as Xen pv guest.
      
      Pavel said:
      
      : This is unique for Xen, so this particular issue won't effect other
      : configurations.  I am going to investigate if there is a way to
      : re-enable deferred page initialization on xen guests.
      
      [akpm@linux-foundation.org: explicitly include xen.h]
      Link: http://lkml.kernel.org/r/20180216154101.22865-1-jgross@suse.com
      Fixes: f7f99100
      
       ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: <stable@vger.kernel.org>	[4.15.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      895f7b8e
  10. 01 Feb, 2018 4 commits
  11. 08 Jan, 2018 1 commit
  12. 05 Jan, 2018 1 commit
    • Dave Young's avatar
      mm: check pfn_valid first in zero_resv_unavail · e8c24773
      Dave Young authored
      With latest kernel I get below bug while testing kdump:
      
        BUG: unable to handle kernel paging request at ffffea00034b1040
        IP: zero_resv_unavail+0xbd/0x126
        PGD 37b98067 P4D 37b98067 PUD 37b97067 PMD 0
        Oops: 0002 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.15.0-rc1+ #316
        Hardware name: LENOVO 20ARS1BJ02/20ARS1BJ02, BIOS GJET92WW (2.42 ) 03/03/2017
        task: ffffffff81a0e4c0 task.stack: ffffffff81a00000
        RIP: 0010:zero_resv_unavail+0xbd/0x126
        RSP: 0000:ffffffff81a03d88 EFLAGS: 00010006
        RAX: 0000000000000000 RBX: ffffea00034b1040 RCX: 0000000000000010
        RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffffea00034b1040
        RBP: 00000000000d2c41 R08: 00000000000000c0 R09: 0000000000000a0d
        R10: 0000000000000002 R11: 0000000000007f01 R12: ffffffff81a03d90
        R13: ffffea0000000000 R14: 0000000000000063 R15: 0000000000000062
        FS:  0000000000000000(0000) GS:ffffffff81c73000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffea00034b1040 CR3: 0000000037609000 CR4: 00000000000606b0
        Call Trace:
         ? free_area_init_nodes+0x640/0x664
         ? zone_sizes_init+0x58/0x72
         ? setup_arch+0xb50/0xc6c
         ? start_kernel+0x64/0x43d
         ? secondary_startup_64+0xa5/0xb0
        Code: c1 e8 0c 48 39 d8 76 27 48 89 de 48 c1 e3 06 48 c7 c7 7a 87 79 81 e8 b0 c0 3e ff 4c 01 eb b9 10 00 00 00 31 c0 48 89 df 49 ff c6 <f3> ab eb bc 6a 00 49 c7 c0 f0 93 d1 81 31 d2 83 ce ff 41 54 49
        RIP: zero_resv_unavail+0xbd/0x126 RSP: ffffffff81a03d88
        CR2: ffffea00034b1040
        ---[ end trace f5ba9e8f73c7ee26 ]---
      
      This is introduced by commit a4a3ede2 ("mm: zero reserved and
      unavailable struct pages").
      
      The reason is some efi reserved boot ranges is not reported in E820 ram.
      In my case it is a bgrt buffer:
      
        efi: mem00: [Boot Data          |RUN|  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x00000000d2c41000-0x00000000d2c85fff] (0MB)
      
      Use "add_efi_memmap" can workaround the problem with another fix:
      
        http://lkml.kernel.org/r/20171130052327.GA3500@dhcp-128-65.nay.redhat.com
      
      In zero_resv_unavail it would be better to check pfn_valid first before
      zero the page struct.  This fixes the problem and potential other
      similar problems.  Also as Pavel Tatashin suggested checks pfn_valid at
      the beginning of the section.
      
      The range is backed by real memory.  The memory range is efi "Boot
      Service Data", that means after ExitBootServices() these ranges can be
      used as system ram.  But some of them need to be reserved, for example
      the bgrt image address in an acpi table, if the image memory is freed
      then kexec reboot will fail because kexec inherit same acpi table to
      initialize the driver.
      
      Link: http://lkml.kernel.org/r/20171201095048.GA3084@dhcp-128-65.nay.redhat.com
      Fixes: a4a3ede2
      
       ("mm: zero reserved and unavailable struct pages")
      Signed-off-by: default avatarDave Young <dyoung@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8c24773
  13. 15 Dec, 2017 1 commit
  14. 30 Nov, 2017 2 commits
    • Mike Kravetz's avatar
      mm/cma: fix alloc_contig_range ret code/potential leak · 63cd4489
      Mike Kravetz authored
      If the call __alloc_contig_migrate_range() in alloc_contig_range returns
      -EBUSY, processing continues so that test_pages_isolated() is called
      where there is a tracepoint to identify the busy pages.  However, it is
      possible for busy pages to become available between the calls to these
      two routines.  In this case, the range of pages may be allocated.
      Unfortunately, the original return code (ret == -EBUSY) is still set and
      returned to the caller.  Therefore, the caller believes the pages were
      not allocated and they are leaked.
      
      Update the comment to indicate that allocation is still possible even if
      __alloc_contig_migrate_range returns -EBUSY.  Also, clear return code in
      this case so that it is not accidentally used or returned to caller.
      
      Link: http://lkml.kernel.org/r/20171122185214.25285-1-mike.kravetz@oracle.com
      Fixes: 8ef5849f
      
       ("mm/cma: always check which page caused allocation failure")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63cd4489
    • Michal Hocko's avatar
      mm, memory_hotplug: do not back off draining pcp free pages from kworker context · 4b81cb2f
      Michal Hocko authored
      drain_all_pages backs off when called from a kworker context since
      commit 0ccce3b9 ("mm, page_alloc: drain per-cpu pages from workqueue
      context") because the original IPI based pcp draining has been replaced
      by a WQ based one and the check wanted to prevent from recursion and
      inter workers dependencies.  This has made some sense at the time
      because the system WQ has been used and one worker holding the lock
      could be blocked while waiting for new workers to emerge which can be a
      problem under OOM conditions.
      
      Since then commit ce612879 ("mm: move pcp and lru-pcp draining into
      single wq") has moved draining to a dedicated (mm_percpu_wq) WQ with a
      rescuer so we shouldn't depend on any other WQ activity to make a
      forward progress so calling drain_all_pages from a worker context is
      safe as long as this doesn't happen from mm_percpu_wq itself which is
      not the case because all workers are required to _not_ depend on any MM
      locks.
      
      Why is this a problem in the first place? ACPI driven memory hot-remove
      (acpi_device_hotplug) is executed from the worker context.  We end up
      calling __offline_pages to free all the pages and that requires both
      lru_add_drain_all_cpuslocked and drain_all_pages to do their job
      otherwise we can have dangling pages on pcp lists and fail the offline
      operation (__test_page_isolated_in_pageblock would see a page with 0 ref
      count but without PageBuddy set).
      
      Fix the issue by removing the worker check in drain_all_pages.
      lru_add_drain_all_cpuslocked doesn't have this restriction so it works
      as expected.
      
      Link: http://lkml.kernel.org/r/20170828093341.26341-1-mhocko@kernel.org
      Fixes: 0ccce3b9
      
       ("mm, page_alloc: drain per-cpu pages from workqueue context")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.11+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b81cb2f
  15. 18 Nov, 2017 1 commit
  16. 16 Nov, 2017 6 commits
    • Oscar Salvador's avatar
      mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP · 0cd842f9
      Oscar Salvador authored
      free_area_init_node() calls alloc_node_mem_map(), but this function does
      nothing unless we have CONFIG_FLAT_NODE_MEM_MAP.
      
      As a cleanup, we can move the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" within
      alloc_node_mem_map() out of the function, and define a
      alloc_node_mem_map() { } when CONFIG_FLAT_NODE_MEM_MAP is not present.
      
      This also moves the printk that lays within the "#ifdef
      CONFIG_FLAT_NODE_MEM_MAP" block from free_area_init_node() to
      alloc_node_mem_map(), getting rid of the "#ifdef
      CONFIG_FLAT_NODE_MEM_MAP" in free_area_init_node().
      
      [akpm@linux-foundation.org: clean up the printk while we're there]
      Link: http://lkml.kernel.org/r/20171114111935.GA11758@techadventures.net
      
      
      Signed-off-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cd842f9
    • Michal Hocko's avatar
      mm: simplify nodemask printing · 0205f755
      Michal Hocko authored
      alloc_warn() and dump_header() have to explicitly handle NULL nodemask
      which forces both paths to use pr_cont.  We can do better.  printk
      already handles NULL pointers properly so all we need is to teach
      nodemask_pr_args to handle NULL nodemask carefully.  This allows
      simplification of both alloc_warn() and dump_header() and gets rid of
      pr_cont altogether.
      
      This patch has been motivated by patch from Joe Perches
      
        http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com
      
      [akpm@linux-foundation.org: fix tile warning, per Arnd]
      Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJoe Perches <joe@perches.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0205f755
    • Pavel Tatashin's avatar
      mm/page_alloc.c: broken deferred calculation · d135e575
      Pavel Tatashin authored
      In reset_deferred_meminit() we determine number of pages that must not
      be deferred.  We initialize pages for at least 2G of memory, but also
      pages for reserved memory in this node.
      
      The reserved memory is determined in this function:
      memblock_reserved_memory_within(), which operates over physical
      addresses, and returns size in bytes.  However, reset_deferred_meminit()
      assumes that that this function operates with pfns, and returns page
      count.
      
      The result is that in the best case machine boots slower than expected
      due to initializing more pages than needed in single thread, and in the
      worst case panics because fewer than needed pages are initialized early.
      
      Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
      Fixes: 864b9a39
      
       ("mm: consider memblock reservations for deferred memory initialization sizing")
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d135e575
    • Tetsuo Handa's avatar
      mm: don't warn about allocations which stall for too long · 400e2249
      Tetsuo Handa authored
      Commit 63f53dea ("mm: warn about allocations which stall for too
      long") was a great step for reducing possibility of silent hang up
      problem caused by memory allocation stalls.  But this commit reverts it,
      for it is possible to trigger OOM lockup and/or soft lockups when many
      threads concurrently called warn_alloc() (in order to warn about memory
      allocation stalls) due to current implementation of printk(), and it is
      difficult to obtain useful information due to limitation of synchronous
      warning approach.
      
      Current printk() implementation flushes all pending logs using the
      context of a thread which called console_unlock().  printk() should be
      able to flush all pending logs eventually unless somebody continues
      appending to printk() buffer.
      
      Since warn_alloc() started appending to printk() buffer while waiting
      for oom_kill_process() to make forward progress when oom_kill_process()
      is processing pending logs, it became possible for warn_alloc() to force
      oom_kill_process() loop inside printk().  As a result, warn_alloc()
      significantly increased possibility of preventing oom_kill_process()
      from making forward progress.
      
      ---------- Pseudo code start ----------
      Before warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          }
          goto retry;
      
      After warn_alloc() was introduced:
      
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else if (waited_for_10seconds()) {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      Although waited_for_10seconds() becomes true once per 10 seconds,
      unbounded number of threads can call waited_for_10seconds() at the same
      time.  Also, since threads doing waited_for_10seconds() keep doing
      almost busy loop, the thread doing print_one_log() can use little CPU
      resource.  Therefore, this situation can be simplified like
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_lock)
          } else {
            atomic_inc(&printk_pending_logs);
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      when printk() is called faster than print_one_log() can process a log.
      
      One of possible mitigation would be to introduce a new lock in order to
      make sure that no other series of printk() (either oom_kill_process() or
      warn_alloc()) can append to printk() buffer when one series of printk()
      (either oom_kill_process() or warn_alloc()) is already in progress.
      
      Such serialization will also help obtaining kernel messages in readable
      form.
      
      ---------- Pseudo code start ----------
        retry:
          if (mutex_trylock(&oom_lock)) {
            mutex_lock(&oom_printk_lock);
            while (atomic_read(&printk_pending_logs) > 0) {
              atomic_dec(&printk_pending_logs);
              print_one_log();
            }
            // Send SIGKILL here.
            mutex_unlock(&oom_printk_lock);
            mutex_unlock(&oom_lock)
          } else {
            if (mutex_trylock(&oom_printk_lock)) {
              atomic_inc(&printk_pending_logs);
              mutex_unlock(&oom_printk_lock);
            }
          }
          goto retry;
      ---------- Pseudo code end ----------
      
      But this commit does not go that direction, for we don't want to
      introduce a new lock dependency, and we unlikely be able to obtain
      useful information even if we serialized oom_kill_process() and
      warn_alloc().
      
      Synchronous approach is prone to unexpected results (e.g.  too late [1],
      too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
      helped with providing information other than "something is going wrong".
      I want to consider asynchronous approach which can obtain information
      during stalls with possibly relevant threads (e.g.  the owner of
      oom_lock and kswapd-like threads) and serve as a trigger for actions
      (e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
      of stalling KVM guest for diagnostic purpose).
      
      This commit temporarily loses ability to report e.g.  OOM lockup due to
      unable to invoke the OOM killer due to !__GFP_FS allocation request.
      But asynchronous approach will be able to detect such situation and emit
      warning.  Thus, let's remove warn_alloc().
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
      [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
      [3] commit db73ee0d ("mm, vmscan: do not loop on too_many_isolated for ever"))
      
      Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      
      
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avataryuwang.yuwang <yuwang.yuwang@alibaba-inc.com>
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      400e2249
    • Vlastimil Babka's avatar
      mm, page_alloc: fix potential false positive in __zone_watermark_ok · b050e376
      Vlastimil Babka authored
      Since commit 97a16fc8 ("mm, page_alloc: only enforce watermarks for
      order-0 allocations"), __zone_watermark_ok() check for high-order
      allocations will shortcut per-migratetype free list checks for
      ALLOC_HARDER allocations, and return true as long as there's free page
      of any migratetype.  The intention is that ALLOC_HARDER can allocate
      from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.
      
      However, as a side effect, the watermark check will then also return
      true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
      CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list.  Since the
      allocation cannot actually obtain isolated pages, and might not be able
      to obtain CMA pages, this can result in a false positive.
      
      The condition should be rare and perhaps the outcome is not a fatal one.
      Still, it's better if the watermark check is correct.  There also
      shouldn't be a performance tradeoff here.
      
      Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
      Fixes: 97a16fc8
      
       ("mm, page_alloc: only enforce watermarks for order-0 allocations")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b050e376
    • Kemi Wang's avatar
      mm, sysctl: make NUMA stats configurable · 4518085e
      Kemi Wang authored
      This is the second step which introduces a tunable interface that allow
      numa stats configurable for optimizing zone_statistics(), as suggested
      by Dave Hansen and Ying Huang.
      
      =========================================================================
      
      When page allocation performance becomes a bottleneck and you can
      tolerate some possible tool breakage and decreased numa counter
      precision, you can do:
      
      	echo 0 > /proc/sys/vm/numa_stat
      
      In this case, numa counter update is ignored.  We can see about
      *4.8%*(185->176) drop of cpu cycles per single page allocation and
      reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
      drop of cpu cycles per single page allocation and reclaim on Jesper's
      page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
      (88 threads, 126G memory).
      
      Benchmark link provided by Jesper D Brouer (increase loop times to
      10000000):
      
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      =========================================================================
      
      When page allocation performance is not a bottleneck and you want all
      tooling to work, you can do:
      
      	echo 1 > /proc/sys/vm/numa_stat
      
      This is system default setting.
      
      Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
      for comments to help improve the original patch.
      
      [keescook@chromium.org: make sure mutex is a global static]
        Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
      Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com
      
      
      Signed-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarYing Huang <ying.huang@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4518085e