Commit edf445ad authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)

Merge hugepage allocation updates from David Rientjes:
 "We (mostly Linus, Andrea, and myself) have been discussing offlist how
  to implement a sane default allocation strategy for hugepages on NUMA

  With these reverts in place, the page allocator will happily allocate
  a remote hugepage immediately rather than try to make a local hugepage
  available. This incurs a substantial performance degradation when
  memory compaction would have otherwise made a local hugepage

  This series reverts those reverts and attempts to propose a more sane
  default allocation strategy specifically for hugepages. Andrea
  acknowledges this is likely to fix the swap storms that he originally
  reported that resulted in the patches that removed __GFP_THISNODE from
  hugepage allocations.

  The immediate goal is to return 5.3 to the behavior the kernel has
  implemented over the past several years so that remote hugepages are
  not immediately allocated when local hugepages could have been made
  available because the increased access latency is untenable.

  The next goal is to introduce a sane default allocation strategy for
  hugepages allocations in general regardless of the configuration of
  the system so that we prevent thrashing of local memory when
  compaction is unlikely to succeed and can prefer remote hugepages over
  remote native pages when the local node is low on memory."

Note on timing: this reverts the hugepage VM behavior changes that got
introduced fairly late in the 5.3 cycle, and that fixed a huge
performance regression for certain loads that had been around since

Andrea had this note:

 "The regression of 4.18 was that it was taking hours to start a VM
  where 3.10 was only taking a few seconds, I reported all the details
  on lkml when it was finally tracked down in August 2018.

  __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
  workload degrade like in the "current upstream" above. And it still
  would have been that bad as above until 5.3-rc5"

where the bad behavior ends up happening as you fill up a local node,
and without that change, you'd get into the nasty swap storm behavior
due to compaction working overtime to make room for more memory on the

As a result 5.3 got the two performance fix reverts in rc5.

However, David Rientjes then noted that those performance fixes in turn
regressed performance for other loads - although not quite to the same
degree.  He suggested reverting the reverts and instead replacing them
with two small changes to how hugepage allocations are done (patch
descriptions rephrased by me):

 - "avoid expensive reclaim when compaction may not succeed": just admit
   that the allocation failed when you're trying to allocate a huge-page
   and compaction wasn't successful.

 - "allow hugepage fallback to remote nodes when madvised": when that
   node-local huge-page allocation failed, retry without forcing the
   local node.

but by then I judged it too late to replace the fixes for a 5.3 release.
So 5.3 was released with behavior that harked back to the pre-4.18 logic.

But now we're in the merge window for 5.4, and we can see if this
alternate model fixes not just the horrendous swap storm behavior, but
also restores the performance regression that the late reverts caused.

Fingers crossed.

* emailed patches from David Rientjes <>:
  mm, page_alloc: allow hugepage fallback to remote nodes when madvised
  mm, page_alloc: avoid expensive reclaim when compaction may not succeed
  Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
  Revert "Revert "mm, thp: restore node-local hugepage allocations""
parents a2953204 76e654cc
......@@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct vm_area_struct *vma, unsigned long addr,
int node);
int node, bool hugepage);
#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
alloc_pages(gfp_mask, order)
#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
alloc_pages(gfp_mask, order)
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node)
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
......@@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
struct mempolicy *get_task_policy(struct task_struct *p);
struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
unsigned long addr);
struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
unsigned long addr);
bool vma_policy_mof(struct vm_area_struct *vma);
extern void numa_default_policy(void);
......@@ -659,40 +659,30 @@ release:
* available
* never: never stall for any thp allocation
static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
gfp_t this_node = 0;
struct mempolicy *pol;
* __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
* specified, to express a general desire to stay on the current
* node for optimistic allocation attempts. If the defrag mode
* and/or madvise hint requires the direct reclaim then we prefer
* to fallback to other node rather than node reclaim because that
* can lead to excessive reclaim even though there is free memory
* on other nodes. We expect that NUMA preferences are specified
* by memory policies.
pol = get_vma_policy(vma, addr);
if (pol->mode != MPOL_BIND)
this_node = __GFP_THISNODE;
/* Always do synchronous compaction */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
/* Kick kcompactd and fail quickly */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
/* Synchronous compaction if madvised, otherwise kick kcompactd */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
__GFP_KSWAPD_RECLAIM | this_node);
(vma_madvised ? __GFP_DIRECT_RECLAIM :
/* Only do synchronous compaction if madvised */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
return GFP_TRANSHUGE_LIGHT | this_node;
(vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
/* Caller must hold page table lock. */
......@@ -764,8 +754,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
pte_free(vma->vm_mm, pgtable);
return ret;
gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
gfp = alloc_hugepage_direct_gfpmask(vma);
page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
if (unlikely(!page)) {
......@@ -1372,9 +1362,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
if (__transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow()) {
huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
haddr, numa_node_id());
huge_gfp = alloc_hugepage_direct_gfpmask(vma);
new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
} else
new_page = NULL;
......@@ -1179,8 +1179,8 @@ static struct page *new_page(struct page *page, unsigned long start)
} else if (PageTransHuge(page)) {
struct page *thp;
thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
address, numa_node_id());
thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
if (!thp)
return NULL;
......@@ -1732,7 +1732,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
* freeing by another task. It is the caller's responsibility to free the
* extra reference for shared policies.
struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
unsigned long addr)
struct mempolicy *pol = __get_vma_policy(vma, addr);
......@@ -2081,6 +2081,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
* @vma: Pointer to VMA or NULL if not available.
* @addr: Virtual Address of the allocation. Must be inside the VMA.
* @node: Which node to prefer for allocation (modulo policy).
* @hugepage: for hugepages try only the preferred node if possible
* This function allocates a page from the kernel page pool and applies
* a NUMA policy associated with the VMA or the current process.
......@@ -2091,7 +2092,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
struct page *
alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, int node)
unsigned long addr, int node, bool hugepage)
struct mempolicy *pol;
struct page *page;
......@@ -2109,6 +2110,42 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
goto out;
if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
int hpage_node = node;
* For hugepage allocation and non-interleave policy which
* allows the current node (or other explicitly preferred
* node) we only try to allocate from the current/preferred
* node and don't fall back to other nodes, as the cost of
* remote accesses would likely offset THP benefits.
* If the policy is interleave, or does not allow the current
* node in its nodemask, we allocate the standard way.
if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
hpage_node = pol->v.preferred_node;
nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(hpage_node, *nmask)) {
page = __alloc_pages_node(hpage_node,
gfp | __GFP_THISNODE, order);
* If hugepage allocations are configured to always
* synchronous compact or the vma has been madvised
* to prefer hugepage backing, retry allowing remote
* memory as well.
if (!page && (gfp & __GFP_DIRECT_RECLAIM))
page = __alloc_pages_node(hpage_node,
gfp | __GFP_NORETRY, order);
goto out;
nmask = policy_nodemask(gfp, pol);
preferred_nid = policy_node(gfp, pol, node);
page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
......@@ -4467,6 +4467,28 @@ retry_cpuset:
if (page)
goto got_pg;
if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
* If allocating entire pageblock(s) and compaction
* failed because all zones are below low watermarks
* or is prohibited because it recently failed at this
* order, fail immediately.
* Reclaim is
* - potentially very expensive because zones are far
* below their low watermarks or this is part of very
* bursty high order allocations,
* - not guaranteed to help because isolate_freepages()
* may not iterate over freed pages as part of its
* linear scan, and
* - unlikely to make entire pageblocks free on its
* own.
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage;
* Checks for costly allocations with __GFP_NORETRY, which
* includes THP page fault allocations
......@@ -1481,7 +1481,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
shmem_pseudo_vma_init(&pvma, info, hindex);
page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
if (page)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment