Skip to content
  • Vlastimil Babka's avatar
    mm, page_alloc: fix more premature OOM due to race with cpuset update · 902b6281
    Vlastimil Babka authored
    I would like to stress that this patchset aims to fix issues and cleanup
    the code *within the existing documented semantics*, i.e.  patch 1
    ignores mempolicy restrictions if the set of allowed nodes has no
    intersection with set of nodes allowed by cpuset.  I believe discussing
    potential changes of the semantics can be better done once we have a
    baseline with no known bugs of the current semantics.
    
    I've recently summarized the cpuset/mempolicy issues in a LSF/MM
    proposal [1] and the discussion itself [2].  I've been trying to rewrite
    the handling as proposed, with the idea that changing semantics to make
    all mempolicies static wrt cpuset updates (and discarding the relative
    and default modes) can be tried on top, as there's a high risk of being
    rejected/reverted because somebody might still care about the removed
    modes.
    
    However I haven't yet figured out how to properly:
    
    1) make mempolicies swappable instead of rebinding in place. I thought
       mbind() already works that way and uses refcounting to avoid
       use-after-free of the old policy by a parallel allocation, but turns
       out true refcounting is only done for shared (shmem) mempolicies, and
       the actual protection for mbind() comes from mmap_sem. Extending the
       refcounting means more overhead in allocator hot path. Also swapping
       whole mempolicies means that we have to allocate the new ones, which
       can fail, and reverting of the partially done work also means
       allocating (note that mbind() doesn't care and will just leave part
       of the range updated and part not updated when returning -ENOMEM...).
    
    2) make cpuset's task->mems_allowed also swappable (after converting it
       from nodemask to zonelist, which is the easy part) for mostly the
       same reasons.
    
    The good news is that while trying to do the above, I've at least
    figured out how to hopefully close the remaining premature OOM's, and do
    a buch of cleanups on top, removing quite some of the code that was also
    supposed to prevent the cpuset update races, but doesn't work anymore
    nowadays.  This should fix the most pressing concerns with this topic
    and give us a better baseline before either proceeding with the original
    proposal, or pushing a change of semantics that removes the problem 1)
    above.  I'd be then fine with trying to change the semantic first and
    rewrite later.
    
    Patchset has been tested with the LTP cpuset01 stress test.
    
    [1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz
    [2] https://lwn.net/Articles/717797/
    [3] https://marc.info/?l=linux-mm&m=149191957922828&w=2
    
    This patch (of 6):
    
    Commit e47483bc ("mm, page_alloc: fix premature OOM when racing with
    cpuset mems update") has fixed known recent regressions found by LTP's
    cpuset01 testcase.  I have however found that by modifying the testcase
    to use per-vma mempolicies via bind(2) instead of per-task mempolicies
    via set_mempolicy(2), the premature OOM still happens and the issue is
    much older.
    
    The root of the problem is that the cpuset's mems_allowed and
    mempolicy's nodemask can temporarily have no intersection, thus
    get_page_from_freelist() cannot find any usable zone.  The current
    semantic for empty intersection is to ignore mempolicy's nodemask and
    honour cpuset restrictions.  This is checked in node_zonelist(), but the
    racy update can happen after we already passed the check.  Such races
    should be protected by the seqlock task->mems_allowed_seq, but it
    doesn't work here, because 1) mpol_rebind_mm() does not happen under
    seqlock for write, and doing so would lead to deadlock, as it takes
    mmap_sem for write, while the allocation can have mmap_sem for read when
    it's taking the seqlock for read.  And 2) the seqlock cookie of callers
    of node_zonelist() (alloc_pages_vma() and alloc_pages_current()) is
    different than the one of __alloc_pages_slowpath(), so there's still a
    potential race window.
    
    This patch fixes the issue by having __alloc_pages_slowpath() check for
    empty intersection of cpuset and ac->nodemask before OOM or allocation
    failure.  If it's indeed empty, the nodemask is ignored and allocation
    retried, which mimics node_zonelist().  This works fine, because almost
    all callers of __alloc_pages_nodemask are obtaining the nodemask via
    node_zonelist().  The only exception is new_node_page() from hotplug,
    where the potential violation of nodemask isn't an issue, as there's
    already a fallback allocation attempt without any nodemask.  If there's
    a future caller that needs to have its specific nodemask honoured over
    task's cpuset restrictions, we'll have to e.g.  add a gfp flag for that.
    
    Link: http://lkml.kernel.org/r/20170517081140.30654-2-vbabka@suse.cz
    
    
    Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Li Zefan <lizefan@huawei.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Dimitri Sivanich <sivanich@sgi.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    902b6281