1. 06 Apr, 2018 1 commit
    • Souptick Joarder's avatar
      mm: change return type to vm_fault_t · 1c8f4220
      Souptick Joarder authored
      The plan for these patches is to introduce the typedef, initially just
      as documentation ("These functions should return a VM_FAULT_ status").
      We'll trickle the patches to individual drivers/filesystems in through
      the maintainers, as far as possible.  Then we'll change the typedef to
      an unsigned int and break the compilation of any unconverted
      vmf_insert_page(), vmf_insert_mixed() and vmf_insert_pfn() are three
      newly added functions.  The various drivers/filesystems where return
      value of fault(), huge_fault(), page_mkwrite() and pfn_mkwrite() get
      converted, will need them.  These functions will return correct
      VM_FAULT_ code based on err value.
      We've had bugs before where drivers returned -EFOO.  And we have this
      silly inefficiency where vm_insert_xxx() return an errno which (afaict)
      every driver then converts into a VM_FAULT code.  In many cases drivers
      failed to return correct VM_FAULT code value despite of vm_insert_xxx()
      fails.  We have indentified and clean up all those existing bugs and
      silly inefficiencies in driver/filesystems by adding these three new
      inline wrappers.  As mentioned above, we will trickle those patches to
      individual drivers/filesystems in through maintainers after these three
      wrapper functions are merged.
      Eventually we can convert vm_insert_xxx() into vmf_insert_xxx() and
      remove these inline wrappers, but these are a good intermediate step.
      Link: http://lkml.kernel.org/r/20180310162351.GA7422@jordon-HP-15-Notebook-PC
      Signed-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 01 Feb, 2018 8 commits
  3. 16 Nov, 2017 5 commits
  4. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      How this work was done:
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
      All documentation files were explicitly excluded.
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
         For non */uapi/* files that summary was:
         SPDX license identifier                            # files
         GPL-2.0                                              11139
         and resulted in the first patch in this series.
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                        930
         and resulted in the second patch in this series.
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
         and that resulted in the third patch in this series.
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  5. 20 Oct, 2017 1 commit
    • Mathieu Desnoyers's avatar
      membarrier: Provide register expedited private command · a961e409
      Mathieu Desnoyers authored
      This introduces a "register private expedited" membarrier command which
      allows eventual removal of important memory barrier constraints on the
      scheduler fast-paths. It changes how the "private expedited" membarrier
      command (new to 4.14) is used from user-space.
      This new command allows processes to register their intent to use the
      private expedited command.  This affects how the expedited private
      command introduced in 4.14-rc is meant to be used, and should be merged
      before 4.14 final.
      Processes are now required to register before using
      MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
      This fixes a problem that arose when designing requested extensions to
      sys_membarrier() to allow JITs to efficiently flush old code from
      instruction caches.  Several potential algorithms are much less painful
      if the user register intent to use this functionality early on, for
      example, before the process spawns the second thread.  Registering at
      this time removes the need to interrupt each and every thread in that
      process at the first expedited sys_membarrier() system call.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 09 Sep, 2017 1 commit
  7. 07 Sep, 2017 1 commit
    • Huang Ying's avatar
      mm, swap: VMA based swap readahead · ec560175
      Huang Ying authored
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory.  And the different tasks in the system may have different access
      patterns, which makes the global space locality estimation incorrect.
      In this patch, when page fault occurs, the virtual pages near the fault
      address will be readahead instead of the swap slots near the fault swap
      slot in swap device.  This avoid to readahead the unrelated swap slots.
      At the same time, the swap readahead is changed to work on per-VMA from
      globally.  So that the different access patterns of the different VMAs
      could be distinguished, and the different readahead policy could be
      applied accordingly.  The original core readahead detection and scaling
      algorithm is reused, because it is an effect algorithm to detect the
      space locality.
      The test and result is as follow,
      Common test condition
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
      NVMe disk
      Micro-benchmark with combined access pattern
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      The test memory size is bigger than RAM to trigger swapping.
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  8. 25 Aug, 2017 1 commit
  9. 10 Aug, 2017 6 commits
    • Minchan Kim's avatar
      mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem · 99baac21
      Minchan Kim authored
      Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
      problem and Mel fixed it[1] and found same problem on MADV_FREE[2].
      Quote from Mel Gorman:
       "The race in question is CPU 0 running madv_free and updating some PTEs
        while CPU 1 is also running madv_free and looking at the same PTEs.
        CPU 1 may have writable TLB entries for a page but fail the pte_dirty
        check (because CPU 0 has updated it already) and potentially fail to
        Hence, when madv_free on CPU 1 returns, there are still potentially
        writable TLB entries and the underlying PTE is still present so that a
        subsequent write does not necessarily propagate the dirty bit to the
        underlying PTE any more. Reclaim at some unknown time at the future
        may then see that the PTE is still clean and discard the page even
        though a write has happened in the meantime. I think this is possible
        but I could have missed some protection in madv_free that prevents it
      This patch aims for solving both problems all at once and is ready for
      other problem with KSM, MADV_FREE and soft-dirty story[3].
      TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
      and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
      catch there are parallel threads going on.  In that case, forcefully,
      flush TLB to prevent for user to access memory via stale TLB entry
      although it fail to gather page table entry.
      I confirmed this patch works with [4] test program Nadav gave so this
      patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
      v2" in current mmotm.
      This patch modifies arch-specific TLB gathering interface(x86, ia64,
      s390, sh, um).  It seems most of architecture are straightforward but
      s390 need to be careful because tlb_flush_mmu works only if
      mm->context.flush_mm is set to non-zero which happens only a pte entry
      really is cleared by ptep_get_and_clear and friends.  However, this
      problem never changes the pte entries but need to flush to prevent
      memory access from stale tlb.
      [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
      [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
      [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      [4] https://patchwork.kernel.org/patch/9861621/
      [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
        Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
      Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.com
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Minchan Kim's avatar
      mm: make tlb_flush_pending global · 0a2dd266
      Minchan Kim authored
      Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
      COMPACTION] but upcoming patches to solve subtle TLB flush batching
      problem will use it regardless of compaction/NUMA so this patch doesn't
      remove the dependency.
      [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
      Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.com
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Minchan Kim's avatar
      mm: refactor TLB gathering API · 56236a59
      Minchan Kim authored
      This patch is a preparatory patch for solving race problems caused by
      TLB batch.  For that, we will increase/decrease TLB flush pending count
      of mm_struct whenever tlb_[gather|finish]_mmu is called.
      Before making it simple, this patch separates architecture specific part
      and rename it to arch_tlb_[gather|finish]_mmu and generic part just
      calls it.
      It shouldn't change any behavior.
      Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.com
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Nadav Amit's avatar
      mm: migrate: fix barriers around tlb_flush_pending · 0a2c4048
      Nadav Amit authored
      Reading tlb_flush_pending while the page-table lock is taken does not
      require a barrier, since the lock/unlock already acts as a barrier.
      Removing the barrier in mm_tlb_flush_pending() to address this issue.
      However, migrate_misplaced_transhuge_page() calls mm_tlb_flush_pending()
      while the page-table lock is already released, which may present a
      problem on architectures with weak memory model (PPC).  To deal with
      this case, a new parameter is added to mm_tlb_flush_pending() to
      indicate if it is read without the page-table lock taken, and calling
      smp_mb__after_unlock_lock() in this case.
      Link: http://lkml.kernel.org/r/20170802000818.4760-3-namit@vmware.com
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Nadav Amit's avatar
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit authored
      Patch series "fixes of TLB batching races", v6.
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      This patch (of 7):
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405
       ("mm: fix TLB flush race between migration, and
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Peter Zijlstra's avatar
      mm, locking: Rework {set,clear,mm}_tlb_flush_pending() · 8b1b436d
      Peter Zijlstra authored
       ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
      added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
      can solve the same problem without this barrier.
      If instead we mandate that mm_tlb_flush_pending() is used while
      holding the PTL we're guaranteed to observe prior
      set_tlb_flush_pending() instances.
      For this to work we need to rework migrate_misplaced_transhuge_page()
      a little and move the test up into do_huge_pmd_numa_page().
      NOTE: this relies on flush_tlb_range() to guarantee:
         (1) it ensures that prior page table updates are visible to the
             page table walker and
         (2) it ensures that subsequent memory accesses are only made
             visible after the invalidation has completed
      This is required for architectures that implement TRANSPARENT_HUGEPAGE
      (arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
      mm_tlb_flush_pending() in their page-table operations (arm, arm64,
      This appears true for:
       - arm (DSB ISB before and after),
       - arm64 (DSB ISHST before, and DSB ISH after),
       - powerpc (PTESYNC before and after),
       - s390 and x86 TLB invalidate are serializing instructions
      But I failed to understand the situation for:
       - arc, mips, sparc
      Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
      and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
      inside the PTL. It still needs to guarantee the PTL unlock happens
      _after_ the invalidate completes.
      Vineet, Ralf and Dave could you guys please have a look?
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  10. 02 Aug, 2017 1 commit
    • Mel Gorman's avatar
      mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · 3ea27719
      Mel Gorman authored
      Nadav Amit identified a theoritical race between page reclaim and
      mprotect due to TLB flushes being batched outside of the PTL being held.
      He described the race as follows:
              CPU0                            CPU1
              ----                            ----
                                              user accesses memory using RW PTE
                                              [PTE now cached in TLB]
              ==> ptep_get_and_clear()
              ==> set_tlb_ubc_flush_pending()
                                              mprotect(addr, PROT_READ)
                                              ==> change_pte_range()
                                              ==> [ PTE non-present - no flush ]
                                              user writes using cached RW PTE
      The same type of race exists for reads when protecting for PROT_NONE and
      also exists for operations that can leave an old TLB entry behind such
      as munmap, mremap and madvise.
      For some operations like mprotect, it's not necessarily a data integrity
      issue but it is a correctness issue as there is a window where an
      mprotect that limits access still allows access.  For munmap, it's
      potentially a data integrity issue although the race is massive as an
      munmap, mmap and return to userspace must all complete between the
      window when reclaim drops the PTL and flushes the TLB.  However, it's
      theoritically possible so handle this issue by flushing the mm if
      reclaim is potentially currently batching TLB flushes.
      Other instances where a flush is required for a present pte should be ok
      as either the page lock is held preventing parallel reclaim or a page
      reference count is elevated preventing a parallel free leading to
      corruption.  In the case of page_mkclean there isn't an obvious path
      that userspace could take advantage of without using the operations that
      are guarded by this patch.  Other users such as gup as a race with
      reclaim looks just at PTEs.  huge page variants should be ok as they
      don't race with reclaim.  mincore only looks at PTEs.  userfault also
      should be ok as if a parallel reclaim takes place, it will either fault
      the page back in or read some of the data before the flush occurs
      triggering a fault.
      Note that a variant of this patch was acked by Andy Lutomirski but this
      was for the x86 parts on top of his PCID work which didn't make the 4.13
      merge window as expected.  His ack is dropped from this version and
      there will be a follow-on patch on top of PCID that will include his
      [akpm@linux-foundation.org: tweak comments]
      [akpm@linux-foundation.org: fix spello]
      Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: <stable@vger.kernel.org>	[v4.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  11. 30 Jun, 2017 1 commit
    • Kees Cook's avatar
      randstruct: Mark various structs for randomization · 3859a271
      Kees Cook authored
      This marks many critical kernel structures for randomization. These are
      structures that have been targeted in the past in security exploits, or
      contain functions pointers, pointers to function pointer tables, lists,
      workqueues, ref-counters, credentials, permissions, or are otherwise
      sensitive. This initial list was extracted from Brad Spengler/PaX Team's
      code in the last public patch of grsecurity/PaX based on my understanding
      of the code. Changes or omissions from the original code are mine and
      don't reflect the original grsecurity/PaX code.
      Left out of this list is task_struct, which requires special handling
      and will be covered in a subsequent patch.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
  12. 13 Mar, 2017 1 commit
    • Dmitry Safonov's avatar
      x86/mm: Introduce mmap_compat_base() for 32-bit mmap() · 1b028f78
      Dmitry Safonov authored
      mmap() uses a base address, from which it starts to look for a free space
      for allocation.
      The base address is stored in mm->mmap_base, which is calculated during
      exec(). The address depends on task's size, set rlimit for stack, ASLR
      randomization. The base depends on the task size and the number of random
      bits which are different for 64-bit and 32bit applications.
      Due to the fact, that the base address is fixed, its mmap() from a compat
      (32bit) syscall issued by a 64bit task will return a address which is based
      on the 64bit base address and does not fit into the 32bit address space
      (4GB). The returned pointer is truncated to 32bit, which results in an
      invalid address.
      To solve store a seperate compat address base plus a compat legacy address
      base in mm_struct. These bases are calculated at exec() time and can be
      used later to address the 32bit compat mmap() issued by 64 bit
      As a consequence of this change 32-bit applications issuing a 64-bit
      syscall (after doing a long jump) will get a 64-bit mapping now. Before
      this change 32-bit applications always got a 32bit mapping.
      [ tglx: Massaged changelog and added a comment ]
      Signed-off-by: default avatarDmitry Safonov <dsafonov@virtuozzo.com>
      Cc: 0x7f454c46@gmail.com
      Cc: linux-mm@kvack.org
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20170306141721.9188-4-dsafonov@virtuozzo.com
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
  13. 03 Mar, 2017 2 commits
  14. 02 Mar, 2017 2 commits
  15. 28 Feb, 2017 1 commit
  16. 17 Dec, 2016 1 commit
  17. 22 Nov, 2016 1 commit
    • Eric W. Biederman's avatar
      mm: Add a user_ns owner to mm_struct and fix ptrace permission checks · bfedb589
      Eric W. Biederman authored
      During exec dumpable is cleared if the file that is being executed is
      not readable by the user executing the file.  A bug in
      ptrace_may_access allows reading the file if the executable happens to
      enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
      unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).
      This problem is fixed with only necessary userspace breakage by adding
      a user namespace owner to mm_struct, captured at the time of exec, so
      it is clear in which user namespace CAP_SYS_PTRACE must be present in
      to be able to safely give read permission to the executable.
      The function ptrace_may_access is modified to verify that the ptracer
      has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
      This ensures that if the task changes it's cred into a subordinate
      user namespace it does not become ptraceable.
      The function ptrace_attach is modified to only set PT_PTRACE_CAP when
      CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
      PT_PTRACE_CAP is to be a flag to note that whatever permission changes
      the task might go through the tracer has sufficient permissions for
      it not to be an issue.  task->cred->user_ns is always the same
      as or descendent of mm->user_ns.  Which guarantees that having
      CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
      To prevent regressions mm->dumpable and mm->user_ns are not considered
      when a task has no mm.  As simply failing ptrace_may_attach causes
      regressions in privileged applications attempting to read things
      such as /proc/<pid>/stat
      Cc: stable@vger.kernel.org
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Fixes: 8409cca7
       ("userns: allow ptrace from non-init user namespaces")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  18. 08 Oct, 2016 1 commit
    • Michal Hocko's avatar
      kernel, oom: fix potential pgd_lock deadlock from __mmdrop · 7283094e
      Michal Hocko authored
      Lockdep complains that __mmdrop is not safe from the softirq context:
        [ INFO: inconsistent lock state ]
        4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
         (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
        {SOFTIRQ-ON-W} state was registered at:
        irq event stamp: 105916
        hardirqs last  enabled at (105916): free_hot_cold_page+0x37e/0x390
        hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
        softirqs last  enabled at (105878): _local_bh_enable+0x42/0x44
        softirqs last disabled at (105879): irq_exit+0x6f/0xd1
        other info that might help us debug this:
         Possible unsafe locking scenario:
         *** DEADLOCK ***
        1 lock held by swapper/1/0:
         #0:  (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800
        stack backtrace:
        CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
      More over commit a79e53d8 ("x86/mm: Fix pgd_lock deadlock") was
      explicit about pgd_lock not to be called from the irq context.  This
      means that __mmdrop called from free_signal_struct has to be postponed
      to a user context.  We already have a similar mechanism for mmput_async
      so we can use it here as well.  This is safe because mm_count is pinned
      by mm_users.
      This fixes bug introduced by "oom: keep mm of the killed task available"
      Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  19. 28 Jul, 2016 1 commit
  20. 26 Jul, 2016 2 commits
  21. 08 Jul, 2016 1 commit
    • Dmitry Safonov's avatar
      x86/vdso: Add mremap hook to vm_special_mapping · b059a453
      Dmitry Safonov authored
      Add possibility for 32-bit user-space applications to move
      the vDSO mapping.
      Previously, when a user-space app called mremap() for the vDSO
      address, in the syscall return path it would land on the previous
      address of the vDSOpage, resulting in segmentation violation.
      Now it lands fine and returns to userspace with a remapped vDSO.
      This will also fix the context.vdso pointer for 64-bit, which does
      not affect the user of vDSO after mremap() currently, but this
      may change in the future.
      As suggested by Andy, return -EINVAL for mremap() that would
      split the vDSO image: that operation cannot possibly result in
      a working system so reject it.
      Renamed and moved the text_mapping structure declaration inside
      map_vdso(), as it used only there and now it complements the
      vvar_mapping variable.
      There is still a problem for remapping the vDSO in glibc
      applications: the linker relocates addresses for syscalls
      on the vDSO page, so you need to relink with the new
      Without that the next syscall through glibc may fail:
        Program received signal SIGSEGV, Segmentation fault.
        #0  0xf7fd9b80 in __kernel_vsyscall ()
        #1  0xf7ec8238 in _exit () from /usr/lib32/libc.so.6
      Signed-off-by: default avatarDmitry Safonov <dsafonov@virtuozzo.com>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: 0x7f454c46@gmail.com
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160628113539.13606-2-dsafonov@virtuozzo.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>