1. 06 Jun, 2017 2 commits
    • David Rientjes's avatar
      compiler, clang: suppress warning for unused static inline functions · abb2ea7d
      David Rientjes authored
      GCC explicitly does not warn for unused static inline functions for
      -Wunused-function.  The manual states:
      	Warn whenever a static function is declared but not defined or
      	a non-inline static function is unused.
      Clang does warn for static inline functions that are unused.
      It turns out that suppressing the warnings avoids potentially complex
      #ifdef directives, which also reduces LOC.
      Suppress the warning for clang.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Linus Torvalds's avatar
      Merge tag 'media/v4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 84c6c303
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
       "Some bug fixes:
         - Don't fail build if atomisp has warnings
         - Some CEC Kconfig changes to allow it to be used by DRM without
           media dependencies
         - A race fix at RC initialization code
         - A driver fix at rainshadow-cec
        IMHO, the one that affects most people in this series is a build fix:
        if you try to build the Kernel with W=1 or using gcc7 and
        all[yes|mod]config, build will fail due to -Werror at atomisp
      * tag 'media/v4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        [media] rc-core: race condition during ir_raw_event_register()
        [media] cec: drop MEDIA_CEC_DEBUG
        [media] cec: rename MEDIA_CEC_NOTIFIER to CEC_NOTIFIER
        [media] cec: select CEC_CORE instead of depend on it
        [media] rainshadow-cec: ensure exit_loop is intialized
        [media] atomisp: don't treat warnings as errors
  2. 05 Jun, 2017 5 commits
    • Linus Torvalds's avatar
      Merge branch 'for-4.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · ba7b2387
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "Two cgroup fixes. One to address RCU delay of cpuset removal affecting
        userland visible behaviors. The other fixes a race condition between
        controller disable and cgroup removal"
      * 'for-4.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cpuset: consider dying css as offline
        cgroup: Prevent kill_css() from being called more than once
    • Linus Torvalds's avatar
      Merge branch 'for-4.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata · e543c8a9
      Linus Torvalds authored
      Pull libata fixes from Tejun Heo:
       - Revert of sata_mv devm_ioremap_resource() conversion. It made init
         fail if there are overlapping resources which led to detection
         failures on some setups.
       - A workaround for an Acer laptop which sometimes reports corrupt port
       - Other non-critical fixes.
      * 'for-4.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
        libata: fix error checking in in ata_parse_force_one()
        Revert "ata: sata_mv: Convert to devm_ioremap_resource()"
        ata: libahci: properly propagate return value of platform_get_irq()
        ata: sata_rcar: Handle return value of clk_prepare_enable
        ahci: Acer SA5-271 SSD Not Detected Fix
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm · 112eb072
      Linus Torvalds authored
      Pull ARM fixes from Russell King:
       "Three fixes this time around:
         - Two fixes for noMMU, fixing the decompressor header layout, and
           preventing a build error with some configurations.
         - Fixing the hyp-stub updates that went in during the merge window
           for platforms that use MCPM"
      * 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
        ARM: 8677/1: boot/compressed: fix decompressor header layout for v7-M
        ARM: 8676/1: NOMMU: provide pgprot_device() macro
        ARM: 8675/1: MCPM: ensure not to enter __hyp_soft_restart from loopback and cpu_power_down
    • Ard Biesheuvel's avatar
      ARM: 8677/1: boot/compressed: fix decompressor header layout for v7-M · 06a4b6d0
      Ard Biesheuvel authored
      As reported by Patrice, the header layout of the decompressor is
      incorrect when building for v7-M. In this case, the __nop macro
      resolves to 'mov r0, r0', which is emitted as a narrow encoding,
      resulting in the header data fields to end up at lower offsets than
      Given the variety of targets we need to support with the same code,
      the startup sequence is a bit of a jumble, and uses instructions
      and macros whose encoding widths cannot be specified (badr), or only
      exist in a narrow encoding (bx)
      So force the use of a wide encoding in __nop, and replace the start
      sequence with a simple jump to the label marking the start of code,
      preceded by a Thumb2 mode switch if required (using explicit wide
      encodings where appropriate). The label itself can be moved to the
      start of code [where it belongs] due to the larger range of branch
      instructions as compared to adr instructions.
      Reported-by: default avatarPatrice CHOTARD <patrice.chotard@st.com>
      Acked-by: default avatarNicolas Pitre <nico@linaro.org>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
    • Vladimir Murzin's avatar
      ARM: 8676/1: NOMMU: provide pgprot_device() macro · 7ef4783e
      Vladimir Murzin authored
      NOMMU build leads to the following error:
        CC      drivers/pci/mmap.o
      drivers/pci/mmap.c: In function 'pci_mmap_resource_range':
      drivers/pci/mmap.c:60:3: error: implicit declaration of function 'pgprot_device' [-Werror=implicit-function-declaration]
         vma->vm_page_prot = pgprot_device(vma->vm_page_prot);
      cc1: some warnings being treated as errors
      scripts/Makefile.build:302: recipe for target 'drivers/pci/mmap.o' failed
      make[2]: *** [drivers/pci/mmap.o] Error 1
      scripts/Makefile.build:561: recipe for target 'drivers/pci' failed
      make[1]: *** [drivers/pci] Error 2
      Makefile:1016: recipe for target 'drivers' failed
      make: *** [drivers] Error 2
      Fix it with support of pgprot_device() macro for NOMMU.
      Fixes: 00d2904f
       ("ARM/PCI: Use generic pci_mmap_resource_range()")
      Signed-off-by: Vladimir Murzin's avatarVladimir Murzin <vladimir.murzin@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
  3. 04 Jun, 2017 15 commits
  4. 03 Jun, 2017 8 commits
  5. 02 Jun, 2017 10 commits
    • Linus Torvalds's avatar
      Merge tag 'acpi-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 104c08ba
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These revert one more problematic commit related to the ACPI-based
        handling of laptop lids and make some unuseful error messages coming
        from ACPICA go away.
         - Revert one more commit related to the ACPI-based handling of laptop
           lids that changed the default behavior on laptops that booted with
           closed lids and introduced a regression there (Benjamin Tissoires).
         - Add a missing acpi_put_table() to the code implementing the
           /sys/firmware/acpi/tables interface to prevent a counter in the
           ACPICA core from overflowing (Dan Williams).
         - Drop error messages printed by ACPICA on acpi_get_table() reference
           counting mismatches as they need not indicate real errors at this
           point (Lv Zheng)"
      * tag 'acpi-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPICA: Tables: Fix regression introduced by a too early mechanism enabling
        Revert "ACPI / button: Change default behavior to lid_init_state=open"
        ACPI / sysfs: fix acpi_get_table() leak / acpi-sysfs denial of service
    • Linus Torvalds's avatar
      Merge tag 'pm-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 89af529a
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "These fix two bugs in error code paths in the cpufreq core and in the
        kirkwood-cpufreq driver.
         - Make cpufreq_register_driver() return an error if the ->init()
           calls fail for all CPUs to prevent non-functional drivers from
           hanging around for no reason (David Arcari).
         - Make kirkwood-cpufreq check the return value of
           clk_prepare_enable() (which may fail) as appropriate (Arvind
      * tag 'pm-4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq: kirkwood-cpufreq:- Handle return value of clk_prepare_enable()
        cpufreq: cpufreq_register_driver() should return -ENODEV if init fails
    • Linus Torvalds's avatar
      Merge tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random · 5a4829b5
      Linus Torvalds authored
      Pull /dev/random bug fix from Ted Ts'o:
       "Fix a race on architectures with prioritized interrupts (such as m68k)
        which can causes crashes in drivers/char/random.c:get_reg()"
      * tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
        fix race in drivers/char/random.c:get_reg()
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f2197649
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "15 fixes"
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        scripts/gdb: make lx-dmesg command work (reliably)
        mm: consider memblock reservations for deferred memory initialization sizing
        mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
        mlock: fix mlock count can not decrease in race condition
        mm/migrate: fix refcount handling when !hugepage_migration_supported()
        dax: fix race between colliding PMD & PTE entries
        mm: avoid spurious 'bad pmd' warning messages
        mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once
        pcmcia: remove left-over %Z format
        slub/memcg: cure the brainless abuse of sysfs attributes
        initramfs: fix disabling of initramfs (and its compression)
        mm: clarify why we want kmalloc before falling backto vmallock
        frv: declare jiffies to be located in the .data section
        include/linux/gfp.h: fix ___GFP_NOLOCKDEP value
        ksm: prevent crash after write_protect_page fails
    • André Draszik's avatar
      scripts/gdb: make lx-dmesg command work (reliably) · d6c97087
      André Draszik authored
      lx-dmesg needs access to the log_buf symbol from printk.c.
      Unfortunately, the symbol log_buf also exists in BPF's verifier.c and
      hence gdb can pick one or the other.  If it happens to pick BPF's
      log_buf, lx-dmesg doesn't work:
        (gdb) lx-dmesg
        Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x0:
        Error occurred in Python command: Cannot access memory at address 0x0
        (gdb) p log_buf
        $15 = 0x0
      Luckily, GDB has a way to deal with this, see
        (gdb) info variables ^log_buf$
        All variables matching regular expression "^log_buf$":
        File <linux.git>/kernel/bpf/verifier.c:
        static char *log_buf;
        File <linux.git>/kernel/printk/printk.c:
        static char *log_buf;
        (gdb) p 'verifier.c'::log_buf
        $1 = 0x0
        (gdb) p 'printk.c'::log_buf
        $2 = 0x811a6aa0 <__log_buf> ""
        (gdb) p &log_buf
        $3 = (char **) 0x8120fe40 <log_buf>
        (gdb) p &'verifier.c'::log_buf
        $4 = (char **) 0x8120fe40 <log_buf>
        (gdb) p &'printk.c'::log_buf
        $5 = (char **) 0x8048b7d0 <log_buf>
      By being explicit about the location of the symbol, we can make lx-dmesg
      work again.  While at it, do the same for the other symbols we need from
      Link: http://lkml.kernel.org/r/20170526112222.3414-1-git@andred.net
      Signed-off-by: default avatarAndré Draszik <git@andred.net>
      Tested-by: default avatarKieran Bingham <kieran@bingham.xyz>
      Acked-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Michal Hocko's avatar
      mm: consider memblock reservations for deferred memory initialization sizing · 864b9a39
      Michal Hocko authored
      We have seen an early OOM killer invocation on ppc64 systems with
      	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
      	kthreadd cpuset=/ mems_allowed=7
      	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
      	Call Trace:
      	  dump_stack+0xb0/0xf0 (unreliable)
      	active_anon:0 inactive_anon:0 isolated_anon:0
      	 active_file:0 inactive_file:0 isolated_file:0
      	 unevictable:0 dirty:0 writeback:0 unstable:0
      	 slab_reclaimable:5 slab_unreclaimable:73
      	 mapped:0 shmem:0 pagetables:0 bounce:0
      	 free:0 free_pcp:0 free_cma:0
      	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      	lowmem_reserve[]: 0 0 0 0
      	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
      	0 total pagecache pages
      	0 pages in swap cache
      	Swap cache stats: add 0, delete 0, find 0/0
      	Free swap  = 0kB
      	Total swap = 0kB
      	819200 pages RAM
      	0 pages HighMem/MovableOnly
      	817481 pages reserved
      	0 pages cma reserved
      	0 pages hwpoisoned
      the reason is that the managed memory is too low (only 110MB) while the
      rest of the the 50GB is still waiting for the deferred intialization to
      be done.  update_defer_init estimates the initial memoty to initialize
      to 2GB at least but it doesn't consider any memory allocated in that
      range.  In this particular case we've had
      	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)
      so the low 2GB is mostly depleted.
      Fix this by considering memblock allocations in the initial static
      initialization estimation.  Move the max_initialise to
      reset_deferred_meminit and implement a simple memblock_reserved_memory
      helper which iterates all reserved blocks and sums the size of all that
      start below the given address.  The cumulative size is than added on top
      of the initial estimation.  This is still not ideal because
      reset_deferred_meminit doesn't consider holes and so reservation might
      be above the initial estimation whihch we ignore but let's make the
      logic simpler until we really need to handle more complicated cases.
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Tested-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • James Morse's avatar
      mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified · 9a291a7c
      James Morse authored
      KVM uses get_user_pages() to resolve its stage2 faults.  KVM sets the
      FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
      finds a VM_FAULT_HWPOISON.  KVM handles these hwpoison pages as a
      special case.  (check_user_page_hwpoison())
      When huge pages are involved, this doesn't work so well.
      get_user_pages() calls follow_hugetlb_page(), which stops early if it
      receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
      -EFAULT to the caller.  The step to map this to -EHWPOISON based on the
      FOLL_ flags is missing.  The hwpoison special case is skipped, and
      -EFAULT is returned to user-space, causing Qemu or kvmtool to exit.
      Instead, move this VM_FAULT_ to errno mapping code into a header file
      and use it from faultin_page() and follow_hugetlb_page().
      With this, KVM works as expected.
      This isn't a problem for arm64 today as we haven't enabled
      MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
      too, so I think this should be a fix.  This doesn't apply earlier than
      stable's v4.11.1 due to all sorts of cleanup.
      [james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()]
        Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.com
      Signed-off-by: James Morse's avatarJames Morse <james.morse@arm.com>
      Acked-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[4.11.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Yisheng Xie's avatar
      mlock: fix mlock count can not decrease in race condition · 70feee0e
      Yisheng Xie authored
      Kefeng reported that when running the follow test, the mlock count in
      meminfo will increase permanently:
       [1] testcase
       linux:~ # cat test_mlockal
       grep Mlocked /proc/meminfo
        for j in `seq 0 10`
       	for i in `seq 4 15`
       		./p_mlockall >> log &
       	sleep 0.2
       # wait some time to let mlock counter decrease and 5s may not enough
       sleep 5
       grep Mlocked /proc/meminfo
       linux:~ # cat p_mlockall.c
       #include <sys/mman.h>
       #include <stdlib.h>
       #include <stdio.h>
       #define SPACE_LEN	4096
       int main(int argc, char ** argv)
      	 	int ret;
      	 	void *adr = malloc(SPACE_LEN);
      	 	if (!adr)
      	 		return -1;
      	 	ret = mlockall(MCL_CURRENT | MCL_FUTURE);
      	 	printf("mlcokall ret = %d\n", ret);
      	 	ret = munlockall();
      	 	printf("munlcokall ret = %d\n", ret);
      	 	return 0;
      In __munlock_pagevec() we should decrement NR_MLOCK for each page where
      we clear the PageMlocked flag.  Commit 1ebb7cc6 ("mm: munlock: batch
      NR_MLOCK zone state updates") has introduced a bug where we don't
      decrement NR_MLOCK for pages where we clear the flag, but fail to
      isolate them from the lru list (e.g.  when the pages are on some other
      cpu's percpu pagevec).  Since PageMlocked stays cleared, the NR_MLOCK
      accounting gets permanently disrupted by this.
      Fix it by counting the number of page whose PageMlock flag is cleared.
      Fixes: 1ebb7cc6 (" mm: munlock: batch NR_MLOCK zone state updates")
      Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Punit Agrawal's avatar
      mm/migrate: fix refcount handling when !hugepage_migration_supported() · 30809f55
      Punit Agrawal authored
      On failing to migrate a page, soft_offline_huge_page() performs the
      necessary update to the hugepage ref-count.
      But when !hugepage_migration_supported() , unmap_and_move_hugepage()
      also decrements the page ref-count for the hugepage.  The combined
      behaviour leaves the ref-count in an inconsistent state.
      This leads to soft lockups when running the overcommitted hugepage test
      from mce-tests suite.
        Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
        soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
        INFO: rcu_preempt detected stalls on CPUs/tasks:
         Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
          (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
          thugetlb_overco R  running task        0  2715   2685 0x00000008
          Call trace:
      Address this by changing the putback_active_hugepage() in
      soft_offline_huge_page() to putback_movable_pages().
      This only triggers on systems that enable memory failure handling
      (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
      I imagine this wasn't triggered as there aren't many systems running
      this configuration.
      [akpm@linux-foundation.org: remove dead comment, per Naoya]
      Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com
      Reported-by: default avatarManoj Iyer <manoj.iyer@canonical.com>
      Tested-by: default avatarManoj Iyer <manoj.iyer@canonical.com>
      Suggested-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[3.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Ross Zwisler's avatar
      dax: fix race between colliding PMD & PTE entries · e2093926
      Ross Zwisler authored
      We currently have two related PMD vs PTE races in the DAX code.  These
      can both be easily triggered by having two threads reading and writing
      simultaneously to the same private mapping, with the key being that
      private mapping reads can be handled with PMDs but private mapping
      writes are always handled with PTEs so that we can COW.
      Here is the first race:
        CPU 0					CPU 1
        (private mapping write)
          create_huge_pmd() - FALLBACK
            passes check for pmd_devmap()
      					(private mapping read)
      					    dax_iomap_pmd_fault() inserts PMD
            dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
            			  installed in our page tables at this spot.
      Here's the second race:
        CPU 0					CPU 1
        (private mapping read)
          passes check for pmd_none()
            dax_iomap_pmd_fault() inserts PMD
        (private mapping write)
          create_huge_pmd() - FALLBACK
      					(private mapping read)
      					  passes check for pmd_none()
            dax_iomap_pte_fault() inserts PTE
      					    dax_iomap_pmd_fault() inserts PMD,
      					       but we already have a PTE at
      					       this spot.
      The core of the issue is that while there is isolation between faults to
      the same range in the DAX fault handlers via our DAX entry locking,
      there is no isolation between faults in the code in mm/memory.c.  This
      means for instance that this code in __handle_mm_fault() can run:
      	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
      		ret = create_huge_pmd(&vmf);
      But by the time we actually get to run the fault handler called by
      create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
      fault has installed a normal PMD here as a parent.  This is the cause of
      the 2nd race.  The first race is similar - there is the following check
      in handle_pte_fault():
      	} else {
      		/* See comment in pte_alloc_one_map() */
      		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
      			return 0;
      So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
      will bail and retry the fault.  This is correct, but there is nothing
      preventing the PMD from being installed after this check but before we
      actually get to the DAX PTE fault handlers.
      In my testing these races result in the following types of errors:
        BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
        BUG: non-zero nr_ptes on freeing mm: 15
      Fix this issue by having the DAX fault handlers verify that it is safe
      to continue their fault after they have taken an entry lock to block
      other racing faults.
      [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
        Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatarPawel Lebioda <pawel.lebioda@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>