28 Sep, 2019
      Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)
      Linus Torvalds authored
      Merge hugepage allocation updates from David Rientjes:
       "We (mostly Linus, Andrea, and myself) have been discussing offlist how
        to implement a sane default allocation strategy for hugepages on NUMA
        With these reverts in place, the page allocator will happily allocate
        a remote hugepage immediately rather than try to make a local hugepage
        available. This incurs a substantial performance degradation when
        memory compaction would have otherwise made a local hugepage
        This series reverts those reverts and attempts to propose a more sane
        default allocation strategy specifically for hugepages. Andrea
        acknowledges this is likely to fix the swap storms that he originally
        reported that resulted in the patches that removed __GFP_THISNODE from
        hugepage allocations.
        The immediate goal is to return 5.3 to the behavior the kernel has
        implemented over the past several years so that remote hugepages are
        not immediately allocated when local hugepages could have been made
        available because the increased access latency is untenable.
        The next goal is to introduce a sane default allocation strategy for
        hugepages allocations in general regardless of the configuration of
        the system so that we prevent thrashing of local memory when
        compaction is unlikely to succeed and can prefer remote hugepages over
        remote native pages when the local node is low on memory."
      Note on timing: this reverts the hugepage VM behavior changes that got
      introduced fairly late in the 5.3 cycle, and that fixed a huge
      performance regression for certain loads that had been around since
      Andrea had this note:
       "The regression of 4.18 was that it was taking hours to start a VM
        where 3.10 was only taking a few seconds, I reported all the details
        on lkml when it was finally tracked down in August 2018.
        __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
        workload degrade like in the "current upstream" above. And it still
        would have been that bad as above until 5.3-rc5"
      where the bad behavior ends up happening as you fill up a local node,
      and without that change, you'd get into the nasty swap storm behavior
      due to compaction working overtime to make room for more memory on the
      As a result 5.3 got the two performance fix reverts in rc5.
      However, David Rientjes then noted that those performance fixes in turn
      regressed performance for other loads - although not quite to the same
      degree.  He suggested reverting the reverts and instead replacing them
      with two small changes to how hugepage allocations are done (patch
      descriptions rephrased by me):
       - "avoid expensive reclaim when compaction may not succeed": just admit
         that the allocation failed when you're trying to allocate a huge-page
         and compaction wasn't successful.
       - "allow hugepage fallback to remote nodes when madvised": when that
         node-local huge-page allocation failed, retry without forcing the
         local node.
      but by then I judged it too late to replace the fixes for a 5.3 release.
      So 5.3 was released with behavior that harked back to the pre-4.18 logic.
      But now we're in the merge window for 5.4, and we can see if this
      alternate model fixes not just the horrendous swap storm behavior, but
      also restores the performance regression that the late reverts caused.
      Fingers crossed.
      * emailed patches from David Rientjes <rientjes@google.com>:
        mm, page_alloc: allow hugepage fallback to remote nodes when madvised
        mm, page_alloc: avoid expensive reclaim when compaction may not succeed
        Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
        Revert "Revert "mm, thp: restore node-local hugepage allocations""
      mm, page_alloc: allow hugepage fallback to remote nodes when madvised · 76e654cc
      For systems configured to always try hard to allocate transparent
      hugepages (thp defrag setting of "always") or for memory that has been
      explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
      remote memory to allocate the hugepage if the local allocation fails
      The point is to allow the initial call to __alloc_pages_node() to attempt
      to defragment local memory to make a hugepage available, if possible,
      rather than immediately fallback to remote memory.  Local hugepages will
      always have a better access latency than remote (huge)pages, so an attempt
      to make a hugepage available locally is always preferred.
      If memory compaction cannot be successful locally, however, it is likely
      better to fallback to remote memory.  This could take on two forms: either
      allow immediate fallback to remote memory or do per-zone watermark checks.
      It would be possible to fallback only when per-zone watermarks fail for
      order-0 memory, since that would require local reclaim for all subsequent
      faults so remote huge allocation is likely better than thrashing the local
      zone for large workloads.
      In this case, it is assumed that because the system is configured to try
      hard to allocate hugepages or the vma is advised to explicitly want to try
      hard for hugepages that remote allocation is better when local allocation
      and memory compaction have both failed.
      Signed-off-by: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      mm, page_alloc: avoid expensive reclaim when compaction may not succeed · b39d0ee2
      Memory compaction has a couple significant drawbacks as the allocation
      order increases, specifically:
       - isolate_freepages() is responsible for finding free pages to use as
         migration targets and is implemented as a linear scan of memory
         starting at the end of a zone,
       - failing order-0 watermark checks in memory compaction does not account
         for how far below the watermarks the zone actually is: to enable
         migration, there must be *some* free memory available.  Per the above,
         watermarks are not always suffficient if isolate_freepages() cannot
         find the free memory but it could require hundreds of MBs of reclaim to
         even reach this threshold (read: potentially very expensive reclaim with
         no indication compaction can be successful), and
       - if compaction at this order has failed recently so that it does not even
         run as a result of deferred compaction, looping through reclaim can often
         be pointless.
      For hugepage allocations, these are quite substantial drawbacks because
      these are very high order allocations (order-9 on x86) and falling back to
      doing reclaim can potentially be *very* expensive without any indication
      that compaction would even be successful.
      Reclaim itself is unlikely to free entire pageblocks and certainly no
      reliance should be put on it to do so in isolation (recall lumpy reclaim).
      This means we should avoid reclaim and simply fail hugepage allocation if
      compaction is deferred.
      It is also not helpful to thrash a zone by doing excessive reclaim if
      compaction may not be able to access that memory.  If order-0 watermarks
      fail and the allocation order is sufficiently large, it is likely better
      to fail the allocation rather than thrashing the zone.
      Signed-off-by: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 19deb769
      This reverts commit 92717d42.
      Since commit a8282608
       ("Revert "mm, thp: restore node-local hugepage
      allocations"") is reverted in this series, it is better to restore the
      previous 5.2 behavior between the thp allocation and the page allocator
      rather than to attempt any consolidation or cleanup for a policy that is
      now reverted.  It's less risky during an rc cycle and subsequent patches
      in this series further modify the same policy that the pre-5.3 behavior
      Consolidation and cleanup can be done subsequent to a sane default page
      allocation strategy, so this patch reverts a cleanup done on a strategy
      that is now reverted and thus is the least risky option.
      Signed-off-by: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Revert "Revert "mm, thp: restore node-local hugepage allocations"" · ac79f78d
      This reverts commit a8282608
      The commit references the original intended semantic for MADV_HUGEPAGE
      which has subsequently taken on three unique purposes:
       - enables or disables thp for a range of memory depending on the system's
         config (is thp "enabled" set to "always" or "madvise"),
       - determines the synchronous compaction behavior for thp allocations at
         fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
       - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
         clear previous hugepage advice).
      These are the three purposes that currently exist in 5.2 and over the
      past several years that userspace has been written around.  Adding a
      NUMA locality preference adds a fourth dimension to an already conflated
      advice mode.
      Based on the semantic that MADV_HUGEPAGE has provided over the past
      several years, there exist workloads that use the tunable based on these
      principles: specifically that the allocation should attempt to
      defragment a local node before falling back.  It is agreed that remote
      hugepages typically (but not always) have a better access latency than
      remote native pages, although on Naples this is at parity for
      The revert commit that this patch reverts allows hugepage allocation to
      immediately allocate remotely when local memory is fragmented.  This is
      contrary to the semantic of MADV_HUGEPAGE over the past several years:
      that is, memory compaction should be attempted locally before falling
      The performance degradation of remote hugepages over local hugepages on
      Rome, for example, is 53.5% increased access latency.  For this reason,
      the goal is to revert back to the 5.2 and previous behavior that would
      attempt local defragmentation before falling back.  With the patch that
      is reverted by this patch, we see performance degradations at the tail
      because the allocator happily allocates the remote hugepage rather than
      even attempting to make a local hugepage available.
      zone_reclaim_mode is not a solution to this problem since it does not
      only impact hugepage allocations but rather changes the memory
      allocation strategy for *all* page allocations.
      Signed-off-by: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Merge tag 'powerpc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · a2953204
      Pull powerpc fixes from Michael Ellerman:
       "An assortment of fixes that were either missed by me, or didn't arrive
        quite in time for the first v5.4 pull.
         - Most notable is a fix for an issue with tlbie (broadcast TLB
           invalidation) on Power9, when using the Radix MMU. The tlbie can
           race with an mtpid (move to PID register, essentially MMU context
           switch) on another thread of the core, which can cause stores to
           continue to go to a page after it's unmapped.
         - A fix in our KVM code to add a missing barrier, the lack of which
           has been observed to cause missed IPIs and subsequently stuck CPUs
           in the host.
         - A change to the way we initialise PCR (Processor Compatibility
           Register) to make it forward compatible with future CPUs.
         - On some older PowerVM systems our H_BLOCK_REMOVE support could
           oops, fix it to detect such systems and fallback to the old
           invalidation method.
         - A fix for an oops seen on some machines when using KASAN on 32-bit.
         - A handful of other minor fixes, and two new selftests.
        Thanks to: Alistair Popple, Aneesh Kumar K.V, Christophe Leroy,
        Gustavo Romero, Joel Stanley, Jordan Niethe, Laurent Dufour, Michael
        Roth, Oliver O'Halloran"
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f19e00ee
      Pull x86 fix from Ingo Molnar:
       "A kexec fix for the case when GCC_PLUGIN_STACKLEAK=y is enabled"
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/purgatory: Disable the stackleak GCC plugin for the purgatory
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9c5efe9a
      Pull scheduler fixes from Ingo Molnar:
       - Apply a number of membarrier related fixes and cleanups, which fixes
         a use-after-free race in the membarrier code
       - Introduce proper RCU protection for tasks on the runqueue - to get
         rid of the subtle task_rcu_dereference() interface that was easy to
         get wrong
       - Misc fixes, but also an EAS speedup
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/fair: Avoid redundant EAS calculation
        sched/core: Remove double update_max_interval() call on CPU startup
        sched/core: Fix preempt_schedule() interrupt return comment
        sched/fair: Fix -Wunused-but-set-variable warnings
        sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
        sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
        sched/membarrier: Skip IPIs when mm->mm_users == 1
        selftests, sched/membarrier: Add multi-threaded test
        sched/membarrier: Fix p->mm->membarrier_state racy load
        sched/membarrier: Call sync_core only before usermode for same mm
        sched/membarrier: Remove redundant check
        sched/membarrier: Fix private expedited registration check
        tasks, sched/core: RCUify the assignment of rq->curr
        tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
        tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
        tasks: Add a count of task RCU users
        sched/core: Convert vcpu_is_preempted() from macro to an inline function
        sched/fair: Remove unused cfs_rq_clock_task() function
      Merge branch 'next-lockdown' of... · aefcf2f4
      Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
      Pull kernel lockdown mode from James Morris:
       "This is the latest iteration of the kernel lockdown patchset, from
        Matthew Garrett, David Howells and others.
        From the original description:
          This patchset introduces an optional kernel lockdown feature,
          intended to strengthen the boundary between UID 0 and the kernel.
          When enabled, various pieces of kernel functionality are restricted.
          Applications that rely on low-level access to either hardware or the
          kernel may cease working as a result - therefore this should not be
          enabled without appropriate evaluation beforehand.
          The majority of mainstream distributions have been carrying variants
          of this patchset for many years now, so there's value in providing a
          doesn't meet every distribution requirement, but gets us much closer
          to not requiring external patches.
        There are two major changes since this was last proposed for mainline:
         - Separating lockdown from EFI secure boot. Background discussion is
           covered here: https://lwn.net/Articles/751061/
         -  Implementation as an LSM, with a default stackable lockdown LSM
            module. This allows the lockdown feature to be policy-driven,
            rather than encoding an implicit policy within the mechanism.
        The new locked_down LSM hook is provided to allow LSMs to make a
        policy decision around whether kernel functionality that would allow
        tampering with or examining the runtime state of the kernel should be
        The included lockdown LSM provides an implementation with a simple
        policy intended for general purpose use. This policy provides a coarse
        level of granularity, controllable via the kernel command line:
        Enable the kernel lockdown feature. If set to integrity, kernel features
        that allow userland to modify the running kernel are disabled. If set to
        confidentiality, kernel features that allow userland to extract
        confidential information from the kernel are also disabled.
        This may also be controlled via /sys/kernel/security/lockdown and
        overriden by kernel configuration.
        New or existing LSMs may implement finer-grained controls of the
        lockdown features. Refer to the lockdown_reason documentation in
        include/linux/security.h for details.
        The lockdown feature has had signficant design feedback and review
        across many subsystems. This code has been in linux-next for some
        weeks, with a few fixes applied along the way.
        Stephen Rothwell noted that commit 9d1f8be5 ("bpf: Restrict bpf
        when kernel lockdown is in confidentiality mode") is missing a
        Signed-off-by from its author. Matthew responded that he is providing
        this under category (c) of the DCO"
      * 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
        kexec: Fix file verification on S390
        security: constify some arrays in lockdown LSM
        lockdown: Print current->comm in restriction messages
        efi: Restrict efivar_ssdt_load when the kernel is locked down
        tracefs: Restrict tracefs when the kernel is locked down
        debugfs: Restrict debugfs when the kernel is locked down
        kexec: Allow kexec_file() with appropriate IMA policy when locked down
        lockdown: Lock down perf when in confidentiality mode
        bpf: Restrict bpf when kernel lockdown is in confidentiality mode
        lockdown: Lock down tracing and perf kprobes when in confidentiality mode
        lockdown: Lock down /proc/kcore
        x86/mmiotrace: Lock down the testmmiotrace module
        lockdown: Lock down module params that specify hardware parameters (eg. ioport)
        lockdown: Lock down TIOCSSERIAL
        lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
        acpi: Disable ACPI table override if the kernel is locked down
        acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
        ACPI: Limit access to custom_method when the kernel is locked down
        x86/msr: Restrict MSR access when the kernel is locked down
        x86: Lock down IO port access when the kernel is locked down
      Merge branch 'next-integrity' of... · f1f2f614
      Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
      Pull integrity updates from Mimi Zohar:
       "The major feature in this time is IMA support for measuring and
        appraising appended file signatures. In addition are a couple of bug
        fixes and code cleanup to use struct_size().
        In addition to the PE/COFF and IMA xattr signatures, the kexec kernel
        image may be signed with an appended signature, using the same
        scripts/sign-file tool that is used to sign kernel modules.
        Similarly, the initramfs may contain an appended signature.
        This contained a lot of refactoring of the existing appended signature
        verification code, so that IMA could retain the existing framework of
        calculating the file hash once, storing it in the IMA measurement list
        and extending the TPM, verifying the file's integrity based on a file
        hash or signature (eg. xattrs), and adding an audit record containing
        the file hash, all based on policy. (The IMA support for appended
      Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux · 298fb76a
      Linus Torvalds authored
      Pull nfsd updates from Bruce Fields:
         - Add a new knfsd file cache, so that we don't have to open and close
           on each (NFSv2/v3) READ or WRITE. This can speed up read and write
           in some cases. It also replaces our readahead cache.
         - Prevent silent data loss on write errors, by treating write errors
           like server reboots for the purposes of write caching, thus forcing
           clients to resend their writes.
         - Tweak the code that allocates sessions to be more forgiving, so
           that NFSv4.1 mounts are less likely to hang when a server already
           has a lot of clients.
         - Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
           limited only by the backend filesystem and the maximum RPC size.
         - Allow the server to enforce use of the correct kerberos credentials
           when a client reclaims state after a reboot.
        And some miscellaneous smaller bugfixes and cleanup"
  27 Sep, 2019
      Merge tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
      Linus Torvalds authored
      Pull fuse virtio-fs support from Miklos Szeredi:
       "Virtio-fs allows exporting directory trees on the host and mounting
        them in guest(s).
        This isn't actually a new filesystem, but a glue layer between the
        fuse filesystem and a virtio based back-end.
        It's similar in functionality to the existing virtio-9p solution, but
        significantly faster in benchmarks and has better POSIX compliance.
        Further permformance improvements can be achieved by sharing the page
        cache between host and guest, allowing for faster I/O and reduced
        memory use.
        Kata Containers have been including the out-of-tree virtio-fs (with
        the shared page cache patches as well) since version 1.7 as an
        experimental feature. They have been active in development and plan to
        switch from virtio-9p to virtio-fs as their default solution. There
        has been interest from other sources as well.
        The userspace infrastructure is slated to be merged into qemu once the
        kernel part hits mainline.
        This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi"
      Merge tag '9p-for-5.4' of git://github.com/martinetd/linux
      Linus Torvalds authored
      Pull 9p updates from Dominique Martinet:
       "Some of the usual small fixes and cleanup.
        Small fixes all around:
         - avoid overlayfs copy-up for PRIVATE mmaps
         - KUMSAN uninitialized warning for transport error
         - one syzbot memory leak fix in 9p cache
         - internal API cleanup for v9fs_fill_super"
      Merge tag 'riscv/for-v5.4-rc1-b' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
      Linus Torvalds authored
      Pull more RISC-V updates from Paul Walmsley:
       "Some additional RISC-V updates.
        This includes one significant fix:
         - Prevent interrupts from being unconditionally re-enabled during
           exception handling if they were disabled in the context in which
           the exception occurred
        Also a few other fixes:
         - Fix a build error when sparse memory support is manually enabled
         - Prevent CPUs beyond CONFIG_NR_CPUS from being enabled in early boot
        And a few minor improvements:
         - DT improvements: in the FU540 SoC DT files, improve U-Boot
           compatibility by adding an "ethernet0" alias, drop an unnecessary
           property from the DT files, and add support for the PWM device
         - KVM preparation: add a KVM-related macro for future RISC-V KVM
           support, and export some symbols required to build KVM support as
         - defconfig additions: build more drivers by default for QEMU
      Merge tag 'nios2-v5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2
      Linus Torvalds authored
      Pull nios2 fix from Ley Foon Tan:
       "Make sure the command line buffer is NUL-terminated"
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
      Linus Torvalds authored
      Pull more KVM updates from Paolo Bonzini:
       "x86 KVM changes:
         - The usual accuracy improvements for nested virtualization
         - The usual round of code cleanups from Sean
         - Added back optimizations that were prematurely removed in 5.2 (the
           bare minimum needed to fix the regression was in 5.3-rc8, here
           comes the rest)
         - Support for UMWAIT/UMONITOR/TPAUSE
         - Direct L2->L0 TLB flushing when L0 is Hyper-V and L1 is KVM
         - Tell Windows guests if SMT is disabled on the host
         - More accurate detection of vmexit cost
         - Revert a pvqspinlock pessimization"
      Merge tag 'pwm/for-5.4-rc1' of... · e37e3bc7
      Linus Torvalds authored
      Merge tag 'pwm/for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
      Pull pwm updates from Thierry Reding:
       "Besides one new driver being added for the PWM controller found in
        various Spreadtrum SoCs, this series of changes brings a slew of,
        mostly minor, fixes and cleanups for existing drivers, as well as some
        enhancements to the core code.
        Lastly, Uwe is added to the PWM subsystem entry of the MAINTAINERS
        file, making official his role as a reviewer"
      Merge tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block
      Linus Torvalds authored
      Pull more io_uring updates from Jens Axboe:
       "Just two things in here:
         - Improvement to the io_uring CQ ring wakeup for batched IO (me)
         - Fix wrong comparison in poll handling (yangerkun)
        I realize the first one is a little late in the game, but it felt
        pointless to hold it off until the next release. Went through various
        testing and reviews with Pavel and peterz"
      Merge tag 'for-linus-2019-09-27' of git://git.kernel.dk/linux-block
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A few fixes/changes to round off this merge window. This contains:
         - Small series making some functional tweaks to blk-iocost (Tejun)
         - Elevator switch locking fix (Ming)
         - Kill redundant call in blk-wbt (Yufen)
         - Fix flush timeout handling (Yufen)"
    • Linus Torvalds's avatar
      Linus Torvalds authored
      Pull thermal management updates from Zhang Rui:
       - Add Amit Kucheria as thermal subsystem Reviewer (Amit Kucheria)
       - Fix a use after free bug when unregistering thermal zone devices (Ido
       - Fix thermal core framework to use put_device() when device_register()
         fails (Yue Hu)
       - Enable intel_pch_thermal and MMIO RAPL support for Intel Icelake
         platform (Srinivas Pandruvada)
       - Add clock operations in qorip thermal driver, for some platforms with
         clock control like i.MX8MQ (Anson Huang)
       - A couple of trivial fixes and cleanups for thermal core and different
         soc thermal drivers (Amit Kucheria, Christophe JAILLET, Chuhong Yuan,
         Fuqian Huang, Kelsey Skunberg, Nathan Huckleberry, Rishi Gupta,
         Srinivas Kandagatla)
      Merge tag 'linux-watchdog-5.4-rc1' of git://www.linux-watchdog.org/linux-watchdog
      Linus Torvalds authored
      Pull watchdog updates from Wim Van Sebroeck:
       - addition of AST2600, i.MX7ULP and F81803 watchdog support
       - removal of the w90x900 and ks8695 drivers
       - ziirave_wdt improvements
       - small fixes and improvements
      Merge tag 'drm-next-2019-09-27' of git://anongit.freedesktop.org/drm/drm
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Fixes built up over the past 1.5 weeks or so, it's two weeks of
        amdgpu, some core cleanups and some panfrost fixes. I also finally
        figured out why my desktop was slow to do a bunch of stuff (someone
        gave it an IPv6 address which can't reach anything!).
         - Some cleanups and fixes in the self-refresh helpers
         - Some cleanups and fixes in the atomic helpers
         - Fix a 64 bit divide
         - Prevent a memory leak in a failure case in dc
         - Load proper gfx firmware on navi14 variants
         - Add more navi12 and navi14 PCI ids
         - Misc fixes for renoir
         - Fix bandwidth issues with multiple displays on vega20
         - Support for Dali
         - Fix a possible oops with KFD on hawaii
         - Fix for backlight level after resume on some APUs
         - Other misc fixes
         - Multiple panfrost fixes for regulator support and page fault
      Merge tag 'ntb-5.4' of git://github.com/jonmason/ntb
      Linus Torvalds authored
      Pull NTB updates from Jon Mason:
       "A few bugfixes and support for new AMD NTB hardware"
      keys: Add Jarkko Sakkinen as co-maintainer · ea1e2bbe
      Jarkko Sakkinen authored
      To address a major procedural concern on Linus's part the keyrings needs
      a co-maintainer.
      Suggested-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Yufen Yu's avatar
      block: fix null pointer dereference in blk_mq_rq_timed_out() · 8d699663
      Yufen Yu authored
      We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
      as following:
      [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
      [  108.827059] PGD 0 P4D 0
      [  108.827313] Oops: 0000 [#1] SMP PTI
      [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
      [  108.829503] Workqueue: kblockd blk_mq_timeout_work
      [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
      [  108.838191] Call Trace:
      [  108.838406]  bt_iter+0x74/0x80
      [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
      [  108.839074]  ? __switch_to_asm+0x34/0x70
      [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
      [  108.840732]  blk_mq_timeout_work+0x74/0x200
      [  108.841151]  process_one_work+0x297/0x680
      [  108.841550]  worker_thread+0x29c/0x6f0
      [  108.841926]  ? rescuer_thread+0x580/0x580
      [  108.842344]  kthread+0x16a/0x1a0
      [  108.842666]  ? kthread_flush_work+0x170/0x170
      [  108.843100]  ret_from_fork+0x35/0x40
      The bug is caused by the race between timeout handle and completion for
      flush request.
      When timeout handle function blk_mq_rq_timed_out() try to read
      'req->q->mq_ops', the 'req' have completed and reinitiated by next
      flush request, which would call blk_rq_init() to clear 'req' as 0.
      After commit 12f5b931
       ("blk-mq: Remove generation seqeunce"),
      normal requests lifetime are protected by refcount. Until 'rq->ref'
      drop to zero, the request can really be free. Thus, these requests
      cannot been reused before timeout handle finish.
      However, flush request has defined .end_io and rq->end_io() is still
      called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
      can be reused by the next flush request handle, resulting in null
      pointer deference BUG ON.
      We fix this problem by covering flush request with 'rq->ref'.
      If the refcount is not zero, flush_end_io() return and wait the
      last holder recall it. To record the request status, we add a new
      entry 'rq_status', which will be used in flush_end_io().
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org # v4.18+
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
       - move rq_status from struct request to struct blk_flush_queue
       - remove unnecessary '{}' pair.
       - let spinlock to protect 'fq->rq_status'
       - move rq_status after flush_running_idx member of struct blk_flush_queue
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Yufen Yu's avatar
      rq-qos: get rid of redundant wbt_update_limits() · 2af2783f
      Yufen Yu authored
      We have updated limits after calling wbt_set_min_lat(). No need to
      update again.
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
      Linus Torvalds authored
      Pull timer fix from Ingo Molnar:
       "Fix a timer expiry bug that would cause spurious delay of timers"
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timer: Read jiffies once when forwarding base clk
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
      Linus Torvalds authored
      Pull more perf updates from Ingo Molnar:
       "The only kernel change is comment typo fixes.
        The rest is mostly tooling fixes, but also new vendor event additions
        and updates, a bigger libperf/libtraceevent library and a header files
        reorganization that came in a bit late"
      Merge tag 'trace-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
      Linus Torvalds authored
      Pull tracing fix from Steven Rostedt:
       "Srikar Dronamraju fixed a bug in the newmulti probe code"
    • Arnaldo Carvalho de Melo's avatar
      perf unwind: Fix libunwind build failure on i386 systems · 26acf400
      Arnaldo Carvalho de Melo authored
      Naresh Kamboju reported, that on the i386 build pr_err()
      doesn't get defined properly due to header ordering:
        perf-in.o: In function `libunwind__x86_reg_id':
        undefined reference to `pr_err'
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Merge tag 'usercopy-v5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
      Linus Torvalds authored
      Pull usercopy fix from Kees Cook:
       "Fix hardened usercopy under CONFIG_DEBUG_VIRTUAL"
      * tag 'usercopy-v5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        usercopy: Avoid HIGHMEM pfn warning
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-5.4-rc1.1' of... · 797a3242
      Linus Torvalds authored
      Merge tag 'linux-kselftest-5.4-rc1.1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      Pull Kselftest updates from Shuah Khan:
       "Fixes to existing tests"
      Merge tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
      Linus Torvalds authored
      Pull NFS client updates from Anna Schumaker:
       "Stable bugfixes:
         - Dequeue the request from the receive queue while we're re-encoding
           # v4.20+
         - Fix buffer handling of GSS MIC without slack # 5.1
         - Increase xprtrdma maximum transport header and slot table sizes
         - Add support for nfs4_call_sync() calls using a custom
         - Optimize the default readahead size
         - Enable pNFS filelayout LAYOUTGET on OPEN
        Other bugfixes and cleanups:
         - Fix possible null-pointer dereferences and memory leaks
         - Various NFS over RDMA cleanups
         - Various NFS over RDMA comment updates
         - Don't receive TCP data into a reset request buffer
         - Don't try to parse incomplete RPC messages
         - Fix congestion window race with disconnect
         - Clean up pNFS return-on-close error handling
         - Fixes for NFS4ERR_OLD_STATEID handling"
      binfmt_elf: Do not move brk for INTERP-less ET_EXEC · 7be3cb01
      Kees Cook authored
      When brk was moved for binaries without an interpreter, it should have
      been limited to ET_DYN only. In other words, the special case was an
      ET_DYN that lacks an INTERP, not just an executable that lacks INTERP.
      The bug manifested for giant static executables, where the brk would end
      up in the middle of the text area on 32-bit architectures.
      Reported-and-tested-by: default avatarRichard Kojedzinszky <richard@kojedz.in>
      Fixes: bbdc6076
       ("binfmt_elf: move brk out of mmap when doing direct loader exec")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.4-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
       "There are a couple of bug fixes and some small code cleanups that came
        in recently:
         - Minor code cleanups
         - Fix a superblock logging error
         - Ensure that collapse range converts the data fork to extents format
           when necessary
         - Revert the ALLOC_USERDATA cleanup because it caused subtle behavior
    • Linus Torvalds's avatar
      Merge branch 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
      Linus Torvalds authored
      Pull jffs2 fix from Al Viro:
       "braino fix for mount API conversion for jffs2"
      * 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        jffs2: Fix mounting under new mount API
    • Linus Torvalds's avatar
      Merge tag 's390-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
      Linus Torvalds authored
      Pull more s390 updates from Vasily Gorbik:
       - Fix three kasan findings
       - Add PERF_EVENT_IOC_PERIOD ioctl support
       - Add Crypto Express7S support and extend sysfs attributes for pkey
       - Minor common I/O layer documentation corrections
      Merge tag 'for-linus-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew)
      Linus Torvalds authored
      Merge more updates from Andrew Morton:
       - almost all of the rest of -mm
       - various other subsystems
      Subsystems affected by this patch series:
        memcg, misc, core-kernel, lib, checkpatch, reiserfs, fat, fork,
        cpumask, kexec, uaccess, kconfig, kgdb, bug, ipc, lzo, kasan, madvise,
        cleanups, pagemap
