1. 08 Sep, 2019 1 commit
  2. 07 Sep, 2019 1 commit
    • Arnd Bergmann's avatar
      ipc: fix sparc64 ipc() wrapper · fb377eb8
      Arnd Bergmann authored
      
      
      Matt bisected a sparc64 specific issue with semctl, shmctl and msgctl
      to a commit from my y2038 series in linux-5.1, as I missed the custom
      sys_ipc() wrapper that sparc64 uses in place of the generic version that
      I patched.
      
      The problem is that the sys_{sem,shm,msg}ctl() functions in the kernel
      now do not allow being called with the IPC_64 flag any more, resulting
      in a -EINVAL error when they don't recognize the command.
      
      Instead, the correct way to do this now is to call the internal
      ksys_old_{sem,shm,msg}ctl() functions to select the API version.
      
      As we generally move towards these functions anyway, change all of
      sparc_ipc() to consistently use those in place of the sys_*() versions,
      and move the required ksys_*() declarations into linux/syscalls.h
      
      The IS_ENABLED(CONFIG_SYSVIPC) check is required to avoid link
      errors when ipc is disabled.
      
      Reported-by: default avatarMatt Turner <mattst88@gmail.com>
      Fixes: 275f2214
      
       ("ipc: rename old-style shmctl/semctl/msgctl syscalls")
      Cc: stable@vger.kernel.org
      Tested-by: default avatarMatt Turner <mattst88@gmail.com>
      Tested-by: default avatarAnatoly Pugachev <matorola@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      fb377eb8
  3. 06 Sep, 2019 1 commit
  4. 05 Sep, 2019 1 commit
  5. 03 Sep, 2019 4 commits
  6. 02 Sep, 2019 1 commit
  7. 31 Aug, 2019 2 commits
  8. 28 Aug, 2019 2 commits
  9. 26 Aug, 2019 1 commit
  10. 23 Aug, 2019 1 commit
  11. 22 Aug, 2019 1 commit
  12. 20 Aug, 2019 1 commit
  13. 19 Aug, 2019 3 commits
    • David Howells's avatar
      keys: Fix description size · 555df336
      David Howells authored
      The maximum key description size is 4095.  Commit f771fde8 ("keys:
      Simplify key description management") inadvertantly reduced that to 255
      and made sizes between 256 and 4095 work weirdly, and any size whereby
      size & 255 == 0 would cause an assertion in __key_link_begin() at the
      following line:
      
      	BUG_ON(index_key->desc_len == 0);
      
      This can be fixed by simply increasing the size of desc_len in struct
      keyring_index_key to a u16.
      
      Note the argument length test in keyutils only checked empty
      descriptions and descriptions with a size around the limit (ie.  4095)
      and not for all the values in between, so it missed this.  This has been
      addressed and
      
      	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/commit/?id=066bf56807c26cd3045a25f355b34c1d8a20a5aa
      
      now exhaustively tests all possible lengths of type, description and
      payload and then some.
      
      The assertion failure looks something like:
      
       kernel BUG at security/keys/keyring.c:1245!
       ...
       RIP: 0010:__key_link_begin+0x88/0xa0
       ...
       Call Trace:
        key_create_or_update+0x211/0x4b0
        __x64_sys_add_key+0x101/0x200
        do_syscall_64+0x5b/0x1e0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      It can be triggered by:
      
      	keyctl add user "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" a @s
      
      Fixes: f771fde8
      
       ("keys: Simplify key description management")
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      555df336
    • Masahiro Yamada's avatar
      netfilter: add include guard to nf_conntrack_h323_types.h · 38a429c8
      Masahiro Yamada authored
      
      
      Add a header include guard just in case.
      
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      38a429c8
    • Eric W. Biederman's avatar
      signal: Allow cifs and drbd to receive their terminating signals · 33da8e7c
      Eric W. Biederman authored
      
      
      My recent to change to only use force_sig for a synchronous events
      wound up breaking signal reception cifs and drbd.  I had overlooked
      the fact that by default kthreads start out with all signals set to
      SIG_IGN.  So a change I thought was safe turned out to have made it
      impossible for those kernel thread to catch their signals.
      
      Reverting the work on force_sig is a bad idea because what the code
      was doing was very much a misuse of force_sig.  As the way force_sig
      ultimately allowed the signal to happen was to change the signal
      handler to SIG_DFL.  Which after the first signal will allow userspace
      to send signals to these kernel threads.  At least for
      wake_ack_receiver in drbd that does not appear actively wrong.
      
      So correct this problem by adding allow_kernel_signal that will allow
      signals whose siginfo reports they were sent by the kernel through,
      but will not allow userspace generated signals, and update cifs and
      drbd to call allow_kernel_signal in an appropriate place so that their
      thread can receive this signal.
      
      Fixing things this way ensures that userspace won't be able to send
      signals and cause problems, that it is clear which signals the
      threads are expecting to receive, and it guarantees that nothing
      else in the system will be affected.
      
      This change was partly inspired by similar cifs and drbd patches that
      added allow_signal.
      
      Reported-by: default avatarronnie sahlberg <ronniesahlberg@gmail.com>
      Reported-by: default avatarChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Tested-by: default avatarChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Cc: Steve French <smfrench@gmail.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Fixes: 247bc947 ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
      Fixes: 72abe3bc ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
      Fixes: fee10990 ("signal/drbd: Use send_sig not force_sig")
      Fixes: 3cf5d076
      
       ("signal: Remove task parameter from force_sig")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      33da8e7c
  14. 15 Aug, 2019 2 commits
  15. 14 Aug, 2019 1 commit
  16. 13 Aug, 2019 5 commits
    • Andrea Arcangeli's avatar
      Revert "mm, thp: restore node-local hugepage allocations" · a8282608
      Andrea Arcangeli authored
      This reverts commit 2f0799a0 ("mm, thp: restore node-local
      hugepage allocations").
      
      commit 2f0799a0 was rightfully applied to avoid the risk of a
      severe regression that was reported by the kernel test robot at the end
      of the merge window.  Now we understood the regression was a false
      positive and was caused by a significant increase in fairness during a
      swap trashing benchmark.  So it's safe to re-apply the fix and continue
      improving the code from there.  The benchmark that reported the
      regression is very useful, but it provides a meaningful result only when
      there is no significant alteration in fairness during the workload.  The
      removal of __GFP_THISNODE increased fairness.
      
      __GFP_THISNODE cannot be used in the generic page faults path for new
      memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
      behavior significantly deviates from what the MPOL_DEFAULT semantics are
      supposed to be for THP and 4k allocations alike.
      
      Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
      set to "madvise") has never meant to provide an implicit MPOL_BIND on
      the "current" node the task is running on, causing swap storms and
      providing a much more aggressive behavior than even zone_reclaim_node =
      3.
      
      Any workload who could have benefited from __GFP_THISNODE has now to
      enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
      the zone_reclaim_mode behavior, but it only did so if THP was enabled:
      if THP was disabled, there would have been no chance to get any 4k page
      from the current node if the current node was full of pagecache, which
      further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
      MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
      semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
      must work exactly the same with MADV_HUGEPAGE set or not.
      
      The performance characteristic of memory depends on the hardware
      details.  The numbers below are obtained on Naples/EPYC architecture and
      the N/A projection extends them to show what we should aim for in the
      future as a good THP NUMA locality default.  The benchmark used
      exercises random memory seeks (note: the cost of the page faults is not
      part of the measurement).
      
        D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
        0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
      
      D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
      intra socket memory), D2 means distance two (i.e.  inter socket memory),
      etc...
      
      For the guest physical memory allocated by qemu and for guest mode
      kernel the performance characteristic of RAM is more complex and an
      ideal default could be:
      
        D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
        0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
      
      NOTE: the N/A are projections and haven't been measured yet, the
      measurement in this case is done on a 1950x with only two NUMA nodes.
      The THP case here means THP was used both in the host and in the guest.
      
      After applying this commit the THP NUMA locality order that we'll get
      out of MADV_HUGEPAGE is this:
      
        D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Before this commit it was:
      
        D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Even if we ignore the breakage of large workloads that can't fit in a
      single node that the __GFP_THISNODE implicit "current node" mbind
      caused, the THP NUMA locality order provided by __GFP_THISNODE was still
      not the one we shall aim for in the long term (i.e.  the first one at
      the top).
      
      After this commit is applied, we can introduce a new allocator multi
      order API and to replace those two alloc_pages_vmas calls in the page
      fault path, with a single multi order call:
      
              unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
              page = alloc_pages_multi_order(..., &order);
              if (!page)
              	goto out;
              if (!(order & (1 << 0))) {
              	VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
              	/* THP fault */
              } else {
              	VM_WARN_ON(order != 1 << 0);
              	/* 4k fallback */
              }
      
      The page allocator logic has to be altered so that when it fails on any
      zone with order 9, it has to try again with a order 0 before falling
      back to the next zone in the zonelist.
      
      After that we need to do more measurements and evaluate if adding an
      opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
      with "DN+1 THP | DN 4k" at every NUMA distance crossing.
      
      Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
      
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8282608
    • Andrea Arcangeli's avatar
      Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 92717d42
      Andrea Arcangeli authored
      Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".
      
      The fixes for what was originally reported as "pathological THP
      behavior" we rightfully reverted to be sure not to introduced
      regressions at end of a merge window after a severe regression report
      from the kernel bot.  We can safely re-apply them now that we had time
      to analyze the problem.
      
      The mm process worked fine, because the good fixes were eventually
      committed upstream without excessive delay.
      
      The regression reported by the kernel bot however forced us to revert
      the good fixes to be sure not to introduce regressions and to give us
      the time to analyze the issue further.  The silver lining is that this
      extra time allowed to think more at this issue and also plan for a
      future direction to improve things further in terms of THP NUMA
      locality.
      
      This patch (of 2):
      
      This reverts commit 356ff8a9 ("Revert "mm, thp: consolidate THP
      gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
      89c83fb5 ("mm, thp: consolidate THP gfp handling into
      alloc_hugepage_direct_gfpmask").
      
      Consolidation of the THP allocation flags at the same place was meant to
      be a clean up to easier handle otherwise scattered code which is
      imposing a maintenance burden.  There were no real problems observed
      with the gfp mask consolidation but the reversion was rushed through
      without a larger consensus regardless.
      
      This patch brings the consolidation back because this should make the
      long term maintainability easier as well as it should allow future
      changes to be less error prone.
      
      [mhocko@kernel.org: changelog additions]
      Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
      
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92717d42
    • Roman Gushchin's avatar
      mm: workingset: fix vmstat counters for shadow nodes · ec9f0238
      Roman Gushchin authored
      Memcg counters for shadow nodes are broken because the memcg pointer is
      obtained in a wrong way. The following approach is used:
              virt_to_page(xa_node)->mem_cgroup
      
      Since commit 4d96ba35 ("mm: memcg/slab: stop setting
      page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
      set for slab pages, so memcg_from_slab_page() should be used instead.
      
      Also I doubt that it ever worked correctly: virt_to_head_page() should
      be used instead of virt_to_page().  Otherwise objects residing on tail
      pages are not accounted, because only the head page contains a valid
      mem_cgroup pointer.  That was a case since the introduction of these
      counters by the commit 68d48e6a ("mm: workingset: add vmstat counter
      for shadow nodes").
      
      Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
      Fixes: 4d96ba35
      
       ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec9f0238
    • Ralph Campbell's avatar
      mm: document zone device struct page field usage · 76470ccd
      Ralph Campbell authored
      Patch series "mm/hmm: fixes for device private page migration", v3.
      
      Testing the latest linux git tree turned up a few bugs with page
      migration to and from ZONE_DEVICE private and anonymous pages.
      Hopefully it clarifies how ZONE_DEVICE private struct page uses the same
      mapping and index fields from the source anonymous page mapping.
      
      This patch (of 3):
      
      Struct page for ZONE_DEVICE private pages uses the page->mapping and and
      page->index fields while the source anonymous pages are migrated to
      device private memory.  This is so rmap_walk() can find the page when
      migrating the ZONE_DEVICE private page back to system memory.
      ZONE_DEVICE pmem backed fsdax pages also use the page->mapping and
      page->index fields when files are mapped into a process address space.
      
      Add comments to struct page and remove the unused "_zd_pad_1" field to
      make this more clear.
      
      Link: http://lkml.kernel.org/r/20190724232700.23327-2-rcampbell@nvidia.com
      
      
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76470ccd
    • John Garry's avatar
      lib: logic_pio: Add logic_pio_unregister_range() · b884e2de
      John Garry authored
      
      
      Add a function to unregister a logical PIO range.
      
      Logical PIO space can still be leaked when unregistering certain
      LOGIC_PIO_CPU_MMIO regions, but this acceptable for now since there are no
      callers to unregister LOGIC_PIO_CPU_MMIO regions, and the logical PIO
      region allocation scheme would need significant work to improve this.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarWei Xu <xuwei5@hisilicon.com>
      b884e2de
  17. 12 Aug, 2019 1 commit
  18. 11 Aug, 2019 1 commit
  19. 10 Aug, 2019 1 commit
    • Christoph Hellwig's avatar
      dma-mapping: fix page attributes for dma_mmap_* · 33dcb37c
      Christoph Hellwig authored
      All the way back to introducing dma_common_mmap we've defaulted to mark
      the pages as uncached.  But this is wrong for DMA coherent devices.
      Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that
      flag is only treated special on the alloc side for non-coherent devices.
      
      Introduce a new dma_pgprot helper that deals with the check for coherent
      devices so that only the remapping cases ever reach arch_dma_mmap_pgprot
      and we thus ensure no aliasing of page attributes happens, which makes
      the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the
      remaining ones.
      
      Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but
      we'll phase it out soon.
      
      Fixes: 64ccc9c0
      
       ("common: dma-mapping: add support for generic dma_mmap_* calls")
      Reported-by: default avatarShawn Anastasio <shawn@anastas.io>
      Reported-by: default avatarGavin Li <git@thegavinli.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
      33dcb37c
  20. 09 Aug, 2019 1 commit
    • Jakub Kicinski's avatar
      net/tls: prevent skb_orphan() from leaking TLS plain text with offload · 41477662
      Jakub Kicinski authored
      sk_validate_xmit_skb() and drivers depend on the sk member of
      struct sk_buff to identify segments requiring encryption.
      Any operation which removes or does not preserve the original TLS
      socket such as skb_orphan() or skb_clone() will cause clear text
      leaks.
      
      Make the TCP socket underlying an offloaded TLS connection
      mark all skbs as decrypted, if TLS TX is in offload mode.
      Then in sk_validate_xmit_skb() catch skbs which have no socket
      (or a socket with no validation) and decrypted flag set.
      
      Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
      sk->sk_validate_xmit_skb are slightly interchangeable right now,
      they all imply TLS offload. The new checks are guarded by
      CONFIG_TLS_DEVICE because that's the option guarding the
      sk_buff->decrypted member.
      
      Second, smaller issue with orphaning is that it breaks
      the guarantee that packets will be delivered to device
      queues in-order. All TLS offload drivers depend on that
      scheduling property. This means skb_orphan_partial()'s
      trick of preserving partial socket references will cause
      issues in the drivers. We need a full orphan, and as a
      result netem delay/throttling will cause all TLS offload
      skbs to be dropped.
      
      Reusing the sk_buff->decrypted flag also protects from
      leaking clear text when incoming, decrypted skb is redirected
      (e.g. by TC).
      
      See commit 0608c69c
      
       ("bpf: sk_msg, sock{map|hash} redirect
      through ULP") for justification why the internal flag is safe.
      The only location which could leak the flag in is tcp_bpf_sendmsg(),
      which is taken care of by clearing the previously unused bit.
      
      v2:
       - remove superfluous decrypted mark copy (Willem);
       - remove the stale doc entry (Boris);
       - rely entirely on EOR marking to prevent coalescing (Boris);
       - use an internal sendpages flag instead of marking the socket
         (Boris).
      v3 (Willem):
       - reorganize the can_skb_orphan_partial() condition;
       - fix the flag leak-in through tcp_bpf_sendmsg.
      
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41477662
  21. 08 Aug, 2019 2 commits
  22. 05 Aug, 2019 3 commits
    • Greg KH's avatar
      KVM: no need to check return value of debugfs_create functions · 3e7093d0
      Greg KH authored
      
      
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return
      void instead of an integer, as we should not care at all about if this
      function actually does anything or not.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <x86@kernel.org>
      Cc: <kvm@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3e7093d0
    • Paolo Bonzini's avatar
      KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae
      Paolo Bonzini authored
      
      
      There is no need for this function as all arches have to implement
      kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
      let us actually simplify the code.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      741cbbae
    • Wanpeng Li's avatar
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li authored
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146
      
       (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  23. 03 Aug, 2019 1 commit
  24. 02 Aug, 2019 1 commit
  25. 01 Aug, 2019 1 commit
    • Christian Brauner's avatar
      pidfd: add P_PIDFD to waitid() · 3695eae5
      Christian Brauner authored
      
      
      This adds the P_PIDFD type to waitid().
      One of the last remaining bits for the pidfd api is to make it possible
      to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
      that want to use the pidfd api to exclusively manage processes can do so
      now.
      
      One of the things this will unblock in the future is the ability to make
      it possible to retrieve the exit status via waitid(P_PIDFD) for
      non-parent processes if handed a _suitable_ pidfd that has this feature
      set. This is similar to what you can do on FreeBSD with kqueue(). It
      might even end up being possible to wait on a process as a non-parent if
      an appropriate property is enabled on the pidfd.
      
      With P_PIDFD no scoping of the process identified by the pidfd is
      possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
      waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
      equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
      rely on pid-based wait*() syscalls for now.
      
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io
      3695eae5