1. 27 May, 2021 2 commits
  2. 05 May, 2021 1 commit
  3. 21 Apr, 2021 2 commits
    • Wanpeng Li's avatar
      KVM: Boost vCPU candidate in user mode which is delivering interrupt · 52acd22f
      Wanpeng Li authored
      Both lock holder vCPU and IPI receiver that has halted are condidate for
      boost. However, the PLE handler was originally designed to deal with the
      lock holder preemption problem. The Intel PLE occurs when the spinlock
      waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
      they can be in either kernel or user mode. the vCPU candidate in user mode
      will not be boosted even if they should respond to IPIs. Some benchmarks
      like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
      of the time they are running in user mode. It can lead to a large number
      of continuous PLE events because the IPI sender causes PLE events
      repeatedly until the receiver is scheduled while the receiver is not
      candidate for a boost.
      This patch boosts the vCPU candidiate in user mode which is delivery
      interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
      VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
      96 HTs Intel CLX box). There is no performance regression for other
      benchmarks like Unixbench spawn (most of the time contend read/write
      lock in kernel mode), ebizzy (most of the time contend read/write sem
      and TLB shoodtdown in kernel mode).
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Nathan Tempelman's avatar
      KVM: x86: Support KVM VMs sharing SEV context · 54526d1f
      Nathan Tempelman authored
      Add a capability for userspace to mirror SEV encryption context from
      one vm to another. On our side, this is intended to support a
      Migration Helper vCPU, but it can also be used generically to support
      other in-guest workloads scheduled by the host. The intention is for
      the primary guest and the mirror to have nearly identical memslots.
      The primary benefits of this are that:
      1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
      can't accidentally clobber each other.
      2) The VMs can have different memory-views, which is necessary for post-copy
      migration (the migration vCPUs on the target need to read and write to
      pages, when the primary guest would VMEXIT).
      This does not change the threat model for AMD SEV. Any memory involved
      is still owned by the primary guest and its initial state is still
      attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
      to circumvent SEV, they could achieve the same effect by simply attaching
      a vCPU to the primary VM.
      This patch deliberately leaves userspace in charge of the memslots for the
      mirror, as it already has the power to mess with them in the primary guest.
      This patch does not support SEV-ES (much less SNP), as it does not
      handle handing off attested VMSAs to the mirror.
      For additional context, we need a Migration Helper because SEV PSP
      migration is far too slow for our live migration on its own. Using
      an in-guest migrator lets us speed this up significantly.
      Signed-off-by: default avatarNathan Tempelman <natet@google.com>
      Message-Id: <20210408223214.2582277-1-natet@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  4. 20 Apr, 2021 1 commit
    • Sean Christopherson's avatar
      KVM: Stop looking for coalesced MMIO zones if the bus is destroyed · 5d3c4c79
      Sean Christopherson authored
      Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev()
      fails to allocate memory for the new instance of the bus.  If it can't
      instantiate a new bus, unregister_dev() destroys all devices _except_ the
      target device.   But, it doesn't tell the caller that it obliterated the
      bus and invoked the destructor for all devices that were on the bus.  In
      the coalesced MMIO case, this can result in a deleted list entry
      dereference due to attempting to continue iterating on coalesced_zones
      after future entries (in the walk) have been deleted.
      Opportunistically add curly braces to the for-loop, which encompasses
      many lines but sneaks by without braces due to the guts being a single
      if statement.
      Fixes: f6588660
       ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210412222050.876100-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  5. 19 Apr, 2021 1 commit
  6. 17 Apr, 2021 4 commits
  7. 22 Feb, 2021 1 commit
    • David Stevens's avatar
      KVM: x86/mmu: Consider the hva in mmu_notifier retry · 4a42d848
      David Stevens authored
      Track the range being invalidated by mmu_notifier and skip page fault
      retries if the fault address is not affected by the in-progress
      invalidation. Handle concurrent invalidations by finding the minimal
      range which includes all ranges being invalidated. Although the combined
      range may include unrelated addresses and cannot be shrunk as individual
      invalidation operations complete, it is unlikely the marginal gains of
      proper range tracking are worth the additional complexity.
      The primary benefit of this change is the reduction in the likelihood of
      extreme latency when handing a page fault due to another thread having
      been preempted while modifying host virtual addresses.
      Signed-off-by: default avatarDavid Stevens <stevensd@chromium.org>
      Message-Id: <20210222024522.1751719-3-stevensd@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  8. 09 Feb, 2021 1 commit
    • Vitaly Kuznetsov's avatar
      KVM: Raise the maximum number of user memslots · 4fc096a9
      Vitaly Kuznetsov authored
      Current KVM_USER_MEM_SLOTS limits are arch specific (512 on Power, 509 on x86,
      32 on s390, 16 on MIPS) but they don't really need to be. Memory slots are
      allocated dynamically in KVM when added so the only real limitation is
      'id_to_index' array which is 'short'. We don't have any other
      KVM_MEM_SLOTS_NUM/KVM_USER_MEM_SLOTS-sized statically defined structures.
      Low KVM_USER_MEM_SLOTS can be a limiting factor for some configurations.
      In particular, when QEMU tries to start a Windows guest with Hyper-V SynIC
      enabled and e.g. 256 vCPUs the limit is hit as SynIC requires two pages per
      vCPU and the guest is free to pick any GFN for each of them, this fragments
      memslots as QEMU wants to have a separate memslot for each of these pages
      (which are supposed to act as 'overlay' pages).
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210127175731.2020089-3-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  9. 04 Feb, 2021 1 commit
  10. 15 Nov, 2020 4 commits
    • Peter Xu's avatar
      KVM: Don't allocate dirty bitmap if dirty ring is enabled · 044c59c4
      Peter Xu authored
      Because kvm dirty rings and kvm dirty log is used in an exclusive way,
      Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
      At the meantime, since the dirty_bitmap will be conditionally created
      now, we can't use it as a sign of "whether this memory slot enabled
      dirty tracking".  Change users like that to check against the kvm
      memory slot flags.
      Note that there still can be chances where the kvm memory slot got its
      dirty_bitmap allocated, _if_ the memory slots are created before
      enabling of the dirty rings and at the same time with the dirty
      tracking capability enabled, they'll still with the dirty_bitmap.
      However it should not hurt much (e.g., the bitmaps will always be
      freed if they are there), and the real users normally won't trigger
      this because dirty bit tracking flag should in most cases only be
      applied to kvm slots only before migration starts, that should be far
      latter than kvm initializes (VM starts).
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012226.5868-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Peter Xu's avatar
      KVM: X86: Implement ring-based dirty memory tracking · fb04a1ed
      Peter Xu authored
      This patch is heavily based on previous work from Lei Cao
      <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
      KVM currently uses large bitmaps to track dirty memory.  These bitmaps
      are copied to userspace when userspace queries KVM for its dirty page
      information.  The use of bitmaps is mostly sufficient for live
      migration, as large parts of memory are be dirtied from one log-dirty
      pass to another.  However, in a checkpointing system, the number of
      dirty pages is small and in fact it is often bounded---the VM is
      paused when it has dirtied a pre-defined number of pages. Traversing a
      large, sparsely populated bitmap to find set bits is time-consuming,
      as is copying the bitmap to user-space.
      A similar issue will be there for live migration when the guest memory
      is huge while the page dirty procedure is trivial.  In that case for
      each dirty sync we need to pull the whole dirty bitmap to userspace
      and analyse every bit even if it's mostly zeros.
      The preferred data structure for above scenarios is a dense list of
      guest frame numbers (GFN).  This patch series stores the dirty list in
      kernel memory that can be memory mapped into userspace to allow speedy
      This patch enables dirty ring for X86 only.  However it should be
      easily extended to other archs as well.
      [1] https://patchwork.kernel.org/patch/10471409/
      Signed-off-by: default avatarLei Cao <lei.cao@stratus.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012222.5767-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Peter Xu's avatar
      KVM: Pass in kvm pointer into mark_page_dirty_in_slot() · 28bd726a
      Peter Xu authored
      The context will be needed to implement the kvm dirty ring.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012044.5151-5-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Paolo Bonzini's avatar
      KVM: remove kvm_clear_guest_page · 2f541442
      Paolo Bonzini authored
      kvm_clear_guest_page is not used anymore after "KVM: X86: Don't track dirty
      for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]", except from kvm_clear_guest.
      We can just inline it in its sole user.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  11. 23 Oct, 2020 1 commit
  12. 21 Oct, 2020 1 commit
  13. 21 Aug, 2020 2 commits
  14. 24 Jul, 2020 1 commit
    • Thomas Gleixner's avatar
      entry: Provide infrastructure for work before transitioning to guest mode · 935ace2f
      Thomas Gleixner authored
      Entering a guest is similar to exiting to user space. Pending work like
      handling signals, rescheduling, task work etc. needs to be handled before
      Provide generic infrastructure to avoid duplication of the same handling
      code all over the place.
      The transfer to guest mode handling is different from the exit to usermode
      handling, e.g. vs. rseq and live patching, so a separate function is used.
      The initial list of work items handled is:
      Architecture specific TIF flags can be added via defines in the
      architecture specific include files.
      The calling convention is also different from the syscall/interrupt entry
      functions as KVM invokes this from the outer vcpu_run() loop with
      interrupts and preemption enabled. To prevent missing a pending work item
      it invokes a check for pending TIF work from interrupt disabled code right
      before transitioning to guest mode. The lockdep, RCU and tracing state
      handling is also done directly around the switch to and from guest mode.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20200722220519.833296398@linutronix.de
  15. 09 Jul, 2020 2 commits
  16. 08 Jul, 2020 1 commit
  17. 16 Jun, 2020 1 commit
  18. 11 Jun, 2020 1 commit
  19. 08 Jun, 2020 1 commit
    • Eiichi Tsukata's avatar
      KVM: x86: Fix APIC page invalidation race · e649b3f0
      Eiichi Tsukata authored
      Commit b1394e74 ("KVM: x86: fix APIC page invalidation") tried
      to fix inappropriate APIC page invalidation by re-introducing arch
      specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
      kvm_mmu_notifier_invalidate_range_start. However, the patch left a
      possible race where the VMCS APIC address cache is updated *before*
      it is unmapped:
        (Invalidator) kvm_mmu_notifier_invalidate_range_start()
        (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
        (KVM VCPU) vcpu_enter_guest()
        (KVM VCPU) kvm_vcpu_reload_apic_access_page()
        (Invalidator) actually unmap page
      Because of the above race, there can be a mismatch between the
      host physical address stored in the APIC_ACCESS_PAGE VMCS field and
      the host physical address stored in the EPT entry for the APIC GPA
      (0xfee0000).  When this happens, the processor will not trap APIC
      accesses, and will instead show the raw contents of the APIC-access page.
      Because Windows OS periodically checks for unexpected modifications to
      the LAPIC register, this will show up as a BSOD crash with BugCheck
      CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
      The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
      cannot guarantee that no additional references are taken to the pages in
      the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
      this case is supported by the MMU notifier API, as documented in
      	 * If the subsystem
               * can't guarantee that no additional references are taken to
               * the pages in the range, it has to implement the
               * invalidate_range() notifier to remove any references taken
               * after invalidate_range_start().
      The fix therefore is to reload the APIC-access page field in the VMCS
      from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
      Cc: stable@vger.kernel.org
      Fixes: b1394e74 ("KVM: x86: fix APIC page invalidation")
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
      Signed-off-by: default avatarEiichi Tsukata <eiichi.tsukata@nutanix.com>
      Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  20. 04 Jun, 2020 1 commit
    • Paolo Bonzini's avatar
      KVM: let kvm_destroy_vm_debugfs clean up vCPU debugfs directories · d56f5136
      Paolo Bonzini authored
      After commit 63d04348
       ("KVM: x86: move kvm_create_vcpu_debugfs after
      last failure point") we are creating the pre-vCPU debugfs files
      after the creation of the vCPU file descriptor.  This makes it
      possible for userspace to reach kvm_vcpu_release before
      kvm_create_vcpu_debugfs has finished.  The vcpu->debugfs_dentry
      then does not have any associated inode anymore, and this causes
      a NULL-pointer dereference in debugfs_create_file.
      The solution is simply to avoid removing the files; they are
      cleaned up when the VM file descriptor is closed (and that must be
      after KVM_CREATE_VCPU returns).  We can stop storing the dentry
      in struct kvm_vcpu too, because it is not needed anywhere after
      kvm_create_vcpu_debugfs returns.
      Reported-by: default avatar <syzbot+705f4401d5a93a59b87d@syzkaller.appspotmail.com>
      Fixes: 63d04348
       ("KVM: x86: move kvm_create_vcpu_debugfs after last failure point")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  21. 01 Jun, 2020 1 commit
  22. 13 May, 2020 1 commit
    • Davidlohr Bueso's avatar
      kvm: Replace vcpu->swait with rcuwait · da4ad88c
      Davidlohr Bueso authored
      The use of any sort of waitqueue (simple or regular) for
      wait/waking vcpus has always been an overkill and semantically
      wrong. Because this is per-vcpu (which is blocked) there is
      only ever a single waiting vcpu, thus no need for any sort of
      As such, make use of the rcuwait primitive, with the following
        - rcuwait already provides the proper barriers that serialize
        concurrent waiter and waker.
        - Task wakeup is done in rcu read critical region, with a
        stable task pointer.
        - Because there is no concurrency among waiters, we need
        not worry about rcuwait_wait_event() calls corrupting
        the wait->task. As a consequence, this saves the locking
        done in swait when modifying the queue. This also applies
        to per-vcore wait for powerpc kvm-hv.
      The x86 tscdeadline_latency test mentioned in 8577370f
      ("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg,
      latency is reduced by around 15-20% with this change.
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: kvmarm@lists.cs.columbia.edu
      Cc: linux-mips@vger.kernel.org
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Message-Id: <20200424054837.5138-6-dave@stgolabs.net>
      [Avoid extra logic changes. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  23. 08 May, 2020 1 commit
  24. 24 Apr, 2020 1 commit
  25. 21 Apr, 2020 3 commits
  26. 14 Apr, 2020 1 commit
    • Sean Christopherson's avatar
      KVM: Check validity of resolved slot when searching memslots · b6467ab1
      Sean Christopherson authored
      Check that the resolved slot (somewhat confusingly named 'start') is a
      valid/allocated slot before doing the final comparison to see if the
      specified gfn resides in the associated slot.  The resolved slot can be
      invalid if the binary search loop terminated because the search index
      was incremented beyond the number of used slots.
      This bug has existed since the binary search algorithm was introduced,
      but went unnoticed because KVM statically allocated memory for the max
      number of slots, i.e. the access would only be truly out-of-bounds if
      all possible slots were allocated and the specified gfn was less than
      the base of the lowest memslot.  Commit 36947254 ("KVM: Dynamically
      size memslot array based on number of used slots") eliminated the "all
      possible slots allocated" condition and made the bug embarrasingly easy
      to hit.
      Fixes: 9c1a5d38
       ("kvm: optimize GFN to memslot lookup with large slots amount")
      Reported-by: default avatar <syzbot+d889b59b2bb87d4047a2@syzkaller.appspotmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200408064059.8957-2-sean.j.christopherson@intel.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  27. 31 Mar, 2020 1 commit
  28. 26 Mar, 2020 1 commit