1. 28 May, 2021 3 commits
  2. 27 May, 2021 8 commits
    • David Matlack's avatar
      KVM: x86/mmu: Fix comment mentioning skip_4k · bedd9195
      David Matlack authored
      
      
      This comment was left over from a previous version of the patch that
      introduced wrprot_gfn_range, when skip_4k was passed in instead of
      min_level.
      
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210526163227.3113557-1-dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bedd9195
    • Marcelo Tosatti's avatar
      KVM: VMX: update vcpu posted-interrupt descriptor when assigning device · a2486020
      Marcelo Tosatti authored
      
      
      For VMX, when a vcpu enters HLT emulation, pi_post_block will:
      
      1) Add vcpu to per-cpu list of blocked vcpus.
      
      2) Program the posted-interrupt descriptor "notification vector"
      to POSTED_INTR_WAKEUP_VECTOR
      
      With interrupt remapping, an interrupt will set the PIR bit for the
      vector programmed for the device on the CPU, test-and-set the
      ON bit on the posted interrupt descriptor, and if the ON bit is clear
      generate an interrupt for the notification vector.
      
      This way, the target CPU wakes upon a device interrupt and wakes up
      the target vcpu.
      
      Problem is that pi_post_block only programs the notification vector
      if kvm_arch_has_assigned_device() is true. Its possible for the
      following to happen:
      
      1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
      notification vector is not programmed
      2) device is assigned to VM
      3) device interrupts vcpu V, sets ON bit
      (notification vector not programmed, so pcpu P remains in idle)
      4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
      kvm_vcpu_kick is skipped
      5) vcpu 0 busy spins on vcpu V's response for several seconds, until
      RCU watchdog NMIs all vCPUs.
      
      To fix this, use the start_assignment kvm_x86_ops callback to kick
      vcpus out of the halt loop, so the notification vector is
      properly reprogrammed to the wakeup vector.
      
      Reported-by: default avatarPei Zhang <pezhang@redhat.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Message-Id: <20210526172014.GA29007@fuller.cnet>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a2486020
    • Marcelo Tosatti's avatar
      KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK · 084071d5
      Marcelo Tosatti authored
      
      
      KVM_REQ_UNBLOCK will be used to exit a vcpu from
      its inner vcpu halt emulation loop.
      
      Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch
      PowerPC to arch specific request bit.
      
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      
      Message-Id: <20210525134321.303768132@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      084071d5
    • Marcelo Tosatti's avatar
      KVM: x86: add start_assignment hook to kvm_x86_ops · 57ab8794
      Marcelo Tosatti authored
      
      
      Add a start_assignment hook to kvm_x86_ops, which is called when
      kvm_arch_start_assignment is done.
      
      The hook is required to update the wakeup vector of a sleeping vCPU
      when a device is assigned to the guest.
      
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      
      Message-Id: <20210525134321.254128742@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      57ab8794
    • Wanpeng Li's avatar
      KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch · 9805cf03
      Wanpeng Li authored
      
      
      Let's treat lapic_timer_advance_ns automatic tuning logic as hypervisor
      overhead, move it before wait_lapic_expire instead of between wait_lapic_expire
      and the world switch, the wait duration should be calculated by the
      up-to-date guest_tsc after the overhead of automatic tuning logic. This
      patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing
      busy waits.
      
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1621339235-11131-5-git-send-email-wanpengli@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9805cf03
    • Wanpeng Li's avatar
      KVM: X86: hyper-v: Task srcu lock when accessing kvm_memslots() · da6d63a0
      Wanpeng Li authored
         WARNING: suspicious RCU usage
         5.13.0-rc1 #4 Not tainted
         -----------------------------
         ./include/linux/kvm_host.h:710 suspicious rcu_dereference_check() usage!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
         1 lock held by hyperv_clock/8318:
          #0: ffffb6b8cb05a7d8 (&hv->hv_lock){+.+.}-{3:3}, at: kvm_hv_invalidate_tsc_page+0x3e/0xa0 [kvm]
      
        stack backtrace:
        CPU: 3 PID: 8318 Comm: hyperv_clock Not tainted 5.13.0-rc1 #4
        Call Trace:
         dump_stack+0x87/0xb7
         lockdep_rcu_suspicious+0xce/0xf0
         kvm_write_guest_page+0x1c1/0x1d0 [kvm]
         kvm_write_guest+0x50/0x90 [kvm]
         kvm_hv_invalidate_tsc_page+0x79/0xa0 [kvm]
         kvm_gen_update_masterclock+0x1d/0x110 [kvm]
         kvm_arch_vm_ioctl+0x2a7/0xc50 [kvm]
         kvm_vm_ioctl+0x123/0x11d0 [kvm]
         __x64_sys_ioctl+0x3ed/0x9d0
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      kvm_memslots() will be called by kvm_write_guest(), so we should take the srcu lock.
      
      Fixes: e880c6ea
      
       (KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs)
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1621339235-11131-4-git-send-email-wanpengli@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      da6d63a0
    • Wanpeng Li's avatar
      KVM: X86: Fix vCPU preempted state from guest's point of view · 1eff0ada
      Wanpeng Li authored
      Commit 66570e96 (kvm: x86: only provide PV features if enabled in guest's
      CPUID) avoids to access pv tlb shootdown host side logic when this pv feature
      is not exposed to guest, however, kvm_steal_time.preempted not only leveraged
      by pv tlb shootdown logic but also mitigate the lock holder preemption issue.
      From guest's point of view, vCPU is always preempted since we lose the reset
      of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not
      exposed. This patch fixes it by clearing kvm_steal_time.preempted before
      vmentry.
      
      Fixes: 66570e96
      
       (kvm: x86: only provide PV features if enabled in guest's CPUID)
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1621339235-11131-3-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1eff0ada
    • Wanpeng Li's avatar
      KVM: X86: Bail out of direct yield in case of under-committed scenarios · 72b268a8
      Wanpeng Li authored
      
      
      In case of under-committed scenarios, vCPUs can be scheduled easily;
      kvm_vcpu_yield_to adds extra overhead, and it is also common to see
      when vcpu->ready is true but yield later failing due to p->state is
      TASK_RUNNING.
      
      Let's bail out in such scenarios by checking the length of current cpu
      runqueue, which can be treated as a hint of under-committed instead of
      guarantee of accuracy. 30%+ of directed-yield attempts can now avoid
      the expensive lookups in kvm_sched_yield() in an under-committed scenario.
      
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1621339235-11131-2-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      72b268a8
  3. 24 May, 2021 3 commits
  4. 21 May, 2021 1 commit
  5. 19 May, 2021 4 commits
  6. 18 May, 2021 6 commits
  7. 17 May, 2021 1 commit
    • Jan Kara's avatar
      quota: Disable quotactl_path syscall · 5b9fedb3
      Jan Kara authored
      In commit fa8b9007 ("quota: wire up quotactl_path") we have wired up
      new quotactl_path syscall. However some people in LWN discussion have
      objected that the path based syscall is missing dirfd and flags argument
      which is mostly standard for contemporary path based syscalls. Indeed
      they have a point and after a discussion with Christian Brauner and
      Sascha Hauer I've decided to disable the syscall for now and update its
      API. Since there is no userspace currently using that syscall and it
      hasn't been released in any major release, we should be fine.
      
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      CC: Sascha Hauer <s.hauer@pengutronix.de>
      Link: https://lore.kernel.org/lkml/20210512153621.n5u43jsytbik4yze@wittgenstein
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      5b9fedb3
  8. 14 May, 2021 1 commit
  9. 13 May, 2021 1 commit
  10. 12 May, 2021 1 commit
  11. 10 May, 2021 3 commits
  12. 07 May, 2021 8 commits
    • Tom Lendacky's avatar
      KVM: SVM: Move GHCB unmapping to fix RCU warning · ce7ea0cf
      Tom Lendacky authored
      When an SEV-ES guest is running, the GHCB is unmapped as part of the
      vCPU run support. However, kvm_vcpu_unmap() triggers an RCU dereference
      warning with CONFIG_PROVE_LOCKING=y because the SRCU lock is released
      before invoking the vCPU run support.
      
      Move the GHCB unmapping into the prepare_guest_switch callback, which is
      invoked while still holding the SRCU lock, eliminating the RCU dereference
      warning.
      
      Fixes: 291bd20d
      
       ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <b2f9b79d15166f2c3e4375c0d9bc3268b7696455.1620332081.git.thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ce7ea0cf
    • Sean Christopherson's avatar
      KVM: SVM: Invert user pointer casting in SEV {en,de}crypt helpers · 368340a3
      Sean Christopherson authored
      
      
      Invert the user pointer params for SEV's helpers for encrypting and
      decrypting guest memory so that they take a pointer and cast to an
      unsigned long as necessary, as opposed to doing the opposite.  Tagging a
      non-pointer as __user is confusing and weird since a cast of some form
      needs to occur to actually access the user data.  This also fixes Sparse
      warnings triggered by directly consuming the unsigned longs, which are
      "noderef" due to the __user tag.
      
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Ashish Kalra <ashish.kalra@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210506231542.2331138-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      368340a3
    • Thomas Gleixner's avatar
      KVM: x86: Prevent deadlock against tk_core.seq · 3f804f6d
      Thomas Gleixner authored
      syzbot reported a possible deadlock in pvclock_gtod_notify():
      
      CPU 0  		  	   	    	    CPU 1
      write_seqcount_begin(&tk_core.seq);
        pvclock_gtod_notify()			    spin_lock(&pool->lock);
          queue_work(..., &pvclock_gtod_work)	    ktime_get()
           spin_lock(&pool->lock);		      do {
           						seq = read_seqcount_begin(tk_core.seq)
      						...
      				              } while (read_seqcount_retry(&tk_core.seq, seq);
      
      While this is unlikely to happen, it's possible.
      
      Delegate queue_work() to irq_work() which postpones it until the
      tk_core.seq write held region is left and interrupts are reenabled.
      
      Fixes: 16e8d74d
      
       ("KVM: x86: notifier for clocksource changes")
      Reported-by: default avatar <syzbot+6beae4000559d41d80f8@syzkaller.appspotmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Message-Id: <87h7jgm1zy.ffs@nanos.tec.linutronix.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3f804f6d
    • Thomas Gleixner's avatar
      KVM: x86: Cancel pvclock_gtod_work on module removal · 594b27e6
      Thomas Gleixner authored
      Nothing prevents the following:
      
        pvclock_gtod_notify()
          queue_work(system_long_wq, &pvclock_gtod_work);
        ...
        remove_module(kvm);
        ...
        work_queue_run()
          pvclock_gtod_work()	<- UAF
      
      Ditto for any other operation on that workqueue list head which touches
      pvclock_gtod_work after module removal.
      
      Cancel the work in kvm_arch_exit() to prevent that.
      
      Fixes: 16e8d74d
      
       ("KVM: x86: notifier for clocksource changes")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Message-Id: <87czu4onry.ffs@nanos.tec.linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      594b27e6
    • Sean Christopherson's avatar
      KVM: x86: Prevent KVM SVM from loading on kernels with 5-level paging · 03ca4589
      Sean Christopherson authored
      
      
      Disallow loading KVM SVM if 5-level paging is supported.  In theory, NPT
      for L1 should simply work, but there unknowns with respect to how the
      guest's MAXPHYADDR will be handled by hardware.
      
      Nested NPT is more problematic, as running an L1 VMM that is using
      2-level page tables requires stacking single-entry PDP and PML4 tables in
      KVM's NPT for L2, as there are no equivalent entries in L1's NPT to
      shadow.  Barring hardware magic, for 5-level paging, KVM would need stack
      another layer to handle PML5.
      
      Opportunistically rename the lm_root pointer, which is used for the
      aforementioned stacking when shadowing 2-level L1 NPT, to pml4_root to
      call out that it's specifically for PML4.
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210505204221.1934471-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      03ca4589
    • Paolo Bonzini's avatar
      KVM: X86: Expose bus lock debug exception to guest · 76ea438b
      Paolo Bonzini authored
      
      
      Bus lock debug exception is an ability to notify the kernel by an #DB
      trap after the instruction acquires a bus lock and is executed when
      CPL>0. This allows the kernel to enforce user application throttling or
      mitigations.
      
      Existence of bus lock debug exception is enumerated via
      CPUID.(EAX=7,ECX=0).ECX[24]. Software can enable these exceptions by
      setting bit 2 of the MSR_IA32_DEBUGCTL. Expose the CPUID to guest and
      emulate the MSR handling when guest enables it.
      
      Support for this feature was originally developed by Xiaoyao Li and
      Chenyi Qiang, but code has since changed enough that this patch has
      nothing in common with theirs, except for this commit message.
      
      Co-developed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20210202090433.13441-4-chenyi.qiang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      76ea438b
    • Chenyi Qiang's avatar
      KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit · e8ea85fb
      Chenyi Qiang authored
      
      
      Bus lock debug exception introduces a new bit DR6_BUS_LOCK (bit 11 of
      DR6) to indicate that bus lock #DB exception is generated. The set/clear
      of DR6_BUS_LOCK is similar to the DR6_RTM. The processor clears
      DR6_BUS_LOCK when the exception is generated. For all other #DB, the
      processor sets this bit to 1. Software #DB handler should set this bit
      before returning to the interrupted task.
      
      In VMM, to avoid breaking the CPUs without bus lock #DB exception
      support, activate the DR6_BUS_LOCK conditionally in DR6_FIXED_1 bits.
      When intercepting the #DB exception caused by bus locks, bit 11 of the
      exit qualification is set to identify it. The VMM should emulate the
      exception by clearing the bit 11 of the guest DR6.
      
      Co-developed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20210202090433.13441-3-chenyi.qiang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8ea85fb
    • Sean Christopherson's avatar
      KVM: x86: Hide RDTSCP and RDPID if MSR_TSC_AUX probing failed · 78bba966
      Sean Christopherson authored
      
      
      If probing MSR_TSC_AUX failed, hide RDTSCP and RDPID, and WARN if either
      feature was reported as supported.  In theory, such a scenario should
      never happen as both Intel and AMD state that MSR_TSC_AUX is available if
      RDTSCP or RDPID is supported.  But, KVM injects #GP on MSR_TSC_AUX
      accesses if probing failed, faults on WRMSR(MSR_TSC_AUX) may be fatal to
      the guest (because they happen during early CPU bringup), and KVM itself
      has effectively misreported RDPID support in the past.
      
      Note, this also has the happy side effect of omitting MSR_TSC_AUX from
      the list of MSRs that are exposed to userspace if probing the MSR fails.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210504171734.1434054-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      78bba966