1. 14 Sep, 2019 1 commit
  2. 22 Jul, 2019 2 commits
  3. 20 Jul, 2019 1 commit
    • Paolo Bonzini's avatar
      KVM: nVMX: do not use dangling shadow VMCS after guest reset · 88dddc11
      Paolo Bonzini authored
      If a KVM guest is reset while running a nested guest, free_nested will
      disable the shadow VMCS execution control in the vmcs01.  However,
      on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync
      the VMCS12 to the shadow VMCS which has since been freed.
      This causes a vmptrld of a NULL pointer on my machime, but Jan reports
      the host to hang altogether.  Let's see how much this trivial patch fixes.
      Reported-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  4. 15 Jul, 2019 1 commit
  5. 05 Jul, 2019 2 commits
    • Krish Sadhukhan's avatar
      KVM nVMX: Check Host Segment Registers and Descriptor Tables on vmentry of nested guests · 1ef23e1f
      Krish Sadhukhan authored
      According to section "Checks on Host Segment and Descriptor-Table
      Registers" in Intel SDM vol 3C, the following checks are performed on
      vmentry of nested guests:
         - In the selector field for each of CS, SS, DS, ES, FS, GS and TR, the
           RPL (bits 1:0) and the TI flag (bit 2) must be 0.
         - The selector fields for CS and TR cannot be 0000H.
         - The selector field for SS cannot be 0000H if the "host address-space
           size" VM-exit control is 0.
         - On processors that support Intel 64 architecture, the base-address
           fields for FS, GS and TR must contain canonical addresses.
      Signed-off-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarKarl Heubaum <karl.heubaum@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Sean Christopherson's avatar
      KVM: nVMX: Stash L1's CR3 in vmcs01.GUEST_CR3 on nested entry w/o EPT · f087a029
      Sean Christopherson authored
      KVM does not have 100% coverage of VMX consistency checks, i.e. some
      checks that cause VM-Fail may only be detected by hardware during a
      nested VM-Entry.  In such a case, KVM must restore L1's state to the
      pre-VM-Enter state as L2's state has already been loaded into KVM's
      software model.
      L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*.  But
      when EPT is disabled, the associated fields hold KVM's shadow values,
      not L1's "real" values.  Fortunately, when EPT is disabled the PDPTRs
      come from memory, i.e. are not cached in the VMCS.  Which leaves CR3
      as the sole anomaly.
      A previously applied workaround to handle CR3 was to force nested early
      checks if EPT is disabled:
        commit 2b27924b
       ("KVM: nVMX: always use early vmcs check when EPT
                               is disabled")
      Forcing nested early checks is undesirable as doing so adds hundreds of
      cycles to every nested VM-Entry.  Rather than take this performance hit,
      handle CR3 by overwriting vmcs01.GUEST_CR3 with L1's CR3 during nested
      VM-Entry when EPT is disabled *and* nested early checks are disabled.
      By stuffing vmcs01.GUEST_CR3, nested_vmx_restore_host_state() will
      naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3.
      These shenanigans work because nested_vmx_restore_host_state() does a
      full kvm_mmu_reset_context(), i.e. unloads the current MMU, which
      guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow CR3
      prior to re-entering L1.
      vcpu->arch.root_mmu.root_hpa is set to INVALID_PAGE via:
          nested_vmx_restore_host_state() ->
              kvm_mmu_reset_context() ->
                  kvm_mmu_unload() ->
      kvm_mmu_unload() has WARN_ON(root_hpa != INVALID_PAGE), i.e. we can bank
      on 'root_hpa == INVALID_PAGE' unless the implementation of
      kvm_mmu_reset_context() is changed.
      On the way into L1, VMCS.GUEST_CR3 is guaranteed to be written (on a
      successful entry) via:
          vcpu_enter_guest() ->
              kvm_mmu_reload() ->
                  kvm_mmu_load() ->
                      kvm_mmu_load_cr3() ->
      Stuff vmcs01.GUEST_CR3 if and only if nested early checks are disabled
      as a "late" VM-Fail should never happen win that case (KVM WARNs), and
      the conditional write avoids the need to restore the correct GUEST_CR3
      when nested_vmx_check_vmentry_hw() fails.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20190607185534.24368-1-sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  6. 02 Jul, 2019 7 commits
    • Liran Alon's avatar
      KVM: nVMX: Change KVM_STATE_NESTED_EVMCS to signal vmcs12 is copied from eVMCS · 323d73a8
      Liran Alon authored
      Currently KVM_STATE_NESTED_EVMCS is used to signal that eVMCS
      capability is enabled on vCPU.
      As indicated by vmx->nested.enlightened_vmcs_enabled.
      This is quite bizarre as userspace VMM should make sure to expose
      same vCPU with same CPUID values in both source and destination.
      In case vCPU is exposed with eVMCS support on CPUID, it is also
      expected to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability.
      Therefore, KVM_STATE_NESTED_EVMCS is redundant.
      KVM_STATE_NESTED_EVMCS is currently used on restore path
      (vmx_set_nested_state()) only to enable eVMCS capability in KVM
      and to signal need_vmcs12_sync such that on next VMEntry to guest
      nested_sync_from_vmcs12() will be called to sync vmcs12 content
      into eVMCS in guest memory.
      However, because restore nested-state is rare enough, we could
      have just modified vmx_set_nested_state() to always signal
      From all the above, it seems that we could have just removed
      the usage of KVM_STATE_NESTED_EVMCS. However, in order to preserve
      backwards migration compatibility, we cannot do that.
      (vmx_get_nested_state() needs to signal flag when migrating from
      new kernel to old kernel).
      Returning KVM_STATE_NESTED_EVMCS when just vCPU have eVMCS enabled
      have a bad side-effect of userspace VMM having to send nested-state
      from source to destination as part of migration stream. Even if
      guest have never used eVMCS as it doesn't even run a nested
      hypervisor workload. This requires destination userspace VMM and
      KVM to support setting nested-state. Which make it more difficult
      to migrate from new host to older host.
      To avoid this, change KVM_STATE_NESTED_EVMCS to signal eVMCS is
      not only enabled but also active. i.e. Guest have made some
      eVMCS active via an enlightened VMEntry. i.e. vmcs12 is copied
      from eVMCS and therefore should be restored into eVMCS resident
      in memory (by copy_vmcs12_to_enlightened()).
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarMaran Wilson <maran.wilson@oracle.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Liran Alon's avatar
      KVM: nVMX: Allow restore nested-state to enable eVMCS when vCPU in SMM · 65b712f1
      Liran Alon authored
      As comment in code specifies, SMM temporarily disables VMX so we cannot
      be in guest mode, nor can VMLAUNCH/VMRESUME be pending.
      However, code currently assumes that these are the only flags that can be
      set on kvm_state->flags. This is not true as KVM_STATE_NESTED_EVMCS
      can also be set on this field to signal that eVMCS should be enabled.
      Therefore, fix code to check for guest-mode and pending VMLAUNCH/VMRESUME
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Jim Mattson's avatar
      kvm: nVMX: Remove unnecessary sync_roots from handle_invept · b1190198
      Jim Mattson authored
      When L0 is executing handle_invept(), the TDP MMU is active. Emulating
      an L1 INVEPT does require synchronizing the appropriate shadow EPT
      root(s), but a call to kvm_mmu_sync_roots in this context won't do
      that. Similarly, the hardware TLB and paging-structure-cache entries
      associated with the appropriate shadow EPT root(s) must be flushed,
      but requesting a TLB_FLUSH from this context won't do that either.
      How did this ever work? KVM always does a sync_roots and TLB flush (in
      the correct context) when transitioning from L1 to L2. That isn't the
      best choice for nested VM performance, but it effectively papers over
      the mistakes here.
      Remove the unnecessary operations and leave a comment to try to do
      better in the future.
      Reported-by: default avatarJunaid Shahid <junaids@google.com>
      Fixes: bfd0a56b
       ("nEPT: Nested INVEPT")
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Nadav Har'El <nyh@il.ibm.com>
      Cc: Jun Nakajima <jun.nakajima@intel.com>
      Cc: Xinhao Xu <xinhao.xu@intel.com>
      Cc: Yang Zhang <yang.z.zhang@Intel.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reviewed-by Peter Shier <pshier@google.com>
      Reviewed-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Vitaly Kuznetsov's avatar
      x86/kvm/nVMX: fix VMCLEAR when Enlightened VMCS is in use · 11e34914
      Vitaly Kuznetsov authored
      When Enlightened VMCS is in use, it is valid to do VMCLEAR and,
      according to TLFS, this should "transition an enlightened VMCS from the
      active to the non-active state". It is, however, wrong to assume that
      it is only valid to do VMCLEAR for the eVMCS which is currently active
      on the vCPU performing VMCLEAR.
      Currently, the logic in handle_vmclear() is broken: in case, there is no
      active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal'
      VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly
      corrupts the memory area.
      So, in case the VMCLEAR argument is not the current active eVMCS on the
      vCPU, how can we know if the area it is pointing to is a normal or an
      enlightened VMCS?
      Thanks to the bug in Hyper-V (see commit 72aeb60c ("KVM: nVMX: Verify
      eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can
      not, the revision can't be used to distinguish between them. So let's
      assume it is always enlightened in case enlightened vmentry is enabled in
      the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to
      minimize the impact for 'unenlightened' workloads.
      Fixes: b8bbab92
       ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Vitaly Kuznetsov's avatar
      x86/KVM/nVMX: don't use clean fields data on enlightened VMLAUNCH · a21a39c2
      Vitaly Kuznetsov authored
      Apparently, Windows doesn't maintain clean fields data after it does
      VMCLEAR for an enlightened VMCS so we can only use it on VMRESUME.
      The issue went unnoticed because currently we do nested_release_evmcs()
      in handle_vmclear() and the consecutive enlightened VMPTRLD invalidates
      clean fields when a new eVMCS is mapped but we're going to change the
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Paolo Bonzini's avatar
      KVM: nVMX: allow setting the VMFUNC controls MSR · e8a70bd4
      Paolo Bonzini authored
      Allow userspace to set a custom value for the VMFUNC controls MSR, as long
      as the capabilities it advertises do not exceed those of the host.
      Fixes: 27c42a1b
       ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor", 2017-08-03)
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Paolo Bonzini's avatar
      KVM: nVMX: include conditional controls in /dev/kvm KVM_GET_MSRS · 6defc591
      Paolo Bonzini authored
      Some secondary controls are automatically enabled/disabled based on the CPUID
      values that are set for the guest.  However, they are still available at a
      global level and therefore should be present when KVM_GET_MSRS is sent to
      Fixes: 1389309c
       ("KVM: nVMX: expose VMX capabilities for nested hypervisors to userspace", 2018-02-26)
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  7. 20 Jun, 2019 1 commit
    • Paolo Bonzini's avatar
      KVM: nVMX: reorganize initial steps of vmx_set_nested_state · 9fd58877
      Paolo Bonzini authored
      Commit 332d0797 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS
      state before setting new state", 2019-05-02) broke evmcs_test because the
      eVMCS setup must be performed even if there is no VMXON region defined,
      as long as the eVMCS bit is set in the assist page.
      While the simplest possible fix would be to add a check on
      kvm_state->flags & KVM_STATE_NESTED_EVMCS in the initial "if" that
      covers kvm_state->hdr.vmx.vmxon_pa == -1ull, that is quite ugly.
      Instead, this patch moves checks earlier in the function and
      conditionalizes them on kvm_state->hdr.vmx.vmxon_pa, so that
      vmx_set_nested_state always goes through vmx_leave_nested
      and nested_enable_evmcs.
      Fixes: 332d0797
       ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state")
      Cc: Aaron Lewis <aaronlewis@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  8. 19 Jun, 2019 1 commit
    • Liran Alon's avatar
      KVM: x86: Modify struct kvm_nested_state to have explicit fields for data · 6ca00dfa
      Liran Alon authored
      Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
      of VMX nested state data in a struct.
      In order to avoid changing the ioctl values of
      KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
      sizeof(struct kvm_nested_state). This is done by defining the data
      struct as "data.vmx[0]". It was the most elegant way I found to
      preserve struct size while still keeping struct readable and easy to
      maintain. It does have a misfortunate side-effect that now it has to be
      accessed as "data.vmx[0]" rather than just "data.vmx".
      Because we are already modifying these structs, I also modified the
      * Define the "format" field values as macros.
      * Rename vmcs_pa to vmcs12_pa for better readability.
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      [Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo]
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  9. 18 Jun, 2019 24 commits