1. 05 Apr, 2012 1 commit
  2. 08 Mar, 2012 6 commits
  3. 21 Feb, 2012 1 commit
    • Linus Torvalds's avatar
      i387: Split up <asm/i387.h> into exported and internal interfaces · 1361b83a
      Linus Torvalds authored
      While various modules include <asm/i387.h> to get access to things we
      actually *intend* for them to use, most of that header file was really
      pretty low-level internal stuff that we really don't want to expose to
      So split the header file into two: the small exported interfaces remain
      in <asm/i387.h>, while the internal definitions that are only used by
      core architecture code are now in <asm/fpu-internal.h>.
      The guiding principle for this was to expose functions that we export to
      modules, and leave them in <asm/i387.h>, while stuff that is used by
      task switching or was marked GPL-only is in <asm/fpu-internal.h>.
      The fpu-internal.h file could be further split up too, especially since
      arch/x86/kvm/ uses some of the remaining stuff for its module.  But that
      kvm usage should probably be abstracted out a bit, and at least now the
      internal FPU accessor functions are much more contained.  Even if it
      isn't perhaps as contained as it _could_ be.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
  4. 18 Feb, 2012 1 commit
    • Linus Torvalds's avatar
      i387: move TS_USEDFPU flag from thread_info to task_struct · f94edacf
      Linus Torvalds authored
      This moves the bit that indicates whether a thread has ownership of the
      FPU from the TS_USEDFPU bit in thread_info->status to a word of its own
      (called 'has_fpu') in task_struct->thread.has_fpu.
      This fixes two independent bugs at the same time:
       - changing 'thread_info->status' from the scheduler causes nasty
         problems for the other users of that variable, since it is defined to
         be thread-synchronous (that's what the "TS_" part of the naming was
         supposed to indicate).
         So perfectly valid code could (and did) do
      	ti->status |= TS_RESTORE_SIGMASK;
         and the compiler was free to do that as separate load, or and store
         instructions.  Which can cause problems with preemption, since a task
         switch could happen in between, and change the TS_USEDFPU bit. The
         change to TS_USEDFPU would be overwritten by the final store.
         In practice, this seldom happened, though, because the 'status' field
         was seldom used more than once, so gcc would generally tend to
         generate code that used a read-modify-write instruction and thus
         happened to avoid this problem - RMW instructions are naturally low
         fat and preemption-safe.
       - On x86-32, the current_thread_info() pointer would, during interrupts
         and softirqs, point to a *copy* of the real thread_info, because
         x86-32 uses %esp to calculate the thread_info address, and thus the
         separate irq (and softirq) stacks would cause these kinds of odd
         thread_info copy aliases.
         This is normally not a problem, since interrupts aren't supposed to
         look at thread information anyway (what thread is running at
         interrupt time really isn't very well-defined), but it confused the
         heck out of irq_fpu_usable() and the code that tried to squirrel
         away the FPU state.
         (It also caused untold confusion for us poor kernel developers).
      It also turns out that using 'task_struct' is actually much more natural
      for most of the call sites that care about the FPU state, since they
      tend to work with the task struct for other reasons anyway (ie
      scheduling).  And the FPU data that we are going to save/restore is
      found there too.
      Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to
      the %esp issue.
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Reported-and-tested-by: default avatarRaphael Prevost <raphael@buro.asia>
      Acked-and-tested-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Tested-by: default avatarPeter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 16 Feb, 2012 1 commit
    • Linus Torvalds's avatar
      i387: don't ever touch TS_USEDFPU directly, use helper functions · 6d59d7a9
      Linus Torvalds authored
      This creates three helper functions that do the TS_USEDFPU accesses, and
      makes everybody that used to do it by hand use those helpers instead.
      In addition, there's a couple of helper functions for the "change both
      CR0.TS and TS_USEDFPU at the same time" case, and the places that do
      that together have been changed to use those.  That means that we have
      fewer random places that open-code this situation.
      The intent is partly to clarify the code without actually changing any
      semantics yet (since we clearly still have some hard to reproduce bug in
      this area), but also to make it much easier to use another approach
      entirely to caching the CR0.TS bit for software accesses.
      Right now we use a bit in the thread-info 'status' variable (this patch
      does not change that), but we might want to make it a full field of its
      own or even make it a per-cpu variable.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 12 Jan, 2012 1 commit
  7. 27 Dec, 2011 6 commits
    • Avi Kivity's avatar
      KVM: VMX: Intercept RDPMC · fee84b07
      Avi Kivity authored
      Intercept RDPMC and forward it to the PMU emulation code.
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      Signed-off-by: default avatarGleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
    • Avi Kivity's avatar
      KVM: Move cpuid code to new file · 00b27a3e
      Avi Kivity authored
      The cpuid code has grown; put it into a separate file.
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
    • Xiao Guangrong's avatar
      KVM: introduce id_to_memslot function · 28a37544
      Xiao Guangrong authored
      Introduce id_to_memslot to get memslot by slot id
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
    • Gleb Natapov's avatar
      KVM: VMX: remove unneeded vmx_load_host_state() calls. · 46199f33
      Gleb Natapov authored
      vmx_load_host_state() does not handle msrs switching (except
      MSR_KERNEL_GS_BASE) since commit 26bb0981
      . Remove call to it
      where it is no longer make sense.
      Signed-off-by: default avatarGleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Fix warning-causing idt-vectoring-info behavior · 51cfe38e
      Nadav Har'El authored
      When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
      to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
      nVMX patch 23, titled "Correct handling of interrupt injection".
      Unfortunately, it is possible (though rare) that at this point there is valid
      idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
      and when L2 tried to run this interrupt's handler, it got a page fault - so
      it returns the original interrupt vector in idt_vectoring_info. The problem
      is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
      like we wished to, because the VMX spec guarantees that idt_vectoring_info
      and exit_reason_external_interrupt can never happen together. This is not
      just specified in the spec - a KVM L1 actually prints a kernel warning
      "unexpected, valid vectoring info" if we violate this guarantee, and some
      users noticed these warnings in L1's logs.
      In order to better emulate a processor, which would never return the external
      interrupt and the idt-vectoring-info together, we need to separate the two
      injection steps: First, complete L1's injection into L2 (i.e., enter L2,
      injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
      and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
      Most of this is already in the code - the only change we need is to remain
      in L2 (and not exit to L1) in this case.
      Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
      although we do enter L2 first, it will exit immediately after processing its
      injection, allowing us to promptly inject to L1.
      Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
      vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
      but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
      of this field. This was explained in patch 25 ("Correct handling of idt
      vectoring info") of the original nVMX patch series.
      Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
      to Abel Gordon for helping me figure out the solution, and to Avi Kivity
      for helping to improve it.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT · d6185f20
      Nadav Har'El authored
      This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
      This bit requests that when next entering the guest, we should run it only
      for as little as possible, and exit again.
      We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
      to continue running so it can inject an event to it, we unfortunately cannot
      just pretend to have run L2 for a little while - We must really launch L2,
      otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
      will be lost. So the existing code runs L2 in this case.
      But L2 could potentially run for a long time until it exits, and the
      injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
      to request that L2 will be entered, as necessary, but will exit as soon as
      possible after entry.
      Our implementation of this request uses smp_send_reschedule() to send a
      self-IPI, with interrupts disabled. The interrupts remain disabled until the
      guest is entered, and then, after the entry is complete (often including
      processing an injection and jumping to the relevant handler), the physical
      interrupt is noticed and causes an exit.
      On recent Intel processors, we could have achieved the same goal by using
      MTF instead of a self-IPI. Another technique worth considering in the future
      is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
      slightly improve performance by avoiding the useless interrupt handler
      which ends up being called when smp_send_reschedule() is used.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
  8. 17 Nov, 2011 3 commits
  9. 25 Sep, 2011 7 commits
    • Jan Kiszka's avatar
      KVM: Clean up and extend rate-limited output · bd80158a
      Jan Kiszka authored
      The use of printk_ratelimit is discouraged, replace it with
      pr*_ratelimited or __ratelimit. While at it, convert remaining
      guest-triggerable printks to rate-limited variants.
      Signed-off-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Jan Kiszka's avatar
      KVM: x86: Move kvm_trace_exit into atomic vmexit section · 1e2b1dd7
      Jan Kiszka authored
      This avoids that events causing the vmexit are recorded before the
      actual exit reason.
      Signed-off-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Kevin Tian's avatar
      KVM: APIC: avoid instruction emulation for EOI writes · 58fbbf26
      Kevin Tian authored
      Instruction emulation for EOI writes can be skipped, since sane
      guest simply uses MOV instead of string operations. This is a nice
      improvement when guest doesn't support x2apic or hyper-V EOI
      a single VM bandwidth is observed with ~8% bandwidth improvement
      (7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.
      Signed-off-by: default avatarKevin Tian <kevin.tian@intel.com>
      <Based on earlier work from>:
      Signed-off-by: default avatarEddie Dong <eddie.dong@intel.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Fix nested VMX TSC emulation · 27fc51b2
      Nadav Har'El authored
      This patch fixes two corner cases in nested (L2) handling of TSC-related
      1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
      the TSC MSR without an exit, then this should set L1's TSC value itself - not
      offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).
      2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
      the vmcs12.TSC_OFFSET.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
    • Nadav Har'El's avatar
      KVM: L1 TSC handling · d5c1785d
      Nadav Har'El authored
      KVM assumed in several places that reading the TSC MSR returns the value for
      L1. This is incorrect, because when L2 is running, the correct TSC read exit
      emulation is to return L2's value.
      We therefore add a new x86_ops function, read_l1_tsc, to use in places that
      specifically need to read the L1 TSC, NOT the TSC of the current level of
      Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
      by a different patch sent by Zachary Amsden (and not yet applied):
      kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
      course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().
      [avi: moved callback to kvm_x86_ops tsc block]
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Acked-by: default avatarZachary Amsdem <zamsden@gmail.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
    • Julia Lawall's avatar
      KVM: VMX: trivial: use BUG_ON · cf3ace79
      Julia Lawall authored
      Use BUG_ON(x) rather than if(x) BUG();
      The semantic patch that fixes this problem is as follows:
      // <smpl>
      @@ identifier x; @@
      -if (x) BUG();
      @@ identifier x; @@
      -if (!x) BUG();
      // </smpl>
      Signed-off-by: default avatarJulia Lawall <julia@diku.dk>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Stefan Hajnoczi's avatar
      KVM: Use __print_symbolic() for vmexit tracepoints · 0d460ffc
      Stefan Hajnoczi authored
      The vmexit tracepoints format the exit_reason to make it human-readable.
      Since the exit_reason depends on the instruction set (vmx or svm),
      formatting is handled with ftrace_print_symbols_seq() by referring to
      the appropriate exit reason table.
      However, the ftrace_print_symbols_seq() function is not meant to be used
      directly in tracepoints since it does not export the formatting table
      which userspace tools like trace-cmd and perf use to format traces.
      In practice perf dies when formatting vmexit-related events and
      trace-cmd falls back to printing the numeric value (with extra
      formatting code in the kvm plugin to paper over this limitation).  Other
      userspace consumers of vmexit-related tracepoints would be in similar
      To avoid significant changes to the kvm_exit tracepoint, this patch
      moves the vmx and svm exit reason tables into arch/x86/kvm/trace.h and
      selects the right table with __print_symbolic() depending on the
      instruction set.  Note that __print_symbolic() is designed for exporting
      the formatting table to userspace and allows trace-cmd and perf to work.
      Signed-off-by: default avatarStefan Hajnoczi <stefanha@linux.vnet.ibm.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
  10. 24 Jul, 2011 2 commits
  11. 12 Jul, 2011 11 commits
    • Nadav Har'El's avatar
      KVM: nVMX: Fix bug preventing more than two levels of nesting · 509c75ea
      Nadav Har'El authored
      The nested VMX feature is supposed to fully emulate VMX for the guest. This
      (theoretically) not only allows it to run its own guests, but also also
      to further emulate VMX for its own guests, and allow arbitrarily deep nesting.
      This patch fixes a bug (discovered by Kevin Tian) in handling a VMLAUNCH
      by L2, which prevented deeper nesting.
      Deeper nesting now works (I only actually tested L3), but is currently
      *absurdly* slow, to the point of being unusable.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Jan Kiszka's avatar
      KVM: VMX: Silence warning on 32-bit hosts · 2e4ce7f5
      Jan Kiszka authored
      a is unused now on CONFIG_X86_32.
      Signed-off-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Miscellenous small corrections · 2844d849
      Nadav Har'El authored
      Small corrections of KVM (spelling, etc.) not directly related to nested VMX.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Add VMX to list of supported cpuid features · 7b8050f5
      Nadav Har'El authored
      If the "nested" module option is enabled, add the "VMX" CPU feature to the
      list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.
      Qemu uses this ioctl, and intersects KVM's list with its own list of desired
      cpu features (depending on the -cpu option given to qemu) to determine the
      final list of features presented to the guest.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Additional TSC-offset handling · 7991825b
      Nadav Har'El authored
      In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
      emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
      set vmcs12.tsc_offset, for this change to survive the next nested entry (see
      Additionally, we also need to modify vmx_adjust_tsc_offset: The semantics
      of this function is that the TSC of all guests on this vcpu, L1 and possibly
      several L2s, need to be adjusted. To do this, we need to adjust vmcs01's
      tsc_offset (this offset will also apply to each L2s we enter). We can't set
      vmcs01 now, so we have to remember this adjustment and apply it when we
      later exit to L1.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Further fixes for lazy FPU loading · 36cf24e0
      Nadav Har'El authored
      KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
      if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
      NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
      traps. And of course, conversely: If L1 wanted to trap these events, we
      must let it, even if L0 is not interested in them.
      This patch fixes some existing KVM code (in update_exception_bitmap(),
      vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
      and L1's needs. Note that handle_cr() was already fixed in the above patch,
      and that new code in introduced in previous patches already handles CR0
      correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Handling of CR0 and CR4 modifying instructions · eeadf9e7
      Nadav Har'El authored
      When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
      which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
      thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
      previous patch).
      When L2 modifies bits that L1 doesn't care about, we let it think (via
      CR[04]_READ_SHADOW) that it did these modifications, while only changing
      (in GUEST_CR[04]) the bits that L0 doesn't shadow.
      This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
      want to leave TS on, while pretending to allow the guest to change it.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Correct handling of idt vectoring info · 66c78ae4
      Nadav Har'El authored
      This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
      When a guest exits while delivering an interrupt or exception, we get this
      information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
      there's nothing we need to do, because L1 will see this field in vmcs12, and
      handle it itself. However, when L2 exits and L0 handles the exit itself and
      plans to return to L2, L0 must inject this event to L2.
      In the normal non-nested case, the idt_vectoring_info case is discovered after
      the exit, and the decision to inject (though not the injection itself) is made
      at that point. However, in the nested case a decision of whether to return
      to L2 or L1 also happens during the injection phase (see the previous
      patches), so in the nested case we can only decide what to do about the
      idt_vectoring_info right after the injection, i.e., in the beginning of
      vmx_vcpu_run, which is the first time we know for sure if we're staying in
      Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular
      vmx_complete_interrupts() code which queues the idt_vectoring_info for
      injection on next entry - because such injection would not be appropriate
      if we will decide to exit to L1. Rather, we just save the idt_vectoring_info
      and related fields in vmcs12 (which is a convenient place to save these
      fields). On the next entry in vmx_vcpu_run (*after* the injection phase,
      potentially exiting to L1 to inject an event requested by user space), if
      we find ourselves in L1 we don't need to do anything with those values
      we saved (as explained above). But if we find that we're in L2, or rather
      *still* at L2 (it's not nested_run_pending, meaning that this is the first
      round of L2 running after L1 having just launched it), we need to inject
      the event saved in those fields - by writing the appropriate VMCS fields.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Correct handling of exception injection · 0b6ac343
      Nadav Har'El authored
      Similar to the previous patch, but concerning injection of exceptions rather
      than external interrupts.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Correct handling of interrupt injection · b6f1250e
      Nadav Har'El authored
      The code in this patch correctly emulates external-interrupt injection
      while a nested guest L2 is running.
      Because of this code's relative un-obviousness, I include here a longer-than-
      usual justification for what it does - much longer than the code itself ;-)
      To understand how to correctly emulate interrupt injection while L2 is
      running, let's look first at what we need to emulate: How would things look
      like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
      an interrupt, we had hardware delivering an interrupt?
      Now we have L1 running on bare metal with a guest L2, and the hardware
      generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and
      VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what
      happens now is this: The processor exits from L2 to L1, with an external-
      interrupt exit reason but without an interrupt vector. L1 runs, with
      interrupts disabled, and it doesn't yet know what the interrupt was. Soon
      after, it enables interrupts and only at that moment, it gets the interrupt
      from the processor. when L1 is KVM, Linux handles this interrupt.
      Now we need exactly the same thing to happen when that L1->L2 system runs
      on top of L0, instead of real hardware. This is how we do this:
      When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
      external-interrupt exit reason (with an invalid interrupt vector), and run L1.
      Just like in the bare metal case, it likely can't deliver the interrupt to
      L1 now because L1 is running with interrupts disabled, in which case it turns
      on the interrupt window when running L1 after the exit. L1 will soon enable
      interrupts, and at that point L0 will gain control again and inject the
      interrupt to L1.
      Finally, there is an extra complication in the code: when nested_run_pending,
      we cannot return to L1 now, and must launch L2. We need to remember the
      interrupt we wanted to inject (and not clear it now), and do it on the
      next exit.
      The above explanation shows that the relative strangeness of the nested
      interrupt injection code in this patch, and the extra interrupt-window
      exit incurred, are in fact necessary for accurate emulation, and are not
      just an unoptimized implementation.
      Let's revisit now the two assumptions made above:
      If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
      does, by the way), things are simple: L0 may inject the interrupt directly
      to the L2 guest - using the normal code path that injects to any guest.
      We support this case in the code below.
      If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT, things look very different from the
      description above: L1 expects to see an exit from L2 with the interrupt vector
      already filled in the exit information, and does not expect to be interrupted
      again with this interrupt. The current code does not (yet) support this case,
      so we do not allow the VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on
      by L1.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    • Nadav Har'El's avatar
      KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit · 644d711a
      Nadav Har'El authored
      This patch contains the logic of whether an L2 exit should be handled by L0
      and then L2 should be resumed, or whether L1 should be run to handle this
      exit (using the nested_vmx_vmexit() function of the previous patch).
      The basic idea is to let L1 handle the exit only if it actually asked to
      trap this sort of event. For example, when L2 exits on a change to CR0,
      we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
      bit which changed; If it did, we exit to L1. But if it didn't it means that
      it is we (L0) that wished to trap this event, so we handle it ourselves.
      The next two patches add additional logic of what to do when an interrupt or
      exception is injected: Does L0 need to do it, should we exit to L1 to do it,
      or should we resume L2 and keep the exception to be injected later.
      We keep a new flag, "nested_run_pending", which can override the decision of
      which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
      L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
      and therefore expects L2 to be run (and perhaps be injected with an event it
      specified, etc.). Nested_run_pending is especially intended to avoid switching
      to L1 in the injection decision-point described above.
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>