1. 22 Jan, 2019 15 commits
  2. 02 Nov, 2018 3 commits
    • Julien Thierry's avatar
      kvm-cpu: Pause vCPU in signal handler · fdd26ecb
      Julien Thierry authored
      
      
      Currently, the handling a pause signal only sets a state that will be
      checked at the begining of the CPU run loop. At the checking point the vCPU
      sends the notification that it is actually paused allowing the pause
      requester to confirm all vCPUs are paused.
      
      Receiving the pause signal during a KVM_RUN ioctl will make KVM exit to
      userspace. However, there is a small window between that check on
      cpu->paused and the execution of KVM_RUN where the signal has been received
      but the vCPU does not go back through the notification and starts KVM_RUN.
      Since there is no guarantee the vCPU will come back to userspace, the
      pause requester might deadlock.
      
      Perform the pause directly from the signal handler. This relies on a vCPU
      thread never receiving a pause signal while being pause, but such scenario
      would have caused a deadlock for the pause requester anyway.
      Signed-off-by: default avatarJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      fdd26ecb
    • Julien Thierry's avatar
      kvm: Do not pause already paused vcpus · 29f4ec31
      Julien Thierry authored
      
      
      With the following sequence:
      	kvm__pause();
      	kvm__continue();
      	kvm__pause();
      
      There is a chance that not all paused threads have been resumed, and the
      second kvm__pause will attempt to pause them again. Since the paused thread
      is waiting to own the pause_lock, it won't write its second pause
      notification. kvm__pause will be waiting for that notification while owning
      pause_lock, so... deadlock.
      
      Simple solution is not to try to pause thread that had not the chance to
      resume.
      Signed-off-by: default avatarJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      29f4ec31
    • Jean-Philippe Brucker's avatar
      virtio: Fix ordering of virt_queue__available() · 66ba0bae
      Jean-Philippe Brucker authored
      
      
      After adding buffers to the virtio queue, the guest increments the avail
      index. It then reads the event index to check if it needs to notify the
      host. If the event index corresponds to the previous avail value, then
      the guest notifies the host. Otherwise it means that the host is still
      processing the queue and hasn't had a chance to increment the event
      index yet. Once it gets there, the host will see the new avail index and
      process the descriptors, so there is no need for a notification.
      
      This is only guaranteed to work if both threads write and read the
      indices in the right order. Currently a barrier is missing from
      virt_queue__available(), and the host may not see an up-to-date value of
      event index after writing avail.
      
                   HOST            |           GUEST
                                   |
                                   |    write avail = 1
                                   |    mb()
                                   |    read event -> 0
              write event = 0      |      == prev_avail -> notify
              read avail -> 1      |
                                   |
              write event = 1      |
              read avail -> 1      |
              wait()               |    write avail = 2
                                   |    mb()
                                   |    read event -> 0
                                   |      != prev_avail -> no notification
      
      By adding a memory barrier on the host side, we ensure that it doesn't
      miss any notification.
      Reviewed-By: Steven Price's avatarSteven Price <steven.price@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      66ba0bae
  3. 16 Aug, 2018 1 commit
  4. 13 Jul, 2018 2 commits
  5. 06 Jul, 2018 1 commit
    • Jean-Philippe Brucker's avatar
      Fix subfolder dependency generation · 665f1b72
      Jean-Philippe Brucker authored
      
      
      When building an object "foo.o", kvmtool also creates a ".foo.o.d" file,
      using the dependency generation feature of CPP. This file describes in
      Makefile format all headers included by foo.c. When one header is
      modified, make rebuilds all objects that include it.
      
      Dependency files in subfolders are currently ignored by make, because
      the target doesn't contain the right prefix. For example virtio/.blk.o.d
      has target "blk.o" instead of "virtio/blk.o". As a result, rebuilding
      kvmtool without first issuing a make clean can introduce sneaky bugs,
      where different objects use mismatched headers. To write the right
      targets in dependency files, add a -MT argument to CPP.
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      665f1b72
  6. 19 Jun, 2018 13 commits
    • Jean-Philippe Brucker's avatar
      vfio: check reserved regions before mapping DMA · 41d773e2
      Jean-Philippe Brucker authored
      
      
      Use the new reserved_regions API to ensure that RAM doesn't overlap any
      reserved region. This prevents for instance from mapping an MSI doorbell
      into the guest IPA space. For the moment we reject any overlapping. In the
      future, we might carve reserved regions out of the guest physical
      space.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      41d773e2
    • Jean-Philippe Brucker's avatar
      Introduce reserved memory regions · fa1076ab
      Jean-Philippe Brucker authored
      
      
      When passing devices to the guest, there might be address ranges
      unavailable to the device. For instance, if address 0x10000000 corresponds
      to an MSI doorbell, any transaction from a device to that address will be
      directed to the MSI controller and might not even reach the IOMMU. In that
      case 0x10000000 is reserved by the physical IOMMU in the guest's physical
      space.
      
      This patch introduces a simple API to register reserved ranges of
      addresses that should not or cannot be provided to the guest. For the
      moment it only checks that a reserved range does not overlap any user
      memory (we don't consider MMIO) and aborts otherwise.
      
      It should be possible instead to poke holes in the guest-physical memory
      map and report them via the architecture's preferred route:
      * ARM and PowerPC can add reserved-memory nodes to the DT they provide to
        the guest.
      * x86 could poke holes in the memory map reported with e820. This requires
        to postpone creating the memory map until at least VFIO is initialized.
      * MIPS could describe the reserved ranges with the "memmap=mm$ss" kernel
        parameter.
      
      This would also require to call KVM_SET_USER_MEMORY_REGION for all memory
      regions at the end of kvmtool initialisation. Extra care should be taken
      to ensure we don't break any architecture, since they currently rely on
      having a linear address space with at most two memory blocks.
      
      This patch doesn't implement any address space carving. If an abort is
      encountered, user can try to rebuild kvmtool with different addresses or
      change its IOMMU resv regions if possible.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      fa1076ab
    • Jean-Philippe Brucker's avatar
      vfio: Support non-mmappable regions · 82caa882
      Jean-Philippe Brucker authored
      
      
      In some cases device regions don't support mmap. They can still be made
      available to the guest by trapping all accesses and forwarding reads or
      writes to VFIO. Such regions may be:
      
      * PCI I/O port BARs.
      * Sub-page regions, for example a 4kB region on a host with 64k pages.
      * Similarly, sparse mmap regions. For example when VFIO allows to mmap
        fragments of a PCI BAR and forbids accessing things like MSI-X tables.
        We don't support the sparse capability at the moment, so trap these
        regions instead (if VFIO rejects the mmap).
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      82caa882
    • Jean-Philippe Brucker's avatar
      vfio-pci: add MSI support · 8dd28afe
      Jean-Philippe Brucker authored
      
      
      Allow guests to use the MSI capability in devices that support it. Emulate
      the MSI capability, which is simpler than MSI-X as it doesn't rely on
      external tables. Reuse most of the MSI-X code. Guests may choose between
      MSI and MSI-X at runtime since we present both capabilities, but they
      cannot enable MSI and MSI-X at the same time (forbidden by PCI).
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      8dd28afe
    • Jean-Philippe Brucker's avatar
      vfio-pci: add MSI-X support · c9888d95
      Jean-Philippe Brucker authored
      
      
      Add virtual MSI-X tables for PCI devices, and create IRQFD routes to let
      the kernel inject MSIs from a physical device directly into the guest.
      
      It would be tempting to create the MSI routes at init time before starting
      vCPUs, when we can afford to exit gracefully. But some of it must be
      initialized when the guest requests it.
      
      * On the KVM side, MSIs must be enabled after devices allocate their IRQ
        lines and irqchips are operational, which can happen until late_init.
      
      * On the VFIO side, hardware state of devices may be updated when setting
        up MSIs. For example, when passing a virtio-pci-legacy device to the
        guest:
      
        (1) The device-specific configuration layout (in BAR0) depends on
            whether MSIs are enabled or not in the device. If they are enabled,
            the device-specific configuration starts at offset 24, otherwise it
            starts at offset 20.
        (2) Linux guest assumes that MSIs are initially disabled (doesn't
            actually check the capability). So it reads the device config at
            offset 20.
        (3) Had we enabled MSIs early, host would have enabled the MSI-X
            capability and device would return the config at offset 24.
        (4) The guest would read junk and explode.
      
      Therefore we have to create MSI-X routes when the guest requests MSIs, and
      enable/disable them in VFIO when the guest pokes the MSI-X capability. We
      have to follow both physical and virtual state of the capability, which
      makes the state machine a bit complex, but I think it works.
      
      An important missing feature is the absence of pending MSI handling. When
      a vector or the function is masked, we should rewire the IRQFD to a
      special thread that keeps note of pending interrupts (or just poll the
      IRQFD before recreating the route?). And when the vector is unmasked, one
      MSI should be injected if it was pending. At the moment no MSI is
      injected, we simply disconnect the IRQFD and all messages are lost.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      c9888d95
    • Jean-Philippe Brucker's avatar
      Add PCI device passthrough using VFIO · 6078a454
      Jean-Philippe Brucker authored
      
      
      Assigning devices using VFIO allows the guest to have direct access to the
      device, whilst filtering accesses to sensitive areas by trapping config
      space accesses and mapping DMA with an IOMMU.
      
      This patch adds a new option to lkvm run: --vfio-pci=<BDF>. Before
      assigning a device to a VM, some preparation is required. As described in
      Linux Documentation/vfio.txt, the device driver needs to be changed to
      vfio-pci:
      
        $ dev=0000:00:00.0
      
        $ echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        $ echo vfio-pci > /sys/bus/pci/devices/$dev/driver_override
        $ echo $dev > /sys/bus/pci/drivers_probe
      
      Adding --vfio-pci=$dev to lkvm-run will pass the device to the guest.
      Multiple devices can be passed to the guest by adding more --vfio-pci
      parameters.
      
      This patch only implements PCI with INTx. MSI-X routing will be added in a
      subsequent patch, and at some point we might add support for passing
      platform devices to guests.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: Robin Murphy's avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      6078a454
    • Jean-Philippe Brucker's avatar
      Add fls_long and roundup_pow_of_two helpers · ac70b5aa
      Jean-Philippe Brucker authored
      
      
      It's always nice to have a log2 handy, and the vfio-pci code will need to
      perform power of two allocation from an arbitrary size. Add fls_long and
      roundup_pow_of_two, based on the GCC builtin.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      ac70b5aa
    • Jean-Philippe Brucker's avatar
      Import VFIO headers · b70d1b9f
      Jean-Philippe Brucker authored
      
      
      To ensure consistency between kvmtool and the kernel, import the UAPI
      headers of the VFIO version we implement. This is from Linux v4.12.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      b70d1b9f
    • Jean-Philippe Brucker's avatar
      pci: add capability helpers · 1a51c93d
      Jean-Philippe Brucker authored
      
      
      Add a way to iterate over all capabilities in a config space. Add a search
      function for getting a specific capability.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      1a51c93d
    • Jean-Philippe Brucker's avatar
      Extend memory bank API with memory types · 8f46c736
      Jean-Philippe Brucker authored
      
      
      Introduce memory types RAM and DEVICE, along with a way for subsystems to
      query the global memory banks. This is required by VFIO, which will need
      to pin and map guest RAM so that assigned devices can safely do DMA to it.
      Depending on the architecture, the physical map is made of either one or
      two RAM regions. In addition, this new memory types API paves the way to
      reserved memory regions introduced in a subsequent patch.
      
      For the moment we put vesa and ivshmem memory into the DEVICE category, so
      they don't have to be pinned. This means that physical devices assigned
      with VFIO won't be able to DMA to the vesa frame buffer or ivshmem. In
      order to do that, simply changing the type to "RAM" would work. But to
      keep the types consistent, it would be better to introduce flags such as
      KVM_MEM_TYPE_DMA that would complement both RAM and DEVICE type. We could
      then reuse the API for generating firmware information (that is, for x86
      bios; DT supports reserved-memory description).
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      8f46c736
    • Jean-Philippe Brucker's avatar
      irq: add irqfd helpers · e59679d2
      Jean-Philippe Brucker authored
      
      
      Add helpers to add and remove IRQFD routing for both irqchips and MSIs.
      We have to make a special case of IRQ lines on ARM where the
      initialisation order goes like this:
      
       (1) Devices reserve their IRQ lines
       (2) VGIC is setup with VGIC_CTRL_INIT (in a late_init call)
       (3) MSIs are reserved lazily, when the guest needs them
      
      Since we cannot setup IRQFD before (2), store the IRQFD routing for IRQ
      lines temporarily until we're ready to submit them.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      e59679d2
    • Jean-Philippe Brucker's avatar
      pci: allow to specify IRQ type for PCI devices · ff01b5db
      Jean-Philippe Brucker authored
      
      
      Currently all our virtual device interrupts are edge-triggered. But we're
      going to need level-triggered interrupts when passing physical devices.
      Let the device configure its interrupt kind. Keep edge as default, to
      avoid changing existing users.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      ff01b5db
    • Jean-Philippe Brucker's avatar
      pci: add config operations callbacks on the PCI header · 023fdaae
      Jean-Philippe Brucker authored
      
      
      When implementing PCI device passthrough, we will need to forward config
      accesses from a guest to the VFIO driver. Add a private cfg_ops structure
      to the PCI header, and use it in the PCI config access functions.
      
      A read from the guest first calls into the device's cfg_ops.read, to let
      the backend update the local header before filling the guest register.
      Same happens for a write, we let the backend perform the write and replace
      the guest-provided register with whatever sticks, before updating the local
      header.
      
      Try to untangle the PCI config access logic while we're at it.
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      [JPB: moved to a separate patch]
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      023fdaae
  7. 23 May, 2018 3 commits
  8. 06 Apr, 2018 2 commits