1. 22 Jan, 2019 5 commits
  2. 02 Nov, 2018 1 commit
    • Jean-Philippe Brucker's avatar
      virtio: Fix ordering of virt_queue__available() · 66ba0bae
      Jean-Philippe Brucker authored
      
      
      After adding buffers to the virtio queue, the guest increments the avail
      index. It then reads the event index to check if it needs to notify the
      host. If the event index corresponds to the previous avail value, then
      the guest notifies the host. Otherwise it means that the host is still
      processing the queue and hasn't had a chance to increment the event
      index yet. Once it gets there, the host will see the new avail index and
      process the descriptors, so there is no need for a notification.
      
      This is only guaranteed to work if both threads write and read the
      indices in the right order. Currently a barrier is missing from
      virt_queue__available(), and the host may not see an up-to-date value of
      event index after writing avail.
      
                   HOST            |           GUEST
                                   |
                                   |    write avail = 1
                                   |    mb()
                                   |    read event -> 0
              write event = 0      |      == prev_avail -> notify
              read avail -> 1      |
                                   |
              write event = 1      |
              read avail -> 1      |
              wait()               |    write avail = 2
                                   |    mb()
                                   |    read event -> 0
                                   |      != prev_avail -> no notification
      
      By adding a memory barrier on the host side, we ensure that it doesn't
      miss any notification.
      
      Reviewed-By: Steven Price's avatarSteven Price <steven.price@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      66ba0bae
  3. 19 Jun, 2018 12 commits
    • Jean-Philippe Brucker's avatar
      Introduce reserved memory regions · fa1076ab
      Jean-Philippe Brucker authored
      
      
      When passing devices to the guest, there might be address ranges
      unavailable to the device. For instance, if address 0x10000000 corresponds
      to an MSI doorbell, any transaction from a device to that address will be
      directed to the MSI controller and might not even reach the IOMMU. In that
      case 0x10000000 is reserved by the physical IOMMU in the guest's physical
      space.
      
      This patch introduces a simple API to register reserved ranges of
      addresses that should not or cannot be provided to the guest. For the
      moment it only checks that a reserved range does not overlap any user
      memory (we don't consider MMIO) and aborts otherwise.
      
      It should be possible instead to poke holes in the guest-physical memory
      map and report them via the architecture's preferred route:
      * ARM and PowerPC can add reserved-memory nodes to the DT they provide to
        the guest.
      * x86 could poke holes in the memory map reported with e820. This requires
        to postpone creating the memory map until at least VFIO is initialized.
      * MIPS could describe the reserved ranges with the "memmap=mm$ss" kernel
        parameter.
      
      This would also require to call KVM_SET_USER_MEMORY_REGION for all memory
      regions at the end of kvmtool initialisation. Extra care should be taken
      to ensure we don't break any architecture, since they currently rely on
      having a linear address space with at most two memory blocks.
      
      This patch doesn't implement any address space carving. If an abort is
      encountered, user can try to rebuild kvmtool with different addresses or
      change its IOMMU resv regions if possible.
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      fa1076ab
    • Jean-Philippe Brucker's avatar
      vfio: Support non-mmappable regions · 82caa882
      Jean-Philippe Brucker authored
      
      
      In some cases device regions don't support mmap. They can still be made
      available to the guest by trapping all accesses and forwarding reads or
      writes to VFIO. Such regions may be:
      
      * PCI I/O port BARs.
      * Sub-page regions, for example a 4kB region on a host with 64k pages.
      * Similarly, sparse mmap regions. For example when VFIO allows to mmap
        fragments of a PCI BAR and forbids accessing things like MSI-X tables.
        We don't support the sparse capability at the moment, so trap these
        regions instead (if VFIO rejects the mmap).
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      82caa882
    • Jean-Philippe Brucker's avatar
      vfio-pci: add MSI support · 8dd28afe
      Jean-Philippe Brucker authored
      
      
      Allow guests to use the MSI capability in devices that support it. Emulate
      the MSI capability, which is simpler than MSI-X as it doesn't rely on
      external tables. Reuse most of the MSI-X code. Guests may choose between
      MSI and MSI-X at runtime since we present both capabilities, but they
      cannot enable MSI and MSI-X at the same time (forbidden by PCI).
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      8dd28afe
    • Jean-Philippe Brucker's avatar
      vfio-pci: add MSI-X support · c9888d95
      Jean-Philippe Brucker authored
      
      
      Add virtual MSI-X tables for PCI devices, and create IRQFD routes to let
      the kernel inject MSIs from a physical device directly into the guest.
      
      It would be tempting to create the MSI routes at init time before starting
      vCPUs, when we can afford to exit gracefully. But some of it must be
      initialized when the guest requests it.
      
      * On the KVM side, MSIs must be enabled after devices allocate their IRQ
        lines and irqchips are operational, which can happen until late_init.
      
      * On the VFIO side, hardware state of devices may be updated when setting
        up MSIs. For example, when passing a virtio-pci-legacy device to the
        guest:
      
        (1) The device-specific configuration layout (in BAR0) depends on
            whether MSIs are enabled or not in the device. If they are enabled,
            the device-specific configuration starts at offset 24, otherwise it
            starts at offset 20.
        (2) Linux guest assumes that MSIs are initially disabled (doesn't
            actually check the capability). So it reads the device config at
            offset 20.
        (3) Had we enabled MSIs early, host would have enabled the MSI-X
            capability and device would return the config at offset 24.
        (4) The guest would read junk and explode.
      
      Therefore we have to create MSI-X routes when the guest requests MSIs, and
      enable/disable them in VFIO when the guest pokes the MSI-X capability. We
      have to follow both physical and virtual state of the capability, which
      makes the state machine a bit complex, but I think it works.
      
      An important missing feature is the absence of pending MSI handling. When
      a vector or the function is masked, we should rewire the IRQFD to a
      special thread that keeps note of pending interrupts (or just poll the
      IRQFD before recreating the route?). And when the vector is unmasked, one
      MSI should be injected if it was pending. At the moment no MSI is
      injected, we simply disconnect the IRQFD and all messages are lost.
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      c9888d95
    • Jean-Philippe Brucker's avatar
      Add PCI device passthrough using VFIO · 6078a454
      Jean-Philippe Brucker authored
      
      
      Assigning devices using VFIO allows the guest to have direct access to the
      device, whilst filtering accesses to sensitive areas by trapping config
      space accesses and mapping DMA with an IOMMU.
      
      This patch adds a new option to lkvm run: --vfio-pci=<BDF>. Before
      assigning a device to a VM, some preparation is required. As described in
      Linux Documentation/vfio.txt, the device driver needs to be changed to
      vfio-pci:
      
        $ dev=0000:00:00.0
      
        $ echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        $ echo vfio-pci > /sys/bus/pci/devices/$dev/driver_override
        $ echo $dev > /sys/bus/pci/drivers_probe
      
      Adding --vfio-pci=$dev to lkvm-run will pass the device to the guest.
      Multiple devices can be passed to the guest by adding more --vfio-pci
      parameters.
      
      This patch only implements PCI with INTx. MSI-X routing will be added in a
      subsequent patch, and at some point we might add support for passing
      platform devices to guests.
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: Robin Murphy's avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      6078a454
    • Jean-Philippe Brucker's avatar
      Add fls_long and roundup_pow_of_two helpers · ac70b5aa
      Jean-Philippe Brucker authored
      
      
      It's always nice to have a log2 handy, and the vfio-pci code will need to
      perform power of two allocation from an arbitrary size. Add fls_long and
      roundup_pow_of_two, based on the GCC builtin.
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      ac70b5aa
    • Jean-Philippe Brucker's avatar
      Import VFIO headers · b70d1b9f
      Jean-Philippe Brucker authored
      
      
      To ensure consistency between kvmtool and the kernel, import the UAPI
      headers of the VFIO version we implement. This is from Linux v4.12.
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      b70d1b9f
    • Jean-Philippe Brucker's avatar
      pci: add capability helpers · 1a51c93d
      Jean-Philippe Brucker authored
      
      
      Add a way to iterate over all capabilities in a config space. Add a search
      function for getting a specific capability.
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      1a51c93d
    • Jean-Philippe Brucker's avatar
      Extend memory bank API with memory types · 8f46c736
      Jean-Philippe Brucker authored
      
      
      Introduce memory types RAM and DEVICE, along with a way for subsystems to
      query the global memory banks. This is required by VFIO, which will need
      to pin and map guest RAM so that assigned devices can safely do DMA to it.
      Depending on the architecture, the physical map is made of either one or
      two RAM regions. In addition, this new memory types API paves the way to
      reserved memory regions introduced in a subsequent patch.
      
      For the moment we put vesa and ivshmem memory into the DEVICE category, so
      they don't have to be pinned. This means that physical devices assigned
      with VFIO won't be able to DMA to the vesa frame buffer or ivshmem. In
      order to do that, simply changing the type to "RAM" would work. But to
      keep the types consistent, it would be better to introduce flags such as
      KVM_MEM_TYPE_DMA that would complement both RAM and DEVICE type. We could
      then reuse the API for generating firmware information (that is, for x86
      bios; DT supports reserved-memory description).
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      8f46c736
    • Jean-Philippe Brucker's avatar
      irq: add irqfd helpers · e59679d2
      Jean-Philippe Brucker authored
      
      
      Add helpers to add and remove IRQFD routing for both irqchips and MSIs.
      We have to make a special case of IRQ lines on ARM where the
      initialisation order goes like this:
      
       (1) Devices reserve their IRQ lines
       (2) VGIC is setup with VGIC_CTRL_INIT (in a late_init call)
       (3) MSIs are reserved lazily, when the guest needs them
      
      Since we cannot setup IRQFD before (2), store the IRQFD routing for IRQ
      lines temporarily until we're ready to submit them.
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      e59679d2
    • Jean-Philippe Brucker's avatar
      pci: allow to specify IRQ type for PCI devices · ff01b5db
      Jean-Philippe Brucker authored
      
      
      Currently all our virtual device interrupts are edge-triggered. But we're
      going to need level-triggered interrupts when passing physical devices.
      Let the device configure its interrupt kind. Keep edge as default, to
      avoid changing existing users.
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      ff01b5db
    • Jean-Philippe Brucker's avatar
      pci: add config operations callbacks on the PCI header · 023fdaae
      Jean-Philippe Brucker authored
      
      
      When implementing PCI device passthrough, we will need to forward config
      accesses from a guest to the VFIO driver. Add a private cfg_ops structure
      to the PCI header, and use it in the PCI config access functions.
      
      A read from the guest first calls into the device's cfg_ops.read, to let
      the backend update the local header before filling the guest register.
      Same happens for a write, we let the backend perform the write and replace
      the guest-provided register with whatever sticks, before updating the local
      header.
      
      Try to untangle the PCI config access logic while we're at it.
      
      Reviewed-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      [JPB: moved to a separate patch]
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      023fdaae
  4. 06 Apr, 2018 1 commit
  5. 19 Mar, 2018 1 commit
    • Jean-Philippe Brucker's avatar
      virtio: Fix ordering of avail index and descriptor read · 15c4e1ef
      Jean-Philippe Brucker authored
      
      
      One barrier seems to be missing from kvmtool's virtio implementation,
      between virt_queue__available() and virt_queue__pop(). In the following
      scenario "avail" represents the shared "available" structure in the virtio
      queue:
      
                     Guest               |               Host
                                         |
          avail.ring[shadow] = desc_idx  | while (avail.idx != shadow)
          smp_wmb()                      |     /* missing smp_rmb() */
          avail.idx = ++shadow           |     desc_idx = avail.ring[shadow++]
      
      If the host observes the avail.idx write before the avail.ring update,
      then it will fetch the wrong desc_idx. Add the missing barrier.
      
      This seems to fix the horrible bug I'm often seeing when running netperf
      in a guest (virtio-net + tap) on AMD Seattle. The TX thread reads the
      wrong descriptor index and either faults when accessing the TX buffer, or
      pushes the wrong index to the used ring. In that case the guest complains
      that "id %u is not a head!" and stops the queue.
      
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      15c4e1ef
  6. 29 Jan, 2018 2 commits
  7. 03 Nov, 2017 2 commits
  8. 09 Jun, 2017 9 commits
  9. 17 Feb, 2017 1 commit
    • Will Deacon's avatar
      kvmtool: virtio-net: fix VIRTIO_NET_F_MRG_RXBUF usage in rx thread · 3fea89a9
      Will Deacon authored
      
      
      When merging virtio-net buffers using the VIRTIO_NET_F_MRG_RXBUF feature,
      the first buffer added to the used ring should indicate the total number
      of buffers used to hold the packet. Unfortunately, kvmtool has a number
      of issues when constructing these merged buffers:
      
        - Commit 5131332e3f1a ("kvmtool: convert net backend to support
          bi-endianness") introduced a strange loop counter, which resulted in
          hdr->num_buffers being set redundantly the first time round
      
        - When adding the buffers to the ring, we actually add them one-by-one,
          allowing the guest to see the header before we've inserted the rest
          of the data buffers...
      
        - ... which is made worse because we non-atomically increment the
          num_buffers count in the header each time we insert a new data buffer
      
      Consequently, the guest quickly becomes confused in its net rx code and
      the whole thing grinds to a halt. This is easily exemplified by trying
      to boot a root filesystem over NFS, which seldom succeeds.
      
      This patch resolves the issues by allowing us to insert items into the
      used ring without updating the index. Once the full payload has been
      added and num_buffers corresponds to the total size, we *then* publish
      the buffers to the guest.
      
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      3fea89a9
  10. 17 May, 2016 1 commit
  11. 11 Apr, 2016 1 commit
  12. 02 Mar, 2016 2 commits
  13. 18 Nov, 2015 2 commits
    • Andre Przywara's avatar
      provide generic read_file() implementation · 649f9515
      Andre Przywara authored
      
      
      In various parts of kvmtool we simply try to read files into memory,
      but fail to do so in a safe way. The read(2) syscall can return early
      having only parts of the file read, or it may return -1 due to being
      interrupted by a signal (in which case we should simply retry).
      The ARM code seems to provide the only safe implementation, so take
      that as an inspiration to provide a generic read_file() function
      usable by every part of kvmtool.
      
      Signed-off-by: Andre Przywara's avatarAndre Przywara <andre.przywara@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      649f9515
    • Andre Przywara's avatar
      Refactor kernel image loading · 004f7684
      Andre Przywara authored
      
      
      Let's face it: Kernel loading is quite architecture specific. Don't
      claim otherwise and move the loading routines into each
      architecture's responsibility.
      This introduces kvm__arch_load_kernel(), which each architecture can
      implement accordingly.
      Provide bzImage loading for x86 and ELF loading for MIPS as special
      cases for those architectures (removing the arch specific code from
      the generic kvm.c file on the way) and rename the existing "flat binary"
      loader functions for the other architectures to the new name.
      
      Signed-off-by: Andre Przywara's avatarAndre Przywara <andre.przywara@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      004f7684