- 11 Dec, 2019 16 commits
-
-
When a guest can reassign BARs, kvmtool needs to maintain the vfio_region consistent with their corresponding BARs. Take the new updated addresses from the PCI header read back from the vfio driver. Also, to modify the BARs, it is expected that guests will disable IO/Memory response in the PCI command. Support this by mapping/unmapping regions when the corresponding response gets enabled/disabled. Signed-off-by:
Julien Thierry <julien.thierry@arm.com> [Fixed BAR selection] Signed-off-by:
Alexandru Elisei <alexandru.elisei@arm.com>
-
Alexandru Elisei authored
Not all devices have the bottom 32 bits of a 64 bit BAR in an even numbered BAR. For example, on an NVIDIA Quadro P400, BARs 1 and 3 are 64bit. Remove this assumption.
-
Alexandru Elisei authored
kvmtool assumes that the BAR that holds the address for the MSIX table and PBA structure has a size which is equal to the total size of the table and the PBA structure and it allocates memory from MMIO space accordingly. However, when initializing the BARs, the BAR size is set according to the region size reported by VFIO. When the physical BAR size is greater than what kvmtool allocated, we can have a situation where the BAR overlaps with another BAR, in which case kvmtool will fail to map the memory. This was found when trying to do PCI passthrough on a PCIe Realtek r8168 NIC. Let's fix this by allocating an amount of MMIO memory equal to table + PBA size or BAR size, whichever is greater.
-
Alexandru Elisei authored
A device's Memory or I/O space address can be written by software in a Base Address Register (BAR). Allow the BARs to be programable by registering the mmio or ioport emulation when access is enabled for that region, not when the virtual machine is created.
-
Alexandru Elisei authored
kvmtool uses brlock for protecting accesses to the ioport and mmio red-black trees. brlock allows concurrent reads, but only one writer, which is assumed not to be a VCPU thread. This is done by issuing a compiler barrier on read and pausing the entire virtual machine on writes. When KVM_BRLOCK_DEBUG is defined, brlock uses instead a pthread read/write lock. When we will implement reassignable BARs, the mmio or ioport mapping will be done as a result of a VCPU mmio access. When brlock is a read/write lock, it means that we will try to acquire a write lock with the read lock already held by the same VCPU and we will deadlock. When it's not, a VCPU will have to call kvm__pause, which means the virtual machine will stay paused forever. Let's avoid all this by using separate pthread_rwlock_t locks for the mmio and the ioport red-black trees and carefully choosing our read critical region such that modification as a result of a guest mmio access doesn't deadlock. In theory, this leaves us with a small window of opportunity for a VCPU to modify a node used by another VCPU. Inserting in the trees is done by the main thread before starting the virtual machine, and deleting is done after the virtual machine has been paused to be destroyed, so in practice this can only happen if the guest is bugged.
-
Alexandru Elisei authored
A device's response to memory or I/O accesses is disabled when Memory Space, respectively I/O Space, is set to 0 in the Command register. According to the PCI Local Bus Specification Revision 3.0, those two bits reset to 0. Let's respect the specifiction, so set Command and I/O Space to 0 on reset, and ignore accesses to a device's respective regions when they are disabled.
-
Currently, callbacks for memory BAR 1 call the IO port emulation. This means that the memory BAR needs I/O Space to be enabled whenever Memory Space is enabled. Refactor the code so the two type of BARs are independent. Also, unify ioport/mmio callback arguments so that they all receive a virtio_device. Signed-off-by:
Julien Thierry <julien.thierry@arm.com>
-
Linux has this convention that the lower 0x1000 bytes of the IO space should not be used. (cf PCIBIOS_MIN_IO). Just allocate those bytes to prevent future allocation assigning it to devices. Signed-off-by:
Julien Thierry <julien.thierry@arm.com>
-
Current PCI IO region that is exposed through the DT contains ports that are reserved by non-PCI devices. Use the proper PCI IO start so that the region exposed through DT can actually be used to reassign device BARs. Signed-off-by:
Julien Thierry <julien.thierry@arm.com>
-
The PCI Local Bus Specification, Rev. 3.0, Section 6.2.5.1. "Address Maps" states: "Devices that map control functions into I/O Space must not consume more than 256 bytes per I/O Base Address register." Yet all the PCI devices allocate IO ports of IOPORT_SIZE (= 1024 bytes). Fix this by having PCI devices use 256 bytes ports for IO BARs. There is no hard requirement on the size of the memory region described by memory BARs. However, the region must be big enough to hold the virtio common interface described in [1], which is 20 bytes, and other MSI-X and/or device specific configuration. To be consistent, let's also limit the memory region described by BAR1 to 256. This is the same size used by BAR2 for each of the two MSI-X vectors. [1] VIRTIO Version 1.0 Committee Specification 04, section 4.4.8. Signed-off-by:
Julien Thierry <julien.thierry@arm.com> [Added rationale for changing BAR1 size to PCI_IO_SIZE] Signed-off-by:
Alexandru Elisei <alexandru.elisei@arm.com>
-
The dynamic ioport allocation with IOPORT_EMPTY is currently only used by PCI devices. Other devices use fixed ports for which they request registration to the ioport API. PCI ports need to be in the PCI IO space and there is no reason ioport API should know a PCI port is being allocated and needs to be placed in PCI IO space. This currently just happens to be the case. Move the responsability of dynamic allocation of ioports from the ioport API to PCI. In the future, if other types of devices also need dynamic ioport allocation, they'll have to figure out the range of ports they are allowed to use. Signed-off-by:
Julien Thierry <julien.thierry@arm.com> [Renamed functions for clarity] Signed-off-by:
Alexandru Elisei <alexandru.elisei@arm.com>
-
Alexandru Elisei authored
The "bus-range" property encodes the first and last bus number. kvmtool uses bus 0 for PCI and bus 1 for MMIO. Advertise only the PCI bus in the PCI DT node by setting "bus-range" to <0, 0>.
-
Alexandru Elisei authored
According to the PCI local bus specification [1], a device's memory size must be a power of two. This is also implicit in the mechanism that a CPU uses to get the memory size requirement for a PCI device. The vesa device requests a memory size that isn't a power of two. According to the same spec [1], a device is allowed to consume more memory than it actually requires. As a result, the amount of memory that the vesa device now reserves has been increased. To prevent slip-ups in the future, a few BUILD_BUG_ON statements were added in places where the memory size is known at compile time. [1] PCI Local Bus Specification Revision 3.0, section 6.2.5.1
-
Alexandru Elisei authored
The pci-shmem emulated device ("ivshmem") was created by QEMU for cross-VM data sharing. The only Linux driver that uses this device is the Android Virtual System on a Chip staging driver, which also mentions a character device driver implemented on top of shmem, which was removed from Linux. On the kvmtool side, the only commits touching the pci-shmem device since it was introduced in 2012 were made when refactoring various kvmtool subsystems. Let's remove the maintenance burden on the kvmtool maintainers and remove this unused device.
-
According to the 'PCI Local Bus Specification, Revision 3.0, February 3, 2004, Section 6.2.5.1, Implementation Notes, page 227' "Software saves the original value of the Base Address register, writes 0 FFFF FFFFh to the register, then reads it back. Size calculation can be done from the 32-bit value read by first clearing encoding information bits (bit 0 for I/O, bits 0-3 for memory), inverting all 32 bits (logical NOT), then incrementing by 1. The resultant 32-bit value is the memory/I/O range size decoded by the register. Note that the upper 16 bits of the result is ignored if the Base Address register is for I/O and bits 16-31 returned zero upon read." kvmtool was returning the actual BAR resource size which would be incorrect as the software software drivers would invert all 32 bits (logical NOT), then incrementing by 1. This ends up with a very large resource size (in some cases more than 4GB) due to which drivers assert/fail to work. e.g if the BAR resource size was 0x1000, kvmtool would return 0x1000 instead of 0xFFFFF00x. Fixed pci__config_wr() to return the size of the BAR in accordance with the PCI Local Bus specification, Implementation Notes. Signed-off-by:
Sami Mujawar <sami.mujawar@arm.com> Signed-off-by:
Julien Thierry <julien.thierry@arm.com> [Reworked algorithm, removed power-of-two check] Signed-off-by:
Alexandru Elisei <alexandru.elisei@arm.com>
-
Alexandru Elisei authored
Use the compiler toolchain version of objcopy instead of the native one when cross-compiling for the x86_64 architecture.
-
- 02 Aug, 2019 1 commit
-
-
Julien Thierry authored
My @arm.com address is gonna stop working. Update README information with an address people can use to actually reach me. Signed-off-by:
Julien Thierry <julien.thierry.kdev@gmail.com> Signed-off-by:
Will Deacon <will@kernel.org>
-
- 03 Jul, 2019 4 commits
-
-
Dave Martin authored
The SVE KVM support for arm64 includes the additional backend header <asm/sve_context.h> from <asm/kvm.h>. So update this header if it is available. To avoid creating a sudden dependency on a specific minimum kernel version, ignore such optional headers if the source kernel tree doesn't have them. Signed-off-by:
Dave Martin <Dave.Martin@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Dave Martin authored
If in intermediate step fails, update_headers.sh blindly continues and may return success status. To avoid errors going unnoticed when driving this script, exit and report failure status as soon as something goes wrong. For good measure, also fail on expansion of undefined shell variables to aid future maintainers. Signed-off-by:
Dave Martin <Dave.Martin@arm.com> Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Dave Martin authored
update_headers.sh can break if the current working directory has a funny name or if something odd is passed for LINUX_ROOT. In the interest of cleanliness, quote where appropriate. Signed-off-by:
Dave Martin <Dave.Martin@arm.com> Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Will Deacon authored
Julien has kindly offered to help maintain kvmtool, but it occurred to me that we don't actually provide any maintainer contact details in the repository as it stands. Add a brief "Maintainers" section to the README, immediately after the "Contributing" section so that people know who to nag about merging and reviewing patches. Acked-by:
Marc Zyngier <marc.zyngier@arm.com> Acked-by:
Julien Thierry <julien.thierry@arm.com> Signed-off-by:
Will Deacon <will@kernel.org>
-
- 10 Jun, 2019 2 commits
-
-
Andre Przywara authored
Kvmtool creates a (debug) UNIX socket file for each VM, using its (possibly auto-generated) name as the filename. There is a check using access(), which bails out with an error message if a socket with that name already exists. Aside from this check being unnecessary, as the bind() call later would complain as well, this is also racy. But more annoyingly the bail out is not needed most of the time: an existing socket inode is most likely just an orphaned leftover from a previous kvmtool run, which just failed to remove that file, because of a crash, for instance. Upon finding such a collision, let's first try to connect to that socket, to detect if there is still a kvmtool instance listening on the other end. If that fails, this socket will never come back to life, so we can safely clean it up and reuse the name for the new guest. However if the connect() succeeds, there is an actual live kvmtool instance using this name, so not proceeding is the only op...
-
Andre Przywara authored
When kvmtool (or the host kernel) crashes or gets killed, we cannot automatically remove the socket file we created for that VM. A later call of "lkvm list" iterates over all those files and complains about those "ghost socket files", as there is no one listening on the other side. Also sometimes the automatic guest name generation happens to generate the same name again, so an unrelated "lkvm run" later complains and stops, which is bad for automation. As the only code doing a listen() on this socket is kvmtool upon VM *creation*, such an orphaned socket file will never come back to life, so we can as well unlink() those sockets in the code. This spares the user from doing it herself. We keep the message in the code to notify the user of this. Signed-off-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
- 29 May, 2019 4 commits
-
-
Andre Przywara authored
clang and GCC9 refuse to compile virtio/blk.c with the following message: virtio/blk.c:161:37: error: taking address of packed member 'geometry' of class or structure 'virtio_blk_config' may result in an unaligned pointer value [-Werror,-Waddress-of-packed-member] struct virtio_blk_geometry *geo = &conf->geometry; Since struct virtio_blk_geometry is in a kernel header, we can't do much about the packed attribute, but as Peter pointed out, the solution is rather simple: just get rid of the convenience variable and use the original struct member directly. Reviewed-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Suggested-by:
Peter Maydell <peter.maydell@linaro.org> Signed-off-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Andre Przywara authored
struct vfio_irq_set from the kernel headers contains a variable sized array to hold a payload. The vfio_irq_eventfd struct puts the "fd" member right after this, hoping it to automatically fit in the payload slot. But having a variable sized type not at the end of a struct is a GNU C extension, so clang will refuse to compile this. Solve this by somewhat doing the compiler's job and place the payload manually at the end of the structure. Reviewed-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Andre Przywara authored
clang complained that the comparison of an u8 variable against 256 is somewhat pointless. Just remove the check, as the condition will never hit. Signed-off-by:
Andre Przywara <andre.przywara@arm.com> Reviewed-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Andre Przywara authored
As clang rightfully pointed out, the ampersand in front of this member looks wrong. Remove it so we actually really compare against the count being 0. Signed-off-by:
Andre Przywara <andre.przywara@arm.com> Reviewed-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
- 26 Apr, 2019 13 commits
-
-
Jean-Philippe Brucker authored
Ensure that all requests are complete when resetting a virtqueue, by draining the AIO queue after stopping the submission thread. Signed-off-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Jean-Philippe Brucker authored
Add a call into the disk layer to synchronize the AIO queue. Wait for all pending requests to complete. This will be necessary when resetting a virtqueue. The wait() operation isn't the same as flush(). A VIRTIO_BLK_T_FLUSH request ensures that any write request *that completed before the FLUSH is sent* is committed to permanent storage (e.g. written back from a write cache). But it doesn't do anything for requests that are still pending when the FLUSH is sent. Avoid introducing a mutex on the io_submit() and io_getevents() paths, because it can lead to 30% throughput drop on heavy FIO jobs. Instead manage an inflight counter using compare-and-swap operations, which is simple enough as the caller doesn't submit new requests while it waits for the AIO queue to drain. The __sync_fetch_and_* operations are a bit rough since they use full barriers, but that didn't seem to introduce a performance regression. Signed-off-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm....>
-
Jean-Philippe Brucker authored
If the AIO thread is still calling io_getevents() while the exit path calls io_destroy(), it will segfault. Wait for the thread to finish before destroying the context. Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Jean-Philippe Brucker authored
Currently when the kernel completes a batch of AIO requests and signals it via eventfd, we retrieve at most AIO_MAX events (256), and ignore the rest. Call io_getevents() again in case more events are pending. Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Jean-Philippe Brucker authored
Add an 'async' attribute to disk_image_operations, that describes if they can submit async I/O or not. disk_image->async is now set iff CONFIG_HAS_AIO and the ops do use AIO. This fixes qcow1, which used to set async = 1 even though the qcow operations don't use AIO. The disk core would perform the read/write operation without pushing the completion onto the virtio queue, and the guest would be stuck waiting. Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Jean-Philippe Brucker authored
Move all AIO code to a separate file, disk/aio.c, to remove as much #ifdefs as possible. Split the raw read/write disk ops into async and sync, and choose which ones to use depending on CONFIG_HAS_AIO. Note that we fix raw_image__close() which incorrectly checked CONFIG_HAS_VIRTIO instead of CONFIG_HAS_AIO, and closed an unitialized disk->evt. A subsequent commit will complete this refactoring by fixing use of the 'async' disk attribute. Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Jean-Philippe Brucker authored
sync() should be called before reboot(RB_AUTOBOOT), otherwise data written to disks might be lost. Signed-off-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Jean-Philippe Brucker authored
Since we don't currently tell the guest when the disk backend is read-only, it will report any inconsistent read after write as an error. An image may be read-only either because user requested it on the command-line, or because write support isn't implemented. Pass the read-only attribute using the VIRTIO_BLK_F_RO feature. Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Jean-Philippe Brucker authored
Even though qcow1 doesn't use the refcount table, the cleanup path still attempts to iterate over its LRU list. Initialize the list to avoid a segfault on exit. Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Julien Thierry authored
Build breaks when using KVM_BRLOCK_DEBUG because the header was seamingly conceived to be included in a single .c file... Fix this by moving the definition of the read/write lock into the kvm struct. Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Julien Thierry <julien.thierry@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Julien Thierry authored
The kvm argument is not passed to br_read_lock/unlock, this works for the barrier implementation because the argument is not used. This ever breaks if another lock implementation is used. Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Julien Thierry <julien.thierry@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Julien Thierry authored
The vesa framebuffer is only used by architectures that explicitly require it (i.e. x86). Compile it out for architectures not using it, as its current implementation might not work for them. Reviewed-by:
Andre Przywara <andre.przywara@arm.com> Signed-off-by:
Julien Thierry <julien.thierry@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-
Leo Yan authored
Since PCI forbids enabling INTx, MSI or MSIX at the same time, it's by default to disable INTx mode when enable MSI/MSIX mode; but this logic is easily broken if the guest PCI driver detects the MSI/MSIX cannot work as expected and tries to rollback to use INTx mode. In this case, the INTx mode has been disabled and has no chance to re-enable it, thus both INTx mode and MSI/MSIX mode cannot work in vfio. Below shows the detailed flow for introducing this issue: vfio_pci_configure_dev_irqs() `-> vfio_pci_enable_intx() vfio_pci_enable_msis() `-> vfio_pci_disable_intx() vfio_pci_disable_msis() => Guest PCI driver disables MSI To fix this issue, when disable MSI/MSIX we need to check if INTx mode is available for this device or not; if the device can support INTx then re-enable it so that the device can fallback to use it. Since vfio_pci_disable_intx() / vfio_pci_enable_intx() pair functions may be called for multiple times, this patch uses 'intx_fd == -1' to denote the INTx is disabled, the pair functions can directly bail out when detect INTx has been disabled and enabled respectively. Suggested-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Leo Yan <leo.yan@linaro.org> Reviewed-by:
Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com>
-