Commit fb04a1ed authored by Peter Xu's avatar Peter Xu Committed by Paolo Bonzini
Browse files

KVM: X86: Implement ring-based dirty memory tracking

This patch is heavily based on previous work from Lei Cao
<> and Paolo Bonzini <>. [1]

KVM currently uses large bitmaps to track dirty memory.  These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information.  The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another.  However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial.  In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN).  This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy

This patch enables dirty ring for X86 only.  However it should be
easily extended to other archs as well.


Signed-off-by: default avatarLei Cao <>
Signed-off-by: default avatarPaolo Bonzini <>
Signed-off-by: default avatarPeter Xu <>
Message-Id: <>
Signed-off-by: default avatarPaolo Bonzini <>
parent 28bd726a
......@@ -262,6 +262,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
memory region. This ioctl returns the size of that region. See the
KVM_RUN documentation for details.
Besides the size of the KVM_RUN communication region, other areas of
the VCPU file descriptor can be mmap-ed, including:
- if KVM_CAP_COALESCED_MMIO is available, a page at
this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
KVM_CAP_COALESCED_MMIO is not documented yet.
- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
KVM_CAP_DIRTY_LOG_RING, see section 8.3.
......@@ -6396,3 +6408,84 @@ When enabled, KVM will disable paravirtual features provided to the
guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
(0x40000001). Otherwise, a guest may use the paravirtual features
regardless of what has actually been exposed through the CPUID leaf.
:Architectures: x86
:Parameters: args[0] - size of the dirty log ring
KVM is capable of tracking dirty memory using ring buffers that are
mmaped into userspace; there is one dirty ring per vcpu.
The dirty ring is available to userspace as an array of
``struct kvm_dirty_gfn``. Each dirty entry it's defined as::
struct kvm_dirty_gfn {
__u32 flags;
__u32 slot; /* as_id | slot_id */
__u64 offset;
The following values are defined for the flags field to define the
current state of the entry::
#define KVM_DIRTY_GFN_F_MASK 0x3
Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM
ioctl to enable this capability for the new guest and set the size of
the rings. Enabling the capability is only allowed before creating any
vCPU, and the size of the ring must be a power of two. The larger the
ring buffer, the less likely the ring is full and the VM is forced to
exit to userspace. The optimal size depends on the workload, but it is
recommended that it be at least 64 KiB (4096 entries).
Just like for dirty page bitmaps, the buffer tracks writes to
all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered
with the flag set, userspace can start harvesting dirty pages from the
ring buffer.
An entry in the ring buffer can be unused (flag bits ``00``),
dirty (flag bits ``01``) or harvested (flag bits ``1X``). The
state machine for the entry is as follows::
dirtied harvested reset
00 -----------> 01 -------------> 1X -------+
^ |
| |
To harvest the dirty pages, userspace accesses the mmaped ring buffer
to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage
the RESET bit must be cleared), then it means this GFN is a dirty GFN.
The userspace should harvest this GFN and mark the flags from state
``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set
to show that this GFN is harvested and waiting for a reset), and move
on to the next GFN. The userspace should continue to do this until the
flags of a GFN have the DIRTY bit cleared, meaning that it has harvested
all the dirty GFNs that were available.
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
After processing one or more entries in the ring buffer, userspace
calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about
it, so that the kernel will reprotect those collected GFNs.
Therefore, the ioctl must be called *before* reading the content of
the dirty pages.
The dirty ring can get full. When it happens, the KVM_RUN of the
vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.
The dirty ring interface has a major difference comparing to the
KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from
userspace, it's still possible that the kernel has not yet flushed the
processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the
flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one
needs to kick the vcpu out of KVM_RUN using a signal. The resulting
vmexit ensures that all dirty GFNs are flushed to the dirty rings.
......@@ -1232,6 +1232,7 @@ struct kvm_x86_ops {
void (*enable_log_dirty_pt_masked)(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t offset, unsigned long mask);
int (*cpu_dirty_log_size)(void);
/* pmu operations of sub-arch */
const struct kvm_pmu_ops *pmu_ops;
......@@ -1744,4 +1745,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
#define GET_SMSTATE(type, buf, offset) \
(*(type *)((buf) + (offset) - 0x7e00))
int kvm_cpu_dirty_log_size(void);
#endif /* _ASM_X86_KVM_HOST_H */
......@@ -12,6 +12,7 @@
#define DE_VECTOR 0
#define DB_VECTOR 1
......@@ -10,7 +10,8 @@ endif
KVM := ../../../virt/kvm
kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
......@@ -1289,6 +1289,14 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
int kvm_cpu_dirty_log_size(void)
if (kvm_x86_ops.cpu_dirty_log_size)
return kvm_x86_ops.cpu_dirty_log_size();
return 0;
bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn)
......@@ -185,7 +185,7 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
if ((!is_writable_pte(old_spte) || pfn_changed) &&
is_writable_pte(new_spte)) {
slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn);
mark_page_dirty_in_slot(slot, gfn);
mark_page_dirty_in_slot(kvm, slot, gfn);
......@@ -7583,6 +7583,11 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
return supported & BIT(bit);
static int vmx_cpu_dirty_log_size(void)
return enable_pml ? PML_ENTITY_NUM : 0;
static struct kvm_x86_ops vmx_x86_ops __initdata = {
.hardware_unsetup = hardware_unsetup,
......@@ -7712,6 +7717,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.migrate_timers = vmx_migrate_timers,
.msr_filter_changed = vmx_msr_filter_changed,
.cpu_dirty_log_size = vmx_cpu_dirty_log_size,
static __init int hardware_setup(void)
......@@ -7829,6 +7835,7 @@ static __init int hardware_setup(void)
vmx_x86_ops.slot_disable_log_dirty = NULL;
vmx_x86_ops.flush_log_dirty = NULL;
vmx_x86_ops.enable_log_dirty_pt_masked = NULL;
vmx_x86_ops.cpu_dirty_log_size = NULL;
if (!cpu_has_vmx_preemption_timer())
......@@ -8754,6 +8754,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
bool req_immediate_exit = false;
/* Forbid vmenter if vcpu dirty ring is soft-full */
if (unlikely(vcpu->kvm->dirty_ring_size &&
kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
r = 0;
goto out;
if (kvm_request_pending(vcpu)) {
if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
#include <linux/kvm.h>
* kvm_dirty_ring: KVM internal dirty ring structure
* @dirty_index: free running counter that points to the next slot in
* dirty_ring->dirty_gfns, where a new dirty page should go
* @reset_index: free running counter that points to the next dirty page
* in dirty_ring->dirty_gfns for which dirty trap needs to
* be reenabled
* @size: size of the compact list, dirty_ring->dirty_gfns
* @soft_limit: when the number of dirty pages in the list reaches this
* limit, vcpu that owns this ring should exit to userspace
* to allow userspace to harvest all the dirty pages
* @dirty_gfns: the array to keep the dirty gfns
* @index: index of this dirty ring
struct kvm_dirty_ring {
u32 dirty_index;
u32 reset_index;
u32 size;
u32 soft_limit;
struct kvm_dirty_gfn *dirty_gfns;
int index;
* If KVM_DIRTY_LOG_PAGE_OFFSET not defined, kvm_dirty_ring.o should
* not be included as well, so define these nop functions for the arch.
static inline u32 kvm_dirty_ring_get_rsvd_entries(void)
return 0;
static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
int index, u32 size)
return 0;
static inline struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
return NULL;
static inline int kvm_dirty_ring_reset(struct kvm *kvm,
struct kvm_dirty_ring *ring)
return 0;
static inline void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
u32 slot, u64 offset)
static inline struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring,
u32 offset)
return NULL;
static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
return true;
#else /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */
u32 kvm_dirty_ring_get_rsvd_entries(void);
int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm);
* called with kvm->slots_lock held, returns the number of
* processed pages.
int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
* returns =0: successfully pushed
* <0: unable to push, need to wait
void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
/* for use in vm_operations_struct */
struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
#endif /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */
#endif /* KVM_DIRTY_RING_H */
......@@ -34,6 +34,7 @@
#include <linux/kvm_types.h>
#include <asm/kvm_host.h>
#include <linux/kvm_dirty_ring.h>
......@@ -319,6 +320,7 @@ struct kvm_vcpu {
bool preempted;
bool ready;
struct kvm_vcpu_arch arch;
struct kvm_dirty_ring dirty_ring;
static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
......@@ -505,6 +507,7 @@ struct kvm {
struct srcu_struct irq_srcu;
pid_t userspace_pid;
unsigned int max_halt_poll_ns;
u32 dirty_ring_size;
#define kvm_err(fmt, ...) \
......@@ -1477,4 +1480,14 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
* This defines how many reserved entries we want to keep before we
* kick the vcpu to the userspace to avoid dirty ring full. This
* value can be tuned to higher if e.g. PML is enabled on the host.
/* Max number of entries allowed for each kvm dirty ring */
......@@ -399,6 +399,69 @@ TRACE_EVENT(kvm_halt_poll_ns,
#define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset),
TP_ARGS(ring, slot, offset),
__field(int, index)
__field(u32, dirty_index)
__field(u32, reset_index)
__field(u32, slot)
__field(u64, offset)
__entry->index = ring->index;
__entry->dirty_index = ring->dirty_index;
__entry->reset_index = ring->reset_index;
__entry->slot = slot;
__entry->offset = offset;
TP_printk("ring %d: dirty 0x%x reset 0x%x "
"slot %u offset 0x%llx (used %u)",
__entry->index, __entry->dirty_index,
__entry->reset_index, __entry->slot, __entry->offset,
__entry->dirty_index - __entry->reset_index)
TP_PROTO(struct kvm_dirty_ring *ring),
__field(int, index)
__field(u32, dirty_index)
__field(u32, reset_index)
__entry->index = ring->index;
__entry->dirty_index = ring->dirty_index;
__entry->reset_index = ring->reset_index;
TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)",
__entry->index, __entry->dirty_index, __entry->reset_index,
__entry->dirty_index - __entry->reset_index)
TP_PROTO(struct kvm_vcpu *vcpu),
__field(int, vcpu_id)
__entry->vcpu_id = vcpu->vcpu_id;
TP_printk("vcpu %d", __entry->vcpu_id)
#endif /* _TRACE_KVM_MAIN_H */
/* This part must be outside protection */
......@@ -250,6 +250,7 @@ struct kvm_hyperv_exit {
#define KVM_EXIT_ARM_NISV 28
#define KVM_EXIT_X86_RDMSR 29
#define KVM_EXIT_X86_WRMSR 30
/* Emulate instruction failed. */
......@@ -1054,6 +1055,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_X86_MSR_FILTER 189
......@@ -1558,6 +1560,9 @@ struct kvm_pv_cmd {
/* Available with KVM_CAP_X86_MSR_FILTER */
#define KVM_X86_SET_MSR_FILTER _IOW(KVMIO, 0xc6, struct kvm_msr_filter)
/* Available with KVM_CAP_DIRTY_LOG_RING */
/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
/* Guest initialization commands */
......@@ -1711,4 +1716,52 @@ struct kvm_hyperv_eventfd {
* Arch needs to define the macro after implementing the dirty ring
* feature. KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
* starting page offset of the dirty ring structures.
* KVM dirty GFN flags, defined as:
* |---------------+---------------+--------------|
* | bit 1 (reset) | bit 0 (dirty) | Status |
* |---------------+---------------+--------------|
* | 0 | 0 | Invalid GFN |
* | 0 | 1 | Dirty GFN |
* | 1 | X | GFN to reset |
* |---------------+---------------+--------------|
* Lifecycle of a dirty GFN goes like:
* dirtied harvested reset
* 00 -----------> 01 -------------> 1X -------+
* ^ |
* | |
* +------------------------------------------+
* The userspace program is only responsible for the 01->1X state
* conversion after harvesting an entry. Also, it must not skip any
* dirty bits, so that dirty bits are always harvested in sequence.
#define KVM_DIRTY_GFN_F_MASK 0x3
* KVM dirty rings should be mapped at KVM_DIRTY_LOG_PAGE_OFFSET of
* per-vcpu mmaped regions as an array of struct kvm_dirty_gfn. The
* size of the gfn buffer is decided by the first argument when
struct kvm_dirty_gfn {
__u32 flags;
__u32 slot;
__u64 offset;
#endif /* __LINUX_KVM_H */
/* SPDX-License-Identifier: GPL-2.0-only */
* KVM dirty ring implementation
* Copyright 2019 Red Hat, Inc.
#include <linux/kvm_host.h>
#include <linux/kvm.h>
#include <linux/vmalloc.h>
#include <linux/kvm_dirty_ring.h>
#include <trace/events/kvm.h>
int __weak kvm_cpu_dirty_log_size(void)
return 0;
u32 kvm_dirty_ring_get_rsvd_entries(void)
return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
return kvm_dirty_ring_used(ring) >= ring->soft_limit;
static bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
return kvm_dirty_ring_used(ring) >= ring->size;
struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
WARN_ON_ONCE(vcpu->kvm != kvm);
return &vcpu->dirty_ring;
static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
struct kvm_memory_slot *memslot;
int as_id, id;
as_id = slot >> 16;
id = (u16)slot;
memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
if (!memslot || (offset + __fls(mask)) >= memslot->npages)
kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
ring->dirty_gfns = vmalloc(size);
if (!ring->dirty_gfns)
return -ENOMEM;
memset(ring->dirty_gfns, 0, size);
ring->size = size / sizeof(struct kvm_dirty_gfn);
ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();
ring->dirty_index = 0;
ring->reset_index = 0;
ring->index = index;
return 0;
static inline void kvm_dirty_gfn_set_invalid(struct kvm_dirty_gfn *gfn)
gfn->flags = 0;
static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn)
gfn->flags = KVM_DIRTY_GFN_F_DIRTY;
static inline bool kvm_dirty_gfn_invalid(struct kvm_dirty_gfn *gfn)
return gfn->flags == 0;
static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
return gfn->flags & KVM_DIRTY_GFN_F_RESET;
int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
u32 cur_slot, next_slot;
u64 cur_offset, next_offset;
unsigned long mask;
int count = 0;
struct kvm_dirty_gfn *entry;
bool first_round = true;
/* This is only needed to make compilers happy */
cur_slot = cur_offset = mask = 0;
while (true) {
entry = &ring->