Skip to content
  • Christoph Lameter's avatar
    Lockless (and preemptless) fastpaths for slub · 8a5ec0ba
    Christoph Lameter authored
    
    
    Use the this_cpu_cmpxchg_double functionality to implement a lockless
    allocation algorithm on arches that support fast this_cpu_ops.
    
    Each of the per cpu pointers is paired with a transaction id that ensures
    that updates of the per cpu information can only occur in sequence on
    a certain cpu.
    
    A transaction id is a "long" integer that is comprised of an event number
    and the cpu number. The event number is incremented for every change to the
    per cpu state. This means that the cmpxchg instruction can verify for an
    update that nothing interfered and that we are updating the percpu structure
    for the processor where we picked up the information and that we are also
    currently on that processor when we update the information.
    
    This results in a significant decrease of the overhead in the fastpaths. It
    also makes it easy to adopt the fast path for realtime kernels since this
    is lockless and does not require the use of the current per cpu area
    over the critical section. It is only important that the per cpu area is
    current at the beginning of the critical section and at the end.
    
    So there is no need even to disable preemption.
    
    Test results show that the fastpath cycle count is reduced by up to ~ 40%
    (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
    adds a few cycles.
    
    Sadly this does nothing for the slowpath which is where the main issues with
    performance in slub are but the best case performance rises significantly.
    (For that see the more complex slub patches that require cmpxchg_double)
    
    Kmalloc: alloc/free test
    
    Before:
    
    10000 times kmalloc(8)/kfree -> 134 cycles
    10000 times kmalloc(16)/kfree -> 152 cycles
    10000 times kmalloc(32)/kfree -> 144 cycles
    10000 times kmalloc(64)/kfree -> 142 cycles
    10000 times kmalloc(128)/kfree -> 142 cycles
    10000 times kmalloc(256)/kfree -> 132 cycles
    10000 times kmalloc(512)/kfree -> 132 cycles
    10000 times kmalloc(1024)/kfree -> 135 cycles
    10000 times kmalloc(2048)/kfree -> 135 cycles
    10000 times kmalloc(4096)/kfree -> 135 cycles
    10000 times kmalloc(8192)/kfree -> 144 cycles
    10000 times kmalloc(16384)/kfree -> 754 cycles
    
    After:
    
    10000 times kmalloc(8)/kfree -> 78 cycles
    10000 times kmalloc(16)/kfree -> 78 cycles
    10000 times kmalloc(32)/kfree -> 82 cycles
    10000 times kmalloc(64)/kfree -> 88 cycles
    10000 times kmalloc(128)/kfree -> 79 cycles
    10000 times kmalloc(256)/kfree -> 79 cycles
    10000 times kmalloc(512)/kfree -> 85 cycles
    10000 times kmalloc(1024)/kfree -> 82 cycles
    10000 times kmalloc(2048)/kfree -> 82 cycles
    10000 times kmalloc(4096)/kfree -> 85 cycles
    10000 times kmalloc(8192)/kfree -> 82 cycles
    10000 times kmalloc(16384)/kfree -> 706 cycles
    
    Kmalloc: Repeatedly allocate then free test
    
    Before:
    
    10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
    10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
    10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
    10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
    10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
    10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
    10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
    10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
    10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
    10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
    10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
    10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles
    
    After:
    
    10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
    10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
    10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
    10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
    10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
    10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
    10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
    10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
    10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
    10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
    10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
    10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles
    
    Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
    Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
    8a5ec0ba