Commit 81819f0f authored by Christoph Lameter's avatar Christoph Lameter Committed by Linus Torvalds
Browse files

SLUB core

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a performance concern due to the frequent references to
   memory policies which may lead a sequence of objects to come from
   one node after another. SLUB will get a slab full of objects
   from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

   SLAB has per node partial lists. This means that over time a large
   number of partial slabs may accumulate on those lists. These can
   only be reused if allocator occur on specific nodes. SLUB has a global
   pool of partial slabs and will consume slabs from that pool to
   decrease fragmentation.

G. Tunables

   SLAB has sophisticated tuning abilities for each slab cache. One can
   manipulate the queue sizes in detail. However, filling the queues still
   requires the uses of the spin lock to check out slabs. SLUB has a global
   parameter (min_slab_order) for tuning. Increasing the minimum slab
   order can decrease the locking overhead. The bigger the slab order the
   less motions of pages between per CPU and partial lists occur and the
   better SLUB will be scaling.

G. Slab merging

   We often have slab caches with similar parameters. SLUB detects those
   on boot up and merges them into the corresponding general caches. This
   leads to more effective memory use. About 50% of all caches can
   be eliminated through slab merging. This will also decrease
   slab fragmentation because partial allocated slabs can be filled
   up again. Slab merging can be switched off by specifying
   slub_nomerge on boot up.

   Note that merging can expose heretofore unknown bugs in the kernel
   because corrupted objects may now be placed differently and corrupt
   differing neighboring objects. Enable sanity checks to find those.

H. Diagnostics

   The current slab diagnostics are difficult to use and require a
   recompilation of the kernel. SLUB contains debugging code that
   is always available (but is kept out of the hot code paths).
   SLUB diagnostics can be enabled via the "slab_debug" option.
   Parameters can be specified to select a single or a group of
   slab caches for diagnostics. This means that the system is running
   with the usual performance and it is much more likely that
   race conditions can be reproduced.

I. Resiliency

   If basic sanity checks are on then SLUB is capable of detecting
   common error conditions and recover as best as possible to allow the
   system to continue.

J. Tracing

   Tracing can be enabled via the slab_debug=T,<slabcache> option
   during boot. SLUB will then protocol all actions on that slabcache
   and dump the object contents on free.

K. On demand DMA cache creation.

   Generally DMA caches are not needed. If a kmalloc is used with
   __GFP_DMA then just create this single slabcache that is needed.
   For systems that have no ZONE_DMA requirement the support is
   completely eliminated.

L. Performance increase

   Some benchmarks have shown speed improvements on kernbench in the
   range of 5-10%. The locking overhead of slub is based on the
   underlying base allocation size. If we can reliably allocate
   larger order pages then it is possible to increase slub
   performance much further. The anti-fragmentation patches may
   enable further performance increases.

Tested on:
i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator

SLUB Boot options

slub_nomerge		Disable merging of slabs
slub_min_order=x	Require a minimum order for slab caches. This
			increases the managed chunk size and therefore
			reduces meta data and locking overhead.
slub_min_objects=x	Mininum objects per slab. Default is 8.
slub_max_order=x	Avoid generating slabs larger than order specified.
slub_debug		Enable all diagnostics for all caches
slub_debug=<options>	Enable selective options for all caches
slub_debug=<o>,<cache>	Enable selective options for a certain set of

Available Debug options
F		Double Free checking, sanity and resiliency
R		Red zoning
P		Object / padding poisoning
U		Track last free / alloc
T		Trace all allocs / frees (only use for individual slabs).

To use SLUB: Apply this patch and then select SLUB as the default slab

[ fix an oops-causing locking error]
[ various stupid cleanups and small fixes]
Signed-off-by: default avatarChristoph Lameter <>
Signed-off-by: default avatarHugh Dickins <>
Signed-off-by: default avatarAndrew Morton <>
Signed-off-by: default avatarLinus Torvalds <>
parent 543691a6
......@@ -53,6 +53,10 @@ config ARCH_HAS_ILOG2_U64
default y
default y
mainmenu "Fujitsu FR-V Kernel Configuration"
source "init/Kconfig"
......@@ -79,6 +79,10 @@ config ARCH_MAY_HAVE_PC_FDC
default y
default y
config DMI
default y
......@@ -19,10 +19,16 @@ struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
atomic_t _count; /* Usage count, see below. */
atomic_t _mapcount; /* Count of ptes mapped in mms,
union {
atomic_t _mapcount; /* Count of ptes mapped in mms,
* to show when page is mapped
* & limit reverse map searches.
struct { /* SLUB uses */
short unsigned int inuse;
short unsigned int offset;
union {
struct {
unsigned long private; /* Mapping-private opaque data:
......@@ -43,8 +49,15 @@ struct page {
spinlock_t ptl;
struct { /* SLUB uses */
struct page *first_page; /* Compound pages */
struct kmem_cache *slab; /* Pointer to slab */
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: pointer to free object */
pgoff_t index; /* Our offset within mapping. */
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
......@@ -18,6 +18,9 @@
#define RED_INACTIVE 0x5A2CF071UL /* when obj is inactive */
#define RED_ACTIVE 0x170FC2A5UL /* when obj is active */
#define SLUB_RED_INACTIVE 0xbb
#define SLUB_RED_ACTIVE 0xcc
/* ...and for poisoning */
#define POISON_INUSE 0x5a /* for use-uninitialised poisoning */
#define POISON_FREE 0x6b /* for use-after-free poisoning */
......@@ -32,6 +32,7 @@ typedef struct kmem_cache kmem_cache_t __deprecated;
#define SLAB_PANIC 0x00040000UL /* Panic if kmem_cache_create() fails */
#define SLAB_DESTROY_BY_RCU 0x00080000UL /* Defer freeing slabs to RCU */
#define SLAB_MEM_SPREAD 0x00100000UL /* Spread some memory over cpuset */
#define SLAB_TRACE 0x00200000UL /* Trace allocations and frees */
/* Flags passed to a constructor functions */
#define SLAB_CTOR_CONSTRUCTOR 0x001UL /* If not set, then deconstructor */
......@@ -42,7 +43,7 @@ typedef struct kmem_cache kmem_cache_t __deprecated;
* struct kmem_cache related prototypes
void __init kmem_cache_init(void);
extern int slab_is_available(void);
int slab_is_available(void);
struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
unsigned long,
......@@ -95,9 +96,14 @@ static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
* the appropriate general cache at compile time.
#if defined(CONFIG_SLAB) || defined(CONFIG_SLUB)
#include <linux/slub_def.h>
#include <linux/slab_def.h>
#endif /* !CONFIG_SLUB */
* Fallback definitions for an allocator not wanting to provide
* its own optimized kmalloc definitions (like SLOB).
......@@ -184,7 +190,7 @@ static inline void *__kmalloc_node(size_t size, gfp_t flags, int node)
* allocator where we care about the real place the memory allocation
* request comes from.
#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
extern void *__kmalloc_track_caller(size_t, gfp_t, void*);
#define kmalloc_track_caller(size, flags) \
__kmalloc_track_caller(size, flags, __builtin_return_address(0))
......@@ -202,7 +208,7 @@ extern void *__kmalloc_track_caller(size_t, gfp_t, void*);
* standard allocator where we care about the real place the memory
* allocation request comes from.
#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, void *);
#define kmalloc_node_track_caller(size, flags, node) \
__kmalloc_node_track_caller(size, flags, node, \
* SLUB : A Slab allocator without object queues.
* (C) 2007 SGI, Christoph Lameter <>
#include <linux/types.h>
#include <linux/gfp.h>
#include <linux/workqueue.h>
#include <linux/kobject.h>
struct kmem_cache_node {
spinlock_t list_lock; /* Protect partial list and nr_partial */
unsigned long nr_partial;
atomic_long_t nr_slabs;
struct list_head partial;
* Slab cache management.
struct kmem_cache {
/* Used for retriving partial slabs etc */
unsigned long flags;
int size; /* The size of an object including meta data */
int objsize; /* The size of an object without meta data */
int offset; /* Free pointer offset. */
unsigned int order;
* Avoid an extra cache line for UP, SMP and for the node local to
* struct kmem_cache.
struct kmem_cache_node local_node;
/* Allocation and freeing of slabs */
int objects; /* Number of objects in slab */
int refcount; /* Refcount for slab cache destroy */
void (*ctor)(void *, struct kmem_cache *, unsigned long);
void (*dtor)(void *, struct kmem_cache *, unsigned long);
int inuse; /* Offset to metadata */
int align; /* Alignment */
const char *name; /* Name (only for display!) */
struct list_head list; /* List of slab caches */
struct kobject kobj; /* For sysfs */
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
struct page *cpu_slab[NR_CPUS];
* Kmalloc subsystem.
#if !defined(CONFIG_MMU) || NR_CPUS > 512 || MAX_NUMNODES > 256
* We keep the general caches in an array of slab caches that are used for
* 2^x bytes of allocations.
extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];
* Sorry that the following has to be that ugly but some versions of GCC
* have trouble with constant propagation and loops.
static inline int kmalloc_index(int size)
if (size == 0)
return 0;
if (size > 64 && size <= 96)
return 1;
if (size > 128 && size <= 192)
return 2;
if (size <= 8) return 3;
if (size <= 16) return 4;
if (size <= 32) return 5;
if (size <= 64) return 6;
if (size <= 128) return 7;
if (size <= 256) return 8;
if (size <= 512) return 9;
if (size <= 1024) return 10;
if (size <= 2 * 1024) return 11;
if (size <= 4 * 1024) return 12;
if (size <= 8 * 1024) return 13;
if (size <= 16 * 1024) return 14;
if (size <= 32 * 1024) return 15;
if (size <= 64 * 1024) return 16;
if (size <= 128 * 1024) return 17;
if (size <= 256 * 1024) return 18;
if (size <= 512 * 1024) return 19;
if (size <= 1024 * 1024) return 20;
if (size <= 2 * 1024 * 1024) return 21;
if (size <= 4 * 1024 * 1024) return 22;
if (size <= 8 * 1024 * 1024) return 23;
if (size <= 16 * 1024 * 1024) return 24;
if (size <= 32 * 1024 * 1024) return 25;
return -1;
* What we really wanted to do and cannot do because of compiler issues is:
* int i;
* if (size <= (1 << i))
* return i;
* Find the slab cache for a given combination of allocation flags and size.
* This ought to end up with a global pointer to the right cache
* in kmalloc_caches.
static inline struct kmem_cache *kmalloc_slab(size_t size)
int index = kmalloc_index(size);
if (index == 0)
return NULL;
if (index < 0) {
* Generate a link failure. Would be great if we could
* do something to stop the compile here.
extern void __kmalloc_size_too_large(void);
return &kmalloc_caches[index];
#define SLUB_DMA __GFP_DMA
/* Disable DMA functionality */
#define SLUB_DMA 0
static inline void *kmalloc(size_t size, gfp_t flags)
if (__builtin_constant_p(size) && !(flags & SLUB_DMA)) {
struct kmem_cache *s = kmalloc_slab(size);
if (!s)
return NULL;
return kmem_cache_alloc(s, flags);
} else
return __kmalloc(size, flags);
static inline void *kzalloc(size_t size, gfp_t flags)
if (__builtin_constant_p(size) && !(flags & SLUB_DMA)) {
struct kmem_cache *s = kmalloc_slab(size);
if (!s)
return NULL;
return kmem_cache_zalloc(s, flags);
} else
return __kzalloc(size, flags);
extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
if (__builtin_constant_p(size) && !(flags & SLUB_DMA)) {
struct kmem_cache *s = kmalloc_slab(size);
if (!s)
return NULL;
return kmem_cache_alloc_node(s, flags, node);
} else
return __kmalloc_node(size, flags, node);
#endif /* _LINUX_SLUB_DEF_H */
......@@ -478,15 +478,6 @@ config SHMEM
option replaces shmem and tmpfs with the much simpler ramfs code,
which may be appropriate on small systems without swap.
config SLAB
default y
bool "Use full SLAB allocator" if (EMBEDDED && !SMP && !SPARSEMEM)
Disabling this replaces the advanced SLAB allocator and
kmalloc support with the drastically simpler SLOB allocator.
SLOB is more space efficient but does not scale well and is
more susceptible to fragmentation.
default y
bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
......@@ -496,6 +487,46 @@ config VM_EVENT_COUNTERS
on EMBEDDED systems. /proc/vmstat will only show page counts
if VM event counters are disabled.
prompt "Choose SLAB allocator"
default SLAB
This option allows to select a slab allocator.
config SLAB
bool "SLAB"
The regular slab allocator that is established and known to work
well in all environments. It organizes chache hot objects in
per cpu and per node queues. SLAB is the default choice for
slab allocator.
config SLUB
bool "SLUB (Unqueued Allocator)"
SLUB is a slab allocator that minimizes cache line usage
instead of managing queues of cached objects (SLAB approach).
Per cpu caching is realized using slabs of objects instead
of queues of objects. SLUB can use memory efficiently
way and has enhanced diagnostics.
config SLOB
# SLOB cannot support SMP because SLAB_DESTROY_BY_RCU does not work
# properly.
depends on EMBEDDED && !SMP && !SPARSEMEM
bool "SLOB (Simple Allocator)"
SLOB replaces the SLAB allocator with a drastically simpler
allocator. SLOB is more space efficient that SLAB but does not
scale well (single lock for all operations) and is more susceptible
to fragmentation. SLOB it is a great choice to reduce
memory usage and code size for embedded systems.
endmenu # General setup
......@@ -511,10 +542,6 @@ config BASE_SMALL
default 0 if BASE_FULL
default 1 if !BASE_FULL
config SLOB
default !SLAB
menu "Loadable module support"
config MODULES
......@@ -25,6 +25,7 @@ obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment