Commit d9927d46 authored by Stephen Rothwell's avatar Stephen Rothwell
Browse files

Merge branch 'akpm-current/current'

parents 6dd13c29 ab573a2b
What: /sys/kernel/reboot
Date: November 2020
KernelVersion: 5.11
Contact: Matteo Croce <mcroce@microsoft.com>
Description: Interface to set the kernel reboot behavior, similarly to
what can be done via the reboot= cmdline option.
(see Documentation/admin-guide/kernel-parameters.txt)
What: /sys/kernel/reboot/mode
Date: November 2020
KernelVersion: 5.11
Contact: Matteo Croce <mcroce@microsoft.com>
Description: Reboot mode. Valid values are: cold warm hard soft gpio
What: /sys/kernel/reboot/type
Date: November 2020
KernelVersion: 5.11
Contact: Matteo Croce <mcroce@microsoft.com>
Description: Reboot type. Valid values are: bios acpi kbd triple efi pci
What: /sys/kernel/reboot/cpu
Date: November 2020
KernelVersion: 5.11
Contact: Matteo Croce <mcroce@microsoft.com>
Description: CPU number to use to reboot.
What: /sys/kernel/reboot/force
Date: November 2020
KernelVersion: 5.11
Contact: Matteo Croce <mcroce@microsoft.com>
Description: Don't wait for any other CPUs on reboot and
avoid anything that could hang.
......@@ -334,6 +334,11 @@ Admin can request writeback of those idle pages at right timing via::
With the command, zram writeback idle pages from memory to the storage.
If admin want to write a specific page in zram device to backing device,
they could write a page index into the interface.
echo "page_index=1251" > /sys/block/zramX/writeback
If there are lots of write IO with flash device, potentially, it has
flash wearout problem so that admin needs to design write limitation
to guarantee storage health for entire product life.
......
......@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
8. LRU
======
Each memcg has its own private LRU. Now, its handling is under global
VM's control (means that it's handled under global pgdat->lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
list management functions under pgdat->lru_lock.
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
from LRU.
(By __isolate_lru_page(), the page is removed from both of global and
private LRU.)
Each memcg has its own vector of LRUs (inactive anon, active anon,
inactive file, active file, unevictable) of pages from each node,
each LRU handled under a single lru_lock for that memcg and node.
9. Typical Tests.
=================
......@@ -219,13 +210,11 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
This is an easy way to test page migration, too.
9.5 mkdir/rmdir
---------------
9.5 nested cgroups
------------------
When using hierarchy, mkdir/rmdir test should be done.
Use tests like the following::
Use tests like the following for testing nested cgroups::
echo 1 >/opt/cgroup/01/memory/use_hierarchy
mkdir /opt/cgroup/01/child_a
mkdir /opt/cgroup/01/child_b
......
......@@ -77,6 +77,8 @@ Brief summary of control files.
memory.soft_limit_in_bytes set/show soft limit of memory usage
memory.stat show various statistics
memory.use_hierarchy set/show hierarchical account enabled
This knob is deprecated and shouldn't be
used.
memory.force_empty trigger forced page reclaim
memory.pressure_level set memory pressure notifications
memory.swappiness set/show swappiness parameter of vmscan
......@@ -285,20 +287,17 @@ When oom event notifier is registered, event will be delivered.
2.6 Locking
-----------
lock_page_cgroup()/unlock_page_cgroup() should not be called under
the i_pages lock.
Other lock order is following:
PG_locked.
mm->page_table_lock
pgdat->lru_lock
lock_page_cgroup.
Lock order is as follows:
In many cases, just lock_page_cgroup() is called.
Page lock (PG_locked bit of page->flags)
mm->page_table_lock or split pte_lock
lock_page_memcg (memcg->move_lock)
mapping->i_pages lock
lruvec->lru_lock.
per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
pgdat->lru_lock, it has no lock of its own.
Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
lruvec->lru_lock; PG_lru bit of page->flags is cleared before
isolating a page from its LRU under lruvec->lru_lock.
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
-----------------------------------------------
......@@ -495,16 +494,13 @@ cgroup might have some charge associated with it, even though all
tasks have migrated away from it. (because we charge against pages, not
against tasks.)
We move the stats to root (if use_hierarchy==0) or parent (if
use_hierarchy==1), and no change on the charge except uncharging
We move the stats to parent, and no change on the charge except uncharging
from the child.
Charges recorded in swap information is not updated at removal of cgroup.
Recorded information is discarded and a cgroup which uses swap (swapcache)
will be charged as a new owner of it.
About use_hierarchy, see Section 6.
5. Misc. interfaces
===================
......@@ -527,8 +523,6 @@ About use_hierarchy, see Section 6.
write will still return success. In this case, it is expected that
memory.kmem.usage_in_bytes == memory.usage_in_bytes.
About use_hierarchy, see Section 6.
5.2 stat file
-------------
......@@ -675,31 +669,20 @@ hierarchy::
d e
In the diagram above, with hierarchical accounting enabled, all memory
usage of e, is accounted to its ancestors up until the root (i.e, c and root),
that has memory.use_hierarchy enabled. If one of the ancestors goes over its
limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
children of the ancestor.
6.1 Enabling hierarchical accounting and reclaim
------------------------------------------------
usage of e, is accounted to its ancestors up until the root (i.e, c and root).
If one of the ancestors goes over its limit, the reclaim algorithm reclaims
from the tasks in the ancestor and the children of the ancestor.
A memory cgroup by default disables the hierarchy feature. Support
can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup::
6.1 Hierarchical accounting and reclaim
---------------------------------------
# echo 1 > memory.use_hierarchy
The feature can be disabled by::
# echo 0 > memory.use_hierarchy
Hierarchical accounting is enabled by default. Disabling the hierarchical
accounting is deprecated. An attempt to do it will result in a failure
and a warning printed to dmesg.
NOTE1:
Enabling/disabling will fail if either the cgroup already has other
cgroups created below it, or if the parent cgroup has use_hierarchy
enabled.
For compatibility reasons writing 1 to memory.use_hierarchy will always pass::
NOTE2:
When panic_on_oom is set to "2", the whole system will panic in
case of an OOM event in any cgroup.
# echo 1 > memory.use_hierarchy
7. Soft limits
==============
......
......@@ -1300,6 +1300,14 @@ PAGE_SIZE multiple when read back.
Amount of memory used in anonymous mappings backed by
transparent hugepages
file_thp
Amount of cached filesystem data backed by transparent
hugepages
shmem_thp
Amount of shm, tmpfs, shared anonymous mmap()s backed by
transparent hugepages
inactive_anon, active_anon, inactive_file, active_file, unevictable
Amount of memory, swap-backed and filesystem-backed,
on the internal memory management lists used by the
......
......@@ -39,6 +39,12 @@ call.
User-space tools can get the kernel name, host name, kernel release
number, kernel version, architecture name and OS type from it.
(uts_namespace, name)
---------------------
Offset of the name's member. Crash Utility and Makedumpfile get
the start address of the init_uts_ns.name from this.
node_online_map
---------------
......
......@@ -401,21 +401,6 @@ compact_fail
is incremented if the system tries to compact memory
but failed.
compact_pages_moved
is incremented each time a page is moved. If
this value is increasing rapidly, it implies that the system
is copying a lot of data to satisfy the huge page allocation.
It is possible that the cost of copying exceeds any savings
from reduced TLB misses.
compact_pagemigrate_failed
is incremented when the underlying mechanism
for moving a page failed.
compact_blocks_moved
is incremented each time memory compaction examines
a huge page aligned range of pages.
It is possible to establish how long the stalls were using the function
tracer to record how long was spent in __alloc_pages_nodemask and
using the mm_page_alloc tracepoint to identify which allocations were
......
......@@ -824,8 +824,8 @@ e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
As a side-effect, it also checks for negative totals (elsewhere reported
as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
(At time of writing, a few stats are known sometimes to be found negative,
with no ill effects: errors and warnings on these stats are suppressed.)
(On a SMP machine some stats can temporarily become negative, with no ill
effects: errors and warnings on these stats are suppressed.)
numa_stat
......@@ -873,12 +873,17 @@ file-backed pages is less than the high watermark in a zone.
unprivileged_userfaultfd
========================
This flag controls whether unprivileged users can use the userfaultfd
system calls. Set this to 1 to allow unprivileged users to use the
userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
privileged users (with SYS_CAP_PTRACE capability).
This flag controls the mode in which unprivileged users can use the
userfaultfd system calls. Set this to 0 to restrict unprivileged users
to handle page faults in user mode only. In this case, users without
SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to
succeed. Prohibiting use of userfaultfd for handling faults from kernel
mode may make certain vulnerabilities more difficult to exploit.
The default value is 1.
Set this to 1 to allow unprivileged users to use the userfaultfd system
calls without any restrictions.
The default value is 0.
user_reserve_kbytes
......
......@@ -147,6 +147,10 @@ The address of a chunk allocated with `kmalloc` is aligned to at least
ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the
alignment is also guaranteed to be at least the respective size.
Chunks allocated with kmalloc() can be resized with krealloc(). Similarly
to kmalloc_array(): a helper for resizing arrays is provided in the form of
krealloc_array().
For large allocations you can use vmalloc() and vzalloc(), or directly
request pages from the page allocator. The memory allocated by `vmalloc`
and related functions is not physically contiguous.
......
......@@ -221,12 +221,12 @@ Unit testing
============
This file::
tools/testing/selftests/vm/gup_benchmark.c
tools/testing/selftests/vm/gup_test.c
has the following new calls to exercise the new pin*() wrapper functions:
* PIN_FAST_BENCHMARK (./gup_benchmark -a)
* PIN_BENCHMARK (./gup_benchmark -b)
* PIN_FAST_BENCHMARK (./gup_test -a)
* PIN_BASIC_TEST (./gup_test -b)
You can monitor how many total dma-pinned pages have been acquired and released
since the system was booted, via two new /proc/vmstat entries: ::
......
......@@ -22,6 +22,7 @@ whole; patches welcome!
ubsan
kmemleak
kcsan
kfence
gdb-kernel-debugging
kgdb
kselftest
......
.. SPDX-License-Identifier: GPL-2.0
Kernel Electric-Fence (KFENCE)
==============================
Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety
error detector. KFENCE detects heap out-of-bounds access, use-after-free, and
invalid-free errors.
KFENCE is designed to be enabled in production kernels, and has near zero
performance overhead. Compared to KASAN, KFENCE trades performance for
precision. The main motivation behind KFENCE's design, is that with enough
total uptime KFENCE will detect bugs in code paths not typically exercised by
non-production test workloads. One way to quickly achieve a large enough total
uptime is when the tool is deployed across a large fleet of machines.
Usage
-----
To enable KFENCE, configure the kernel with::
CONFIG_KFENCE=y
To build a kernel with KFENCE support, but disabled by default (to enable, set
``kfence.sample_interval`` to non-zero value), configure the kernel with::
CONFIG_KFENCE=y
CONFIG_KFENCE_SAMPLE_INTERVAL=0
KFENCE provides several other configuration options to customize behaviour (see
the respective help text in ``lib/Kconfig.kfence`` for more info).
Tuning performance
~~~~~~~~~~~~~~~~~~
The most important parameter is KFENCE's sample interval, which can be set via
the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The
sample interval determines the frequency with which heap allocations will be
guarded by KFENCE. The default is configurable via the Kconfig option
``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0``
disables KFENCE.
The KFENCE memory pool is of fixed size, and if the pool is exhausted, no
further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default
255), the number of available guarded objects can be controlled. Each object
requires 2 pages, one for the object itself and the other one used as a guard
page; object pages are interleaved with guard pages, and every object page is
therefore surrounded by two guard pages.
The total memory dedicated to the KFENCE memory pool can be computed as::
( #objects + 1 ) * 2 * PAGE_SIZE
Using the default config, and assuming a page size of 4 KiB, results in
dedicating 2 MiB to the KFENCE memory pool.
Note: On architectures that support huge pages, KFENCE will ensure that the
pool is using pages of size ``PAGE_SIZE``. This will result in additional page
tables being allocated.
Error reports
~~~~~~~~~~~~~
A typical out-of-bounds access looks like this::
==================================================================
BUG: KFENCE: out-of-bounds in test_out_of_bounds_read+0xa3/0x22b
Out-of-bounds access at 0xffffffffb672efff (1B left of kfence-#17):
test_out_of_bounds_read+0xa3/0x22b
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32, cache=kmalloc-32] allocated by task 507:
test_alloc+0xf3/0x25b
test_out_of_bounds_read+0x98/0x22b
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
CPU: 4 PID: 107 Comm: kunit_try_catch Not tainted 5.8.0-rc6+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
==================================================================
The header of the report provides a short summary of the function involved in
the access. It is followed by more detailed information about the access and
its origin. Note that, real kernel addresses are only shown for
``CONFIG_DEBUG_KERNEL=y`` builds.
Use-after-free accesses are reported as::
==================================================================
BUG: KFENCE: use-after-free in test_use_after_free_read+0xb3/0x143
Use-after-free access at 0xffffffffb673dfe0 (in kfence-#24):
test_use_after_free_read+0xb3/0x143
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
kfence-#24 [0xffffffffb673dfe0-0xffffffffb673dfff, size=32, cache=kmalloc-32] allocated by task 507:
test_alloc+0xf3/0x25b
test_use_after_free_read+0x76/0x143
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
freed by task 507:
test_use_after_free_read+0xa8/0x143
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
CPU: 4 PID: 109 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
==================================================================
KFENCE also reports on invalid frees, such as double-frees::
==================================================================
BUG: KFENCE: invalid free in test_double_free+0xdc/0x171
Invalid free of 0xffffffffb6741000:
test_double_free+0xdc/0x171
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
kfence-#26 [0xffffffffb6741000-0xffffffffb674101f, size=32, cache=kmalloc-32] allocated by task 507:
test_alloc+0xf3/0x25b
test_double_free+0x76/0x171
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
freed by task 507:
test_double_free+0xa8/0x171
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
CPU: 4 PID: 111 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
==================================================================
KFENCE also uses pattern-based redzones on the other side of an object's guard
page, to detect out-of-bounds writes on the unprotected side of the object.
These are reported on frees::
==================================================================
BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184
Corrupted memory at 0xffffffffb6797ff9 [ 0xac . . . . . . ] (in kfence-#69):
test_kmalloc_aligned_oob_write+0xef/0x184
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
kfence-#69 [0xffffffffb6797fb0-0xffffffffb6797ff8, size=73, cache=kmalloc-96] allocated by task 507:
test_alloc+0xf3/0x25b
test_kmalloc_aligned_oob_write+0x57/0x184
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
CPU: 4 PID: 120 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
==================================================================
For such errors, the address where the corruption occurred as well as the
invalidly written bytes (offset from the address) are shown; in this
representation, '.' denote untouched bytes. In the example above ``0xac`` is
the value written to the invalid address at offset 0, and the remaining '.'
denote that no following bytes have been touched. Note that, real values are
only shown for ``CONFIG_DEBUG_KERNEL=y`` builds; to avoid information
disclosure for non-debug builds, '!' is used instead to denote invalidly
written bytes.
And finally, KFENCE may also report on invalid accesses to any protected page
where it was not possible to determine an associated object, e.g. if adjacent
object pages had not yet been allocated::
==================================================================
BUG: KFENCE: invalid access in test_invalid_access+0x26/0xe0
Invalid access at 0xffffffffb670b00a:
test_invalid_access+0x26/0xe0
kunit_try_run_case+0x51/0x85
kunit_generic_run_threadfn_adapter+0x16/0x30
kthread+0x137/0x160
ret_from_fork+0x22/0x30
CPU: 4 PID: 124 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
==================================================================
DebugFS interface
~~~~~~~~~~~~~~~~~
Some debugging information is exposed via debugfs:
* The file ``/sys/kernel/debug/kfence/stats`` provides runtime statistics.
* The file ``/sys/kernel/debug/kfence/objects`` provides a list of objects
allocated via KFENCE, including those already freed but protected.
Implementation Details
----------------------
Guarded allocations are set up based on the sample interval. After expiration
of the sample interval, the next allocation through the main allocator (SLAB or
SLUB) returns a guarded allocation from the KFENCE object pool (allocation
sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and
the next allocation is set up after the expiration of the interval. To "gate" a
KFENCE allocation through the main allocator's fast-path without overhead,
KFENCE relies on static branches via the static keys infrastructure. The static
branch is toggled to redirect the allocation to KFENCE.
KFENCE objects each reside on a dedicated page, at either the left or right
page boundaries selected at random. The pages to the left and right of the
object page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access. Such page faults are then
intercepted by KFENCE, which handles the fault gracefully by reporting an
out-of-bounds access, and marking the page as accessible so that the faulting
code can (wrongly) continue executing (set ``panic_on_warn`` to panic instead).
To detect out-of-bounds writes to memory within the object's page itself,
KFENCE also uses pattern-based redzones. For each object page, a redzone is set
up for all non-object memory. For typical alignments, the redzone is only
required on the unguarded side of an object. Because KFENCE must honor the
cache's requested alignment, special alignments may result in unprotected gaps
on either side of an object, all of which are redzoned.
The following figure illustrates the page layout::
---+-----------+-----------+-----------+-----------+-----------+---
| xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx |
| xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx |
| x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x |
| xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx |
| xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx |
| xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx |
---+-----------+-----------+-----------+-----------+-----------+---
Upon deallocation of a KFENCE object, the object's page is again protected and
the object is marked as freed. Any further access to the object causes a fault
and KFENCE reports a use-after-free access. Freed objects are inserted at the
tail of KFENCE's freelist, so that the least recently freed objects are reused
first, and the chances of detecting use-after-frees of recently freed objects
is increased.
Interface
---------
The following describes the functions which are used by allocators as well as
page handling code to set up and deal with KFENCE allocations.
.. kernel-doc:: include/linux/kfence.h
:functions: is_kfence_address
kfence_shutdown_cache
kfence_alloc kfence_free __kfence_free
kfence_ksize kfence_object_start
kfence_handle_page_fault
Related Tools
-------------
In userspace, a similar approach is taken by `GWP-ASan
<http://llvm.org/docs/GwpAsan.html>`_. GWP-ASan also relies on guard pages and
a sampling strategy to detect memory unsafety bugs at scale. KFENCE's design is
directly influenced by GWP-ASan, and can be seen as its kernel sibling. Another
similar but non-sampling approach, that also inspired the name "KFENCE", can be
found in the userspace `Electric Fence Malloc Debugger
<https://linux.die.net/man/3/efence>`_.
In the kernel, several tools exist to debug memory access errors, and in
particular KASAN can detect all bug classes that KFENCE can detect. While KASAN
is more precise, relying on compiler instrumentation, this comes at a
performance cost.
It is worth highlighting that KASAN and KFENCE are complementary, with
different target environments. For instance, KASAN is the better debugging-aid,
where test cases or reproducers exists: due to the lower chance to detect the
error, it would require more effort using KFENCE to debug. Deployments at scale
that cannot afford to enable KASAN, however, would benefit from using KFENCE to
discover bugs due to code paths not exercised by test cases or fuzzers.
......@@ -210,6 +210,7 @@ read the file /proc/PID/status::
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
SpeculationIndirectBranch: conditional enabled
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 1
......@@ -292,6 +293,7 @@ It's slow but very precise.
NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
Speculation_Store_Bypass speculative store bypass mitigation status
SpeculationIndirectBranch indirect branch speculation mode
Cpus_allowed mask of CPUs on which this process may run
Cpus_allowed_list Same as previous, but in "list format"
Mems_allowed mask of memory nodes allowed</