1. 27 Sep, 2019 1 commit
  2. 26 Sep, 2019 3 commits
  3. 25 Sep, 2019 1 commit
  4. 24 Sep, 2019 4 commits
    • Michal Hocko's avatar
      memcg, kmem: deprecate kmem.limit_in_bytes · 0158115f
      Michal Hocko authored
      Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
      which turned out to be really a bad idea because there are paths which
      cannot shrink the kernel memory usage enough to get below the limit (e.g.
      because the accounted memory is not reclaimable).  There are cases when
      the failure is even not allowed (e.g.  __GFP_NOFAIL).  This means that the
      kmem limit is in excess to the hard limit without any way to shrink and
      thus completely useless.  OOM killer cannot be invoked to handle the
      situation because that would lead to a premature oom killing.
      
      As a result many places might see ENOMEM returning from kmalloc and result
      in unexpected errors.  E.g.  a global OOM killer when there is a lot of
      free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and
      therefore pagefault_out_of_memory would result in OOM killer.
      
      Please note that the kernel memory is still accounted to the overall limit
      along with the user memory so removing the kmem specific limit should
      still allow to contain kernel memory consumption.  Unlike the kmem one,
      though, it invokes memory reclaim and targeted memcg oom killing if
      necessary.
      
      Start the deprecation process by crying to the kernel log.  Let's see
      whether there are relevant usecases and simply return to EINVAL in the
      second stage if nobody complains in few releases.
      
      [akpm@linux-foundation.org: tweak documentation text]
      Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.cz
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0158115f
    • Vlastimil Babka's avatar
      mm, page_owner, debug_pagealloc: save and dump freeing stack trace · 8974558f
      Vlastimil Babka authored
      The debug_pagealloc functionality is useful to catch buggy page allocator
      users that cause e.g.  use after free or double free.  When page
      inconsistency is detected, debugging is often simpler by knowing the call
      stack of process that last allocated and freed the page.  When page_owner
      is also enabled, we record the allocation stack trace, but not freeing.
      
      This patch therefore adds recording of freeing process stack trace to page
      owner info, if both page_owner and debug_pagealloc are configured and
      enabled.  With only page_owner enabled, this info is not useful for the
      memory leak debugging use case.  dump_page() is adjusted to print the
      info.  An example result of calling __free_pages() twice may look like
      this (note the page last free stack trace):
      
      BUG: Bad page state in process bash  pfn:13d8f8
      page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1affff800000000()
      raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
      page dumped because: nonzero _refcount
      page_owner tracks the page as freed
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL)
       prep_new_page+0x143/0x150
       get_page_from_freelist+0x289/0x380
       __alloc_pages_nodemask+0x13c/0x2d0
       khugepaged+0x6e/0xc10
       kthread+0xf9/0x130
       ret_from_fork+0x3a/0x50
      page last free stack trace:
       free_pcp_prepare+0x134/0x1e0
       free_unref_page+0x18/0x90
       khugepaged+0x7b/0xc10
       kthread+0xf9/0x130
       ret_from_fork+0x3a/0x50
      Modules linked in:
      CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
      Call Trace:
       dump_stack+0x85/0xc0
       bad_page.cold+0xba/0xbf
       rmqueue_pcplist.isra.0+0x6c5/0x6d0
       rmqueue+0x2d/0x810
       get_page_from_freelist+0x191/0x380
       __alloc_pages_nodemask+0x13c/0x2d0
       __get_free_pages+0xd/0x30
       __pud_alloc+0x2c/0x110
       copy_page_range+0x4f9/0x630
       dup_mmap+0x362/0x480
       dup_mm+0x68/0x110
       copy_process+0x19e1/0x1b40
       _do_fork+0x73/0x310
       __x64_sys_clone+0x75/0x80
       do_syscall_64+0x6e/0x1e0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x7f10af854a10
      ...
      
      Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8974558f
    • Waiman Long's avatar
      mm, slab: extend slab/shrink to shrink all memcg caches · 04f768a3
      Waiman Long authored
      Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
      file to shrink the slab by flushing out all the per-cpu slabs and free
      slabs in partial lists.  This can be useful to squeeze out a bit more
      memory under extreme condition as well as making the active object counts
      in /proc/slabinfo more accurate.
      
      This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
      option is usually not enabled and "slub_memcg_sysfs=1" not set.  Even if
      memcg sysfs is turned on, it is too cumbersome and impractical to manage
      all those per-memcg sysfs files in a real production system.
      
      So there is no practical way to shrink memcg caches.  Fix this by enabling
      a proper write to the shrink sysfs file of the root cache to scan all the
      available memcg caches and shrink them as well.  For a non-root memcg
      cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that
      cache will be shrunk when written.
      
      On a 2-socket 64-core 256-thread arm64 system with 64k page after
      a parallel kernel build, the the amount of memory occupied by slabs
      before shrinking slabs were:
      
       # grep task_struct /proc/slabinfo
       task_struct        53137  53192   4288   61    4 : tunables    0    0
       0 : slabdata    872    872      0
       # grep "^S[lRU]" /proc/meminfo
       Slab:            3936832 kB
       SReclaimable:     399104 kB
       SUnreclaim:      3537728 kB
      
      After shrinking slabs (by echoing "1" to all shrink files):
      
       # grep "^S[lRU]" /proc/meminfo
       Slab:            1356288 kB
       SReclaimable:     263296 kB
       SUnreclaim:      1092992 kB
       # grep task_struct /proc/slabinfo
       task_struct         2764   6832   4288   61    4 : tunables    0    0
       0 : slabdata    112    112      0
      
      Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.com
      
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f768a3
    • Tianyu Lan's avatar
      KVM/Hyper-V: Add new KVM capability KVM_CAP_HYPERV_DIRECT_TLBFLUSH · 344c6c80
      Tianyu Lan authored
      
      
      Hyper-V direct tlb flush function should be enabled for
      guest that only uses Hyper-V hypercall. User space
      hypervisor(e.g, Qemu) can disable KVM identification in
      CPUID and just exposes Hyper-V identification to make
      sure the precondition. Add new KVM capability KVM_CAP_
      HYPERV_DIRECT_TLBFLUSH for user space to enable Hyper-V
      direct tlb function and this function is default to be
      disabled in KVM.
      
      Signed-off-by: default avatarTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      344c6c80
  5. 21 Sep, 2019 2 commits
  6. 20 Sep, 2019 3 commits
  7. 18 Sep, 2019 3 commits
  8. 17 Sep, 2019 15 commits
  9. 16 Sep, 2019 7 commits
  10. 15 Sep, 2019 1 commit