1. 18 Dec, 2019 1 commit
  2. 01 Dec, 2019 6 commits
    • Daniel Axtens's avatar
      kasan: support backing vmalloc space with real shadow memory · 3c5c3cfb
      Daniel Axtens authored
      Patch series "kasan: support backing vmalloc space with real shadow
      memory", v11.
      
      Currently, vmalloc space is backed by the early shadow page.  This means
      that kasan is incompatible with VMAP_STACK.
      
      This series provides a mechanism to back vmalloc space with real,
      dynamically allocated memory.  I have only wired up x86, because that's
      the only currently supported arch I can work with easily, but it's very
      easy to wire up other architectures, and it appears that there is some
      work-in-progress code to do this on arm64 and s390.
      
      This has been discussed before in the context of VMAP_STACK:
       - https://bugzilla.kernel.org/show_bug.cgi?id=202009
       - https://lkml.org/lkml/2018/7/22/198
       - https://lkml.org/lkml/2019/7/19/822
      
      In terms of implementation details:
      
      Most mappings in vmalloc space are small, requiring less than a full
      page of shadow space.  Allocating a full shadow page per mapping would
      therefore be wasteful.  Furthermore, to ensure that different mappings
      use different shadow pages, mappings would have to be aligned to
      KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.
      
      Instead, share backing space across multiple mappings.  Allocate a
      backing page when a mapping in vmalloc space uses a particular page of
      the shadow region.  This page can be shared by other vmalloc mappings
      later on.
      
      We hook in to the vmap infrastructure to lazily clean up unused shadow
      memory.
      
      Testing with test_vmalloc.sh on an x86 VM with 2 vCPUs shows that:
      
       - Turning on KASAN, inline instrumentation, without vmalloc, introuduces
         a 4.1x-4.2x slowdown in vmalloc operations.
      
       - Turning this on introduces the following slowdowns over KASAN:
           * ~1.76x slower single-threaded (test_vmalloc.sh performance)
           * ~2.18x slower when both cpus are performing operations
             simultaneously (test_vmalloc.sh sequential_test_order=1)
      
      This is unfortunate but given that this is a debug feature only, not the
      end of the world.  The benchmarks are also a stress-test for the vmalloc
      subsystem: they're not indicative of an overall 2x slowdown!
      
      This patch (of 4):
      
      Hook into vmalloc and vmap, and dynamically allocate real shadow memory
      to back the mappings.
      
      Most mappings in vmalloc space are small, requiring less than a full
      page of shadow space.  Allocating a full shadow page per mapping would
      therefore be wasteful.  Furthermore, to ensure that different mappings
      use different shadow pages, mappings would have to be aligned to
      KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.
      
      Instead, share backing space across multiple mappings.  Allocate a
      backing page when a mapping in vmalloc space uses a particular page of
      the shadow region.  This page can be shared by other vmalloc mappings
      later on.
      
      We hook in to the vmap infrastructure to lazily clean up unused shadow
      memory.
      
      To avoid the difficulties around swapping mappings around, this code
      expects that the part of the shadow region that covers the vmalloc space
      will not be covered by the early shadow page, but will be left unmapped.
      This will require changes in arch-specific code.
      
      This allows KASAN with VMAP_STACK, and may be helpful for architectures
      that do not have a separate module space (e.g.  powerpc64, which I am
      currently working on).  It also allows relaxing the module alignment
      back to PAGE_SIZE.
      
      Testing with test_vmalloc.sh on an x86 VM with 2 vCPUs shows that:
      
       - Turning on KASAN, inline instrumentation, without vmalloc, introuduces
         a 4.1x-4.2x slowdown in vmalloc operations.
      
       - Turning this on introduces the following slowdowns over KASAN:
           * ~1.76x slower single-threaded (test_vmalloc.sh performance)
           * ~2.18x slower when both cpus are performing operations
             simultaneously (test_vmalloc.sh sequential_test_order=3D1)
      
      This is unfortunate but given that this is a debug feature only, not the
      end of the world.
      
      The full benchmark results are:
      
      Performance
      
                                    No KASAN      KASAN original x baseline  KASAN vmalloc x baseline    x KASAN
      
      fix_size_alloc_test             662004            11404956      17.23       19144610      28.92       1.68
      full_fit_alloc_test             710950            12029752      16.92       13184651      18.55       1.10
      long_busy_list_alloc_test      9431875            43990172       4.66       82970178       8.80       1.89
      random_size_alloc_test         5033626            23061762       4.58       47158834       9.37       2.04
      fix_align_alloc_test           1252514            15276910      12.20       31266116      24.96       2.05
      random_size_align_alloc_te     1648501            14578321       8.84       25560052      15.51       1.75
      align_shift_alloc_test             147                 830       5.65           5692      38.72       6.86
      pcpu_alloc_test                  80732              125520       1.55         140864       1.74       1.12
      Total Cycles              119240774314        763211341128       6.40  1390338696894      11.66       1.82
      
      Sequential, 2 cpus
      
                                    No KASAN      KASAN original x baseline  KASAN vmalloc x baseline    x KASAN
      
      fix_size_alloc_test            1423150            14276550      10.03       27733022      19.49       1.94
      full_fit_alloc_test            1754219            14722640       8.39       15030786       8.57       1.02
      long_busy_list_alloc_test     11451858            52154973       4.55      107016027       9.34       2.05
      random_size_alloc_test         5989020            26735276       4.46       68885923      11.50       2.58
      fix_align_alloc_test           2050976            20166900       9.83       50491675      24.62       2.50
      random_size_align_alloc_te     2858229            17971700       6.29       38730225      13.55       2.16
      align_shift_alloc_test             405                6428      15.87          26253      64.82       4.08
      pcpu_alloc_test                 127183              151464       1.19         216263       1.70       1.43
      Total Cycles               54181269392        308723699764       5.70   650772566394      12.01       2.11
      fix_size_alloc_test            1420404            14289308      10.06       27790035      19.56       1.94
      full_fit_alloc_test            1736145            14806234       8.53       15274301       8.80       1.03
      long_busy_list_alloc_test     11404638            52270785       4.58      107550254       9.43       2.06
      random_size_alloc_test         6017006            26650625       4.43       68696127      11.42       2.58
      fix_align_alloc_test           2045504            20280985       9.91       50414862      24.65       2.49
      random_size_align_alloc_te     2845338            17931018       6.30       38510276      13.53       2.15
      align_shift_alloc_test             472                3760       7.97           9656      20.46       2.57
      pcpu_alloc_test                 118643              132732       1.12         146504       1.23       1.10
      Total Cycles               54040011688        309102805492       5.72   651325675652      12.05       2.11
      
      [dja@axtens.net: fixups]
        Link: http://lkml.kernel.org/r/20191120052719.7201-1-dja@axtens.net
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D202009
      Link: http://lkml.kernel.org/r/20191031093909.9228-2-dja@axtens.net
      
      
      Signed-off-by: Mark Rutland <mark.rutland@arm.com> [shadow rework]
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Co-developed-by: Mark Rutland's avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c5c3cfb
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: rework vmap_area_lock · e36176be
      Uladzislau Rezki (Sony) authored
      With the new allocation approach introduced in the 5.2 kernel, it
      becomes possible to get rid of one global spinlock.  By doing that we
      can further improve the KVA from the performance point of view.
      
      Basically we can have two independent locks, one for allocation part and
      another one for deallocation, because of two different entities: "free
      data structures" and "busy data structures".
      
      As a result, allocation/deallocation operations can still interfere
      between each other in case of running simultaneously on different CPUs,
      it means there is still dependency, but with two locks it becomes lower.
      
      Summarizing:
        - it reduces the high lock contention
        - it allows to perform operations on "free" and "busy"
          trees in parallel on different CPUs. Please note it
          does not solve scalability issue.
      
      Test results:
      
      In order to evaluate this patch, we can run "vmalloc test driver" to see
      how many CPU cycles it takes to complete all test cases running
      sequentially.  All online CPUs run it so it will cause a high lock
      contention.
      
      HiKey 960, ARM64, 8xCPUs, big.LITTLE:
      
      <snip>
          sudo ./test_vmalloc.sh sequential_test_order=1
      <snip>
      
      <default>
      [  390.950557] All test took CPU0=457126382 cycles
      [  391.046690] All test took CPU1=454763452 cycles
      [  391.128586] All test took CPU2=454539334 cycles
      [  391.222669] All test took CPU3=455649517 cycles
      [  391.313946] All test took CPU4=388272196 cycles
      [  391.410425] All test took CPU5=384036264 cycles
      [  391.492219] All test took CPU6=387432964 cycles
      [  391.578433] All test took CPU7=387201996 cycles
      <default>
      
      <patched>
      [  304.721224] All test took CPU0=391521310 cycles
      [  304.821219] All test took CPU1=393533002 cycles
      [  304.917120] All test took CPU2=392243032 cycles
      [  305.008986] All test took CPU3=392353853 cycles
      [  305.108944] All test took CPU4=297630721 cycles
      [  305.196406] All test took CPU5=297548736 cycles
      [  305.288602] All test took CPU6=297092392 cycles
      [  305.381088] All test took CPU7=297293597 cycles
      <patched>
      
      ~14%-23% patched variant is better.
      
      Link: http://lkml.kernel.org/r/20191022155800.20468-1-urezki@gmail.com
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e36176be
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: add more comments to the adjust_va_to_fit_type() · 060650a2
      Uladzislau Rezki (Sony) authored
      When fit type is NE_FIT_TYPE there is a need in one extra object.
      Usually the "ne_fit_preload_node" per-CPU variable has it and there is
      no need in GFP_NOWAIT allocation, but there are exceptions.
      
      This commit just adds more explanations, as a result giving answers on
      questions like when it can occur, how often, under which conditions and
      what happens if GFP_NOWAIT gets failed.
      
      Link: http://lkml.kernel.org/r/20191016095438.12391-3-urezki@gmail.com
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Daniel Wagner <dwagner@suse.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      060650a2
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: respect passed gfp_mask when doing preloading · f07116d7
      Uladzislau Rezki (Sony) authored
      Allocation functions should comply with the given gfp_mask as much as
      possible.  The preallocation code in alloc_vmap_area doesn't follow that
      pattern and it is using a hardcoded GFP_KERNEL.  Although this doesn't
      really make much difference because vmalloc is not GFP_NOWAIT compliant
      in general (e.g.  page table allocations are GFP_KERNEL) there is no
      reason to spread that bad habit and it is good to fix the antipattern.
      
      [mhocko@suse.com: rewrite changelog]
      Link: http://lkml.kernel.org/r/20191016095438.12391-2-urezki@gmail.com
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Daniel Wagner <dwagner@suse.de>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f07116d7
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: remove preempt_disable/enable when doing preloading · 81f1ba58
      Uladzislau Rezki (Sony) authored
      Some background.  The preemption was disabled before to guarantee that a
      preloaded object is available for a CPU, it was stored for.  That was
      achieved by combining the disabling the preemption and taking the spin
      lock while the ne_fit_preload_node is checked.
      
      The aim was to not allocate in atomic context when spinlock is taken
      later, for regular vmap allocations.  But that approach conflicts with
      CONFIG_PREEMPT_RT philosophy.  It means that calling spin_lock() with
      disabled preemption is forbidden in the CONFIG_PREEMPT_RT kernel.
      
      Therefore, get rid of preempt_disable() and preempt_enable() when the
      preload is done for splitting purpose.  As a result we do not guarantee
      now that a CPU is preloaded, instead we minimize the case when it is
      not, with this change, by populating the per cpu preload pointer under
      the vmap_area_lock.
      
      This implies that at least each caller that has done the preallocation
      will not fallback to an atomic allocation later.  It is possible that
      the preallocation would be pointless or that no preallocation is done
      because of the race but the data shows that this is really rare.
      
      For example i run the special test case that follows the preload pattern
      and path.  20 "unbind" threads run it and each does 1000000 allocations.
      Only 3.5 times among 1000000 a CPU was not preloaded.  So it can happen
      but the number is negligible.
      
      [mhocko@suse.com: changelog additions]
      Link: http://lkml.kernel.org/r/20191016095438.12391-1-urezki@gmail.com
      Fixes: 82dd23e8
      
       ("mm/vmalloc.c: preload a CPU with one object for split purpose")
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarDaniel Wagner <dwagner@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81f1ba58
    • Liu Xiang's avatar
      dcf61ff0
  3. 18 Nov, 2019 1 commit
    • Andrii Nakryiko's avatar
      bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY · fc970227
      Andrii Nakryiko authored
      
      
      Add ability to memory-map contents of BPF array map. This is extremely useful
      for working with BPF global data from userspace programs. It allows to avoid
      typical bpf_map_{lookup,update}_elem operations, improving both performance
      and usability.
      
      There had to be special considerations for map freezing, to avoid having
      writable memory view into a frozen map. To solve this issue, map freezing and
      mmap-ing is happening under mutex now:
        - if map is already frozen, no writable mapping is allowed;
        - if map has writable memory mappings active (accounted in map->writecnt),
          map freezing will keep failing with -EBUSY;
        - once number of writable memory mappings drops to zero, map freezing can be
          performed again.
      
      Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
      can't be memory mapped either.
      
      For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
      to be mmap()'able. We also need to make sure that array data memory is
      page-sized and page-aligned, so we over-allocate memory in such a way that
      struct bpf_array is at the end of a single page of memory with array->value
      being aligned with the start of the second page. On deallocation we need to
      accomodate this memory arrangement to free vmalloc()'ed memory correctly.
      
      One important consideration regarding how memory-mapping subsystem functions.
      Memory-mapping subsystem provides few optional callbacks, among them open()
      and close().  close() is called for each memory region that is unmapped, so
      that users can decrease their reference counters and free up resources, if
      necessary. open() is *almost* symmetrical: it's called for each memory region
      that is being mapped, **except** the very first one. So bpf_map_mmap does
      initial refcnt bump, while open() will do any extra ones after that. Thus
      number of close() calls is equal to number of open() calls plus one more.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
      fc970227
  4. 26 Sep, 2019 1 commit
  5. 24 Sep, 2019 3 commits
  6. 04 Sep, 2019 1 commit
    • Christoph Hellwig's avatar
      vmalloc: lift the arm flag for coherent mappings to common code · fe9041c2
      Christoph Hellwig authored
      
      
      The arm architecture had a VM_ARM_DMA_CONSISTENT flag to mark DMA
      coherent remapping for a while.  Lift this flag to common code so
      that we can use it generically.  We also check it in the only place
      VM_USERMAP is directly check so that we can entirely replace that
      flag as well (although I'm not even sure why we'd want to allow
      remapping DMA appings, but I'd rather not change behavior).
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      fe9041c2
  7. 13 Aug, 2019 1 commit
    • Kuppuswamy Sathyanarayanan's avatar
      mm/vmalloc.c: fix percpu free VM area search criteria · 5336e52c
      Kuppuswamy Sathyanarayanan authored
      Recent changes to the vmalloc code by commit 68ad4a33
      ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
      cause spurious percpu allocation failures.  These, in turn, can result
      in panic()s in the slub code.  One such possible panic was reported by
      Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
      Another related panic observed is,
      
       RIP: 0033:0x7f46f7441b9b
       Call Trace:
        dump_stack+0x61/0x80
        pcpu_alloc.cold.30+0x22/0x4f
        mem_cgroup_css_alloc+0x110/0x650
        cgroup_apply_control_enable+0x133/0x330
        cgroup_mkdir+0x41b/0x500
        kernfs_iop_mkdir+0x5a/0x90
        vfs_mkdir+0x102/0x1b0
        do_mkdirat+0x7d/0xf0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
      to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
      uses two lists (vmap_area_list & free_vmap_area_list) to track the used
      and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
      sizes[], nr_vms, align) function is used for allocating congruent VM
      areas for percpu memory allocator.  In order to not conflict with
      VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
      VMALLOC space.  So the search for free vm_area for the given requirement
      starts near VMALLOC_END and moves upwards towards VMALLOC_START.
      
      Prior to commit 68ad4a33, the search for free vm_area in
      pcpu_get_vm_areas() involves following two main steps.
      
      Step 1:
          Find a aligned "base" adress near VMALLOC_END.
          va = free vm area near VMALLOC_END
      Step 2:
          Loop through number of requested vm_areas and check,
              Step 2.1:
                 if (base < VMALLOC_START)
                    1. fail with error
              Step 2.2:
                 // end is offsets[area] + sizes[area]
                 if (base + end > va->vm_end)
                     1. Move the base downwards and repeat Step 2
              Step 2.3:
                 if (base + start < va->vm_start)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      But Commit 68ad4a33 removed Step 2.2 and modified Step 2.3 as below:
      
              Step 2.3:
                 if (base + start < va->vm_start || base + end > va->vm_end)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      Above change is the root cause of spurious percpu memory allocation
      failures.  For example, consider a case where a relatively large vm_area
      (~ 30 TB) was ignored in free vm_area search because it did not pass the
      base + end < vm->vm_end boundary check.  Ignoring such large free
      vm_area's would lead to not finding free vm_area within boundary of
      VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.
      
      So modify the search algorithm to include Step 2.2.
      
      Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
      Fixes: 68ad4a33
      
       ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
      Signed-off-by: default avatarKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5336e52c
  8. 22 Jul, 2019 1 commit
  9. 12 Jul, 2019 7 commits
  10. 29 Jun, 2019 1 commit
  11. 24 Jun, 2019 1 commit
  12. 03 Jun, 2019 2 commits
    • Rick Edgecombe's avatar
      mm/vmalloc: Avoid rare case of flushing TLB with weird arguments · 31e67340
      Rick Edgecombe authored
      
      
      In a rare case, flush_tlb_kernel_range() could be called with a start
      higher than the end.
      
      In vm_remove_mappings(), in case page_address() returns 0 for all pages
      (for example they were all in highmem), _vm_unmap_aliases() will be
      called with start = ULONG_MAX, end = 0 and flush = 1.
      
      If at the same time, the vmalloc purge operation is triggered by something
      else while the current operation is between remove_vm_area() and
      _vm_unmap_aliases(), then the vm mapping just removed will be already
      purged. In this case the call of vm_unmap_aliases() may not find any other
      mappings to flush and so end up flushing start = ULONG_MAX, end = 0. So
      only set flush = true if we find something in the direct mapping that we
      need to flush, and this way this can't happen.
      Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 868b104d ("mm/vmalloc: Add flag for freeing of special permsissions")
      Link: https://lkml.kernel.org/r/20190527211058.2729-3-rick.p.edgecombe@intel.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      31e67340
    • Rick Edgecombe's avatar
      mm/vmalloc: Fix calculation of direct map addr range · 8e41f872
      Rick Edgecombe authored
      
      
      The calculation of the direct map address range to flush was wrong.
      This could cause the RO direct map alias to not get flushed. Today
      this shouldn't be a problem because this flush is only needed on x86
      right now and the spurious fault handler will fix cached RO->RW
      translations. In the future though, it could cause the permissions
      to remain RO in the TLB for the direct map alias, and then the page
      would return from the page allocator to some other component as RO
      and cause a crash.
      
      So fix fix the address range calculation so the flush will include the
      direct map range.
      Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 868b104d ("mm/vmalloc: Add flag for freeing of special permsissions")
      Link: https://lkml.kernel.org/r/20190527211058.2729-2-rick.p.edgecombe@intel.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8e41f872
  13. 01 Jun, 2019 1 commit
  14. 21 May, 2019 1 commit
  15. 18 May, 2019 3 commits
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro · a6cf4e0f
      Uladzislau Rezki (Sony) authored
      This macro adds some debug code to check that vmap allocations are
      happened in ascending order.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6cf4e0f
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro · bb850f4d
      Uladzislau Rezki (Sony) authored
      This macro adds some debug code to check that the augment tree is
      maintained correctly, meaning that every node contains valid
      subtree_max_size value.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb850f4d
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc.c: keep track of free blocks for vmap allocation · 68ad4a33
      Uladzislau Rezki (Sony) authored
      Patch series "improve vmap allocation", v3.
      
      Objective
      ---------
      
      Please have a look for the description at:
      
        https://lkml.org/lkml/2018/10/19/786
      
      but let me also summarize it a bit here as well.
      
      The current implementation has O(N) complexity. Requests with different
      permissive parameters can lead to long allocation time. When i say
      "long" i mean milliseconds.
      
      Description
      -----------
      
      This approach organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range, i.e.  an allocation is done over free areas lookups,
      instead of finding a hole between two busy blocks.  It allows to have
      lower number of objects which represent the free space, therefore to have
      less fragmented memory allocator.  Because free blocks are always as large
      as possible.
      
      It uses the augment tree where all free areas are sorted in ascending
      order of va->va_start address in pair with linked list that provides
      O(1) access to prev/next elements.
      
      Since the tree is augment, we also maintain the "subtree_max_size" of VA
      that reflects a maximum available free block in its left or right
      sub-tree.  Knowing that, we can easily traversal toward the lowest (left
      most path) free area.
      
      Allocation: ~O(log(N)) complexity.  It is sequential allocation method
      therefore tends to maximize locality.  The search is done until a first
      suitable block is large enough to encompass the requested parameters.
      Bigger areas are split.
      
      I copy paste here the description of how the area is split, since i
      described it in https://lkml.org/lkml/2018/10/19/786
      
      <snip>
      
      A free block can be split by three different ways.  Their names are
      FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e.  they
      correspond to how requested size and alignment fit to a free block.
      
      FL_FIT_TYPE - in this case a free block is just removed from the free
      list/tree because it fully fits.  Comparing with current design there is
      an extra work with rb-tree updating.
      
      LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit.  In this case what we do
      is just cutting a free block.  It is as fast as a current design.  Most of
      the vmalloc allocations just end up with this case, because the edge is
      always aligned to 1.
      
      NE_FIT_TYPE - Is much less common case.  Basically it happens when
      requested size and alignment does not fit left nor right edges, i.e.  it
      is between them.  In this case during splitting we have to build a
      remaining left free area and place it back to the free list/tree.
      
      Comparing with current design there are two extra steps.  First one is we
      have to allocate a new vmap_area structure.  Second one we have to insert
      that remaining free block to the address sorted list/tree.
      
      In order to optimize a first case there is a cache with free_vmap objects.
      Instead of allocating from slab we just take an object from the cache and
      reuse it.
      
      Second one is pretty optimized.  Since we know a start point in the tree
      we do not do a search from the top.  Instead a traversal begins from a
      rb-tree node we split.
      <snip>
      
      De-allocation.  ~O(log(N)) complexity.  An area is not inserted straight
      away to the tree/list, instead we identify the spot first, checking if it
      can be merged around neighbors.  The list provides O(1) access to
      prev/next, so it is pretty fast to check it.  Summarizing.  If merged then
      large coalesced areas are created, if not the area is just linked making
      more fragments.
      
      There is one more thing that i should mention here.  After modification of
      VA node, its subtree_max_size is updated if it was/is the biggest area in
      its left or right sub-tree.  Apart of that it can also be populated back
      to upper levels to fix the tree.  For more details please have a look at
      the __augment_tree_propagate_from() function and the description.
      
      Tests and stressing
      -------------------
      
      I use the "test_vmalloc.sh" test driver available under
      "tools/testing/selftests/vm/" since 5.1-rc1 kernel.  Just trigger "sudo
      ./test_vmalloc.sh" to find out how to deal with it.
      
      Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
      Regarding last one, i do not have any physical access to NUMA system,
      therefore i emulated it.  The time of stressing is days.
      
      If you run the test driver in "stress mode", you also need the patch that
      is in Andrew's tree but not in Linux 5.1-rc1.  So, please apply it:
      
      http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c
      
      After massive testing, i have not identified any problems like memory
      leaks, crashes or kernel panics.  I find it stable, but more testing would
      be good.
      
      Performance analysis
      --------------------
      
      I have used two systems to test.  One is i5-3320M CPU @ 2.60GHz and
      another is HiKey960(arm64) board.  i5-3320M runs on 4.20 kernel, whereas
      Hikey960 uses 4.15 kernel.  I have both system which could run on 5.1-rc1
      as well, but the results have not been ready by time i an writing this.
      
      Currently it consist of 8 tests.  There are three of them which correspond
      to different types of splitting(to compare with default).  We have 3
      ones(see above).  Another 5 do allocations in different conditions.
      
      a) sudo ./test_vmalloc.sh performance
      
      When the test driver is run in "performance" mode, it runs all available
      tests pinned to first online CPU with sequential execution test order.  We
      do it in order to get stable and repeatable results.  Take a look at time
      difference in "long_busy_list_alloc_test".  It is not surprising because
      the worst case is O(N).
      
      # i5-3320M
      How many cycles all tests took:
      CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt
      
      # Hikey960 8x CPUs
      How many cycles all tests took:
      CPU0=3478683207 cycles vs CPU0=463767978 cycles
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt
      
      b) time sudo ./test_vmalloc.sh test_repeat_count=1
      
      With this configuration, all tests are run on all available online CPUs.
      Before running each CPU shuffles its tests execution order.  It gives
      random allocation behaviour.  So it is rough comparison, but it puts in
      the picture for sure.
      
      # i5-3320M
      <default>            vs            <patched>
      real    101m22.813s                real    0m56.805s
      user    0m0.011s                   user    0m0.015s
      sys     0m5.076s                   sys     0m0.023s
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt
      
      # Hikey960 8x CPUs
      <default>            vs            <patched>
      real    unknown                    real    4m25.214s
      user    unknown                    user    0m0.011s
      sys     unknown                    sys     0m0.670s
      
      I did not manage to complete this test on "default Hikey960" kernel
      version.  After 24 hours it was still running, therefore i had to cancel
      it.  That is why real/user/sys are "unknown".
      
      This patch (of 3):
      
      Currently an allocation of the new vmap area is done over busy list
      iteration(complexity O(n)) until a suitable hole is found between two busy
      areas.  Therefore each new allocation causes the list being grown.  Due to
      over fragmented list and different permissive parameters an allocation can
      take a long time.  For example on embedded devices it is milliseconds.
      
      This patch organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range.  It uses an augment red-black tree that keeps blocks
      sorted by their offsets in pair with linked list keeping the free space in
      order of increasing addresses.
      
      Nodes are augmented with the size of the maximum available free block in
      its left or right sub-tree.  Thus, that allows to take a decision and
      traversal toward the block that will fit and will have the lowest start
      address, i.e.  it is sequential allocation.
      
      Allocation: to allocate a new block a search is done over the tree until a
      suitable lowest(left most) block is large enough to encompass: the
      requested size, alignment and vstart point.  If the block is bigger than
      requested size - it is split.
      
      De-allocation: when a busy vmap area is freed it can either be merged or
      inserted to the tree.  Red-black tree allows efficiently find a spot
      whereas a linked list provides a constant-time access to previous and next
      blocks to check if merging can be done.  In case of merging of
      de-allocated memory chunk a large coalesced area is created.
      
      Complexity: ~O(log(N))
      
      [urezki@gmail.com: v3]
        Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68ad4a33
  16. 15 May, 2019 2 commits
  17. 30 Apr, 2019 1 commit
    • Rick Edgecombe's avatar
      mm/vmalloc: Add flag for freeing of special permsissions · 868b104d
      Rick Edgecombe authored
      
      
      Add a new flag VM_FLUSH_RESET_PERMS, for enabling vfree operations to
      immediately clear executable TLB entries before freeing pages, and handle
      resetting permissions on the directmap. This flag is useful for any kind
      of memory with elevated permissions, or where there can be related
      permissions changes on the directmap. Today this is RO+X and RO memory.
      
      Although this enables directly vfreeing non-writeable memory now,
      non-writable memory cannot be freed in an interrupt because the allocation
      itself is used as a node on deferred free list. So when RO memory needs to
      be freed in an interrupt the code doing the vfree needs to have its own
      work queue, as was the case before the deferred vfree list was added to
      vmalloc.
      
      For architectures with set_direct_map_ implementations this whole operation
      can be done with one TLB flush when centralized like this. For others with
      directmap permissions, currently only arm64, a backup method using
      set_memory functions is used to reset the directmap. When arm64 adds
      set_direct_map_ functions, this backup can be removed.
      
      When the TLB is flushed to both remove TLB entries for the vmalloc range
      mapping and the direct map permissions, the lazy purge operation could be
      done to try to save a TLB flush later. However today vm_unmap_aliases
      could flush a TLB range that does not include the directmap. So a helper
      is added with extra parameters that can allow both the vmalloc address and
      the direct mapping to be flushed during this operation. The behavior of the
      normal vm_unmap_aliases function is unchanged.
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Suggested-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <akpm@linux-foundation.org>
      Cc: <ard.biesheuvel@linaro.org>
      Cc: <deneen.t.dock@intel.com>
      Cc: <kernel-hardening@lists.openwall.com>
      Cc: <kristen@linux.intel.com>
      Cc: <linux_dti@icloud.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190426001143.4983-17-namit@vmware.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      868b104d
  18. 06 Mar, 2019 6 commits