1. 28 Dec, 2018 8 commits
    • Andrey Konovalov's avatar
      kasan, mm, arm64: tag non slab memory allocated via pagealloc · 2813b9c0
      Andrey Konovalov authored
      Tag-based KASAN doesn't check memory accesses through pointers tagged with
      0xff.  When page_address is used to get pointer to memory that corresponds
      to some page, the tag of the resulting pointer gets set to 0xff, even
      though the allocated memory might have been tagged differently.
      
      For slab pages it's impossible to recover the correct tag to return from
      page_address, since the page might contain multiple slab objects tagged
      with different values, and we can't know in advance which one of them is
      going to get accessed.  For non slab pages however, we can recover the tag
      in page_address, since the whole page was marked with the same tag.
      
      This patch adds tagging to non slab memory allocated with pagealloc.  To
      set the tag of the pointer returned from page_address, the tag gets stored
      to page->flags when the memory gets allocated.
      
      Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2813b9c0
    • Andrey Konovalov's avatar
      kasan, arm64: add brk handler for inline instrumentation · 41eea9cd
      Andrey Konovalov authored
      Tag-based KASAN inline instrumentation mode (which embeds checks of shadow
      memory into the generated code, instead of inserting a callback) generates
      a brk instruction when a tag mismatch is detected.
      
      This commit adds a tag-based KASAN specific brk handler, that decodes the
      immediate value passed to the brk instructions (to extract information
      about the memory access that triggered the mismatch), reads the register
      values (x0 contains the guilty address) and reports the bug.
      
      Link: http://lkml.kernel.org/r/c91fe7684070e34dc34b419e6b69498f4dcacc2d.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41eea9cd
    • Andrey Konovalov's avatar
      mm: move obj_to_index to include/linux/slab_def.h · 5b7c4148
      Andrey Konovalov authored
      While with SLUB we can actually preassign tags for caches with contructors
      and store them in pointers in the freelist, SLAB doesn't allow that since
      the freelist is stored as an array of indexes, so there are no pointers to
      store the tags.
      
      Instead we compute the tag twice, once when a slab is created before
      calling the constructor and then again each time when an object is
      allocated with kmalloc.  Tag is computed simply by taking the lowest byte
      of the index that corresponds to the object.  However in kasan_kmalloc we
      only have access to the objects pointer, so we need a way to find out
      which index this object corresponds to.
      
      This patch moves obj_to_index from slab.c to include/linux/slab_def.h to
      be reused by KASAN.
      
      Link: http://lkml.kernel.org/r/c02cd9e574cfd93858e43ac94b05e38f891fef64.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b7c4148
    • Andrey Konovalov's avatar
      kasan: add tag related helper functions · 3c9e3aa1
      Andrey Konovalov authored
      This commit adds a few helper functions, that are meant to be used to work
      with tags embedded in the top byte of kernel pointers: to set, to get or
      to reset the top byte.
      
      Link: http://lkml.kernel.org/r/f6c6437bb8e143bc44f42c3c259c62e734be7935.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c9e3aa1
    • Andrey Konovalov's avatar
      kasan: initialize shadow to 0xff for tag-based mode · 080eb83f
      Andrey Konovalov authored
      A tag-based KASAN shadow memory cell contains a memory tag, that
      corresponds to the tag in the top byte of the pointer, that points to that
      memory.  The native top byte value of kernel pointers is 0xff, so with
      tag-based KASAN we need to initialize shadow memory to 0xff.
      
      [cai@lca.pw: arm64: skip kmemleak for KASAN again\
        Link: http://lkml.kernel.org/r/20181226020550.63712-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/5cc1b789aad7c99cf4f3ec5b328b147ad53edb40.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      080eb83f
    • Andrey Konovalov's avatar
      kasan: rename kasan_zero_page to kasan_early_shadow_page · 9577dd74
      Andrey Konovalov authored
      With tag based KASAN mode the early shadow value is 0xff and not 0x00, so
      this patch renames kasan_zero_(page|pte|pmd|pud|p4d) to
      kasan_early_shadow_(page|pte|pmd|pud|p4d) to avoid confusion.
      
      Link: http://lkml.kernel.org/r/3fed313280ebf4f88645f5b89ccbc066d320e177.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Suggested-by: Mark Rutland's avatarMark Rutland <mark.rutland@arm.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9577dd74
    • Andrey Konovalov's avatar
      kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS · 2bd926b4
      Andrey Konovalov authored
      This commit splits the current CONFIG_KASAN config option into two:
      1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one
         that exists now);
      2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode.
      
      The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have
      another hardware tag-based KASAN mode, that will rely on hardware memory
      tagging support in arm64.
      
      With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to
      instrument kernel files with -fsantize=kernel-hwaddress (except the ones
      for which KASAN_SANITIZE := n is set).
      
      Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both
      CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes.
      
      This commit also adds empty placeholder (for now) implementation of
      tag-based KASAN specific hooks inserted by the compiler and adjusts
      common hooks implementation.
      
      While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option
      is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will
      enable once all the infrastracture code has been added.
      
      Link: http://lkml.kernel.org/r/b2550106eb8a68b10fefbabce820910b115aa853.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bd926b4
    • Andrey Konovalov's avatar
      kasan, mm: change hooks signatures · 0116523c
      Andrey Konovalov authored
      Patch series "kasan: add software tag-based mode for arm64", v13.
      
      This patchset adds a new software tag-based mode to KASAN [1].  (Initially
      this mode was called KHWASAN, but it got renamed, see the naming rationale
      at the end of this section).
      
      The plan is to implement HWASan [2] for the kernel with the incentive,
      that it's going to have comparable to KASAN performance, but in the same
      time consume much less memory, trading that off for somewhat imprecise bug
      detection and being supported only for arm64.
      
      The underlying ideas of the approach used by software tag-based KASAN are:
      
      1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
         pointer tags in the top byte of each kernel pointer.
      
      2. Using shadow memory, we can store memory tags for each chunk of kernel
         memory.
      
      3. On each memory allocation, we can generate a random tag, embed it into
         the returned pointer and set the memory tags that correspond to this
         chunk of memory to the same value.
      
      4. By using compiler instrumentation, before each memory access we can add
         a check that the pointer tag matches the tag of the memory that is being
         accessed.
      
      5. On a tag mismatch we report an error.
      
      With this patchset the existing KASAN mode gets renamed to generic KASAN,
      with the word "generic" meaning that the implementation can be supported
      by any architecture as it is purely software.
      
      The new mode this patchset adds is called software tag-based KASAN.  The
      word "tag-based" refers to the fact that this mode uses tags embedded into
      the top byte of kernel pointers and the TBI arm64 CPU feature that allows
      to dereference such pointers.  The word "software" here means that shadow
      memory manipulation and tag checking on pointer dereference is done in
      software.  As it is the only tag-based implementation right now, "software
      tag-based" KASAN is sometimes referred to as simply "tag-based" in this
      patchset.
      
      A potential expansion of this mode is a hardware tag-based mode, which
      would use hardware memory tagging support (announced by Arm [3]) instead
      of compiler instrumentation and manual shadow memory manipulation.
      
      Same as generic KASAN, software tag-based KASAN is strictly a debugging
      feature.
      
      [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
      
      [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
      
      [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
      
      ====== Rationale
      
      On mobile devices generic KASAN's memory usage is significant problem.
      One of the main reasons to have tag-based KASAN is to be able to perform a
      similar set of checks as the generic one does, but with lower memory
      requirements.
      
      Comment from Vishwath Mohan <vishwath@google.com>:
      
      I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
      problematic to enable for environments that don't tolerate the increased
      memory pressure well.  This includes
      
      (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
      (c) Connected components like Pixel's visual core [1].
      
      These are both places I'd love to have a low(er) memory footprint option at
      my disposal.
      
      Comment from Evgenii Stepanov <eugenis@google.com>:
      
      Looking at a live Android device under load, slab (according to
      /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB).  KASAN's
      overhead of 2x - 3x on top of it is not insignificant.
      
      Not having this overhead enables near-production use - ex.  running
      KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
      not reproduce in test configuration.  These are the ones that often cost
      the most engineering time to track down.
      
      CPU overhead is bad, but generally tolerable.  RAM is critical, in our
      experience.  Once it gets low enough, OOM-killer makes your life
      miserable.
      
      [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
      
      ====== Technical details
      
      Software tag-based KASAN mode is implemented in a very similar way to the
      generic one. This patchset essentially does the following:
      
      1. TCR_TBI1 is set to enable Top Byte Ignore.
      
      2. Shadow memory is used (with a different scale, 1:16, so each shadow
         byte corresponds to 16 bytes of kernel memory) to store memory tags.
      
      3. All slab objects are aligned to shadow scale, which is 16 bytes.
      
      4. All pointers returned from the slab allocator are tagged with a random
         tag and the corresponding shadow memory is poisoned with the same value.
      
      5. Compiler instrumentation is used to insert tag checks. Either by
         calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
         CONFIG_KASAN_INLINE flags are reused).
      
      6. When a tag mismatch is detected in callback instrumentation mode
         KASAN simply prints a bug report. In case of inline instrumentation,
         clang inserts a brk instruction, and KASAN has it's own brk handler,
         which reports the bug.
      
      7. The memory in between slab objects is marked with a reserved tag, and
         acts as a redzone.
      
      8. When a slab object is freed it's marked with a reserved tag.
      
      Bug detection is imprecise for two reasons:
      
      1. We won't catch some small out-of-bounds accesses, that fall into the
         same shadow cell, as the last byte of a slab object.
      
      2. We only have 1 byte to store tags, which means we have a 1/256
         probability of a tag match for an incorrect access (actually even
         slightly less due to reserved tag values).
      
      Despite that there's a particular type of bugs that tag-based KASAN can
      detect compared to generic KASAN: use-after-free after the object has been
      allocated by someone else.
      
      ====== Testing
      
      Some kernel developers voiced a concern that changing the top byte of
      kernel pointers may lead to subtle bugs that are difficult to discover.
      To address this concern deliberate testing has been performed.
      
      It doesn't seem feasible to do some kind of static checking to find
      potential issues with pointer tagging, so a dynamic approach was taken.
      All pointer comparisons/subtractions have been instrumented in an LLVM
      compiler pass and a kernel module that would print a bug report whenever
      two pointers with different tags are being compared/subtracted (ignoring
      comparisons with NULL pointers and with pointers obtained by casting an
      error code to a pointer type) has been used.  Then the kernel has been
      booted in QEMU and on an Odroid C2 board and syzkaller has been run.
      
      This yielded the following results.
      
      The two places that look interesting are:
      
      is_vmalloc_addr in include/linux/mm.h
      is_kernel_rodata in mm/util.c
      
      Here we compare a pointer with some fixed untagged values to make sure
      that the pointer lies in a particular part of the kernel address space.
      Since tag-based KASAN doesn't add tags to pointers that belong to rodata
      or vmalloc regions, this should work as is.  To make sure debug checks to
      those two functions that check that the result doesn't change whether we
      operate on pointers with or without untagging has been added.
      
      A few other cases that don't look that interesting:
      
      Comparing pointers to achieve unique sorting order of pointee objects
      (e.g. sorting locks addresses before performing a double lock):
      
      tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
      pipe_double_lock in fs/pipe.c
      unix_state_double_lock in net/unix/af_unix.c
      lock_two_nondirectories in fs/inode.c
      mutex_lock_double in kernel/events/core.c
      
      ep_cmp_ffd in fs/eventpoll.c
      fsnotify_compare_groups fs/notify/mark.c
      
      Nothing needs to be done here, since the tags embedded into pointers
      don't change, so the sorting order would still be unique.
      
      Checks that a pointer belongs to some particular allocation:
      
      is_sibling_entry in lib/radix-tree.c
      object_is_on_stack in include/linux/sched/task_stack.h
      
      Nothing needs to be done here either, since two pointers can only belong
      to the same allocation if they have the same tag.
      
      Overall, since the kernel boots and works, there are no critical bugs.
      As for the rest, the traditional kernel testing way (use until fails) is
      the only one that looks feasible.
      
      Another point here is that tag-based KASAN is available under a separate
      config option that needs to be deliberately enabled. Even though it might
      be used in a "near-production" environment to find bugs that are not found
      during fuzzing or running tests, it is still a debug tool.
      
      ====== Benchmarks
      
      The following numbers were collected on Odroid C2 board. Both generic and
      tag-based KASAN were used in inline instrumentation mode.
      
      Boot time [1]:
      * ~1.7 sec for clean kernel
      * ~5.0 sec for generic KASAN
      * ~5.0 sec for tag-based KASAN
      
      Network performance [2]:
      * 8.33 Gbits/sec for clean kernel
      * 3.17 Gbits/sec for generic KASAN
      * 2.85 Gbits/sec for tag-based KASAN
      
      Slab memory usage after boot [3]:
      * ~40 kb for clean kernel
      * ~105 kb (~260% overhead) for generic KASAN
      * ~47 kb (~20% overhead) for tag-based KASAN
      
      KASAN memory overhead consists of three main parts:
      1. Increased slab memory usage due to redzones.
      2. Shadow memory (the whole reserved once during boot).
      3. Quaratine (grows gradually until some preset limit; the more the limit,
         the more the chance to detect a use-after-free).
      
      Comparing tag-based vs generic KASAN for each of these points:
      1. 20% vs 260% overhead.
      2. 1/16th vs 1/8th of physical memory.
      3. Tag-based KASAN doesn't require quarantine.
      
      [1] Time before the ext4 driver is initialized.
      [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
      [3] Measured as `cat /proc/meminfo | grep Slab`.
      
      ====== Some notes
      
      A few notes:
      
      1. The patchset can be found here:
         https://github.com/xairy/kasan-prototype/tree/khwasan
      
      2. Building requires a recent Clang version (7.0.0 or later).
      
      3. Stack instrumentation is not supported yet and will be added later.
      
      This patch (of 25):
      
      Tag-based KASAN changes the value of the top byte of pointers returned
      from the kernel allocation functions (such as kmalloc).  This patch
      updates KASAN hooks signatures and their usage in SLAB and SLUB code to
      reflect that.
      
      Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0116523c
  2. 24 Dec, 2018 1 commit
    • Peter Oskolkov's avatar
      net: dccp: fix kernel crash on module load · c92c81df
      Peter Oskolkov authored
      Patch eedbbb0d "net: dccp: initialize (addr,port) ..."
      added calling to inet_hashinfo2_init() from dccp_init().
      
      However, inet_hashinfo2_init() is marked as __init(), and
      thus the kernel panics when dccp is loaded as module. Removing
      __init() tag from inet_hashinfo2_init() is not feasible because
      it calls into __init functions in mm.
      
      This patch adds inet_hashinfo2_init_mod() function that can
      be called after the init phase is done; changes dccp_init() to
      call the new function; un-marks inet_hashinfo2_init() as
      exported.
      
      Fixes: eedbbb0d
      
       ("net: dccp: initialize (addr,port) ...")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c92c81df
  3. 23 Dec, 2018 3 commits
  4. 22 Dec, 2018 1 commit
  5. 21 Dec, 2018 7 commits
    • Paolo Abeni's avatar
      net: drop the unused helper skb_ext_get() · d312d0a6
      Paolo Abeni authored
      
      
      Such helper is currently unused, and skb extension users are
      better off using skb_ext_add()/skb_ext_del(). So let's drop
      it.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d312d0a6
    • Sean Christopherson's avatar
      Revert "compiler-gcc: disable -ftracer for __noclone functions" · 2bcbd406
      Sean Christopherson authored
      The -ftracer optimization was disabled in __noclone as a workaround to
      GCC duplicating a blob of inline assembly that happened to define a
      global variable.  It has been pointed out that no amount of workarounds
      can guarantee the compiler won't duplicate inline assembly[1], and that
      disabling the -ftracer optimization has several unintended and nasty
      side effects[2][3].
      
      Now that the offending KVM code which required the workaround has
      been properly fixed and no longer uses __noclone, remove the -ftracer
      optimization tweak from __noclone.
      
      [1] https://lore.kernel.org/lkml/ri6y38lo23g.fsf@suse.cz/T/#u
      [2] https://lore.kernel.org/lkml/20181218140105.ajuiglkpvstt3qxs@treble/T/#u
      [3] https://patchwork.kernel.org/patch/8707981/#21817015
      
      This reverts commit 95272c29
      
      .
      Suggested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2bcbd406
    • Jim Mattson's avatar
      kvm: Change offset in kvm_write_guest_offset_cached to unsigned · 7a86dab8
      Jim Mattson authored
      Since the offset is added directly to the hva from the
      gfn_to_hva_cache, a negative offset could result in an out of bounds
      write. The existing BUG_ON only checks for addresses beyond the end of
      the gfn_to_hva_cache, not for addresses before the start of the
      gfn_to_hva_cache.
      
      Note that all current call sites have non-negative offsets.
      
      Fixes: 4ec6e863
      
       ("kvm: Introduce kvm_write_guest_offset_cached()")
      Reported-by: default avatarCfir Cohen <cfir@google.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarCfir Cohen <cfir@google.com>
      Reviewed-by: default avatarPeter Shier <pshier@google.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      7a86dab8
    • Tariq Toukan's avatar
      net/mlx5e: XDP, Support Enhanced Multi-Packet TX WQE · 5e0d2eef
      Tariq Toukan authored
      
      
      Add support for the HW feature of multi-packet WQE in XDP
      xmit flow.
      
      The conventional TX descriptor (WQE, Work Queue Element) serves
      a single packet. Our HW has support for multi-packet WQE (MPWQE)
      in which a single descriptor serves multiple TX packets.
      
      This reduces both the PCI overhead and the CPU cycles wasted on
      writing them.
      
      In this patch we add support for the HW feature, which is supported
      starting from ConnectX-5.
      
      Performance:
      Tested packet rate for UDP 64Byte multi-stream over ConnectX-5 NICs.
      CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      XDP_TX:
      We see a huge gain on single port ConnectX-5, and reach the 100 Mpps
      milestone.
      * Single-port HCA:
      	Before:   70 Mpps
      	After:   100 Mpps (+42.8%)
      
      * Dual-port HCA:
      	Before: 51.7 Mpps
      	After:  57.3 Mpps (+10.8%)
      
      * In both cases we tested traffic on one port and for now On Dual-port HCAs
        we see only small gain, we are working to overcome this bottleneck, but
        for the moment only with experimental firmware on dual port HCAs we can
        reach the wanted numbers as seen on Single-port HCAs.
      
      XDP_REDIRECT:
      Redirect from (A) ConnectX-5 to (B) ConnectX-5.
      Due to a setup limitation, (A) and (B) are on different NUMA nodes,
      so absolute performance numbers are not optimal.
      Note:
        Below is the transmit rate of (B), not the redirect rate of (A)
        which is in some cases higher.
      
      * (B) is single-port:
      	Before:   77 Mpps
      	After:    90 Mpps (+16.8%)
      
      * (B) is dual-port:
      	Before:  61 Mpps
      	After:   72 Mpps (+18%)
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      5e0d2eef
    • Alexey Kardashevskiy's avatar
      vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver · 7f928917
      Alexey Kardashevskiy authored
      
      
      POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
      pluggable PCIe devices but still have PCIe links which are used
      for config space and MMIO. In addition to that the GPUs have 6 NVLinks
      which are connected to other GPUs and the POWER9 CPU. POWER9 chips
      have a special unit on a die called an NPU which is an NVLink2 host bus
      adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
      These systems also support ATS (address translation services) which is
      a part of the NVLink2 protocol. Such GPUs also share on-board RAM
      (16GB or 32GB) to the system via the same NVLink2 so a CPU has
      cache-coherent access to a GPU RAM.
      
      This exports GPU RAM to the userspace as a new VFIO device region. This
      preregisters the new memory as device memory as it might be used for DMA.
      This inserts pfns from the fault handler as the GPU memory is not onlined
      until the vendor driver is loaded and trained the NVLinks so doing this
      earlier causes low level errors which we fence in the firmware so
      it does not hurt the host system but still better be avoided; for the same
      reason this does not map GPU RAM into the host kernel (usual thing for
      emulated access otherwise).
      
      This exports an ATSD (Address Translation Shootdown) register of NPU which
      allows TLB invalidations inside GPU for an operating system. The register
      conveniently occupies a single 64k page. It is also presented to
      the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
      each of them can be used for TLB invalidation in a GPU linked to this NPU.
      This allocates one ATSD register per an NVLink bridge allowing passing
      up to 6 registers. Due to the host firmware bug (just recently fixed),
      only 1 ATSD register per NPU was actually advertised to the host system
      so this passes that alone register via the first NVLink bridge device in
      the group which is still enough as QEMU collects them all back and
      presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
      
      In order to provide the userspace with the information about GPU-to-NVLink
      connections, this exports an additional capability called "tgt"
      (which is an abbreviated host system bus address). The "tgt" property
      tells the GPU its own system address and allows the guest driver to
      conglomerate the routing information so each GPU knows how to get directly
      to the other GPUs.
      
      For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
      know LPID (a logical partition ID or a KVM guest hardware ID in other
      words) and PID (a memory context ID of a userspace process, not to be
      confused with a linux pid). This assigns a GPU to LPID in the NPU and
      this is why this adds a listener for KVM on an IOMMU group. A PID comes
      via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
      
      This requires coherent memory and ATSD to be available on the host as
      the GPU vendor only supports configurations with both features enabled
      and other configurations are known not to work. Because of this and
      because of the ways the features are advertised to the host system
      (which is a device tree with very platform specific properties),
      this requires enabled POWERNV platform.
      
      The V100 GPUs do not advertise any of these capabilities via the config
      space and there are more than just one device ID so this relies on
      the platform to tell whether these GPUs have special abilities such as
      NVLinks.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7f928917
    • Peter Oskolkov's avatar
      net: seg6.h: remove an unused #include · a6ae520d
      Peter Oskolkov authored
      
      
      A minor code cleanup.
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6ae520d
    • Stephen Hemminger's avatar
      linux/netlink.h: drop unnecessary extern prefix · aa9d6e0f
      Stephen Hemminger authored
      
      
      Don't need extern prefix before function prototypes.
      Checkpatch has complained about this for a couple of years.
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa9d6e0f
  6. 20 Dec, 2018 10 commits
    • Florian Westphal's avatar
      netfilter: netns: shrink netns_ct struct · 8527f9df
      Florian Westphal authored
      
      
      remove the obsolete sysctl anchors and move auto_assign_helper_warned
      to avoid/cover a hole.  Reduces size by 40 bytes on 64 bit.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      8527f9df
    • Florian Westphal's avatar
      netfilter: conntrack: remove empty pernet fini stubs · fc3893fd
      Florian Westphal authored
      
      
      after moving sysctl handling into single place, the init functions
      can't fail anymore and some of the fini functions are empty.
      
      Remove them and change return type to void.
      This also simplifies error unwinding in conntrack module init path.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      fc3893fd
    • Florian Westphal's avatar
      netfilter: conntrack: un-export seq_print_acct · 4b216e21
      Florian Westphal authored
      
      
      Only one caller, just place it where its needed.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4b216e21
    • Florian Westphal's avatar
      netfilter: conntrack: udp: only extend timeout to stream mode after 2s · d535c8a6
      Florian Westphal authored
      
      
      Currently DNS resolvers that send both A and AAAA queries from same source port
      can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
      for three minutes, even though DNS requests are done in a few milliseconds.
      
      Add a two second grace period where we continue to use the ordinary
      30-second default timeout.  Its enough for DNS request/response traffic,
      even if two request/reply packets are involved.
      
      ASSURED is still set, else conntrack (and thus a possible
      NAT mapping ...) gets zapped too in case conntrack table runs full.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d535c8a6
    • John Fastabend's avatar
      bpf: sk_msg, sock{map|hash} redirect through ULP · 0608c69c
      John Fastabend authored
      A sockmap program that redirects through a kTLS ULP enabled socket
      will not work correctly because the ULP layer is skipped. This
      fixes the behavior to call through the ULP layer on redirect to
      ensure any operations required on the data stream at the ULP layer
      continue to be applied.
      
      To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
      calling the BPF layer on a redirected message. This is
      required to avoid calling the BPF layer multiple times (possibly
      recursively) which is not the current/expected behavior without
      ULPs. In the future we may add a redirect flag if users _do_
      want the policy applied again but this would need to work for both
      ULP and non-ULP sockets and be opt-in to avoid breaking existing
      programs.
      
      Also to avoid polluting the flag space with an internal flag we
      reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
      MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
      SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
      to verify is user space API is masked correctly to ensure the flag
      can not be set by user. (Note this needs to be true regardless
      because we have internal flags already in-use that user space
      should not be able to set). But for completeness we have two UAPI
      paths into sendpage, sendfile and splice.
      
      In the sendfile case the function do_sendfile() zero's flags,
      
      ./fs/read_write.c:
       static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
      		   	    size_t count, loff_t max)
       {
         ...
         fl = 0;
      #if 0
         /*
          * We need to debate whether we can enable this or not. The
          * man page documents EAGAIN return for the output at least,
          * and the application is arguably buggy if it doesn't expect
          * EAGAIN on a non-blocking file descriptor.
          */
          if (in.file->f_flags & O_NONBLOCK)
      	fl = SPLICE_F_NONBLOCK;
      #endif
          file_start_write(out.file);
          retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
       }
      
      In the splice case the pipe_to_sendpage "actor" is used which
      masks flags with SPLICE_F_MORE.
      
      ./fs/splice.c:
       static int pipe_to_sendpage(struct pipe_inode_info *pipe,
      			    struct pipe_buffer *buf, struct splice_desc *sd)
       {
         ...
         more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
         ...
       }
      
      Confirming what we expect that internal flags  are in fact internal
      to socket side.
      
      Fixes: d3b18ad3
      
       ("tls: add bpf support to sk_msg handling")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0608c69c
    • John Fastabend's avatar
      bpf: sk_msg, fix socket data_ready events · 552de910
      John Fastabend authored
      When a skb verdict program is in-use and either another BPF program
      redirects to that socket or the new SK_PASS support is used the
      data_ready callback does not wake up application. Instead because
      the stream parser/verdict is using the sk data_ready callback we wake
      up the stream parser/verdict block.
      
      Fix this by adding a helper to check if the stream parser block is
      enabled on the sk and if so call the saved pointer which is the
      upper layers wake up function.
      
      This fixes application stalls observed when an application is waiting
      for data in a blocking read().
      
      Fixes: d829e9c4
      
       ("tls: convert to generic sk_msg interface")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      552de910
    • John Fastabend's avatar
      bpf: skmsg, replace comments with BUILD bug · 7a69c0f2
      John Fastabend authored
      
      
      Enforce comment on structure layout dependency with a BUILD_BUG_ON
      to ensure the condition is maintained.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7a69c0f2
    • Christoph Hellwig's avatar
      powerpc: use mm zones more sensibly · 25078dc1
      Christoph Hellwig authored
      
      
      Powerpc has somewhat odd usage where ZONE_DMA is used for all memory on
      common 64-bit configfs, and ZONE_DMA32 is used for 31-bit schemes.
      
      Move to a scheme closer to what other architectures use (and I dare to
      say the intent of the system):
      
       - ZONE_DMA: optionally for memory < 31-bit (64-bit embedded only)
       - ZONE_NORMAL: everything addressable by the kernel
       - ZONE_HIGHMEM: memory > 32-bit for 32-bit kernels
      
      Also provide information on how ZONE_DMA is used by defining
      ARCH_ZONE_DMA_BITS.
      
      Contains various fixes from Benjamin Herrenschmidt.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      25078dc1
    • Sinan Kaya's avatar
      PCI/ACPI: Allow ACPI to be built without CONFIG_PCI set · 5d32a665
      Sinan Kaya authored
      
      
      We are compiling PCI code today for systems with ACPI and no PCI
      device present. Remove the useless code and reduce the tight
      dependency.
      Signed-off-by: default avatarSinan Kaya <okaya@kernel.org>
      Acked-by: Bjorn Helgaas <bhelgaas@google.com> # PCI parts
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5d32a665
    • Sinan Kaya's avatar
      ACPICA: Remove PCI bits from ACPICA when CONFIG_PCI is unset · bd23fac3
      Sinan Kaya authored
      
      
      Allow ACPI to be built without PCI support in place.
      Signed-off-by: default avatarSinan Kaya <okaya@kernel.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      bd23fac3
  7. 19 Dec, 2018 10 commits
    • wenxu's avatar
      iptunnel: make TUNNEL_FLAGS available in uapi · 1875a9ab
      wenxu authored
      
      
      ip l add dev tun type gretap external
      ip r a 10.0.0.1 encap ip dst 192.168.152.171 id 1000 dev gretap
      
      For gretap Key example when the command set the id but don't set the
      TUNNEL_KEY flags. There is no key field in the send packet
      
      In the lwtunnel situation, some TUNNEL_FLAGS should can be set by
      userspace
      Signed-off-by: default avatarwenxu <wenxu@ucloud.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1875a9ab
    • Roopa Prabhu's avatar
      neighbour: register rtnl doit handler · 82cbb5c6
      Roopa Prabhu authored
      
      
      this patch registers neigh doit handler. The doit handler
      returns a neigh entry given dst and dev. This is similar
      to route and fdb doit (get) handlers. Also moves nda_policy
      declaration from rtnetlink.c to neighbour.c
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Reviewed-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82cbb5c6
    • Florian Westphal's avatar
      net: switch secpath to use skb extension infrastructure · 4165079b
      Florian Westphal authored
      
      
      Remove skb->sp and allocate secpath storage via extension
      infrastructure.  This also reduces sk_buff by 8 bytes on x86_64.
      
      Total size of allyesconfig kernel is reduced slightly, as there is
      less inlined code (one conditional atomic op instead of two on
      skb_clone).
      
      No differences in throughput in following ipsec performance tests:
      - transport mode with aes on 10GB link
      - tunnel mode between two network namespaces with aes and null cipher
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4165079b
    • Florian Westphal's avatar
      xfrm: use secpath_exist where applicable · 26912e37
      Florian Westphal authored
      
      
      Will reduce noise when skb->sp is removed later in this series.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26912e37
    • Florian Westphal's avatar
      net: use skb_sec_path helper in more places · 2294be0f
      Florian Westphal authored
      
      
      skb_sec_path gains 'const' qualifier to avoid
      xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type
      
      same reasoning as previous conversions: Won't need to touch these
      spots anymore when skb->sp is removed.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2294be0f
    • Florian Westphal's avatar
      net: move secpath_exist helper to sk_buff.h · 7af8f4ca
      Florian Westphal authored
      
      
      Future patch will remove skb->sp pointer.
      To reduce noise in those patches, move existing helper to
      sk_buff and use it in more places to ease skb->sp replacement later.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7af8f4ca
    • Florian Westphal's avatar
      xfrm: change secpath_set to return secpath struct, not error value · 0ca64da1
      Florian Westphal authored
      
      
      It can only return 0 (success) or -ENOMEM.
      Change return value to a pointer to secpath struct.
      
      This avoids direct access to skb->sp:
      
      err = secpath_set(skb);
      if (!err) ..
      skb->sp-> ...
      
      Becomes:
      sp = secpath_set(skb)
      if (!sp) ..
      sp-> ..
      
      This reduces noise in followup patch which is going to remove skb->sp.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ca64da1
    • Florian Westphal's avatar
      net: convert bridge_nf to use skb extension infrastructure · de8bda1d
      Florian Westphal authored
      
      
      This converts the bridge netfilter (calling iptables hooks from bridge)
      facility to use the extension infrastructure.
      
      The bridge_nf specific hooks in skb clone and free paths are removed, they
      have been replaced by the skb_ext hooks that do the same as the bridge nf
      allocations hooks did.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de8bda1d
    • Florian Westphal's avatar
      sk_buff: add skb extension infrastructure · df5042f4
      Florian Westphal authored
      
      
      This adds an optional extension infrastructure, with ispec (xfrm) and
      bridge netfilter as first users.
      objdiff shows no changes if kernel is built without xfrm and br_netfilter
      support.
      
      The third (planned future) user is Multipath TCP which is still
      out-of-tree.
      MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
      numbers used by individual subflows.
      
      This DSS mapping is read/written from tcp option space on receive and
      written to tcp option space on transmitted tcp packets that are part of
      and MPTCP connection.
      
      Extending skb_shared_info or adding a private data field to skb fclones
      doesn't work for incoming skb, so a different DSS propagation method would
      be required for the receive side.
      
      mptcp has same requirements as secpath/bridge netfilter:
      
      1. extension memory is released when the sk_buff is free'd.
      2. data is shared after cloning an skb (clone inherits extension)
      3. adding extension to an skb will COW the extension buffer if needed.
      
      The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
      mapping for tx and rx processing.
      
      Two new members are added to sk_buff:
      1. 'active_extensions' byte (filling a hole), telling which extensions
         are available for this skb.
         This has two purposes.
         a) avoids the need to initialize the pointer.
         b) allows to "delete" an extension by clearing its bit
         value in ->active_extensions.
      
         While it would be possible to store the active_extensions byte
         in the extension struct instead of sk_buff, there is one problem
         with this:
          When an extension has to be disabled, we can always clear the
          bit in skb->active_extensions.  But in case it would be stored in the
          extension buffer itself, we might have to COW it first, if
          we are dealing with a cloned skb.  On kmalloc failure we would
          be unable to turn an extension off.
      
      2. extension pointer, located at the end of the sk_buff.
         If the active_extensions byte is 0, the pointer is undefined,
         it is not initialized on skb allocation.
      
      This adds extra code to skb clone and free paths (to deal with
      refcount/free of extension area) but this replaces similar code that
      manages skb->nf_bridge and skb->sp structs in the followup patches of
      the series.
      
      It is possible to add support for extensions that are not preseved on
      clones/copies.
      
      To do this, it would be needed to define a bitmask of all extensions that
      need copy/cow semantics, and change __skb_ext_copy() to check
      ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
      ->active_extensions to 0 on the new clone.
      
      This isn't done here because all extensions that get added here
      need the copy/cow semantics.
      
      v2:
      Allocate entire extension space using kmem_cache.
      Upside is that this allows better tracking of used memory,
      downside is that we will allocate more space than strictly needed in
      most cases (its unlikely that all extensions are active/needed at same
      time for same skb).
      The allocated memory (except the small extension header) is not cleared,
      so no additonal overhead aside from memory usage.
      
      Avoid atomic_dec_and_test operation on skb_ext_put()
      by using similar trick as kfree_skbmem() does with fclone_ref:
      If recount is 1, there is no concurrent user and we can free right away.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df5042f4
    • Florian Westphal's avatar
      netfilter: avoid using skb->nf_bridge directly · c4b0e771
      Florian Westphal authored
      
      
      This pointer is going to be removed soon, so use the existing helpers in
      more places to avoid noise when the removal happens.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4b0e771