1. 28 Dec, 2018 8 commits
    • Andrey Konovalov's avatar
      kasan, mm, arm64: tag non slab memory allocated via pagealloc · 2813b9c0
      Andrey Konovalov authored
      Tag-based KASAN doesn't check memory accesses through pointers tagged with
      0xff.  When page_address is used to get pointer to memory that corresponds
      to some page, the tag of the resulting pointer gets set to 0xff, even
      though the allocated memory might have been tagged differently.
      
      For slab pages it's impossible to recover the correct tag to return from
      page_address, since the page might contain multiple slab objects tagged
      with different values, and we can't know in advance which one of them is
      going to get accessed.  For non slab pages however, we can recover the tag
      in page_address, since the whole page was marked with the same tag.
      
      This patch adds tagging to non slab memory allocated with pagealloc.  To
      set the tag of the pointer returned from page_address, the tag gets stored
      to page->flags when the memory gets allocated.
      
      Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2813b9c0
    • Andrey Konovalov's avatar
      kasan, arm64: add brk handler for inline instrumentation · 41eea9cd
      Andrey Konovalov authored
      Tag-based KASAN inline instrumentation mode (which embeds checks of shadow
      memory into the generated code, instead of inserting a callback) generates
      a brk instruction when a tag mismatch is detected.
      
      This commit adds a tag-based KASAN specific brk handler, that decodes the
      immediate value passed to the brk instructions (to extract information
      about the memory access that triggered the mismatch), reads the register
      values (x0 contains the guilty address) and reports the bug.
      
      Link: http://lkml.kernel.org/r/c91fe7684070e34dc34b419e6b69498f4dcacc2d.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41eea9cd
    • Andrey Konovalov's avatar
      mm: move obj_to_index to include/linux/slab_def.h · 5b7c4148
      Andrey Konovalov authored
      While with SLUB we can actually preassign tags for caches with contructors
      and store them in pointers in the freelist, SLAB doesn't allow that since
      the freelist is stored as an array of indexes, so there are no pointers to
      store the tags.
      
      Instead we compute the tag twice, once when a slab is created before
      calling the constructor and then again each time when an object is
      allocated with kmalloc.  Tag is computed simply by taking the lowest byte
      of the index that corresponds to the object.  However in kasan_kmalloc we
      only have access to the objects pointer, so we need a way to find out
      which index this object corresponds to.
      
      This patch moves obj_to_index from slab.c to include/linux/slab_def.h to
      be reused by KASAN.
      
      Link: http://lkml.kernel.org/r/c02cd9e574cfd93858e43ac94b05e38f891fef64.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b7c4148
    • Andrey Konovalov's avatar
      kasan: add tag related helper functions · 3c9e3aa1
      Andrey Konovalov authored
      This commit adds a few helper functions, that are meant to be used to work
      with tags embedded in the top byte of kernel pointers: to set, to get or
      to reset the top byte.
      
      Link: http://lkml.kernel.org/r/f6c6437bb8e143bc44f42c3c259c62e734be7935.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c9e3aa1
    • Andrey Konovalov's avatar
      kasan: initialize shadow to 0xff for tag-based mode · 080eb83f
      Andrey Konovalov authored
      A tag-based KASAN shadow memory cell contains a memory tag, that
      corresponds to the tag in the top byte of the pointer, that points to that
      memory.  The native top byte value of kernel pointers is 0xff, so with
      tag-based KASAN we need to initialize shadow memory to 0xff.
      
      [cai@lca.pw: arm64: skip kmemleak for KASAN again\
        Link: http://lkml.kernel.org/r/20181226020550.63712-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/5cc1b789aad7c99cf4f3ec5b328b147ad53edb40.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      080eb83f
    • Andrey Konovalov's avatar
      kasan: rename kasan_zero_page to kasan_early_shadow_page · 9577dd74
      Andrey Konovalov authored
      With tag based KASAN mode the early shadow value is 0xff and not 0x00, so
      this patch renames kasan_zero_(page|pte|pmd|pud|p4d) to
      kasan_early_shadow_(page|pte|pmd|pud|p4d) to avoid confusion.
      
      Link: http://lkml.kernel.org/r/3fed313280ebf4f88645f5b89ccbc066d320e177.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Suggested-by: Mark Rutland's avatarMark Rutland <mark.rutland@arm.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9577dd74
    • Andrey Konovalov's avatar
      kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS · 2bd926b4
      Andrey Konovalov authored
      This commit splits the current CONFIG_KASAN config option into two:
      1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one
         that exists now);
      2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode.
      
      The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have
      another hardware tag-based KASAN mode, that will rely on hardware memory
      tagging support in arm64.
      
      With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to
      instrument kernel files with -fsantize=kernel-hwaddress (except the ones
      for which KASAN_SANITIZE := n is set).
      
      Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both
      CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes.
      
      This commit also adds empty placeholder (for now) implementation of
      tag-based KASAN specific hooks inserted by the compiler and adjusts
      common hooks implementation.
      
      While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option
      is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will
      enable once all the infrastracture code has been added.
      
      Link: http://lkml.kernel.org/r/b2550106eb8a68b10fefbabce820910b115aa853.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bd926b4
    • Andrey Konovalov's avatar
      kasan, mm: change hooks signatures · 0116523c
      Andrey Konovalov authored
      Patch series "kasan: add software tag-based mode for arm64", v13.
      
      This patchset adds a new software tag-based mode to KASAN [1].  (Initially
      this mode was called KHWASAN, but it got renamed, see the naming rationale
      at the end of this section).
      
      The plan is to implement HWASan [2] for the kernel with the incentive,
      that it's going to have comparable to KASAN performance, but in the same
      time consume much less memory, trading that off for somewhat imprecise bug
      detection and being supported only for arm64.
      
      The underlying ideas of the approach used by software tag-based KASAN are:
      
      1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
         pointer tags in the top byte of each kernel pointer.
      
      2. Using shadow memory, we can store memory tags for each chunk of kernel
         memory.
      
      3. On each memory allocation, we can generate a random tag, embed it into
         the returned pointer and set the memory tags that correspond to this
         chunk of memory to the same value.
      
      4. By using compiler instrumentation, before each memory access we can add
         a check that the pointer tag matches the tag of the memory that is being
         accessed.
      
      5. On a tag mismatch we report an error.
      
      With this patchset the existing KASAN mode gets renamed to generic KASAN,
      with the word "generic" meaning that the implementation can be supported
      by any architecture as it is purely software.
      
      The new mode this patchset adds is called software tag-based KASAN.  The
      word "tag-based" refers to the fact that this mode uses tags embedded into
      the top byte of kernel pointers and the TBI arm64 CPU feature that allows
      to dereference such pointers.  The word "software" here means that shadow
      memory manipulation and tag checking on pointer dereference is done in
      software.  As it is the only tag-based implementation right now, "software
      tag-based" KASAN is sometimes referred to as simply "tag-based" in this
      patchset.
      
      A potential expansion of this mode is a hardware tag-based mode, which
      would use hardware memory tagging support (announced by Arm [3]) instead
      of compiler instrumentation and manual shadow memory manipulation.
      
      Same as generic KASAN, software tag-based KASAN is strictly a debugging
      feature.
      
      [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
      
      [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
      
      [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
      
      ====== Rationale
      
      On mobile devices generic KASAN's memory usage is significant problem.
      One of the main reasons to have tag-based KASAN is to be able to perform a
      similar set of checks as the generic one does, but with lower memory
      requirements.
      
      Comment from Vishwath Mohan <vishwath@google.com>:
      
      I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
      problematic to enable for environments that don't tolerate the increased
      memory pressure well.  This includes
      
      (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
      (c) Connected components like Pixel's visual core [1].
      
      These are both places I'd love to have a low(er) memory footprint option at
      my disposal.
      
      Comment from Evgenii Stepanov <eugenis@google.com>:
      
      Looking at a live Android device under load, slab (according to
      /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB).  KASAN's
      overhead of 2x - 3x on top of it is not insignificant.
      
      Not having this overhead enables near-production use - ex.  running
      KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
      not reproduce in test configuration.  These are the ones that often cost
      the most engineering time to track down.
      
      CPU overhead is bad, but generally tolerable.  RAM is critical, in our
      experience.  Once it gets low enough, OOM-killer makes your life
      miserable.
      
      [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
      
      ====== Technical details
      
      Software tag-based KASAN mode is implemented in a very similar way to the
      generic one. This patchset essentially does the following:
      
      1. TCR_TBI1 is set to enable Top Byte Ignore.
      
      2. Shadow memory is used (with a different scale, 1:16, so each shadow
         byte corresponds to 16 bytes of kernel memory) to store memory tags.
      
      3. All slab objects are aligned to shadow scale, which is 16 bytes.
      
      4. All pointers returned from the slab allocator are tagged with a random
         tag and the corresponding shadow memory is poisoned with the same value.
      
      5. Compiler instrumentation is used to insert tag checks. Either by
         calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
         CONFIG_KASAN_INLINE flags are reused).
      
      6. When a tag mismatch is detected in callback instrumentation mode
         KASAN simply prints a bug report. In case of inline instrumentation,
         clang inserts a brk instruction, and KASAN has it's own brk handler,
         which reports the bug.
      
      7. The memory in between slab objects is marked with a reserved tag, and
         acts as a redzone.
      
      8. When a slab object is freed it's marked with a reserved tag.
      
      Bug detection is imprecise for two reasons:
      
      1. We won't catch some small out-of-bounds accesses, that fall into the
         same shadow cell, as the last byte of a slab object.
      
      2. We only have 1 byte to store tags, which means we have a 1/256
         probability of a tag match for an incorrect access (actually even
         slightly less due to reserved tag values).
      
      Despite that there's a particular type of bugs that tag-based KASAN can
      detect compared to generic KASAN: use-after-free after the object has been
      allocated by someone else.
      
      ====== Testing
      
      Some kernel developers voiced a concern that changing the top byte of
      kernel pointers may lead to subtle bugs that are difficult to discover.
      To address this concern deliberate testing has been performed.
      
      It doesn't seem feasible to do some kind of static checking to find
      potential issues with pointer tagging, so a dynamic approach was taken.
      All pointer comparisons/subtractions have been instrumented in an LLVM
      compiler pass and a kernel module that would print a bug report whenever
      two pointers with different tags are being compared/subtracted (ignoring
      comparisons with NULL pointers and with pointers obtained by casting an
      error code to a pointer type) has been used.  Then the kernel has been
      booted in QEMU and on an Odroid C2 board and syzkaller has been run.
      
      This yielded the following results.
      
      The two places that look interesting are:
      
      is_vmalloc_addr in include/linux/mm.h
      is_kernel_rodata in mm/util.c
      
      Here we compare a pointer with some fixed untagged values to make sure
      that the pointer lies in a particular part of the kernel address space.
      Since tag-based KASAN doesn't add tags to pointers that belong to rodata
      or vmalloc regions, this should work as is.  To make sure debug checks to
      those two functions that check that the result doesn't change whether we
      operate on pointers with or without untagging has been added.
      
      A few other cases that don't look that interesting:
      
      Comparing pointers to achieve unique sorting order of pointee objects
      (e.g. sorting locks addresses before performing a double lock):
      
      tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
      pipe_double_lock in fs/pipe.c
      unix_state_double_lock in net/unix/af_unix.c
      lock_two_nondirectories in fs/inode.c
      mutex_lock_double in kernel/events/core.c
      
      ep_cmp_ffd in fs/eventpoll.c
      fsnotify_compare_groups fs/notify/mark.c
      
      Nothing needs to be done here, since the tags embedded into pointers
      don't change, so the sorting order would still be unique.
      
      Checks that a pointer belongs to some particular allocation:
      
      is_sibling_entry in lib/radix-tree.c
      object_is_on_stack in include/linux/sched/task_stack.h
      
      Nothing needs to be done here either, since two pointers can only belong
      to the same allocation if they have the same tag.
      
      Overall, since the kernel boots and works, there are no critical bugs.
      As for the rest, the traditional kernel testing way (use until fails) is
      the only one that looks feasible.
      
      Another point here is that tag-based KASAN is available under a separate
      config option that needs to be deliberately enabled. Even though it might
      be used in a "near-production" environment to find bugs that are not found
      during fuzzing or running tests, it is still a debug tool.
      
      ====== Benchmarks
      
      The following numbers were collected on Odroid C2 board. Both generic and
      tag-based KASAN were used in inline instrumentation mode.
      
      Boot time [1]:
      * ~1.7 sec for clean kernel
      * ~5.0 sec for generic KASAN
      * ~5.0 sec for tag-based KASAN
      
      Network performance [2]:
      * 8.33 Gbits/sec for clean kernel
      * 3.17 Gbits/sec for generic KASAN
      * 2.85 Gbits/sec for tag-based KASAN
      
      Slab memory usage after boot [3]:
      * ~40 kb for clean kernel
      * ~105 kb (~260% overhead) for generic KASAN
      * ~47 kb (~20% overhead) for tag-based KASAN
      
      KASAN memory overhead consists of three main parts:
      1. Increased slab memory usage due to redzones.
      2. Shadow memory (the whole reserved once during boot).
      3. Quaratine (grows gradually until some preset limit; the more the limit,
         the more the chance to detect a use-after-free).
      
      Comparing tag-based vs generic KASAN for each of these points:
      1. 20% vs 260% overhead.
      2. 1/16th vs 1/8th of physical memory.
      3. Tag-based KASAN doesn't require quarantine.
      
      [1] Time before the ext4 driver is initialized.
      [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
      [3] Measured as `cat /proc/meminfo | grep Slab`.
      
      ====== Some notes
      
      A few notes:
      
      1. The patchset can be found here:
         https://github.com/xairy/kasan-prototype/tree/khwasan
      
      2. Building requires a recent Clang version (7.0.0 or later).
      
      3. Stack instrumentation is not supported yet and will be added later.
      
      This patch (of 25):
      
      Tag-based KASAN changes the value of the top byte of pointers returned
      from the kernel allocation functions (such as kmalloc).  This patch
      updates KASAN hooks signatures and their usage in SLAB and SLUB code to
      reflect that.
      
      Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0116523c
  2. 23 Dec, 2018 2 commits
  3. 22 Dec, 2018 1 commit
  4. 21 Dec, 2018 5 commits
  5. 20 Dec, 2018 5 commits
    • John Fastabend's avatar
      bpf: sk_msg, sock{map|hash} redirect through ULP · 0608c69c
      John Fastabend authored
      A sockmap program that redirects through a kTLS ULP enabled socket
      will not work correctly because the ULP layer is skipped. This
      fixes the behavior to call through the ULP layer on redirect to
      ensure any operations required on the data stream at the ULP layer
      continue to be applied.
      
      To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
      calling the BPF layer on a redirected message. This is
      required to avoid calling the BPF layer multiple times (possibly
      recursively) which is not the current/expected behavior without
      ULPs. In the future we may add a redirect flag if users _do_
      want the policy applied again but this would need to work for both
      ULP and non-ULP sockets and be opt-in to avoid breaking existing
      programs.
      
      Also to avoid polluting the flag space with an internal flag we
      reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
      MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
      SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
      to verify is user space API is masked correctly to ensure the flag
      can not be set by user. (Note this needs to be true regardless
      because we have internal flags already in-use that user space
      should not be able to set). But for completeness we have two UAPI
      paths into sendpage, sendfile and splice.
      
      In the sendfile case the function do_sendfile() zero's flags,
      
      ./fs/read_write.c:
       static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
      		   	    size_t count, loff_t max)
       {
         ...
         fl = 0;
      #if 0
         /*
          * We need to debate whether we can enable this or not. The
          * man page documents EAGAIN return for the output at least,
          * and the application is arguably buggy if it doesn't expect
          * EAGAIN on a non-blocking file descriptor.
          */
          if (in.file->f_flags & O_NONBLOCK)
      	fl = SPLICE_F_NONBLOCK;
      #endif
          file_start_write(out.file);
          retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
       }
      
      In the splice case the pipe_to_sendpage "actor" is used which
      masks flags with SPLICE_F_MORE.
      
      ./fs/splice.c:
       static int pipe_to_sendpage(struct pipe_inode_info *pipe,
      			    struct pipe_buffer *buf, struct splice_desc *sd)
       {
         ...
         more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
         ...
       }
      
      Confirming what we expect that internal flags  are in fact internal
      to socket side.
      
      Fixes: d3b18ad3
      
       ("tls: add bpf support to sk_msg handling")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0608c69c
    • John Fastabend's avatar
      bpf: sk_msg, fix socket data_ready events · 552de910
      John Fastabend authored
      When a skb verdict program is in-use and either another BPF program
      redirects to that socket or the new SK_PASS support is used the
      data_ready callback does not wake up application. Instead because
      the stream parser/verdict is using the sk data_ready callback we wake
      up the stream parser/verdict block.
      
      Fix this by adding a helper to check if the stream parser block is
      enabled on the sk and if so call the saved pointer which is the
      upper layers wake up function.
      
      This fixes application stalls observed when an application is waiting
      for data in a blocking read().
      
      Fixes: d829e9c4
      
       ("tls: convert to generic sk_msg interface")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      552de910
    • John Fastabend's avatar
      bpf: skmsg, replace comments with BUILD bug · 7a69c0f2
      John Fastabend authored
      
      
      Enforce comment on structure layout dependency with a BUILD_BUG_ON
      to ensure the condition is maintained.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7a69c0f2
    • Christoph Hellwig's avatar
      powerpc: use mm zones more sensibly · 25078dc1
      Christoph Hellwig authored
      
      
      Powerpc has somewhat odd usage where ZONE_DMA is used for all memory on
      common 64-bit configfs, and ZONE_DMA32 is used for 31-bit schemes.
      
      Move to a scheme closer to what other architectures use (and I dare to
      say the intent of the system):
      
       - ZONE_DMA: optionally for memory < 31-bit (64-bit embedded only)
       - ZONE_NORMAL: everything addressable by the kernel
       - ZONE_HIGHMEM: memory > 32-bit for 32-bit kernels
      
      Also provide information on how ZONE_DMA is used by defining
      ARCH_ZONE_DMA_BITS.
      
      Contains various fixes from Benjamin Herrenschmidt.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      25078dc1
    • Sinan Kaya's avatar
      PCI/ACPI: Allow ACPI to be built without CONFIG_PCI set · 5d32a665
      Sinan Kaya authored
      
      
      We are compiling PCI code today for systems with ACPI and no PCI
      device present. Remove the useless code and reduce the tight
      dependency.
      Signed-off-by: default avatarSinan Kaya <okaya@kernel.org>
      Acked-by: Bjorn Helgaas <bhelgaas@google.com> # PCI parts
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5d32a665
  6. 19 Dec, 2018 12 commits
    • Florian Westphal's avatar
      net: switch secpath to use skb extension infrastructure · 4165079b
      Florian Westphal authored
      
      
      Remove skb->sp and allocate secpath storage via extension
      infrastructure.  This also reduces sk_buff by 8 bytes on x86_64.
      
      Total size of allyesconfig kernel is reduced slightly, as there is
      less inlined code (one conditional atomic op instead of two on
      skb_clone).
      
      No differences in throughput in following ipsec performance tests:
      - transport mode with aes on 10GB link
      - tunnel mode between two network namespaces with aes and null cipher
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4165079b
    • Florian Westphal's avatar
      net: use skb_sec_path helper in more places · 2294be0f
      Florian Westphal authored
      
      
      skb_sec_path gains 'const' qualifier to avoid
      xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type
      
      same reasoning as previous conversions: Won't need to touch these
      spots anymore when skb->sp is removed.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2294be0f
    • Florian Westphal's avatar
      net: move secpath_exist helper to sk_buff.h · 7af8f4ca
      Florian Westphal authored
      
      
      Future patch will remove skb->sp pointer.
      To reduce noise in those patches, move existing helper to
      sk_buff and use it in more places to ease skb->sp replacement later.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7af8f4ca
    • Florian Westphal's avatar
      net: convert bridge_nf to use skb extension infrastructure · de8bda1d
      Florian Westphal authored
      
      
      This converts the bridge netfilter (calling iptables hooks from bridge)
      facility to use the extension infrastructure.
      
      The bridge_nf specific hooks in skb clone and free paths are removed, they
      have been replaced by the skb_ext hooks that do the same as the bridge nf
      allocations hooks did.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de8bda1d
    • Florian Westphal's avatar
      sk_buff: add skb extension infrastructure · df5042f4
      Florian Westphal authored
      
      
      This adds an optional extension infrastructure, with ispec (xfrm) and
      bridge netfilter as first users.
      objdiff shows no changes if kernel is built without xfrm and br_netfilter
      support.
      
      The third (planned future) user is Multipath TCP which is still
      out-of-tree.
      MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
      numbers used by individual subflows.
      
      This DSS mapping is read/written from tcp option space on receive and
      written to tcp option space on transmitted tcp packets that are part of
      and MPTCP connection.
      
      Extending skb_shared_info or adding a private data field to skb fclones
      doesn't work for incoming skb, so a different DSS propagation method would
      be required for the receive side.
      
      mptcp has same requirements as secpath/bridge netfilter:
      
      1. extension memory is released when the sk_buff is free'd.
      2. data is shared after cloning an skb (clone inherits extension)
      3. adding extension to an skb will COW the extension buffer if needed.
      
      The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
      mapping for tx and rx processing.
      
      Two new members are added to sk_buff:
      1. 'active_extensions' byte (filling a hole), telling which extensions
         are available for this skb.
         This has two purposes.
         a) avoids the need to initialize the pointer.
         b) allows to "delete" an extension by clearing its bit
         value in ->active_extensions.
      
         While it would be possible to store the active_extensions byte
         in the extension struct instead of sk_buff, there is one problem
         with this:
          When an extension has to be disabled, we can always clear the
          bit in skb->active_extensions.  But in case it would be stored in the
          extension buffer itself, we might have to COW it first, if
          we are dealing with a cloned skb.  On kmalloc failure we would
          be unable to turn an extension off.
      
      2. extension pointer, located at the end of the sk_buff.
         If the active_extensions byte is 0, the pointer is undefined,
         it is not initialized on skb allocation.
      
      This adds extra code to skb clone and free paths (to deal with
      refcount/free of extension area) but this replaces similar code that
      manages skb->nf_bridge and skb->sp structs in the followup patches of
      the series.
      
      It is possible to add support for extensions that are not preseved on
      clones/copies.
      
      To do this, it would be needed to define a bitmask of all extensions that
      need copy/cow semantics, and change __skb_ext_copy() to check
      ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
      ->active_extensions to 0 on the new clone.
      
      This isn't done here because all extensions that get added here
      need the copy/cow semantics.
      
      v2:
      Allocate entire extension space using kmem_cache.
      Upside is that this allows better tracking of used memory,
      downside is that we will allocate more space than strictly needed in
      most cases (its unlikely that all extensions are active/needed at same
      time for same skb).
      The allocated memory (except the small extension header) is not cleared,
      so no additonal overhead aside from memory usage.
      
      Avoid atomic_dec_and_test operation on skb_ext_put()
      by using similar trick as kfree_skbmem() does with fclone_ref:
      If recount is 1, there is no concurrent user and we can free right away.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df5042f4
    • Florian Westphal's avatar
      netfilter: avoid using skb->nf_bridge directly · c4b0e771
      Florian Westphal authored
      
      
      This pointer is going to be removed soon, so use the existing helpers in
      more places to avoid noise when the removal happens.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4b0e771
    • Bartosz Golaszewski's avatar
      regmap: irq: add an option to clear status registers on unmask · c82ea33e
      Bartosz Golaszewski authored and Mark Brown's avatar Mark Brown committed
      
      
      Some interrupt controllers whose interrupts are acked on read will set
      the status bits for masked interrupts without changing the state of
      the IRQ line.
      
      Some chips have an additional "feature" where if those set bits are
      not cleared before unmasking their respective interrupts, the IRQ
      line will change the state and we'll interpret this as an interrupt
      although it actually fired when it was masked.
      
      Add a new field to the irq chip struct that tells the regmap irq chip
      code to always clear the status registers before actually changing the
      irq mask values.
      Signed-off-by: default avatarBartosz Golaszewski <bgolaszewski@baylibre.com>
      Signed-off-by: Mark Brown's avatarMark Brown <broonie@kernel.org>
      c82ea33e
    • Matti Vaittinen's avatar
      regmap: regmap-irq/gpio-max77620: add level-irq support · 1c2928e3
      Matti Vaittinen authored and Mark Brown's avatar Mark Brown committed
      
      
      Add level active IRQ support to regmap-irq irqchip. Change breaks
      existing regmap-irq type setting. Convert the existing drivers which
      use regmap-irq with trigger type setting (gpio-max77620) to work
      with this new approach. So we do not magically support level-active
      IRQs on gpio-max77620 - but add support to the regmap-irq for chips
      which support them =)
      
      We do not support distinguishing situation where HW supports rising
      and falling edge detection but not both. Separating this would require
      inventing yet another flags for IRQ types.
      Signed-off-by: default avatarMatti Vaittinen <matti.vaittinen@fi.rohmeurope.com>
      Signed-off-by: Mark Brown's avatarMark Brown <broonie@kernel.org>
      1c2928e3
    • Ingo Molnar's avatar
      Revert "x86/objtool: Use asm macros to work around GCC inlining bugs" · 96af6cd0
      Ingo Molnar authored
      This reverts commit c06c4d80.
      
      See this commit for details about the revert:
      
        e769742d
      
       ("Revert "x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs"")
      Reported-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Reviewed-by: default avatarBorislav Petkov <bp@alien8.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Richard Biener <rguenther@suse.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      96af6cd0
    • Dou Liyang's avatar
      genirq/affinity: Add is_managed to struct irq_affinity_desc · c410abbb
      Dou Liyang authored
      
      
      Devices which use managed interrupts usually have two classes of
      interrupts:
      
        - Interrupts for multiple device queues
        - Interrupts for general device management
      
      Currently both classes are treated the same way, i.e. as managed
      interrupts. The general interrupts get the default affinity mask assigned
      while the device queue interrupts are spread out over the possible CPUs.
      
      Treating the general interrupts as managed is both a limitation and under
      certain circumstances a bug. Assume the following situation:
      
       default_irq_affinity = 4..7
      
      So if CPUs 4-7 are offlined, then the core code will shut down the device
      management interrupts because the last CPU in their affinity mask went
      offline.
      
      It's also a limitation because it's desired to allow manual placement of
      the general device interrupts for various reasons. If they are marked
      managed then the interrupt affinity setting from both user and kernel space
      is disabled. That limitation was reported by Kashyap and Sumit.
      
      Expand struct irq_affinity_desc with a new bit 'is_managed' which is set
      for truly managed interrupts (queue interrupts) and cleared for the general
      device interrupts.
      
      [ tglx: Simplify code and massage changelog ]
      Reported-by: default avatarKashyap Desai <kashyap.desai@broadcom.com>
      Reported-by: default avatarSumit Saxena <sumit.saxena@broadcom.com>
      Signed-off-by: default avatarDou Liyang <douliyangs@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: linux-pci@vger.kernel.org
      Cc: shivasharan.srikanteshwara@broadcom.com
      Cc: ming.lei@redhat.com
      Cc: hch@lst.de
      Cc: bhelgaas@google.com
      Cc: douliyang1@huawei.com
      Link: https://lkml.kernel.org/r/20181204155122.6327-3-douliyangs@gmail.com
      c410abbb
    • Dou Liyang's avatar
      genirq/core: Introduce struct irq_affinity_desc · bec04037
      Dou Liyang authored
      
      
      The interrupt affinity management uses straight cpumask pointers to convey
      the automatically assigned affinity masks for managed interrupts. The core
      interrupt descriptor allocation also decides based on the pointer being non
      NULL whether an interrupt is managed or not.
      
      Devices which use managed interrupts usually have two classes of
      interrupts:
      
        - Interrupts for multiple device queues
        - Interrupts for general device management
      
      Currently both classes are treated the same way, i.e. as managed
      interrupts. The general interrupts get the default affinity mask assigned
      while the device queue interrupts are spread out over the possible CPUs.
      
      Treating the general interrupts as managed is both a limitation and under
      certain circumstances a bug. Assume the following situation:
      
       default_irq_affinity = 4..7
      
      So if CPUs 4-7 are offlined, then the core code will shut down the device
      management interrupts because the last CPU in their affinity mask went
      offline.
      
      It's also a limitation because it's desired to allow manual placement of
      the general device interrupts for various reasons. If they are marked
      managed then the interrupt affinity setting from both user and kernel space
      is disabled.
      
      To remedy that situation it's required to convey more information than the
      cpumasks through various interfaces related to interrupt descriptor
      allocation.
      
      Instead of adding yet another argument, create a new data structure
      'irq_affinity_desc' which for now just contains the cpumask. This struct
      can be expanded to convey auxilliary information in the next step.
      
      No functional change, just preparatory work.
      
      [ tglx: Simplified logic and clarified changelog ]
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Suggested-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarDou Liyang <douliyangs@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: linux-pci@vger.kernel.org
      Cc: kashyap.desai@broadcom.com
      Cc: shivasharan.srikanteshwara@broadcom.com
      Cc: sumit.saxena@broadcom.com
      Cc: ming.lei@redhat.com
      Cc: hch@lst.de
      Cc: douliyang1@huawei.com
      Link: https://lkml.kernel.org/r/20181204155122.6327-2-douliyangs@gmail.com
      bec04037
    • vingu-linaro's avatar
      PM-runtime: Switch autosuspend over to using hrtimers · 8234f673
      vingu-linaro authored
      PM-runtime uses the timer infrastructure for autosuspend. This implies
      that the minimum time before autosuspending a device is in the range
      of 1 tick included to 2 ticks excluded
       -On arm64 this means between 4ms and 8ms with default jiffies
        configuration
       -And on arm, it is between 10ms and 20ms
      
      These values are quite high for embedded systems which sometimes want
      the duration to be in the range of 1 ms.
      
      It is possible to switch autosuspend over to using hrtimers to get
      finer granularity for short durations and take advantage of slack to
      retain some margins and get long timeouts with minimum wakeups.
      
      On an arm64 platform that uses 1ms for autosuspending timeout of its
      GPU, idle power is reduced by 10% with hrtimer.
      
      The latency impact on arm64 hikey octo cores is:
       - mark_last_busy: from 1.11 us to 1.25 us
       - rpm_suspend: from 15.54 us to 15.38 us
      [Only the code path of rpm_suspend() that starts hrtimer has been
      measured.]
      
      arm64 image (arm64 default defconfig) decreases by around 3KB
      with following details:
      
      $ size vmlinux-timer
         text	   data	    bss	    dec	    hex	filename
      12034646	6869268	 386840	19290754	1265a82	vmlinux
      
      $ size vmlinux-hrtimer
         text	   data	    bss	    dec	    hex	filename
      12030550	6870164	 387032	19287746	1264ec2	vmlinux
      
      The latency impact on arm 32bits snowball dual cores is :
       - mark_last_busy: from 0.31 us usec to 0.77 us
       - rpm_suspend: from 6.83 us to 6.67 usec
      
      The increase of the image for snowball platform that I used for
      testing performance impact, is neglictable (244B).
      
      $ size vmlinux-timer
         text	   data	    bss	    dec	    hex	filename
      7157961	2119580	 264120	9541661	 91981d	build-ux500/vmlinux
      
      size vmlinux-hrtimer
         text	   data	    bss	    dec	    hex	filename
      7157773	21198846
      
      	 264248	9541905	 919911	vmlinux-hrtimer
      
      And arm 32bits image (multi_v7_defconfig) increases by around 1.7KB
      with following details:
      
      $ size vmlinux-timer
         text	   data	    bss	    dec	    hex	filename
      13304443	6803420	 402768	20510631	138f7a7	vmlinux
      
      $ size vmlinux-hrtimer
         text	   data	    bss	    dec	    hex	filename
      13304299	6805276	 402768	20512343	138fe57	vmlinux
      Signed-off-by: vingu-linaro's avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      8234f673
  7. 18 Dec, 2018 7 commits
    • John Fastabend's avatar
      bpf: sockmap, metadata support for reporting size of msg · 3bdbd022
      John Fastabend authored
      
      
      This adds metadata to sk_msg_md for BPF programs to read the sk_msg
      size.
      
      When the SK_MSG program is running under an application that is using
      sendfile the data is not copied into sk_msg buffers by default. Rather
      the BPF program uses sk_msg_pull_data to read the bytes in. This
      avoids doing the costly memcopy instructions when they are not in
      fact needed. However, if we don't know the size of the sk_msg we
      have to guess if needed bytes are available by doing a pull request
      which may fail. By including the size of the sk_msg BPF programs can
      check the size before issuing sk_msg_pull_data requests.
      
      Additionally, the same applies for sendmsg calls when the application
      provides multiple iovs. Here the BPF program needs to pull in data
      to update data pointers but its not clear where the data ends without
      a size parameter. In many cases "guessing" is not easy to do
      and results in multiple calls to pull and without bounded loops
      everything gets fairly tricky.
      
      Clean this up by including a u32 size field. Note, all writes into
      sk_msg_md are rejected already from sk_msg_is_valid_access so nothing
      additional is needed there.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3bdbd022
    • Heiner Kallweit's avatar
      net: phy: improve phy state checking · 2b3e88ea
      Heiner Kallweit authored
      
      
      Add helpers phy_is_started() and __phy_is_started() to avoid open-coded
      checks whether PHY has been started. To make the check easier move
      PHY_HALTED before PHY_UP in enum phy_state. Further improvements:
      
      phy_start_aneg():
      Return -EBUSY and print warning if function is called from a non-started
      state (DOWN, READY, HALTED). Better check because function is exported
      and drivers may use it incorrectly.
      
      phy_interrupt():
      Return IRQ_NONE also if state is DOWN or READY. We should never receive
      an interrupt in one of these states, but better play safe.
      
      phy_stop():
      Just return and print a warning if PHY is in a non-started state.
      This warning should help to identify drivers with unbalanced calls to
      phy_start() / phy_stop().
      
      phy_state_machine():
      Schedule state machine run only if PHY is in a started state.
      E.g. if state is READY we don't need the state machine, it will be
      started by phy_start().
      
      v2:
      - don't use __func__ within phy_warn_state
      v3:
      - use WARN() instead of printing error message to facilitate debugging
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b3e88ea
    • Matt Mullins's avatar
      bpf: support raw tracepoints in modules · a38d1107
      Matt Mullins authored
      
      
      Distributions build drivers as modules, including network and filesystem
      drivers which export numerous tracepoints.  This enables
      bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a38d1107
    • Richard Fitzgerald's avatar
      irqchip: Add driver for Cirrus Logic Madera codecs · da0abe1a
      Richard Fitzgerald authored
      
      
      The Cirrus Logic Madera codecs (Cirrus Logic CS47L35/85/90/91 and WM1840)
      are highly complex devices containing up to 7 programmable DSPs and many
      other internal sources of interrupts plus a number of GPIOs that can be
      used as interrupt inputs. The large number (>150) of internal interrupt
      sources are managed by an on-board interrupt controller.
      
      This driver provides the handling for the interrupt controller. As the
      codec is accessed via regmap, we can make use of the generic IRQ
      functionality from regmap to do most of the work. Only around half of
      the possible interrupt source are currently of interest from the driver
      so only this subset is defined. Others can be added in future if needed.
      
      The KConfig options are not user-configurable because this driver is
      mandatory so is automatically included when the parent MFD driver is
      selected.
      Signed-off-by: default avatarRichard Fitzgerald <rf@opensource.cirrus.com>
      Signed-off-by: default avatarCharles Keepax <ckeepax@opensource.cirrus.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      da0abe1a
    • Ingo Molnar's avatar
      genirq: Fix various typos in comments · c5f48c0a
      Ingo Molnar authored
      
      
      Go over the IRQ subsystem source code (including irqchip drivers) and
      fix common typos in comments.
      
      No change in functionality intended.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      c5f48c0a
    • Shaul Triebitz's avatar
      mac80211: update HE operation fields to D3.0 · daa5b835
      Shaul Triebitz authored
      
      
      HE Operation element has changed in 11ax D3.0.  Update the fields
      accordingly.
      Signed-off-by: default avatarShaul Triebitz <shaul.triebitz@intel.com>
      Signed-off-by: default avatarLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      daa5b835
    • Emmanuel Grumbach's avatar
      ieee80211: add bits for TWT in Extended Capabilities IE · fdb313e3
      Emmanuel Grumbach authored
      
      
      These bits are defined in ieee802.11ax to advertise support
      for TWT in addition to the bits in the HE IE.
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      fdb313e3