1. 05 Dec, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT · a74f0fa0
      Eric Dumazet authored
      
      
      TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
      as a step to enable bigger tcp sndbuf limits.
      
      It works reasonably well, but the following happens :
      
      Once the limit is reached, TCP stack generates
      an [E]POLLOUT event for every incoming ACK packet.
      
      This causes a high number of context switches.
      
      This patch implements the strategy David Miller added
      in sock_def_write_space() :
      
       - If TCP socket has a notsent_lowat constraint of X bytes,
         allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
         only if number of notsent bytes is below X/2
      
      This considerably reduces TCP_NOTSENT_LOWAT overhead,
      while allowing to keep the pipe full.
      
      Tested:
       100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM
      
      A:/# cat /proc/sys/net/ipv4/tcp_wmem
      4096	262144	64000000
      A:/# super_netperf 100 -H B -l 1000 -- -K bbr &
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/
      
      A:/# vmstat 5 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
       2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
       0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
       1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
       2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0
      
      A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow
      
      A:/# vmstat 5 5  # check that context switches have not inflated too much.
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
       0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
       1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
       1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
       1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a74f0fa0
  2. 04 Dec, 2018 1 commit
    • Ido Schimmel's avatar
      skbuff: Rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark' · 875e8939
      Ido Schimmel authored
      Commit abf4bb6b
      
       ("skbuff: Add the offload_mr_fwd_mark field") added
      the 'offload_mr_fwd_mark' field to indicate that a packet has already
      undergone L3 multicast routing by a capable device. The field is used to
      prevent the kernel from forwarding a packet through a netdev through
      which the device has already forwarded the packet.
      
      Currently, no unicast packet is routed by both the device and the
      kernel, but this is about to change by subsequent patches and we need to
      be able to mark such packets, so that they will no be forwarded twice.
      
      Instead of adding yet another field to 'struct sk_buff', we can just
      rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet
      either has a multicast or a unicast destination IP.
      
      While at it, add a comment about both 'offload_fwd_mark' and
      'offload_l3_fwd_mark'.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      875e8939
  3. 03 Dec, 2018 8 commits
  4. 01 Dec, 2018 2 commits
    • Nicolas Dichtel's avatar
      tun: implement carrier change · 26d31925
      Nicolas Dichtel authored
      
      
      The userspace may need to control the carrier state.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDidier Pallard <didier.pallard@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26d31925
    • Paolo Abeni's avatar
      net: reorder flowi_common fields to avoid holes · bf1c3ab8
      Paolo Abeni authored
      
      
      the flowi* structures are used and memsetted by server functions
      in critical path. Currently flowi_common has a couple of holes that
      we can eliminate reordering the struct fields. As a side effect,
      both flowi4 and flowi6 shrink by 8 bytes.
      
      Before:
      pahole -EC flowi_common
      struct flowi_common {
      // ...
      	/* size: 40, cachelines: 1, members: 10 */
      	/* sum members: 32, holes: 1, sum holes: 4 */
      	/* padding: 4 */
      	/* last cacheline: 40 bytes */
      };
      pahole -EC flowi6
      struct flowi6 {
      // ...
              /* size: 88, cachelines: 2, members: 6 */
              /* padding: 4 */
              /* last cacheline: 24 bytes */
      };
      pahole -EC flowi4
      struct flowi4 {
      // ...
              /* size: 56, cachelines: 1, members: 4 */
              /* padding: 4 */
              /* last cacheline: 56 bytes */
      };
      
      After:
      struct flowi_common {
      // ...
      	/* size: 32, cachelines: 1, members: 10 */
      	/* last cacheline: 32 bytes */
      };
      struct flowi6 {
      // ...
              /* size: 80, cachelines: 2, members: 6 */
              /* padding: 4 */
              /* last cacheline: 16 bytes */
      };
      struct flowi4 {
      // ...
              /* size: 48, cachelines: 1, members: 4 */
              /* padding: 4 */
              /* last cacheline: 48 bytes */
      };
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf1c3ab8
  5. 30 Nov, 2018 8 commits
  6. 29 Nov, 2018 1 commit
  7. 28 Nov, 2018 3 commits
  8. 27 Nov, 2018 5 commits
    • Nikolay Aleksandrov's avatar
      net: bridge: add no_linklocal_learn bool option · 70e4272b
      Nikolay Aleksandrov authored
      
      
      Use the new boolopt API to add an option which disables learning from
      link-local packets. The default is kept as before and learning is
      enabled. This is a simple map from a boolopt bit to a bridge private
      flag that is tested before learning.
      
      v2: pass NULL for extack via sysfs
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70e4272b
    • Nikolay Aleksandrov's avatar
      net: bridge: add support for user-controlled bool options · a428afe8
      Nikolay Aleksandrov authored
      We have been adding many new bridge options, a big number of which are
      boolean but still take up netlink attribute ids and waste space in the skb.
      Recently we discussed learning from link-local packets[1] and decided
      yet another new boolean option will be needed, thus introducing this API
      to save some bridge nl space.
      The API supports changing the value of multiple boolean options at once
      via the br_boolopt_multi struct which has an optmask (which options to
      set, bit per opt) and optval (options' new values). Future boolean
      options will only be added to the br_boolopt_id enum and then will have
      to be handled in br_boolopt_toggle/get. The API will automatically
      add the ability to change and export them via netlink, sysfs can use the
      single boolopt function versions to do the same. The behaviour with
      failing/succeeding is the same as with normal netlink option changing.
      
      If an option requires mapping to internal kernel flag or needs special
      configuration to be enabled then it should be handled in
      br_boolopt_toggle. It should also be able to retrieve an option's current
      state via br_boolopt_get.
      
      v2: WARN_ON() on unsupported option as that shouldn't be possible and
          also will help catch people who add new options without handling
          them for both set and get. Pass down extack so if an option desires
          it could set it on error and be more user-friendly.
      
      [1] https://www.spinics.net/lists/netdev/msg532698.html
      
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a428afe8
    • Tiwei Bie's avatar
      virtio: add packed ring types and macros · 89a9157e
      Tiwei Bie authored
      
      
      Add types and macros for packed ring.
      Signed-off-by: default avatarTiwei Bie <tiwei.bie@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89a9157e
    • Yonghong Song's avatar
      bpf: btf: support proper non-jit func info · ba64e7d8
      Yonghong Song authored
      Commit 838e9690 ("bpf: Introduce bpf_func_info")
      added bpf func info support. The userspace is able
      to get better ksym's for bpf programs with jit, and
      is able to print out func prototypes.
      
      For a program containing func-to-func calls, the existing
      implementation returns user specified number of function
      calls and BTF types if jit is enabled. If the jit is not
      enabled, it only returns the type for the main function.
      
      This is undesirable. Interpreter may still be used
      and we should keep feature identical regardless of
      whether jit is enabled or not.
      This patch fixed this discrepancy.
      
      Fixes: 838e9690
      
       ("bpf: Introduce bpf_func_info")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ba64e7d8
    • Daniel Borkmann's avatar
      bpf, ppc64: generalize fetching subprog into bpf_jit_get_func_addr · e2c95a61
      Daniel Borkmann authored
      
      
      Make fetching of the BPF call address from ppc64 JIT generic. ppc64
      was using a slightly different variant rather than through the insns'
      imm field encoding as the target address would not fit into that space.
      Therefore, the target subprog number was encoded into the insns' offset
      and fetched through fp->aux->func[off]->bpf_func instead. Given there
      are other JITs with this issue and the mechanism of fetching the address
      is JIT-generic, move it into the core as a helper instead. On the JIT
      side, we get information on whether the retrieved address is a fixed
      one, that is, not changing through JIT passes, or a dynamic one. For
      the former, JITs can optimize their imm emission because this doesn't
      change jump offsets throughout JIT process.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarSandipan Das <sandipan@linux.ibm.com>
      Tested-by: default avatarSandipan Das <sandipan@linux.ibm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e2c95a61
  9. 26 Nov, 2018 2 commits
  10. 25 Nov, 2018 3 commits
  11. 24 Nov, 2018 5 commits
    • Petr Machata's avatar
      switchdev: Replace port obj add/del SDO with a notification · d17d9f5e
      Petr Machata authored
      
      
      Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of
      this field from all clients, which were migrated to use switchdev
      notification in the previous patches.
      
      Add a new function switchdev_port_obj_notify() that sends the switchdev
      notifications SWITCHDEV_PORT_OBJ_ADD and _DEL.
      
      Update switchdev_port_obj_del_now() to dispatch to this new function.
      Drop __switchdev_port_obj_add() and update switchdev_port_obj_add()
      likewise.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d17d9f5e
    • Petr Machata's avatar
      switchdev: Add helpers to aid traversal through lower devices · f30f0601
      Petr Machata authored
      
      
      After the transition from switchdev operations to notifier chain (which
      will take place in following patches), the onus is on the driver to find
      its own devices below possible layer of LAG or other uppers.
      
      The logic to do so is fairly repetitive: each driver is looking for its
      own devices among the lowers of the notified device. For those that it
      finds, it calls a handler. To indicate that the event was handled,
      struct switchdev_notifier_port_obj_info.handled is set. The differences
      lie only in what constitutes an "own" device and what handler to call.
      
      Therefore abstract this logic into two helpers,
      switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If
      a driver only supports physical ports under a bridge device, it will
      simply avoid this layer of indirection.
      
      One area where this helper diverges from the current switchdev behavior
      is the case of mixed lowers, some of which are switchdev ports and some
      of which are not. Previously, such scenario would fail with -EOPNOTSUPP.
      The helper could do that for lowers for which the passed-in predicate
      doesn't hold. That would however break the case that switchdev ports
      from several different drivers are stashed under one master, a scenario
      that switchdev currently happily supports. Therefore tolerate any and
      all unknown netdevices, whether they are backed by a switchdev driver
      or not.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f30f0601
    • Petr Machata's avatar
      switchdev: Add SWITCHDEV_PORT_OBJ_ADD, SWITCHDEV_PORT_OBJ_DEL · aa4efe21
      Petr Machata authored
      
      
      An offloading driver may need to have access to switchdev events on
      ports that aren't directly under its control. An example is a VXLAN port
      attached to a bridge offloaded by a driver. The driver needs to know
      about VLANs configured on the VXLAN device. However the VXLAN device
      isn't stashed between the bridge and a front-panel-port device (such as
      is the case e.g. for LAG devices), so the usual switchdev ops don't
      reach the driver.
      
      VXLAN is likely not the only device type like this: in theory any L2
      tunnel device that needs offloading will prompt requirement of this
      sort. This falsifies the assumption that only the lower devices of a
      front panel port need to be notified to achieve flawless offloading.
      
      A way to fix this is to give up the notion of port object addition /
      deletion as a switchdev operation, which assumes somewhat tight coupling
      between the message producer and consumer. And instead send the message
      over a notifier chain.
      
      To that end, introduce two new switchdev notifier types,
      SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types
      communicate the same event as the corresponding switchdev op, except in
      a form of a notification. struct switchdev_notifier_port_obj_info was
      added to carry the fields that the switchdev op carries. An additional
      field, handled, will be used to communicate back to switchdev that the
      event has reached an interested party, which will be important for the
      two-phase commit.
      
      The two switchdev operations themselves are kept in place. Following
      patches first convert individual clients to the notifier protocol, and
      only then are the operations removed.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa4efe21
    • Petr Machata's avatar
      switchdev: Add a blocking notifier chain · a93e3b17
      Petr Machata authored
      
      
      In general one can't assume that a switchdev notifier is called in a
      non-atomic context, and correspondingly, the switchdev notifier chain is
      an atomic one.
      
      However, port object addition and deletion messages are delivered from a
      process context. Even the MDB addition messages, whose delivery is
      scheduled from atomic context, are queued and the delivery itself takes
      place in blocking context. For VLAN messages in particular, keeping the
      blocking nature is important for error reporting.
      
      Therefore introduce a blocking notifier chain and related service
      functions to distribute the notifications for which a blocking context
      can be assumed.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a93e3b17
    • Petr Machata's avatar
      switchdev: SWITCHDEV_OBJ_PORT_{VLAN, MDB}(): Sanitize · ec394af5
      Petr Machata authored
      
      
      The two macros SWITCHDEV_OBJ_PORT_VLAN() and SWITCHDEV_OBJ_PORT_MDB()
      expand to a container_of() call, yielding an appropriate container of
      their sole argument. However, due to a name collision, the first
      argument, i.e. the contained object pointer, is not the only one to get
      expanded. The third argument, which is a structure member name, and
      should be kept literal, gets expanded as well. The only safe way to use
      these two macros is therefore to name the local variable passed to them
      "obj".
      
      To fix this, rename the sole argument of the two macros from
      "obj" (which collides with the member name) to "OBJ". Additionally,
      instead of passing "OBJ" to container_of() verbatim, parenthesize it, so
      that a comma in the passed-in expression doesn't pollute the
      container_of() invocation.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec394af5
  12. 23 Nov, 2018 1 commit