1. 05 Dec, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT · a74f0fa0
      Eric Dumazet authored
      TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
      as a step to enable bigger tcp sndbuf limits.
      It works reasonably well, but the following happens :
      Once the limit is reached, TCP stack generates
      an [E]POLLOUT event for every incoming ACK packet.
      This causes a high number of context switches.
      This patch implements the strategy David Miller added
      in sock_def_write_space() :
       - If TCP socket has a notsent_lowat constraint of X bytes,
         allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
         only if number of notsent bytes is below X/2
      This considerably reduces TCP_NOTSENT_LOWAT overhead,
      while allowing to keep the pipe full.
       100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM
      A:/# cat /proc/sys/net/ipv4/tcp_wmem
      4096	262144	64000000
      A:/# super_netperf 100 -H B -l 1000 -- -K bbr &
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/
      A:/# vmstat 5 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
       2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
       0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
       1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
       2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0
      A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow
      A:/# vmstat 5 5  # check that context switches have not inflated too much.
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
       0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
       1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
       1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
       1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  2. 03 Dec, 2018 3 commits
  3. 01 Dec, 2018 1 commit
    • Paolo Abeni's avatar
      net: reorder flowi_common fields to avoid holes · bf1c3ab8
      Paolo Abeni authored
      the flowi* structures are used and memsetted by server functions
      in critical path. Currently flowi_common has a couple of holes that
      we can eliminate reordering the struct fields. As a side effect,
      both flowi4 and flowi6 shrink by 8 bytes.
      pahole -EC flowi_common
      struct flowi_common {
      // ...
      	/* size: 40, cachelines: 1, members: 10 */
      	/* sum members: 32, holes: 1, sum holes: 4 */
      	/* padding: 4 */
      	/* last cacheline: 40 bytes */
      pahole -EC flowi6
      struct flowi6 {
      // ...
              /* size: 88, cachelines: 2, members: 6 */
              /* padding: 4 */
              /* last cacheline: 24 bytes */
      pahole -EC flowi4
      struct flowi4 {
      // ...
              /* size: 56, cachelines: 1, members: 4 */
              /* padding: 4 */
              /* last cacheline: 56 bytes */
      struct flowi_common {
      // ...
      	/* size: 32, cachelines: 1, members: 10 */
      	/* last cacheline: 32 bytes */
      struct flowi6 {
      // ...
              /* size: 80, cachelines: 2, members: 6 */
              /* padding: 4 */
              /* last cacheline: 16 bytes */
      struct flowi4 {
      // ...
              /* size: 48, cachelines: 1, members: 4 */
              /* padding: 4 */
              /* last cacheline: 48 bytes */
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  4. 30 Nov, 2018 4 commits
  5. 26 Nov, 2018 1 commit
  6. 24 Nov, 2018 5 commits
    • Petr Machata's avatar
      switchdev: Replace port obj add/del SDO with a notification · d17d9f5e
      Petr Machata authored
      Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of
      this field from all clients, which were migrated to use switchdev
      notification in the previous patches.
      Add a new function switchdev_port_obj_notify() that sends the switchdev
      notifications SWITCHDEV_PORT_OBJ_ADD and _DEL.
      Update switchdev_port_obj_del_now() to dispatch to this new function.
      Drop __switchdev_port_obj_add() and update switchdev_port_obj_add()
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Petr Machata's avatar
      switchdev: Add helpers to aid traversal through lower devices · f30f0601
      Petr Machata authored
      After the transition from switchdev operations to notifier chain (which
      will take place in following patches), the onus is on the driver to find
      its own devices below possible layer of LAG or other uppers.
      The logic to do so is fairly repetitive: each driver is looking for its
      own devices among the lowers of the notified device. For those that it
      finds, it calls a handler. To indicate that the event was handled,
      struct switchdev_notifier_port_obj_info.handled is set. The differences
      lie only in what constitutes an "own" device and what handler to call.
      Therefore abstract this logic into two helpers,
      switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If
      a driver only supports physical ports under a bridge device, it will
      simply avoid this layer of indirection.
      One area where this helper diverges from the current switchdev behavior
      is the case of mixed lowers, some of which are switchdev ports and some
      of which are not. Previously, such scenario would fail with -EOPNOTSUPP.
      The helper could do that for lowers for which the passed-in predicate
      doesn't hold. That would however break the case that switchdev ports
      from several different drivers are stashed under one master, a scenario
      that switchdev currently happily supports. Therefore tolerate any and
      all unknown netdevices, whether they are backed by a switchdev driver
      or not.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Petr Machata's avatar
      switchdev: Add SWITCHDEV_PORT_OBJ_ADD, SWITCHDEV_PORT_OBJ_DEL · aa4efe21
      Petr Machata authored
      An offloading driver may need to have access to switchdev events on
      ports that aren't directly under its control. An example is a VXLAN port
      attached to a bridge offloaded by a driver. The driver needs to know
      about VLANs configured on the VXLAN device. However the VXLAN device
      isn't stashed between the bridge and a front-panel-port device (such as
      is the case e.g. for LAG devices), so the usual switchdev ops don't
      reach the driver.
      VXLAN is likely not the only device type like this: in theory any L2
      tunnel device that needs offloading will prompt requirement of this
      sort. This falsifies the assumption that only the lower devices of a
      front panel port need to be notified to achieve flawless offloading.
      A way to fix this is to give up the notion of port object addition /
      deletion as a switchdev operation, which assumes somewhat tight coupling
      between the message producer and consumer. And instead send the message
      over a notifier chain.
      To that end, introduce two new switchdev notifier types,
      communicate the same event as the corresponding switchdev op, except in
      a form of a notification. struct switchdev_notifier_port_obj_info was
      added to carry the fields that the switchdev op carries. An additional
      field, handled, will be used to communicate back to switchdev that the
      event has reached an interested party, which will be important for the
      two-phase commit.
      The two switchdev operations themselves are kept in place. Following
      patches first convert individual clients to the notifier protocol, and
      only then are the operations removed.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Petr Machata's avatar
      switchdev: Add a blocking notifier chain · a93e3b17
      Petr Machata authored
      In general one can't assume that a switchdev notifier is called in a
      non-atomic context, and correspondingly, the switchdev notifier chain is
      an atomic one.
      However, port object addition and deletion messages are delivered from a
      process context. Even the MDB addition messages, whose delivery is
      scheduled from atomic context, are queued and the delivery itself takes
      place in blocking context. For VLAN messages in particular, keeping the
      blocking nature is important for error reporting.
      Therefore introduce a blocking notifier chain and related service
      functions to distribute the notifications for which a blocking context
      can be assumed.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Petr Machata's avatar
      switchdev: SWITCHDEV_OBJ_PORT_{VLAN, MDB}(): Sanitize · ec394af5
      Petr Machata authored
      expand to a container_of() call, yielding an appropriate container of
      their sole argument. However, due to a name collision, the first
      argument, i.e. the contained object pointer, is not the only one to get
      expanded. The third argument, which is a structure member name, and
      should be kept literal, gets expanded as well. The only safe way to use
      these two macros is therefore to name the local variable passed to them
      To fix this, rename the sole argument of the two macros from
      "obj" (which collides with the member name) to "OBJ". Additionally,
      instead of passing "OBJ" to container_of() verbatim, parenthesize it, so
      that a comma in the passed-in expression doesn't pollute the
      container_of() invocation.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 22 Nov, 2018 2 commits
    • Petr Machata's avatar
      vxlan: Add hardware FDB learning · 5728ae0d
      Petr Machata authored
      In order to allow devices to signal learning events to VXLAN, introduce
      two new switchdev messages: SWITCHDEV_VXLAN_FDB_ADD_TO_BRIDGE and
      Listen to these notifications in the vxlan driver. The FDB entries
      learned this way have an NTF_EXT_LEARNED flag, and only entries marked
      as such can be unlearned by the _DEL_ event. They are also immediately
      marked as offloaded. This is the same behavior that the bridge driver
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Petr Machata's avatar
      vxlan: Mark user-added FDB entries · 45598c1c
      Petr Machata authored
      The VXLAN driver needs to differentiate between FDB entries learned by
      the VXLAN driver, and those added by the user. The latter ones shouldn't
      be taken over by external learning events. This is in accordance with
      bridge behavior.
      Therefore, extend the flags bitfield to 16 bits and add a new private
      NTF flag to mark the user-added entries.
      This seems preferable to adding a dedicated boolean, because passing the
      flag, unlike passing e.g. a true, makes it clear what the meaning of the
      bit is.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  8. 20 Nov, 2018 3 commits
  9. 19 Nov, 2018 4 commits
  10. 18 Nov, 2018 1 commit
  11. 17 Nov, 2018 1 commit
  12. 15 Nov, 2018 2 commits
    • Cong Wang's avatar
      net: get rid of __tcp_checksum_complete() · 6ab6dfa6
      Cong Wang authored
      __tcp_checksum_complete() is 100% same with __skb_checksum_complete()
      and there is no other caller except tcp_checksum_complete().
      So, just use __skb_checksum_complete() there.
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • David Howells's avatar
      rxrpc: Fix life check · 7150ceaa
      David Howells authored
      The life-checking function, which is used by kAFS to make sure that a call
      is still live in the event of a pending signal, only samples the received
      packet serial number counter; it doesn't actually provoke a change in the
      counter, rather relying on the server to happen to give us a packet in the
      time window.
      Fix this by adding a function to force a ping to be transmitted.
      kAFS then keeps track of whether there's been a stall, and if so, uses the
      new function to ping the server, resetting the timeout to allow the reply
      to come back.
      If there's a stall, a ping and the call is *still* stalled in the same
      place after another period, then the call will be aborted.
      Fixes: bc5e3a54 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
      Fixes: f4d15fb6
       ("rxrpc: Provide functions for allowing cleaner handling of signals")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 14 Nov, 2018 4 commits
  14. 12 Nov, 2018 2 commits
  15. 11 Nov, 2018 1 commit
    • John Hurley's avatar
      net: sched: register callbacks for indirect tc block binds · 7f76fa36
      John Hurley authored
      Currently drivers can register to receive TC block bind/unbind callbacks
      by implementing the setup_tc ndo in any of their given netdevs. However,
      drivers may also be interested in binds to higher level devices (e.g.
      tunnel drivers) to potentially offload filters applied to them.
      Introduce indirect block devs which allows drivers to register callbacks
      for block binds on other devices. The callback is triggered when the
      device is bound to a block, allowing the driver to register for rules
      applied to that block using already available functions.
      Freeing an indirect block callback will trigger an unbind event (if
      necessary) to direct the driver to remove any offloaded rules and unreg
      any block rule callbacks. It is the responsibility of the implementing
      driver to clean any registered indirect block callbacks before exiting,
      if the block it still active at such a time.
      Allow registering an indirect block dev callback for a device that is
      already bound to a block. In this case (if it is an ingress block),
      register and also trigger the callback meaning that any already installed
      rules can be replayed to the calling driver.
      Signed-off-by: default avatarJohn Hurley <john.hurley@netronome.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  16. 09 Nov, 2018 5 commits
    • Jakub Kicinski's avatar
      net: sched: red: inform offloads about harddrop setting · 190852a5
      Jakub Kicinski authored
      To mirror software behaviour on offload more precisely inform
      the drivers about the state of the harddrop flag.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Stefano Brivio's avatar
      udp: Support for error handlers of tunnels with arbitrary destination port · e7cc0824
      Stefano Brivio authored
      ICMP error handling is currently not possible for UDP tunnels not
      employing a receiving socket with local destination port matching the
      remote one, because we have no way to look them up.
      Add an err_handler tunnel encapsulation operation that can be exported by
      tunnels in order to pass the error to the protocol implementing the
      encapsulation. We can't easily use a lookup function as we did for VXLAN
      and GENEVE, as protocol error handlers, which would be in turn called by
      implementations of this new operation, handle the errors themselves,
      together with the tunnel lookup.
      Without a socket, we can't be sure which encapsulation error handler is
      the appropriate one: encapsulation handlers (the ones for FoU and GUE
      introduced in the next patch, e.g.) will need to check the new error codes
      returned by protocol handlers to figure out if errors match the given
      encapsulation, and, in turn, report this error back, so that we can try
      all of them in __udp{4,6}_lib_err_encap_no_sk() until we have a match.
      - Name all arguments in err_handler prototypes (David Miller)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Stefano Brivio's avatar
      net: Convert protocol error handlers from void to int · 32bbd879
      Stefano Brivio authored
      We'll need this to handle ICMP errors for tunnels without a sending socket
      (i.e. FoU and GUE). There, we might have to look up different types of IP
      tunnels, registered as network protocols, before we get a match, so we
      want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
      inet_protos and inet6_protos. These error codes will be used in the next
      For consistency, return sensible error codes in protocol error handlers
      whenever handlers can't handle errors because, even if valid, they don't
      match a protocol or any of its states.
      This has no effect on existing error handling paths.
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Stefano Brivio's avatar
      vxlan: Allow configuration of DF behaviour · b4d30697
      Stefano Brivio authored
      Allow users to set the IPv4 DF bit in outgoing packets, or to inherit its
      value from the IPv4 inner header. If the encapsulated protocol is IPv6 and
      DF is configured to be inherited, always set it.
      For IPv4, inheriting DF from the inner header was probably intended from
      the very beginning judging by the comment to vxlan_xmit(), but it wasn't
      actually implemented -- also because it would have done more harm than
      good, without handling for ICMP Fragmentation Needed messages.
      According to RFC 7348, "Path MTU discovery MAY be used". An expired RFC
      draft, draft-saum-nvo3-pmtud-over-vxlan-05, whose purpose was to describe
      PMTUD implementation, says that "is a MUST that Vxlan gateways [...]
      SHOULD set the DF-bit [...]", whatever that means.
      Given this background, the only sane option is probably to let the user
      decide, and keep the current behaviour as default.
      This only applies to non-lwt tunnels: if an external control plane is
      used, tunnel key will still control the DF flag.
      - DF behaviour configuration only applies for non-lwt tunnels, move DF
        setting to if (!info) block in vxlan_xmit_one() (Stephen Hemminger)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Stefano Brivio's avatar
      udp: Handle ICMP errors for tunnels with same destination port on both endpoints · a36e185e
      Stefano Brivio authored
      For both IPv4 and IPv6, if we can't match errors to a socket, try
      tunnels before ignoring them. Look up a socket with the original source
      and destination ports as found in the UDP packet inside the ICMP payload,
      this will work for tunnels that force the same destination port for both
      endpoints, i.e. VXLAN and GENEVE.
      Actually, lwtunnels could break this assumption if they are configured by
      an external control plane to have different destination ports on the
      endpoints: in this case, we won't be able to trace ICMP messages back to
      For IPv6 redirect messages, call ip6_redirect() directly with the output
      interface argument set to the interface we received the packet from (as
      it's the very interface we should build the exception on), otherwise the
      new nexthop will be rejected. There's no such need for IPv4.
      Tunnels can now export an encap_err_lookup() operation that indicates a
      match. Pass the packet to the lookup function, and if the tunnel driver
      reports a matching association, continue with regular ICMP error handling.
      - Added newline between network and transport header sets in
        __udp{4,6}_lib_err_encap() (David Miller)
      - Removed redundant skb_reset_network_header(skb); in
      - Removed redundant reassignment of iph in __udp4_lib_err_encap()
        (Sabrina Dubroca)
      - Edited comment to __udp{4,6}_lib_err_encap() to reflect the fact this
        won't work with lwtunnels configured to use asymmetric ports. By the way,
        it's VXLAN, not VxLAN (Jiri Benc)
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>