1. 05 Dec, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT · a74f0fa0
      Eric Dumazet authored
      
      
      TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
      as a step to enable bigger tcp sndbuf limits.
      
      It works reasonably well, but the following happens :
      
      Once the limit is reached, TCP stack generates
      an [E]POLLOUT event for every incoming ACK packet.
      
      This causes a high number of context switches.
      
      This patch implements the strategy David Miller added
      in sock_def_write_space() :
      
       - If TCP socket has a notsent_lowat constraint of X bytes,
         allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
         only if number of notsent bytes is below X/2
      
      This considerably reduces TCP_NOTSENT_LOWAT overhead,
      while allowing to keep the pipe full.
      
      Tested:
       100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM
      
      A:/# cat /proc/sys/net/ipv4/tcp_wmem
      4096	262144	64000000
      A:/# super_netperf 100 -H B -l 1000 -- -K bbr &
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/
      
      A:/# vmstat 5 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
       2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
       0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
       1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
       2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0
      
      A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow
      
      A:/# vmstat 5 5  # check that context switches have not inflated too much.
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
       0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
       1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
       1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
       1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a74f0fa0
  2. 04 Dec, 2018 1 commit
    • Ido Schimmel's avatar
      skbuff: Rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark' · 875e8939
      Ido Schimmel authored
      Commit abf4bb6b
      
       ("skbuff: Add the offload_mr_fwd_mark field") added
      the 'offload_mr_fwd_mark' field to indicate that a packet has already
      undergone L3 multicast routing by a capable device. The field is used to
      prevent the kernel from forwarding a packet through a netdev through
      which the device has already forwarded the packet.
      
      Currently, no unicast packet is routed by both the device and the
      kernel, but this is about to change by subsequent patches and we need to
      be able to mark such packets, so that they will no be forwarded twice.
      
      Instead of adding yet another field to 'struct sk_buff', we can just
      rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet
      either has a multicast or a unicast destination IP.
      
      While at it, add a comment about both 'offload_fwd_mark' and
      'offload_l3_fwd_mark'.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      875e8939
  3. 03 Dec, 2018 3 commits
    • Willem de Bruijn's avatar
      udp: elide zerocopy operation in hot path · 52900d22
      Willem de Bruijn authored
      
      
      With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
      Release of its last reference triggers a completion notification.
      
      The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
      the skbs, because it can build, send and free skbs within its loop,
      possibly reaching refcount zero and freeing the ubuf_info too soon.
      
      The UDP stack currently also takes this extra ref, but does not need
      it as all skbs are sent after return from __ip(6)_append_data.
      
      Avoid the extra refcount_inc and refcount_dec_and_test, and generally
      the sock_zerocopy_put in the common path, by passing the initial
      reference to the first skb.
      
      This approach is taken instead of initializing the refcount to 0, as
      that would generate error "refcount_t: increment on 0" on the
      next skb_zcopy_set.
      
      Changes
        v3 -> v4
          - Move skb_zcopy_set below the only kfree_skb that might cause
            a premature uarg destroy before skb_zerocopy_put_abort
            - Move the entire skb_shinfo assignment block, to keep that
              cacheline access in one place
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52900d22
    • Willem de Bruijn's avatar
      udp: msg_zerocopy · b5947e5d
      Willem de Bruijn authored
      Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
      interpret flag MSG_ZEROCOPY.
      
      This patch was previously part of the zerocopy RFC patchsets. Zerocopy
      is not effective at small MTU. With segmentation offload building
      larger datagrams, the benefit of page flipping outweights the cost of
      generating a completion notification.
      
      tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
      test patch and making skb_orphan_frags_rx same as skb_orphan_frags:
      
          ipv4 udp -t 1
          tx=191312 (11938 MB) txc=0 zc=n
          rx=191312 (11938 MB)
          ipv4 udp -z -t 1
          tx=304507 (19002 MB) txc=304507 zc=y
          rx=304507 (19002 MB)
          ok
          ipv6 udp -t 1
          tx=174485 (10888 MB) txc=0 zc=n
          rx=174485 (10888 MB)
          ipv6 udp -z -t 1
          tx=294801 (18396 MB) txc=294801 zc=y
          rx=294801 (18396 MB)
          ok
      
      Changes
        v1 -> v2
          - Fixup reverse christmas tree violation
        v2 -> v3
          - Split refcount avoidance optimization into separate patch
            - Fix refcount leak on error in fragmented case
              (thanks to Paolo Abeni for pointing this one out!)
            - Fix refcount inc on zero
            - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
              This is needed since commit 5cf4a853
      
       ("tcp: really ignore
      	MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5947e5d
    • Shalom Toledo's avatar
      devlink: Add 'fw_load_policy' generic parameter · 846e980a
      Shalom Toledo authored
      
      
      Many drivers load the device's firmware image during the initialization
      flow either from the flash or from the disk. Currently this option is not
      controlled by the user and the driver decides from where to load the
      firmware image.
      
      'fw_load_policy' gives the ability to control this option which allows the
      user to choose between different loading policies supported by the driver.
      
      This parameter can be useful while testing and/or debugging the device. For
      example, testing a firmware bug fix.
      Signed-off-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      846e980a
  4. 30 Nov, 2018 3 commits
    • Jakub Kicinski's avatar
      rtnetlink: avoid frame size warning in rtnl_newlink() · a2939745
      Jakub Kicinski authored
      
      
      Standard kernel compilation produces the following warning:
      
      net/core/rtnetlink.c: In function ‘rtnl_newlink’:
      net/core/rtnetlink.c:3232:1: warning: the frame size of 1288 bytes is larger than 1024 bytes [-Wframe-larger-than=]
       }
        ^
      
      This should not really be an issue, as rtnl_newlink() stack is
      generally quite shallow.
      
      Fix the warning by allocating attributes with kmalloc() in a wrapper
      and passing it down to rtnl_newlink(), avoiding complexities on error
      paths.
      
      Alternatively we could kmalloc() some structure within rtnl_newlink(),
      slave attributes look like a good candidate.  In practice it adds to
      already rather high complexity and length of the function.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2939745
    • Jakub Kicinski's avatar
      rtnetlink: remove a level of indentation in rtnl_newlink() · 420d0318
      Jakub Kicinski authored
      rtnl_newlink() used to create VLAs based on link kind.  Since
      commit ccf8dbcd
      
       ("rtnetlink: Remove VLA usage") statically
      sized array is created on the stack, so there is no more use
      for a separate code block that used to be the VLA's live range.
      
      While at it christmas tree the variables.  Note that there is
      a goto-based retry so to be on the safe side the variables can
      no longer be initialized in place.  It doesn't seem to matter,
      logically, but why make the code harder to read..
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      420d0318
    • Geneviève Bastien's avatar
      net: Add trace events for all receive exit points · b0e3f1bd
      Geneviève Bastien authored
      
      
      Trace events are already present for the receive entry points, to indicate
      how the reception entered the stack.
      
      This patch adds the corresponding exit trace events that will bound the
      reception such that all events occurring between the entry and the exit
      can be considered as part of the reception context. This greatly helps
      for dependency and root cause analyses.
      
      Without this, it is not possible with tracepoint instrumentation to
      determine whether a sched_wakeup event following a netif_receive_skb
      event is the result of the packet reception or a simple coincidence after
      further processing by the thread. It is possible using other mechanisms
      like kretprobes, but considering the "entry" points are already present,
      it would be good to add the matching exit events.
      
      In addition to linking packets with wakeups, the entry/exit event pair
      can also be used to perform network stack latency analyses.
      Signed-off-by: default avatarGeneviève Bastien <gbastien@versatic.net>
      CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: Ingo Molnar <mingo@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> (tracing side)
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0e3f1bd
  5. 29 Nov, 2018 1 commit
  6. 28 Nov, 2018 6 commits
  7. 27 Nov, 2018 1 commit
    • Nikolay Aleksandrov's avatar
      net: bridge: add support for user-controlled bool options · a428afe8
      Nikolay Aleksandrov authored
      We have been adding many new bridge options, a big number of which are
      boolean but still take up netlink attribute ids and waste space in the skb.
      Recently we discussed learning from link-local packets[1] and decided
      yet another new boolean option will be needed, thus introducing this API
      to save some bridge nl space.
      The API supports changing the value of multiple boolean options at once
      via the br_boolopt_multi struct which has an optmask (which options to
      set, bit per opt) and optval (options' new values). Future boolean
      options will only be added to the br_boolopt_id enum and then will have
      to be handled in br_boolopt_toggle/get. The API will automatically
      add the ability to change and export them via netlink, sysfs can use the
      single boolopt function versions to do the same. The behaviour with
      failing/succeeding is the same as with normal netlink option changing.
      
      If an option requires mapping to internal kernel flag or needs special
      configuration to be enabled then it should be handled in
      br_boolopt_toggle. It should also be able to retrieve an option's current
      state via br_boolopt_get.
      
      v2: WARN_ON() on unsupported option as that shouldn't be possible and
          also will help catch people who add new options without handling
          them for both set and get. Pass down extack so if an option desires
          it could set it on error and be more user-friendly.
      
      [1] https://www.spinics.net/lists/netdev/msg532698.html
      
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a428afe8
  8. 26 Nov, 2018 1 commit
  9. 25 Nov, 2018 1 commit
    • Eric Dumazet's avatar
      net: remove unsafe skb_insert() · 4bffc669
      Eric Dumazet authored
      
      
      I do not see how one can effectively use skb_insert() without holding
      some kind of lock. Otherwise other cpus could have changed the list
      right before we have a chance of acquiring list->lock.
      
      Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this
      one probably meant to use __skb_insert() since it appears nesqp->pau_list
      is protected by nesqp->pau_lock. This looks like nesqp->pau_lock
      could be removed, since nesqp->pau_list.lock could be used instead.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Faisal Latif <faisal.latif@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: linux-rdma <linux-rdma@vger.kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bffc669
  10. 23 Nov, 2018 2 commits
  11. 22 Nov, 2018 1 commit
  12. 21 Nov, 2018 1 commit
    • Petr Machata's avatar
      net: skb_scrub_packet(): Scrub offload_fwd_mark · b5dd186d
      Petr Machata authored
      When a packet is trapped and the corresponding SKB marked as
      already-forwarded, it retains this marking even after it is forwarded
      across veth links into another bridge. There, since it ingresses the
      bridge over veth, which doesn't have offload_fwd_mark, it triggers a
      warning in nbp_switchdev_frame_mark().
      
      Then nbp_switchdev_allowed_egress() decides not to allow egress from
      this bridge through another veth, because the SKB is already marked, and
      the mark (of 0) of course matches. Thus the packet is incorrectly
      blocked.
      
      Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
      function is called from tunnels and also from veth, and thus catches the
      cases where traffic is forwarded between bridges and transformed in a
      way that invalidates the marking.
      
      Fixes: 6bc506b4 ("bridge: switchdev: Add forward mark support for stacked devices")
      Fixes: abf4bb6b
      
       ("skbuff: Add the offload_mr_fwd_mark field")
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Suggested-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5dd186d
  13. 20 Nov, 2018 1 commit
    • Petr Machata's avatar
      net: skb_scrub_packet(): Scrub offload_fwd_mark · 6f9a5069
      Petr Machata authored
      
      
      When a packet is trapped and the corresponding SKB marked as
      already-forwarded, it retains this marking even after it is forwarded
      across veth links into another bridge. There, since it ingresses the
      bridge over veth, which doesn't have offload_fwd_mark, it triggers a
      warning in nbp_switchdev_frame_mark().
      
      Then nbp_switchdev_allowed_egress() decides not to allow egress from
      this bridge through another veth, because the SKB is already marked, and
      the mark (of 0) of course matches. Thus the packet is incorrectly
      blocked.
      
      Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
      function is called from tunnels and also from veth, and thus catches the
      cases where traffic is forwarded between bridges and transformed in a
      way that invalidates the marking.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Suggested-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f9a5069
  14. 18 Nov, 2018 1 commit
    • Eric Dumazet's avatar
      net-gro: reset skb->pkt_type in napi_reuse_skb() · 33d9a2c7
      Eric Dumazet authored
      eth_type_trans() assumes initial value for skb->pkt_type
      is PACKET_HOST.
      
      This is indeed the value right after a fresh skb allocation.
      
      However, it is possible that GRO merged a packet with a different
      value (like PACKET_OTHERHOST in case macvlan is used), so
      we need to make sure napi->skb will have pkt_type set back to
      PACKET_HOST.
      
      Otherwise, valid packets might be dropped by the stack because
      their pkt_type is not PACKET_HOST.
      
      napi_reuse_skb() was added in commit 96e93eab ("gro: Add
      internal interfaces for VLAN"), but this bug always has
      been there.
      
      Fixes: 96e93eab
      
       ("gro: Add internal interfaces for VLAN")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33d9a2c7
  15. 17 Nov, 2018 5 commits
  16. 15 Nov, 2018 2 commits
  17. 12 Nov, 2018 1 commit
  18. 10 Nov, 2018 1 commit
    • 배석진's avatar
      flow_dissector: do not dissect l4 ports for fragments · 62230715
      배석진 authored
      Only first fragment has the sport/dport information,
      not the following ones.
      
      If we want consistent hash for all fragments, we need to
      ignore ports even for first fragment.
      
      This bug is visible for IPv6 traffic, if incoming fragments
      do not have a flow label, since skb_get_hash() will give
      different results for first fragment and following ones.
      
      It is also visible if any routing rule wants dissection
      and sport or dport.
      
      See commit 5e5d6fed ("ipv6: route: dissect flow
      in input path if fib rules need it") for details.
      
      [edumazet] rewrote the changelog completely.
      
      Fixes: 06635a35
      
       ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
      Signed-off-by: default avatar배석진 <soukjin.bae@samsung.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62230715
  19. 09 Nov, 2018 7 commits
    • Nitin Hande's avatar
      bpf: Extend the sk_lookup() helper to XDP hookpoint. · c8123ead
      Nitin Hande authored
      
      
      This patch proposes to extend the sk_lookup() BPF API to the
      XDP hookpoint. The sk_lookup() helper supports a lookup
      on incoming packet to find the corresponding socket that will
      receive this packet. Current support for this BPF API is
      at the tc hookpoint. This patch will extend this API at XDP
      hookpoint. A XDP program can map the incoming packet to the
      5-tuple parameter and invoke the API to find the corresponding
      socket structure.
      Signed-off-by: default avatarNitin Hande <Nitin.Hande@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c8123ead
    • Sowmini Varadhan's avatar
      bpf: add perf event notificaton support for sock_ops · a5a3a828
      Sowmini Varadhan authored
      
      
      This patch allows eBPF programs that use sock_ops to send perf
      based event notifications using bpf_perf_event_output(). Our main
      use case for this is the following:
      
        We would like to monitor some subset of TCP sockets in user-space,
        (the monitoring application would define 4-tuples it wants to monitor)
        using TCP_INFO stats to analyze reported problems. The idea is to
        use those stats to see where the bottlenecks are likely to be ("is
        it application-limited?" or "is there evidence of BufferBloat in
        the path?" etc).
      
        Today we can do this by periodically polling for tcp_info, but this
        could be made more efficient if the kernel would asynchronously
        notify the application via tcp_info when some "interesting"
        thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
        are reached. And to make this effective, it is better if
        we could apply the threshold check *before* constructing the
        tcp_info netlink notification, so that we don't waste resources
        constructing notifications that will be discarded by the filter.
      
      This work solves the problem by adding perf event based notification
      support for sock_ops. The eBPF program can thus be designed to apply
      any desired filters to the bpf_sock_ops and trigger a perf event
      notification based on the evaluation from the filter. The user space
      component can use these perf event notifications to either read any
      state managed by the eBPF program, or issue a TCP_INFO netlink call
      if desired.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a5a3a828
    • Andrey Ignatov's avatar
      bpf: Fix IPv6 dport byte order in bpf_sk_lookup_udp · b13b8787
      Andrey Ignatov authored
      Lookup functions in sk_lookup have different expectations about byte
      order of provided arguments.
      
      Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
      expect dport to be in network byte order and do ntohs(dport) internally.
      
      At the same time __inet6_lookup expects dport to be in host byte order
      and correspondingly name the argument hnum.
      
      sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
      __inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
      uses host instead of expected network byte order. It makes result
      returned by bpf_sk_lookup_udp for IPv6 incorrect.
      
      The patch fixes byte order of dport passed to __udp6_lib_lookup.
      
      Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84
      fixes TCPv6 but breaks UDPv6.
      
      Fixes: 5ef0ae84
      
       ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b13b8787
    • Michał Mirosław's avatar
      net/core: use __vlan_hwaccel helpers · b1817524
      Michał Mirosław authored
      
      
      This removes assumptions about VLAN_TAG_PRESENT bit.
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1817524
    • Cong Wang's avatar
      net: move __skb_checksum_complete*() to skbuff.c · 49f8e832
      Cong Wang authored
      
      
      __skb_checksum_complete_head() and __skb_checksum_complete()
      are both declared in skbuff.h, they fit better in skbuff.c
      than datagram.c.
      
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49f8e832
    • Ivan Khoronzhuk's avatar
      net: core: dev_addr_lists: add auxiliary func to handle reference address updates · e7946760
      Ivan Khoronzhuk authored
      
      
      In order to avoid all table update, and only remove or add new
      address, the auxiliary function exists, named __hw_addr_sync_dev().
      It allows end driver do nothing when nothing changed and add/rm when
      concrete address is firstly added or lastly removed. But it doesn't
      include cases when an address of real device or vlan was reused by
      other vlans or vlan/macval devices.
      
      For handaling events when address was reused/unreused the patch adds
      new auxiliary routine - __hw_addr_ref_sync_dev(). It allows to do
      nothing when nothing was changed and do updates only for an address
      being added/reused/deleted/unreused. Thus, clone address changes for
      vlans can be mirrored in the table. The function is exclusive with
      __hw_addr_sync_dev(). It's responsibility of the end driver to
      identify address vlan device, if it needs so.
      Signed-off-by: default avatarIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7946760
    • David Barmann's avatar
      sock: Reset dst when changing sk_mark via setsockopt · 50254256
      David Barmann authored
      
      
      When setting the SO_MARK socket option, if the mark changes, the dst
      needs to be reset so that a new route lookup is performed.
      
      This fixes the case where an application wants to change routing by
      setting a new sk_mark.  If this is done after some packets have already
      been sent, the dst is cached and has no effect.
      Signed-off-by: default avatarDavid Barmann <david.barmann@stackpath.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50254256