1. 18 Jan, 2019 1 commit
  2. 06 Dec, 2018 1 commit
    • Edward Cree's avatar
      net: use skb_list_del_init() to remove from RX sublists · 22f6bbb7
      Edward Cree authored
      list_del() leaves the skb->next pointer poisoned, which can then lead to
       a crash in e.g. OVS forwarding.  For example, setting up an OVS VXLAN
       forwarding bridge on sfc as per:
      $ ovs-vsctl show
          Bridge "br0"
              Port "br0"
                  Interface "br0"
                      type: internal
              Port "enp6s0f0"
                  Interface "enp6s0f0"
              Port "vxlan0"
                  Interface "vxlan0"
                      type: vxlan
                      options: {key="1", local_ip="", remote_ip=""}
          ovs_version: "2.5.0"
      (where is an address on enp6s0f1)
      and sending traffic across it will lead to the following panic:
      general protection fault: 0000 [#1] SMP PTI
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-rc3-ehc+ #701
      Hardware name: Dell Inc. PowerEdge R710/0M233H, BIOS 6.4.0 07/23/2013
      RIP: 0010:dev_hard_start_xmit+0x38/0x200
      Code: 53 48 89 fb 48 83 ec 20 48 85 ff 48 89 54 24 08 48 89 4c 24 18 0f 84 ab 01 00 00 48 8d 86 90 00 00 00 48 89 f5 48 89 44 24 10 <4c> 8b 33 48 c7 03 00 00 00 00 48 8b 05 c7 d1 b3 00 4d 85 f6 0f 95
      RSP: 0018:ffff888627b437e0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: dead000000000100 RCX: ffff88862279c000
      RDX: ffff888614a342c0 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888618a88000 R08: 0000000000000001 R09: 00000000000003e8
      R10: 0000000000000000 R11: ffff888614a34140 R12: 0000000000000000
      R13: 0000000000000062 R14: dead000000000100 R15: ffff888616430000
      FS:  0000000000000000(0000) GS:ffff888627b40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6d2bc6d000 CR3: 000000000200a000 CR4: 00000000000006e0
      Call Trace:
       ? masked_flow_lookup+0xf7/0x220 [openvswitch]
       ? ep_poll_callback+0x101/0x310
       do_execute_actions+0xaba/0xaf0 [openvswitch]
       ? __wake_up_common+0x8a/0x150
       ? __wake_up_common_lock+0x87/0xc0
       ? queue_userspace_packet+0x31c/0x5b0 [openvswitch]
       ovs_execute_actions+0x47/0x120 [openvswitch]
       ovs_dp_process_packet+0x7d/0x110 [openvswitch]
       ovs_vport_receive+0x6e/0xd0 [openvswitch]
       ? dst_alloc+0x64/0x90
       ? rt_dst_alloc+0x50/0xd0
       ? ip_route_input_slow+0x19a/0x9a0
       ? __udp_enqueue_schedule_skb+0x198/0x1b0
       ? __udp4_lib_rcv+0x856/0xa30
       ? __udp4_lib_rcv+0x856/0xa30
       ? cpumask_next_and+0x19/0x20
       ? find_busiest_group+0x12d/0xcd0
       netdev_frame_hook+0xce/0x150 [openvswitch]
       ? __efx_rx_packet+0x335/0x5e0 [sfc]
       efx_poll+0x182/0x320 [sfc]
      So, in all listified-receive handling, instead pull skbs off the lists with
      Fixes: 9af86f93 ("net: core: fix use-after-free in __netif_receive_skb_list_core")
      Fixes: 7da517a3 ("net: core: Another step of skb receive list processing")
      Fixes: a4ca8b7d ("net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()")
      Fixes: d8269e2c
       ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  3. 08 Nov, 2018 1 commit
  4. 10 Sep, 2018 2 commits
  5. 12 Jul, 2018 1 commit
    • Jesper Dangaard Brouer's avatar
      net: ipv4: fix listify ip_rcv_finish in case of forwarding · 0761680d
      Jesper Dangaard Brouer authored
      In commit 5fa12739 ("net: ipv4: listify ip_rcv_finish") calling
      dst_input(skb) was split-out.  The ip_sublist_rcv_finish() just calls
      dst_input(skb) in a loop.
      The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
      from the list before invoking dst_input().  Further more we need to
      clear skb->next as other parts of the network stack use another kind
      of SKB lists for xmit_more (see dev_hard_start_xmit).
      A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
      dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
      sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().
      This patch only fixes the crash, but there is a huge potential for
      a performance boost if we can pass an SKB-list through to ip_forward.
      Fixes: 5fa12739
       ("net: ipv4: listify ip_rcv_finish")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 06 Jul, 2018 1 commit
    • Edward Cree's avatar
      net: ipv4: fix list processing on L3 slave devices · efe6aaca
      Edward Cree authored
      If we have an L3 master device, l3mdev_ip_rcv() will steal the skb, but
       we were returning NET_RX_SUCCESS from ip_rcv_finish_core() which meant
       that ip_list_rcv_finish() would keep it on the list.  Instead let's
       move the l3mdev_ip_rcv() call into the caller, so that our response to
       a steal can be different in the single packet path (return
       NET_RX_SUCCESS) and the list path (forget this packet and continue).
      Fixes: 5fa12739
       ("net: ipv4: listify ip_rcv_finish")
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 05 Jul, 2018 1 commit
  8. 04 Jul, 2018 2 commits
    • Edward Cree's avatar
      net: ipv4: listify ip_rcv_finish · 5fa12739
      Edward Cree authored
      ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
       demux or route lookup.  The last step, calling dst_input(skb), is left to
       the caller; in the listified case, we split to form sublists with a common
       dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
      The next step in listification would thus be to add a list_input() method
       to struct dst_entry.
      Early demux is an indirect call based on iph->protocol; this is another
       opportunity for listification which is not taken here (it would require
       slicing up ip_rcv_finish_core() to allow splitting on protocol changes).
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Edward Cree's avatar
      net: ipv4: listified version of ip_rcv · 17266ee9
      Edward Cree authored
      Also involved adding a way to run a netfilter hook over a list of packets.
       Rather than attempting to make netfilter know about lists (which would be
       a major project in itself) we just let it call the regular okfn (in this
       case ip_rcv_finish()) for any packets it steals, and have it give us back
       a list of packets it's synchronously accepted (which normally NF_HOOK
       would automatically call okfn() on, but we want to be able to potentially
       pass the list to a listified version of okfn().)
      The netfilter hooks themselves are indirect calls that still happen per-
       packet (see nf_hook_entry_hookfn()), but again, changing that can be left
       for future work.
      There is potential for out-of-order receives if the netfilter hook ends up
       synchronously stealing packets, as they will be processed before any
       accepts earlier in the list.  However, it was already possible for an
       asynchronous accept to cause out-of-order receives, so presumably this is
       considered OK.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 22 Mar, 2018 1 commit
  10. 01 Oct, 2017 1 commit
  11. 24 Mar, 2017 1 commit
    • subashab@codeaurora.org's avatar
      net: Add sysctl to toggle early demux for tcp and udp · dddb64bc
      subashab@codeaurora.org authored
      Certain system process significant unconnected UDP workload.
      It would be preferrable to disable UDP early demux for those systems
      and enable it for TCP only.
      By disabling UDP demux, we see these slight gains on an ARM64 system-
      782 -> 788Mbps unconnected single stream UDPv4
      633 -> 654Mbps unconnected UDPv4 different sources
      The performance impact can change based on CPU architecure and cache
      sizes. There will not much difference seen if entire UDP hash table
      is in cache.
      Both sysctls are enabled by default to preserve existing behavior.
      v1->v2: Change function pointer instead of adding conditional as
      suggested by Stephen.
      v2->v3: Read once in callers to avoid issues due to compiler
      optimizations. Also update commit message with the tests.
      v3->v4: Store and use read once result instead of querying pointer
      again incorrectly.
      v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 16 Sep, 2016 1 commit
    • Mark Tomlinson's avatar
      net: VRF: Pass original iif to ip_route_input() · d6f64d72
      Mark Tomlinson authored
      The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except
      the global VRF, this replaces skb->dev with the VRF master interface.
      When calling ip_route_input_noref() from here, the checks for forwarding
      look at this master device instead of the initial ingress interface.
      This will allow packets to be routed which normally would be dropped.
      For example, an interface that is not assigned an IP address should
      drop packets, but because the checking is against the master device, the
      packet will be forwarded.
      The fix here is to still call l3mdev_ip_rcv(), but remember the initial
      net_device. This is passed to the other functions within ip_rcv_finish,
      so they still see the original interface.
      Signed-off-by: default avatarMark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
      Acked-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 11 May, 2016 2 commits
    • David Ahern's avatar
      net: original ingress device index in PKTINFO · 0b922b7a
      David Ahern authored
      Applications such as OSPF and BFD need the original ingress device not
      the VRF device; the latter can be derived from the former. To that end
      add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
      the skb control buffer similar to IPv6. From there the pktinfo can just
      pull it from cb with the PKTINFO_SKB_CB cast.
      The previous patch moving the skb->dev change to L3 means nothing else
      is needed for IPv6; it just works.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • David Ahern's avatar
      net: l3mdev: Add hook in ip and ipv6 · 74b20582
      David Ahern authored
      Currently the VRF driver uses the rx_handler to switch the skb device
      to the VRF device. Switching the dev prior to the ip / ipv6 layer
      means the VRF driver has to duplicate IP/IPv6 processing which adds
      overhead and makes features such as retaining the ingress device index
      more complicated than necessary.
      This patch moves the hook to the L3 layer just after the first NF_HOOK
      for PRE_ROUTING. This location makes exposing the original ingress device
      trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
      in the future.
      dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
      with the switched device through the packet taps to maintain current
      behavior (tcpdump can be used on either the vrf device or the enslaved
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  14. 28 Apr, 2016 4 commits
  15. 17 Feb, 2016 1 commit
  16. 11 Feb, 2016 1 commit
  17. 29 Jan, 2016 1 commit
  18. 13 Oct, 2015 2 commits
  19. 18 Sep, 2015 4 commits
    • Eric W. Biederman's avatar
      netfilter: Pass net into okfn · 0c4b51f0
      Eric W. Biederman authored
      This is immediately motivated by the bridge code that chains functions that
      call into netfilter.  Without passing net into the okfns the bridge code would
      need to guess about the best expression for the network namespace to process
      packets in.
      As net is frequently one of the first things computed in continuation functions
      after netfilter has done it's job passing in the desired network namespace is in
      many cases a code simplification.
      To support this change the function dst_output_okfn is introduced to
      simplify passing dst_output as an okfn.  For the moment dst_output_okfn
      just silently drops the struct net.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric W. Biederman's avatar
      netfilter: Pass struct net into the netfilter hooks · 29a26a56
      Eric W. Biederman authored
      Pass a network namespace parameter into the netfilter hooks.  At the
      call site of the netfilter hooks the path a packet is taking through
      the network stack is well known which allows the network namespace to
      be easily and reliabily.
      This allows the replacement of magic code like
      "dev_net(state->in?:state->out)" that appears at the start of most
      netfilter hooks with "state->net".
      In almost all cases the network namespace passed in is derived
      from the first network device passed in, guaranteeing those
      paths will not see any changes in practice.
      The exceptions are:
      xfrm/xfrm_output.c:xfrm_output_resume()         xs_net(skb_dst(skb)->xfrm)
      ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont()      ip_vs_conn_net(cp)
      ipvs/ip_vs_xmit.c:ip_vs_send_or_cont()          ip_vs_conn_net(cp)
      ipv4/raw.c:raw_send_hdrinc()                    sock_net(sk)
      ipv6/ip6_output.c:ip6_xmit()			sock_net(sk)
      ipv6/ndisc.c:ndisc_send_skb()                   dev_net(skb->dev) not dev_net(dst->dev)
      ipv6/raw.c:raw6_send_hdrinc()                   sock_net(sk)
      br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev
      In all cases these exceptions seem to be a better expression for the
      network namespace the packet is being processed in then the historic
      "dev_net(in?in:out)".  I am documenting them in case something odd
      pops up and someone starts trying to track down what happened.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Eric W. Biederman's avatar
    • Eric W. Biederman's avatar
  20. 21 Jul, 2015 1 commit
    • Thomas Graf's avatar
      dst: Metadata destinations · f38a9eb1
      Thomas Graf authored
      Introduces a new dst_metadata which enables to carry per packet metadata
      between forwarding and processing elements via the skb->dst pointer.
      The structure is set up to be a union. Thus, each separate type of
      metadata requires its own dst instance. If demand arises to carry
      multiple types of metadata concurrently, metadata dst entries can be
      made stackable.
      The metadata dst entry is refcnt'ed as expected for now but a non
      reference counted use is possible if the reference is forced before
      queueing the skb.
      In order to allow allocating dsts with variable length, the existing
      dst_alloc() is split into a dst_alloc() and dst_init() function. The
      existing dst_init() function to initialize the subsystem is being
      renamed to dst_subsys_init() to make it clear what is what.
      The check before ip_route_input() is changed to ignore metadata dsts
      and drop the dst inside the routing function thus allowing to interpret
      metadata in a later commit.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  21. 07 Apr, 2015 1 commit
    • David Miller's avatar
      netfilter: Pass socket pointer down through okfn(). · 7026b1dd
      David Miller authored
      On the output paths in particular, we have to sometimes deal with two
      socket contexts.  First, and usually skb->sk, is the local socket that
      generated the frame.
      And second, is potentially the socket used to control a tunneling
      socket, such as one the encapsulates using UDP.
      We do not want to disassociate skb->sk when encapsulating in order
      to fix this, because that would break socket memory accounting.
      The most extreme case where this can cause huge problems is an
      AF_PACKET socket transmitting over a vxlan device.  We hit code
      paths doing checks that assume they are dealing with an ipv4
      socket, but are actually operating upon the AF_PACKET one.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  22. 03 Apr, 2015 2 commits
  23. 28 Jan, 2014 1 commit
    • Holger Eitzenberger's avatar
      net: Fix memory leak if TPROXY used with TCP early demux · a452ce34
      Holger Eitzenberger authored
      I see a memory leak when using a transparent HTTP proxy using TPROXY
      together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):
      unreferenced object 0xffff88008cba4a40 (size 1696):
        comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
        hex dump (first 32 bytes):
          0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00  .. j@..7..2.....
          02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
          [<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
          [<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
          [<ffffffff812702cf>] sk_clone_lock+0x14/0x283
          [<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
          [<ffffffff8129a893>] netlink_broadcast+0x14/0x16
          [<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
          [<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
          [<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
          [<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
          [<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
          [<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
          [<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
          [<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
          [<ffffffff8127aa77>] process_backlog+0xee/0x1c5
          [<ffffffff8127c949>] net_rx_action+0xa7/0x200
          [<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157
      But there are many more, resulting in the machine going OOM after some
      From looking at the TPROXY code, and with help from Florian, I see
      that the memory leak is introduced in tcp_v4_early_demux():
        void tcp_v4_early_demux(struct sk_buff *skb)
          /* ... */
          iph = ip_hdr(skb);
          th = tcp_hdr(skb);
          if (th->doff < sizeof(struct tcphdr) / 4)
          sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
                             iph->saddr, th->source,
                             iph->daddr, ntohs(th->dest),
          if (sk) {
              skb->sk = sk;
      where the socket is assigned unconditionally to skb->sk, also bumping
      the refcnt on it.  This is problematic, because in our case the skb
      has already a socket assigned in the TPROXY target.  This then results
      in the leak I see.
      The very same issue seems to be with IPv6, but haven't tested.
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarHolger Eitzenberger <holger@eitzenberger.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  24. 09 Aug, 2013 1 commit
    • Eric Dumazet's avatar
      net: add SNMP counters tracking incoming ECN bits · 1f07d03e
      Eric Dumazet authored
      With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
      counters do not count the number of frames, but number of aggregated
      Its probably too late to change this now.
      This patch adds four new counters, tracking number of frames, regardless
      of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.
      Ip[6]NoECTPkts : Number of packets received with NOECT
      Ip[6]ECT1Pkts  : Number of packets received with ECT(1)
      Ip[6]ECT0Pkts  : Number of packets received with ECT(0)
      Ip[6]CEPkts    : Number of packets received with Congestion Experienced
      lph37:~# nstat | egrep "Pkts|InReceive"
      IpInReceives                    1634137            0.0
      Ip6InReceives                   3714107            0.0
      Ip6InNoECTPkts                  19205              0.0
      Ip6InECT0Pkts                   52651828           0.0
      IpExtInNoECTPkts                33630              0.0
      IpExtInECT0Pkts                 15581379           0.0
      IpExtInCEPkts                   6                  0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 16 Jul, 2013 1 commit
    • Eric Dumazet's avatar
      ipv4: set transport header earlier · 21d1196a
      Eric Dumazet authored
      commit 45f00f99
       ("ipv4: tcp: clean up tcp_v4_early_demux()") added a
      performance regression for non GRO traffic, basically disabling
      IP early demux.
      IPv6 stack resets transport header in ip6_rcv() before calling
      IP early demux in ip6_rcv_finish(), while IPv4 does this only in
      ip_local_deliver_finish(), _after_ IP early demux.
      GRO traffic happened to enable IP early demux because transport header
      is also set in inet_gro_receive()
      Instead of reverting the faulty commit, we can make IPv4/IPv6 behave the
      same : transport_header should be set in ip_rcv() instead of
      ip_local_deliver_finish() can also use skb_network_header_len() which is
      faster than ip_hdrlen()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  26. 29 Apr, 2013 1 commit
  27. 01 Mar, 2013 1 commit
  28. 05 Feb, 2013 1 commit
  29. 30 Jul, 2012 1 commit
    • Eric Dumazet's avatar
      net: TCP early demux cleanup · cca32e4b
      Eric Dumazet authored
      early_demux() handlers should be called in RCU context, and as we
      use skb_dst_set_noref(skb, dst), caller must not exit from RCU context
      before dst use (skb_dst(skb)) or release (skb_drop(dst))
      Therefore, rcu_read_lock()/rcu_read_unlock() pairs around
      ->early_demux() are confusing and not needed :
      Protocol handlers are already in an RCU read lock section.
      (__netif_receive_skb() does the rcu_read_lock() )
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>