1. 18 Sep, 2015 2 commits
    • Eric W. Biederman's avatar
      netfilter: Pass struct net into the netfilter hooks · 29a26a56
      Eric W. Biederman authored
      
      
      Pass a network namespace parameter into the netfilter hooks.  At the
      call site of the netfilter hooks the path a packet is taking through
      the network stack is well known which allows the network namespace to
      be easily and reliabily.
      
      This allows the replacement of magic code like
      "dev_net(state->in?:state->out)" that appears at the start of most
      netfilter hooks with "state->net".
      
      In almost all cases the network namespace passed in is derived
      from the first network device passed in, guaranteeing those
      paths will not see any changes in practice.
      
      The exceptions are:
      xfrm/xfrm_output.c:xfrm_output_resume()         xs_net(skb_dst(skb)->xfrm)
      ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont()      ip_vs_conn_net(cp)
      ipvs/ip_vs_xmit.c:ip_vs_send_or_cont()          ip_vs_conn_net(cp)
      ipv4/raw.c:raw_send_hdrinc()                    sock_net(sk)
      ipv6/ip6_output.c:ip6_xmit()			sock_net(sk)
      ipv6/ndisc.c:ndisc_send_skb()                   dev_net(skb->dev) not dev_net(dst->dev)
      ipv6/raw.c:raw6_send_hdrinc()                   sock_net(sk)
      br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev
      
      In all cases these exceptions seem to be a better expression for the
      network namespace the packet is being processed in then the historic
      "dev_net(in?in:out)".  I am documenting them in case something odd
      pops up and someone starts trying to track down what happened.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29a26a56
    • Eric W. Biederman's avatar
      net: Merge dst_output and dst_output_sk · 5a70649e
      Eric W. Biederman authored
      
      
      Add a sock paramter to dst_output making dst_output_sk superfluous.
      Add a skb->sk parameter to all of the callers of dst_output
      Have the callers of dst_output_sk call dst_output.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a70649e
  2. 07 Apr, 2015 1 commit
    • David Miller's avatar
      netfilter: Pass socket pointer down through okfn(). · 7026b1dd
      David Miller authored
      
      
      On the output paths in particular, we have to sometimes deal with two
      socket contexts.  First, and usually skb->sk, is the local socket that
      generated the frame.
      
      And second, is potentially the socket used to control a tunneling
      socket, such as one the encapsulates using UDP.
      
      We do not want to disassociate skb->sk when encapsulating in order
      to fix this, because that would break socket memory accounting.
      
      The most extreme case where this can cause huge problems is an
      AF_PACKET socket transmitting over a vxlan device.  We hit code
      paths doing checks that assume they are dealing with an ipv4
      socket, but are actually operating upon the AF_PACKET one.
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7026b1dd
  3. 03 Apr, 2015 2 commits
  4. 26 Mar, 2015 1 commit
  5. 25 Mar, 2015 1 commit
  6. 02 Mar, 2015 1 commit
  7. 04 Feb, 2015 2 commits
  8. 09 Dec, 2014 3 commits
  9. 10 Nov, 2014 2 commits
  10. 05 Nov, 2014 1 commit
    • David S. Miller's avatar
      net: Add and use skb_copy_datagram_msg() helper. · 51f3d02b
      David S. Miller authored
      
      
      This encapsulates all of the skb_copy_datagram_iovec() callers
      with call argument signature "skb, offset, msghdr->msg_iov, length".
      
      When we move to iov_iters in the networking, the iov_iter object will
      sit in the msghdr.
      
      Having a helper like this means there will be less places to touch
      during that transformation.
      
      Based upon descriptions and patch from Al Viro.
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51f3d02b
  11. 23 Jul, 2014 1 commit
    • Quentin Armitage's avatar
      ipv4: Make IP_MULTICAST_ALL and IP_MSFILTER work on raw sockets · f5220d63
      Quentin Armitage authored
      
      
      Currently, although IP_MULTICAST_ALL and IP_MSFILTER ioctl calls succeed on
      raw sockets, there is no code to implement the functionality on received
      packets; it is only implemented for UDP sockets. The raw(7) man page states:
      "In addition, all ip(7) IPPROTO_IP socket options valid for datagram sockets
      are supported", which implies these ioctls should work on raw sockets.
      
      To fix this, add a call to ip_mc_sf_allow on raw sockets.
      
      This should not break any existing code, since the current position of
      not calling ip_mc_sf_filter makes it behave as if neither the IP_MULTICAST_ALL
      nor the IP_MSFILTER ioctl had been called. Adding the call to ip_mc_sf_allow
      will therefore maintain the current behaviour so long as IP_MULTICAST_ALL and
      IP_MSFILTER ioctls are not called. Any code that currently is calling
      IP_MULTICAST_ALL or IP_MSFILTER ioctls on raw sockets presumably is wanting
      the filter to be applied, although no filtering will currently be occurring.
      
      Signed-off-by: default avatarQuentin Armitage <quentin@armitage.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5220d63
  12. 15 Jul, 2014 1 commit
  13. 02 Jun, 2014 1 commit
    • Eric Dumazet's avatar
      inetpeer: get rid of ip_id_count · 73f156a6
      Eric Dumazet authored
      
      
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73f156a6
  14. 19 Feb, 2014 1 commit
  15. 19 Jan, 2014 1 commit
  16. 06 Dec, 2013 1 commit
  17. 23 Nov, 2013 1 commit
  18. 18 Nov, 2013 1 commit
  19. 08 Oct, 2013 1 commit
    • Shawn Bohrer's avatar
      net: ipv4 only populate IP_PKTINFO when needed · fbf8866d
      Shawn Bohrer authored
      
      
      The since the removal of the routing cache computing
      fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
      packet received.  This has introduced a performance regression for some
      UDP workloads.
      
      This change skips populating the packet info for sockets that do not have
      IP_PKTINFO set.
      
      Benchmark results from a netperf UDP_RR test:
      Before 89789.68 transactions/s
      After  90587.62 transactions/s
      
      Benchmark results from a fio 1 byte UDP multicast pingpong test
      (Multicast one way unicast response):
      Before 12.63us RTT
      After  12.48us RTT
      
      Signed-off-by: default avatarShawn Bohrer <sbohrer@rgmadvisors.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbf8866d
  20. 28 Sep, 2013 1 commit
    • Francesco Fusco's avatar
      ipv4: processing ancillary IP_TOS or IP_TTL · aa661581
      Francesco Fusco authored
      
      
      If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
      packets with the specified TTL or TOS overriding the socket values specified
      with the traditional setsockopt().
      
      The struct inet_cork stores the values of TOS, TTL and priority that are
      passed through the struct ipcm_cookie. If there are user-specified TOS
      (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
      used to override the per-socket values. In case of TOS also the priority
      is changed accordingly.
      
      Two helper functions get_rttos and get_rtconn_flags are defined to take
      into account the presence of a user specified TOS value when computing
      RT_TOS and RT_CONN_FLAGS.
      
      Signed-off-by: default avatarFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa661581
  21. 24 Sep, 2013 1 commit
  22. 19 Sep, 2013 1 commit
    • Ansis Atteka's avatar
      ip: generate unique IP identificator if local fragmentation is allowed · 703133de
      Ansis Atteka authored
      
      
      If local fragmentation is allowed, then ip_select_ident() and
      ip_select_ident_more() need to generate unique IDs to ensure
      correct defragmentation on the peer.
      
      For example, if IPsec (tunnel mode) has to encrypt large skbs
      that have local_df bit set, then all IP fragments that belonged
      to different ESP datagrams would have used the same identificator.
      If one of these IP fragments would get lost or reordered, then
      peer could possibly stitch together wrong IP fragments that did
      not belong to the same datagram. This would lead to a packet loss
      or data corruption.
      
      Signed-off-by: default avatarAnsis Atteka <aatteka@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      703133de
  23. 29 Aug, 2013 1 commit
  24. 15 Aug, 2013 1 commit
  25. 28 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      
      
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  26. 18 Feb, 2013 2 commits
  27. 21 Jan, 2013 1 commit
  28. 25 Sep, 2012 1 commit
  29. 24 Sep, 2012 1 commit
    • Eric Dumazet's avatar
      net: use a per task frag allocator · 5640f768
      Eric Dumazet authored
      
      
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5640f768
  30. 22 Sep, 2012 1 commit
  31. 15 Aug, 2012 1 commit
  32. 12 Jul, 2012 1 commit
  33. 15 Jun, 2012 1 commit
    • David S. Miller's avatar
      ipv4: Handle PMTU in all ICMP error handlers. · 36393395
      David S. Miller authored
      
      
      With ip_rt_frag_needed() removed, we have to explicitly update PMTU
      information in every ICMP error handler.
      
      Create two helper functions to facilitate this.
      
      1) ipv4_sk_update_pmtu()
      
         This updates the PMTU when we have a socket context to
         work with.
      
      2) ipv4_update_pmtu()
      
         Raw version, used when no socket context is available.  For this
         interface, we essentially just pass in explicit arguments for
         the flow identity information we would have extracted from the
         socket.
      
         And you'll notice that ipv4_sk_update_pmtu() is simply implemented
         in terms of ipv4_update_pmtu()
      
      Note that __ip_route_output_key() is used, rather than something like
      ip_route_output_flow() or ip_route_output_key().  This is because we
      absolutely do not want to end up with a route that does IPSEC
      encapsulation and the like.  Instead, we only want the route that
      would get us to the node described by the outermost IP header.
      
      Reported-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36393395