1. 21 Feb, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: switch to GSO being always on · 0a6b2a1d
      Eric Dumazet authored
      Oleksandr Natalenko reported performance issues with BBR without FQ
      packet scheduler that were root caused to lack of SG and GSO/TSO on
      his configuration.
      In this mode, TCP internal pacing has to setup a high resolution timer
      for each MSS sent.
      We could implement in TCP a strategy similar to the one adopted
      in commit fefa569a
       ("net_sched: sch_fq: account for schedule/timers drifts")
      or decide to finally switch TCP stack to a GSO only mode.
      This has many benefits :
      1) Most TCP developments are done with TSO in mind.
      2) Less high-resolution timers needs to be armed for TCP-pacing
      3) GSO can benefit of xmit_more hint
      4) Receiver GRO is more effective (as if TSO was used for real on sender)
         -> Lower ACK traffic
      5) Write queues have less overhead (one skb holds about 64KB of payload)
      6) SACK coalescing just works.
      7) rtx rb-tree contains less packets, SACK is cheaper.
      This patch implements the minimum patch, but we can remove some legacy
      code as follow ups.
      On 40Gbit link, one netperf -t TCP_STREAM
      sg on:  26 Gbits/sec
      sg off: 15.7 Gbits/sec   (was 2.3 Gbit before patch)
      sg on:  24.2 Gbits/sec
      sg off: 14.9 Gbits/sec  (was 0.66 Gbit before patch !!! )
      sg on:  24.4 Gbits/sec
      sg off: 15 Gbits/sec  (was 0.66 Gbit before patch !!! )
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  2. 12 Feb, 2018 1 commit
    • Denys Vlasenko's avatar
      net: make getname() functions return length rather than use int* parameter · 9b2c45d4
      Denys Vlasenko authored
      Changes since v1:
      Added changes in these files:
      All these functions either return a negative error indicator,
      or store length of sockaddr into "int *socklen" parameter
      and return zero on success.
      "int *socklen" parameter is awkward. For example, if caller does not
      care, it still needs to provide on-stack storage for the value
      it does not need.
      None of the many FOO_getname() functions of various protocols
      ever used old value of *socklen. They always just overwrite it.
      This change drops this parameter, and makes all these functions, on success,
      return length of sockaddr. It's always >= 0 and can be differentiated
      from an error.
      Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
      rpc_sockname() lost "int buflen" parameter, since its only use was
      to be passed to kernel_getsockname() as &buflen and subsequently
      not used in any way.
      Userspace API is not changed.
          text    data     bss      dec     hex filename
      30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
      30108109 2633612  873672 33615393 200ee21 vmlinux.o
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: linux-bluetooth@vger.kernel.org
      CC: linux-decnet-user@lists.sourceforge.net
      CC: linux-wireless@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: linux-sctp@vger.kernel.org
      CC: linux-nfs@vger.kernel.org
      CC: linux-x25@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  3. 18 Jan, 2018 1 commit
  4. 15 Jan, 2018 1 commit
    • David Windsor's avatar
      net: Define usercopy region in struct proto slab cache · 30c2c9f1
      David Windsor authored
      In support of usercopy hardening, this patch defines a region in the
      struct proto slab cache in which userspace copy operations are allowed.
      Some protocols need to copy objects to/from userspace, and they can
      declare the region via their proto structure with the new usersize and
      useroffset fields. Initially, if no region is specified (usersize ==
      0), the entire field is marked as whitelisted. This allows protocols
      to be whitelisted in subsequent patches. Once all protocols have been
      annotated, the full-whitelist default can be removed.
      This region is known as the slab cache's usercopy region. Slab caches
      can now check that each dynamically sized copy operation involving
      cache-managed memory falls entirely within the slab's usercopy region.
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: default avatarDavid Windsor <dave@nullcore.net>
      [kees: adjust commit log, split off per-proto patches]
      [kees: add logic for by-default full-whitelist]
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
  5. 08 Jan, 2018 1 commit
    • David Ahern's avatar
      net: ipv6: Allow connect to linklocal address from socket bound to vrf · 54dc3e33
      David Ahern authored
      Allow a process bound to a VRF to connect to a linklocal address.
      Currently, this fails because of a mismatch between the scope of the
      linklocal address and the sk_bound_dev_if inherited by the VRF binding:
          $ ssh -6 fe80::70b8:cff:fedd:ead8%eth1
          ssh: connect to host fe80::70b8:cff:fedd:ead8%eth1 port 22: Invalid argument
      Relax the scope check to allow the socket to be bound to the same L3
      device as the scope id.
      This makes ipv6 linklocal consistent with other relaxed checks enabled
      by commits 1ff23bee ("net: l3mdev: Allow send on enslaved interface")
      and 7bb387c5
       ("net: Allow IP_MULTICAST_IF to set index to L3 slave").
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 28 Dec, 2017 1 commit
  7. 20 Dec, 2017 1 commit
  8. 19 Dec, 2017 1 commit
    • Tonghao Zhang's avatar
      sock: Move the socket inuse to namespace. · 648845ab
      Tonghao Zhang authored
      In some case, we want to know how many sockets are in use in
      different _net_ namespaces. It's a key resource metric.
      This patch add a member in struct netns_core. This is a counter
      for socket-inuse in the _net_ namespace. The patch will add/sub
      counter in the sk_alloc, sk_clone_lock and __sk_free.
      This patch will not counter the socket created in kernel.
      It's not very useful for userspace to know how many kernel
      sockets we created.
      The main reasons for doing this are that:
      1. When linux calls the 'do_exit' for process to exit, the functions
      'exit_task_namespaces' and 'exit_task_work' will be called sequentially.
      'exit_task_namespaces' may have destroyed the _net_ namespace, but
      'sock_release' called in 'exit_task_work' may use the _net_ namespace
      if we counter the socket-inuse in sock_release.
      2. socket and sock are in pair. More important, sock holds the _net_
      namespace. We counter the socket-inuse in sock, for avoiding holding
      _net_ namespace again in socket. It's a easy way to maintain the code.
      Signed-off-by: default avatarMartin Zhang <zhangjunweimartin@didichuxing.com>
      Signed-off-by: default avatarTonghao Zhang <zhangtonghao@didichuxing.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 13 Dec, 2017 1 commit
  10. 05 Dec, 2017 1 commit
    • Eric Dumazet's avatar
      net: remove hlist_nulls_add_tail_rcu() · d7efc6c1
      Eric Dumazet authored
      Alexander Potapenko reported use of uninitialized memory [1]
      This happens when inserting a request socket into TCP ehash,
      in __sk_nulls_add_node_rcu(), since sk_reuseport is not initialized.
      Bug was added by commit d894ba18 ("soreuseport: fix ordering for
      mixed v4/v6 sockets")
      Note that d296ba60 ("soreuseport: Resolve merge conflict for v4/v6
      ordering fix") missed the opportunity to get rid of
      hlist_nulls_add_tail_rcu() :
      Both UDP sockets and TCP/DCCP listeners no longer use
      __sk_nulls_add_node_rcu() for their hash insertion.
      Since all other sockets have unique 4-tuple, the reuseport status
      has no special meaning, so we can always use hlist_nulls_add_head_rcu()
      for them and save few cycles/instructions.
      BUG: KMSAN: use of uninitialized memory in inet_ehash_insert+0xd40/0x1050
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3288
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16
       dump_stack+0x185/0x1d0 lib/dump_stack.c:52
       kmsan_report+0x13f/0x1c0 mm/kmsan/kmsan.c:1016
       __msan_warning_32+0x69/0xb0 mm/kmsan/kmsan_instr.c:766
       __sk_nulls_add_node_rcu ./include/net/sock.h:684
       inet_ehash_insert+0xd40/0x1050 net/ipv4/inet_hashtables.c:413
       reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:754
       inet_csk_reqsk_queue_hash_add+0x1cc/0x300 net/ipv4/inet_connection_sock.c:765
       tcp_conn_request+0x31e7/0x36f0 net/ipv4/tcp_input.c:6414
       tcp_v4_conn_request+0x16d/0x220 net/ipv4/tcp_ipv4.c:1314
       tcp_rcv_state_process+0x42a/0x7210 net/ipv4/tcp_input.c:5917
       tcp_v4_do_rcv+0xa6a/0xcd0 net/ipv4/tcp_ipv4.c:1483
       tcp_v4_rcv+0x3de0/0x4ab0 net/ipv4/tcp_ipv4.c:1763
       ip_local_deliver_finish+0x6bb/0xcb0 net/ipv4/ip_input.c:216
       NF_HOOK ./include/linux/netfilter.h:248
       ip_local_deliver+0x3fa/0x480 net/ipv4/ip_input.c:257
       dst_input ./include/net/dst.h:477
       ip_rcv_finish+0x6fb/0x1540 net/ipv4/ip_input.c:397
       NF_HOOK ./include/linux/netfilter.h:248
       ip_rcv+0x10f6/0x15c0 net/ipv4/ip_input.c:488
       __netif_receive_skb_core+0x36f6/0x3f60 net/core/dev.c:4298
       __netif_receive_skb net/core/dev.c:4336
       netif_receive_skb_internal+0x63c/0x19c0 net/core/dev.c:4497
       napi_skb_finish net/core/dev.c:4858
       napi_gro_receive+0x629/0xa50 net/core/dev.c:4889
       e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4018
       e1000_clean+0x43aa/0x5970 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
       napi_poll net/core/dev.c:5500
       net_rx_action+0x73c/0x1820 net/core/dev.c:5566
       __do_softirq+0x4b4/0x8dd kernel/softirq.c:284
       invoke_softirq kernel/softirq.c:364
       irq_exit+0x203/0x240 kernel/softirq.c:405
       exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:638
       do_IRQ+0x15e/0x1a0 arch/x86/kernel/irq.c:263
      Fixes: d894ba18 ("soreuseport: fix ordering for mixed v4/v6 sockets")
      Fixes: d296ba60
       ("soreuseport: Resolve merge conflict for v4/v6 ordering fix")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarCraig Gallek <kraig@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  11. 27 Nov, 2017 1 commit
  12. 16 Nov, 2017 2 commits
  13. 14 Nov, 2017 1 commit
    • Eric Dumazet's avatar
      tcp: allow drivers to tweak TSQ logic · 3a9b76fd
      Eric Dumazet authored
      I had many reports that TSQ logic breaks wifi aggregation.
      Current logic is to allow up to 1 ms of bytes to be queued into qdisc
      and drivers queues.
      But Wifi aggregation needs a bigger budget to allow bigger rates to
      be discovered by various TCP Congestion Controls algorithms.
      This patch adds an extra socket field, allowing wifi drivers to select
      another log scale to derive TCP Small Queue credit from current pacing
      Initial value is 10, meaning that this patch does not change current
      We expect wifi drivers to set this field to smaller values (tests have
      been done with values from 6 to 9)
      They would have to use following template :
      if (skb->sk && skb->sk->sk_pacing_shift != MY_PACING_SHIFT)
           skb->sk->sk_pacing_shift = MY_PACING_SHIFT;
      Ref: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1670041
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Johannes Berg <johannes.berg@intel.com>
      Cc: Toke Høiland-Jørgensen <toke@toke.dk>
      Cc: Kir Kolyshkin <kir@openvz.org>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  14. 10 Nov, 2017 1 commit
  15. 24 Oct, 2017 1 commit
  16. 06 Oct, 2017 1 commit
    • Eric Dumazet's avatar
      tcp: implement rb-tree based retransmit queue · 75c119af
      Eric Dumazet authored
      Using a linear list to store all skbs in write queue has been okay
      for quite a while : O(N) is not too bad when N < 500.
      Things get messy when N is the order of 100,000 : Modern TCP stacks
      want 10Gbit+ of throughput even with 200 ms RTT flows.
      40 ns per cache line miss means a full scan can use 4 ms,
      blowing away CPU caches.
      SACK processing often can use various hints to avoid parsing
      whole retransmit queue. But with high packet losses and/or high
      reordering, hints no longer work.
      Sender has to process thousands of unfriendly SACK, accumulating
      a huge socket backlog, burning a cpu and massively dropping packets.
      Using an rb-tree for retransmit queue has been avoided for years
      because it added complexity and overhead, but now is the time
      to be more resistant and say no to quadratic behavior.
      1) RTX queue is no longer part of the write queue : already sent skbs
      are stored in one rb-tree.
      2) Since reaching the head of write queue no longer needs
      sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue
       On receiver :
       netem on ingress : delay 150ms 200us loss 1
       GRO disabled to force stress and SACK storms.
      for f in `seq 1 10`
       ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
      done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
      Before patch :
      After patch:
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  17. 22 Sep, 2017 1 commit
    • Eric Dumazet's avatar
      net: prevent dst uses after free · 222d7dbd
      Eric Dumazet authored
      In linux-4.13, Wei worked hard to convert dst to a traditional
      refcounted model, removing GC.
      We now want to make sure a dst refcount can not transition from 0 back
      to 1.
      The problem here is that input path attached a not refcounted dst to an
      skb. Then later, because packet is forwarded and hits skb_dst_force()
      before exiting RCU section, we might try to take a refcount on one dst
      that is about to be freed, if another cpu saw 1 -> 0 transition in
      dst_release() and queued the dst for freeing after one RCU grace period.
      Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
      always perform the complete check against dst refcount, and not assume
      it is not zero.
      Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
      [  989.919496]  skb_dst_force+0x32/0x34
      [  989.919498]  __dev_queue_xmit+0x1ad/0x482
      [  989.919501]  ? eth_header+0x28/0xc6
      [  989.919502]  dev_queue_xmit+0xb/0xd
      [  989.919504]  neigh_connected_output+0x9b/0xb4
      [  989.919507]  ip_finish_output2+0x234/0x294
      [  989.919509]  ? ipt_do_table+0x369/0x388
      [  989.919510]  ip_finish_output+0x12c/0x13f
      [  989.919512]  ip_output+0x53/0x87
      [  989.919513]  ip_forward_finish+0x53/0x5a
      [  989.919515]  ip_forward+0x2cb/0x3e6
      [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
      [  989.919518]  ip_rcv_finish+0x2e2/0x321
      [  989.919519]  ip_rcv+0x26f/0x2eb
      [  989.919522]  ? vlan_do_receive+0x4f/0x289
      [  989.919523]  __netif_receive_skb_core+0x467/0x50b
      [  989.919526]  ? tcp_gro_receive+0x239/0x239
      [  989.919529]  ? inet_gro_receive+0x226/0x238
      [  989.919530]  __netif_receive_skb+0x4d/0x5f
      [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
      [  989.919533]  napi_gro_receive+0x45/0x81
      [  989.919536]  ixgbe_poll+0xc8a/0xf09
      [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
      [  989.919540]  net_rx_action+0xf4/0x266
      [  989.919543]  __do_softirq+0xa8/0x19d
      [  989.919545]  irq_exit+0x5d/0x6b
      [  989.919546]  do_IRQ+0x9c/0xb5
      [  989.919548]  common_interrupt+0x93/0x93
      [  989.919548]  </IRQ>
      Similarly dst_clone() can use dst_hold() helper to have additional
      debugging, as a follow up to commit 44ebe791 ("net: add debug
      atomic_inc_not_zero() in dst_hold()")
      In net-next we will convert dst atomic_t to refcount_t for peace of
      Fixes: a4c2fd7f
       ("net: remove DST_NOCACHE flag")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reported-by: default avatarPaweł Staszewski <pstaszewski@itcare.pl>
      Bisected-by: default avatarPaweł Staszewski <pstaszewski@itcare.pl>
      Acked-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  18. 29 Aug, 2017 1 commit
  19. 18 Aug, 2017 1 commit
    • Matthew Dawson's avatar
      datagram: When peeking datagrams with offset < 0 don't skip empty skbs · a0917e0b
      Matthew Dawson authored
      Due to commit e6afc8ac
       ("udp: remove
      headers from UDP packets before queueing"), when udp packets are being
      peeked the requested extra offset is always 0 as there is no need to skip
      the udp header.  However, when the offset is 0 and the next skb is
      of length 0, it is only returned once.  The behaviour can be seen with
      the following python script:
      from socket import *;
      f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      f.bind(('::', 0));
      addr=('::1', f.getsockname()[1]);
      g.sendto(b'', addr)
      g.sendto(b'b', addr)
      print(f.recvfrom(10, MSG_PEEK));
      print(f.recvfrom(10, MSG_PEEK));
      Where the expected output should be the empty string twice.
      Instead, make sk_peek_offset return negative values, and pass those values
      to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
      to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
      __skb_try_recv_from_queue will then ensure the offset is reset back to 0
      if a peek is requested without an offset, unless no packets are found.
      Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
      greater then 0, and off is greater then or equal to skb->len, then
      (_off || skb->len) must always be true assuming skb->len >= 0 is always
      Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
      as it double checked if MSG_PEEK was set in the flags.
       - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
      redundant checks
       - Fix peeking in udp{,v6}_recvmsg to report the right value when the
      offset is 0
       - Marked new branch in __skb_try_recv_from_queue as unlikely.
      Signed-off-by: default avatarMatthew Dawson <matthew@mjdsystems.ca>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  20. 04 Aug, 2017 2 commits
    • Willem de Bruijn's avatar
      sock: add MSG_ZEROCOPY · 52267790
      Willem de Bruijn authored
      The kernel supports zerocopy sendmsg in virtio and tap. Expand the
      infrastructure to support other socket types. Introduce a completion
      notification channel over the socket error queue. Notifications are
      returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
      blocking the send/recv path on receiving notifications.
      Add reference counting, to support the skb split, merge, resize and
      clone operations possible with SOCK_STREAM and other socket types.
      The patch does not yet modify any datapaths.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Willem de Bruijn's avatar
      sock: allocate skbs from optmem · 98ba0bd5
      Willem de Bruijn authored
      Add sock_omalloc and sock_ofree to be able to allocate control skbs,
      for instance for looping errors onto sk_error_queue.
      The transmit budget (sk_wmem_alloc) is involved in transmit skb
      shaping, most notably in TCP Small Queues. Using this budget for
      control packets would impact transmission.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  21. 01 Aug, 2017 1 commit
  22. 12 Jul, 2017 1 commit
  23. 08 Jul, 2017 1 commit
  24. 01 Jul, 2017 2 commits
  25. 30 Jun, 2017 1 commit
    • Kees Cook's avatar
      randstruct: Mark various structs for randomization · 3859a271
      Kees Cook authored
      This marks many critical kernel structures for randomization. These are
      structures that have been targeted in the past in security exploits, or
      contain functions pointers, pointers to function pointer tables, lists,
      workqueues, ref-counters, credentials, permissions, or are otherwise
      sensitive. This initial list was extracted from Brad Spengler/PaX Team's
      code in the last public patch of grsecurity/PaX based on my understanding
      of the code. Changes or omissions from the original code are mine and
      don't reflect the original grsecurity/PaX code.
      Left out of this list is task_struct, which requires special handling
      and will be covered in a subsequent patch.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
  26. 21 Jun, 2017 1 commit
    • Paolo Abeni's avatar
      sock: avoid dirtying incoming_cpu if not needed · 34cfb542
      Paolo Abeni authored
      for connected socket, the incoming_cpu field in the sock struct
      is not going to change frequently, but we are setting it
      unconditionally for each packet.
      Since sk_incoming_cpu and sk_flags share the same cacheline,
      and the latter is access by udp_recvmsg(), this cause a cache
      miss for each packet for UDP connected socket.
      With this patch, we set the incoming cpu field only when the
      ingress cpu really changes.
      This gives a small but measurable performance improvement for
      connected UDP socket.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  27. 08 Jun, 2017 1 commit
    • Eric Dumazet's avatar
      tcp: add TCPMemoryPressuresChrono counter · 06044751
      Eric Dumazet authored
      DRAM supply shortage and poor memory pressure tracking in TCP
      stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
      limits) and tcp_mem[] quite hazardous.
      TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
      limits being hit, but only tracking number of transitions.
      If TCP stack behavior under stress was perfect :
      1) It would maintain memory usage close to the limit.
      2) Memory pressure state would be entered for short times.
      We certainly prefer 100 events lasting 10ms compared to one event
      lasting 200 seconds.
      This patch adds a new SNMP counter tracking cumulative duration of
      memory pressure events, given in ms units.
      $ cat /proc/sys/net/ipv4/tcp_mem
      3088    4117    6176
      $ grep TCP /proc/net/sockstat
      TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
      $ nstat -n ; sleep 10 ; nstat |grep Pressure
      TcpExtTCPMemoryPressures        1700
      TcpExtTCPMemoryPressuresChrono  5209
      v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  28. 16 May, 2017 3 commits
    • Eric Dumazet's avatar
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet authored
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      This patch implements an automatic fallback to internal pacing.
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      If MQ+pfifo_fast is used on the NIC :
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      Now use MQ+FQ :
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      As expected, number of interrupts per second is very different.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Paolo Abeni's avatar
      net/sock: factor out dequeue/peek with offset code · 65101aec
      Paolo Abeni authored
      And update __sk_queue_drop_skb() to work on the specified queue.
      This will help the udp protocol to use an additional private
      rx queue in a later patch.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Mauro Carvalho Chehab's avatar
      net: fix some identation issues at kernel-doc markups · d651983d
      Mauro Carvalho Chehab authored
      Sphinx is very pedantic with regards to identation and
      escape sequences:
        ./include/net/sock.h:1967: ERROR: Unexpected indentation.
        ./include/net/sock.h:1969: ERROR: Unexpected indentation.
        ./include/net/sock.h:1970: WARNING: Block quote ends without a blank line; unexpected unindent.
        ./include/net/sock.h:1971: WARNING: Block quote ends without a blank line; unexpected unindent.
        ./include/net/sock.h:2268: WARNING: Inline emphasis start-string without end-string.
        ./net/core/sock.c:2686: ERROR: Unexpected indentation.
        ./net/core/sock.c:2687: WARNING: Block quote ends without a blank line; unexpected unindent.
        ./net/core/datagram.c:182: WARNING: Inline emphasis start-string without end-string.
        ./include/linux/netdevice.h:1444: ERROR: Unexpected indentation.
        ./drivers/net/phy/phy.c:381: ERROR: Unexpected indentation.
        ./drivers/net/phy/phy.c:382: WARNING: Block quote ends without a blank line; unexpected unindent.
      - Fix spacing where needed;
      - Properly escape constants;
      - Use a literal block for a race description.
      No functional changes.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@s-opensource.com>
  29. 18 Apr, 2017 1 commit
    • Paul E. McKenney's avatar
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney authored
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
  30. 03 Apr, 2017 1 commit
  31. 31 Mar, 2017 1 commit
    • Paolo Abeni's avatar
      sock: avoid dirtying sk_stamp, if possible · 6c7c98ba
      Paolo Abeni authored
      sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for
      every packet, even if the SOCK_TIMESTAMP flag is not set in the
      related socket.
      If selinux is enabled, this cause a cache miss for every packet
      since sk->sk_stamp and sk->sk_security share the same cacheline.
      With this change sk_stamp is set only if the SOCK_TIMESTAMP
      flag is set, and is cleared for the first packet, so that the user
      perceived behavior is unchanged.
      This gives up to 5% speed-up under udp-flood with small packets.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  32. 22 Mar, 2017 1 commit
  33. 10 Mar, 2017 1 commit
    • David Howells's avatar
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells authored
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      The theory lockdep comes up with is as follows:
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      	mmap_sem must be taken before sk_lock-AF_RXRPC
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      	sk_lock-AF_INET must be taken before mmap_sem
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      Fix the general case by:
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
           because they follow a sock_create_kern() and accept off of that.
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  34. 09 Mar, 2017 1 commit
  35. 02 Mar, 2017 1 commit