1. 05 Dec, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT · a74f0fa0
      Eric Dumazet authored
      
      
      TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
      as a step to enable bigger tcp sndbuf limits.
      
      It works reasonably well, but the following happens :
      
      Once the limit is reached, TCP stack generates
      an [E]POLLOUT event for every incoming ACK packet.
      
      This causes a high number of context switches.
      
      This patch implements the strategy David Miller added
      in sock_def_write_space() :
      
       - If TCP socket has a notsent_lowat constraint of X bytes,
         allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
         only if number of notsent bytes is below X/2
      
      This considerably reduces TCP_NOTSENT_LOWAT overhead,
      while allowing to keep the pipe full.
      
      Tested:
       100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM
      
      A:/# cat /proc/sys/net/ipv4/tcp_wmem
      4096	262144	64000000
      A:/# super_netperf 100 -H B -l 1000 -- -K bbr &
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/
      
      A:/# vmstat 5 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
       2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
       0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
       1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
       2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0
      
      A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow
      
      A:/# vmstat 5 5  # check that context switches have not inflated too much.
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
       0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
       1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
       1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
       1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a74f0fa0
  2. 23 Oct, 2018 1 commit
    • Karsten Graul's avatar
      Revert "net: simplify sock_poll_wait" · 89ab066d
      Karsten Graul authored
      This reverts commit dd979b4d
      
      .
      
      This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
      internal TCP socket for the initial handshake with the remote peer.
      Whenever the SMC connection can not be established this TCP socket is
      used as a fallback. All socket operations on the SMC socket are then
      forwarded to the TCP socket. In case of poll, the file->private_data
      pointer references the SMC socket because the TCP socket has no file
      assigned. This causes tcp_poll to wait on the wrong socket.
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89ab066d
  3. 16 Oct, 2018 1 commit
    • Eric Dumazet's avatar
      net: extend sk_pacing_rate to unsigned long · 76a9ebe8
      Eric Dumazet authored
      
      
      sk_pacing_rate has beed introduced as a u32 field in 2013,
      effectively limiting per flow pacing to 34Gbit.
      
      We believe it is time to allow TCP to pace high speed flows
      on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
      
      This patch adds no cost for 32bit kernels.
      
      The tcpi_pacing_rate and tcpi_max_pacing_rate were already
      exported as 64bit, so iproute2/ss command require no changes.
      
      Unfortunately the SO_MAX_PACING_RATE socket option will stay
      32bit and we will need to add a new option to let applications
      control high pacing rates.
      
      State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
      ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                       timer:(on,003ms,0) ino:91863 sk:2 <->
       skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
       ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
       rcvmss:536 advmss:1448
       cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
       segs_in:3916318 data_segs_out:177279175
       bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
       send 28045.5Mbps lastrcv:73333
       pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
       busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
       notsent:2085120 minrtt:0.013
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76a9ebe8
  4. 15 Oct, 2018 1 commit
    • Daniel Borkmann's avatar
      tls: convert to generic sk_msg interface · d829e9c4
      Daniel Borkmann authored
      Convert kTLS over to make use of sk_msg interface for plaintext and
      encrypted scattergather data, so it reuses all the sk_msg helpers
      and data structure which later on in a second step enables to glue
      this to BPF.
      
      This also allows to remove quite a bit of open coded helpers which
      are covered by the sk_msg API. Recent changes in kTLs 80ece6a0
      ("tls: Remove redundant vars from tls record structure") and
      4e6d4720
      
       ("tls: Add support for inplace records encryption")
      changed the data path handling a bit; while we've kept the latter
      optimization intact, we had to undo the former change to better
      fit the sk_msg model, hence the sg_aead_in and sg_aead_out have
      been brought back and are linked into the sk_msg sgs. Now the kTLS
      record contains a msg_plaintext and msg_encrypted sk_msg each.
      
      In the original code, the zerocopy_from_iter() has been used out
      of TX but also RX path. For the strparser skb-based RX path,
      we've left the zerocopy_from_iter() in decrypt_internal() mostly
      untouched, meaning it has been moved into tls_setup_from_iter()
      with charging logic removed (as not used from RX). Given RX path
      is not based on sk_msg objects, we haven't pursued setting up a
      dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it
      could be an option to prusue in a later step.
      
      Joint work with John.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d829e9c4
  5. 03 Oct, 2018 1 commit
  6. 13 Sep, 2018 1 commit
  7. 31 Jul, 2018 1 commit
  8. 30 Jul, 2018 2 commits
  9. 07 Jul, 2018 1 commit
  10. 04 Jul, 2018 2 commits
    • Jesus Sanchez-Palencia's avatar
      net/sched: Make etf report drops on error_queue · 4b15c707
      Jesus Sanchez-Palencia authored
      
      
      Use the socket error queue for reporting dropped packets if the
      socket has enabled that feature through the SO_TXTIME API.
      
      Packets are dropped either on enqueue() if they aren't accepted by the
      qdisc or on dequeue() if the system misses their deadline. Those are
      reported as different errors so applications can react accordingly.
      
      Userspace can retrieve the errors through the socket error queue and the
      corresponding cmsg interfaces. A struct sock_extended_err* is used for
      returning the error data, and the packet's timestamp can be retrieved by
      adding both ee_data and ee_info fields as e.g.:
      
          ((__u64) serr->ee_data << 32) + serr->ee_info
      
      This feature is disabled by default and must be explicitly enabled by
      applications. Enabling it can bring some overhead for the Tx cycles
      of the application.
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b15c707
    • Richard Cochran's avatar
      net: Add a new socket option for a future transmit time. · 80b14dee
      Richard Cochran authored
      
      
      This patch introduces SO_TXTIME. User space enables this option in
      order to pass a desired future transmit time in a CMSG when calling
      sendmsg(2). The argument to this socket option is a 8-bytes long struct
      provided by the uapi header net_tstamp.h defined as:
      
      struct sock_txtime {
      	clockid_t 	clockid;
      	u32		flags;
      };
      
      Note that new fields were added to struct sock by filling a 2-bytes
      hole found in the struct. For that reason, neither the struct size or
      number of cachelines were altered.
      Signed-off-by: default avatarRichard Cochran <rcochran@linutronix.de>
      Signed-off-by: default avatarJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80b14dee
  11. 02 Jul, 2018 3 commits
  12. 26 May, 2018 1 commit
  13. 10 May, 2018 1 commit
  14. 01 May, 2018 1 commit
  15. 06 Apr, 2018 1 commit
  16. 31 Mar, 2018 1 commit
    • Andrey Ignatov's avatar
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov authored
      
      
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
  17. 29 Mar, 2018 1 commit
    • John Fastabend's avatar
      bpf: sockmap redirect ingress support · 8934ce2f
      John Fastabend authored
      
      
      Add support for the BPF_F_INGRESS flag in sk_msg redirect helper.
      To do this add a scatterlist ring for receiving socks to check
      before calling into regular recvmsg call path. Additionally, because
      the poll wakeup logic only checked the skb recv queue we need to
      add a hook in TCP stack (similar to write side) so that we have
      a way to wake up polling socks when a scatterlist is redirected
      to that sock.
      
      After this all that is needed is for the redirect helper to
      push the scatterlist into the psock receive queue.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      8934ce2f
  18. 19 Mar, 2018 2 commits
  19. 12 Mar, 2018 1 commit
    • Xin Long's avatar
      sock_diag: request _diag module only when the family or proto has been registered · bf2ae2e4
      Xin Long authored
      
      
      Now when using 'ss' in iproute, kernel would try to load all _diag
      modules, which also causes corresponding family and proto modules
      to be loaded as well due to module dependencies.
      
      Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
      would be loaded.
      
      For example:
      
        $ lsmod|grep sctp
        $ ss
        $ lsmod|grep sctp
        sctp_diag              16384  0
        sctp                  323584  5 sctp_diag
        inet_diag              24576  4 raw_diag,tcp_diag,sctp_diag,udp_diag
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
      
      As these family and proto modules are loaded unintentionally, it
      could cause some problems, like:
      
      - Some debug tools use 'ss' to collect the socket info, which loads all
        those diag and family and protocol modules. It's noisy for identifying
        issues.
      
      - Users usually expect to drop sctp init packet silently when they
        have no sense of sctp protocol instead of sending abort back.
      
      - It wastes resources (especially with multiple netns), and SCTP module
        can't be unloaded once it's loaded.
      
      ...
      
      In short, it's really inappropriate to have these family and proto
      modules loaded unexpectedly when just doing debugging with inet_diag.
      
      This patch is to introduce sock_load_diag_module() where it loads
      the _diag module only when it's corresponding family or proto has
      been already registered.
      
      Note that we can't just load _diag module without the family or
      proto loaded, as some symbols used in _diag module are from the
      family or proto module.
      
      v1->v2:
        - move inet proto check to inet_diag to avoid a compiling err.
      v2->v3:
        - define sock_load_diag_module in sock.c and export one symbol
          only.
        - improve the changelog.
      Reported-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarPhil Sutter <phil@nwl.cc>
      Acked-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf2ae2e4
  20. 21 Feb, 2018 2 commits
    • Eric Dumazet's avatar
      tcp: remove sk_check_csum_caps() · dead7cdb
      Eric Dumazet authored
      
      
      Since TCP relies on GSO, we do not need this helper anymore.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dead7cdb
    • Eric Dumazet's avatar
      tcp: switch to GSO being always on · 0a6b2a1d
      Eric Dumazet authored
      Oleksandr Natalenko reported performance issues with BBR without FQ
      packet scheduler that were root caused to lack of SG and GSO/TSO on
      his configuration.
      
      In this mode, TCP internal pacing has to setup a high resolution timer
      for each MSS sent.
      
      We could implement in TCP a strategy similar to the one adopted
      in commit fefa569a
      
       ("net_sched: sch_fq: account for schedule/timers drifts")
      or decide to finally switch TCP stack to a GSO only mode.
      
      This has many benefits :
      
      1) Most TCP developments are done with TSO in mind.
      2) Less high-resolution timers needs to be armed for TCP-pacing
      3) GSO can benefit of xmit_more hint
      4) Receiver GRO is more effective (as if TSO was used for real on sender)
         -> Lower ACK traffic
      5) Write queues have less overhead (one skb holds about 64KB of payload)
      6) SACK coalescing just works.
      7) rtx rb-tree contains less packets, SACK is cheaper.
      
      This patch implements the minimum patch, but we can remove some legacy
      code as follow ups.
      
      Tested:
      
      On 40Gbit link, one netperf -t TCP_STREAM
      
      BBR+fq:
      sg on:  26 Gbits/sec
      sg off: 15.7 Gbits/sec   (was 2.3 Gbit before patch)
      
      BBR+pfifo_fast:
      sg on:  24.2 Gbits/sec
      sg off: 14.9 Gbits/sec  (was 0.66 Gbit before patch !!! )
      
      BBR+fq_codel:
      sg on:  24.4 Gbits/sec
      sg off: 15 Gbits/sec  (was 0.66 Gbit before patch !!! )
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a6b2a1d
  21. 12 Feb, 2018 1 commit
    • Denys Vlasenko's avatar
      net: make getname() functions return length rather than use int* parameter · 9b2c45d4
      Denys Vlasenko authored
      
      
      Changes since v1:
      Added changes in these files:
          drivers/infiniband/hw/usnic/usnic_transport.c
          drivers/staging/lustre/lnet/lnet/lib-socket.c
          drivers/target/iscsi/iscsi_target_login.c
          drivers/vhost/net.c
          fs/dlm/lowcomms.c
          fs/ocfs2/cluster/tcp.c
          security/tomoyo/network.c
      
      Before:
      All these functions either return a negative error indicator,
      or store length of sockaddr into "int *socklen" parameter
      and return zero on success.
      
      "int *socklen" parameter is awkward. For example, if caller does not
      care, it still needs to provide on-stack storage for the value
      it does not need.
      
      None of the many FOO_getname() functions of various protocols
      ever used old value of *socklen. They always just overwrite it.
      
      This change drops this parameter, and makes all these functions, on success,
      return length of sockaddr. It's always >= 0 and can be differentiated
      from an error.
      
      Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
      
      rpc_sockname() lost "int buflen" parameter, since its only use was
      to be passed to kernel_getsockname() as &buflen and subsequently
      not used in any way.
      
      Userspace API is not changed.
      
          text    data     bss      dec     hex filename
      30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
      30108109 2633612  873672 33615393 200ee21 vmlinux.o
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: linux-bluetooth@vger.kernel.org
      CC: linux-decnet-user@lists.sourceforge.net
      CC: linux-wireless@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: linux-sctp@vger.kernel.org
      CC: linux-nfs@vger.kernel.org
      CC: linux-x25@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b2c45d4
  22. 18 Jan, 2018 1 commit
  23. 15 Jan, 2018 1 commit
    • David Windsor's avatar
      net: Define usercopy region in struct proto slab cache · 30c2c9f1
      David Windsor authored
      
      
      In support of usercopy hardening, this patch defines a region in the
      struct proto slab cache in which userspace copy operations are allowed.
      Some protocols need to copy objects to/from userspace, and they can
      declare the region via their proto structure with the new usersize and
      useroffset fields. Initially, if no region is specified (usersize ==
      0), the entire field is marked as whitelisted. This allows protocols
      to be whitelisted in subsequent patches. Once all protocols have been
      annotated, the full-whitelist default can be removed.
      
      This region is known as the slab cache's usercopy region. Slab caches
      can now check that each dynamically sized copy operation involving
      cache-managed memory falls entirely within the slab's usercopy region.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: default avatarDavid Windsor <dave@nullcore.net>
      [kees: adjust commit log, split off per-proto patches]
      [kees: add logic for by-default full-whitelist]
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      30c2c9f1
  24. 08 Jan, 2018 1 commit
    • David Ahern's avatar
      net: ipv6: Allow connect to linklocal address from socket bound to vrf · 54dc3e33
      David Ahern authored
      Allow a process bound to a VRF to connect to a linklocal address.
      Currently, this fails because of a mismatch between the scope of the
      linklocal address and the sk_bound_dev_if inherited by the VRF binding:
          $ ssh -6 fe80::70b8:cff:fedd:ead8%eth1
          ssh: connect to host fe80::70b8:cff:fedd:ead8%eth1 port 22: Invalid argument
      
      Relax the scope check to allow the socket to be bound to the same L3
      device as the scope id.
      
      This makes ipv6 linklocal consistent with other relaxed checks enabled
      by commits 1ff23bee ("net: l3mdev: Allow send on enslaved interface")
      and 7bb387c5
      
       ("net: Allow IP_MULTICAST_IF to set index to L3 slave").
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54dc3e33
  25. 28 Dec, 2017 1 commit
  26. 20 Dec, 2017 1 commit
  27. 19 Dec, 2017 1 commit
    • Tonghao Zhang's avatar
      sock: Move the socket inuse to namespace. · 648845ab
      Tonghao Zhang authored
      
      
      In some case, we want to know how many sockets are in use in
      different _net_ namespaces. It's a key resource metric.
      
      This patch add a member in struct netns_core. This is a counter
      for socket-inuse in the _net_ namespace. The patch will add/sub
      counter in the sk_alloc, sk_clone_lock and __sk_free.
      
      This patch will not counter the socket created in kernel.
      It's not very useful for userspace to know how many kernel
      sockets we created.
      
      The main reasons for doing this are that:
      
      1. When linux calls the 'do_exit' for process to exit, the functions
      'exit_task_namespaces' and 'exit_task_work' will be called sequentially.
      'exit_task_namespaces' may have destroyed the _net_ namespace, but
      'sock_release' called in 'exit_task_work' may use the _net_ namespace
      if we counter the socket-inuse in sock_release.
      
      2. socket and sock are in pair. More important, sock holds the _net_
      namespace. We counter the socket-inuse in sock, for avoiding holding
      _net_ namespace again in socket. It's a easy way to maintain the code.
      Signed-off-by: default avatarMartin Zhang <zhangjunweimartin@didichuxing.com>
      Signed-off-by: default avatarTonghao Zhang <zhangtonghao@didichuxing.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      648845ab
  28. 13 Dec, 2017 1 commit
  29. 05 Dec, 2017 1 commit
    • Eric Dumazet's avatar
      net: remove hlist_nulls_add_tail_rcu() · d7efc6c1
      Eric Dumazet authored
      Alexander Potapenko reported use of uninitialized memory [1]
      
      This happens when inserting a request socket into TCP ehash,
      in __sk_nulls_add_node_rcu(), since sk_reuseport is not initialized.
      
      Bug was added by commit d894ba18 ("soreuseport: fix ordering for
      mixed v4/v6 sockets")
      
      Note that d296ba60 ("soreuseport: Resolve merge conflict for v4/v6
      ordering fix") missed the opportunity to get rid of
      hlist_nulls_add_tail_rcu() :
      
      Both UDP sockets and TCP/DCCP listeners no longer use
      __sk_nulls_add_node_rcu() for their hash insertion.
      
      Since all other sockets have unique 4-tuple, the reuseport status
      has no special meaning, so we can always use hlist_nulls_add_head_rcu()
      for them and save few cycles/instructions.
      
      [1]
      
      ==================================================================
      BUG: KMSAN: use of uninitialized memory in inet_ehash_insert+0xd40/0x1050
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #3288
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:16
       dump_stack+0x185/0x1d0 lib/dump_stack.c:52
       kmsan_report+0x13f/0x1c0 mm/kmsan/kmsan.c:1016
       __msan_warning_32+0x69/0xb0 mm/kmsan/kmsan_instr.c:766
       __sk_nulls_add_node_rcu ./include/net/sock.h:684
       inet_ehash_insert+0xd40/0x1050 net/ipv4/inet_hashtables.c:413
       reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:754
       inet_csk_reqsk_queue_hash_add+0x1cc/0x300 net/ipv4/inet_connection_sock.c:765
       tcp_conn_request+0x31e7/0x36f0 net/ipv4/tcp_input.c:6414
       tcp_v4_conn_request+0x16d/0x220 net/ipv4/tcp_ipv4.c:1314
       tcp_rcv_state_process+0x42a/0x7210 net/ipv4/tcp_input.c:5917
       tcp_v4_do_rcv+0xa6a/0xcd0 net/ipv4/tcp_ipv4.c:1483
       tcp_v4_rcv+0x3de0/0x4ab0 net/ipv4/tcp_ipv4.c:1763
       ip_local_deliver_finish+0x6bb/0xcb0 net/ipv4/ip_input.c:216
       NF_HOOK ./include/linux/netfilter.h:248
       ip_local_deliver+0x3fa/0x480 net/ipv4/ip_input.c:257
       dst_input ./include/net/dst.h:477
       ip_rcv_finish+0x6fb/0x1540 net/ipv4/ip_input.c:397
       NF_HOOK ./include/linux/netfilter.h:248
       ip_rcv+0x10f6/0x15c0 net/ipv4/ip_input.c:488
       __netif_receive_skb_core+0x36f6/0x3f60 net/core/dev.c:4298
       __netif_receive_skb net/core/dev.c:4336
       netif_receive_skb_internal+0x63c/0x19c0 net/core/dev.c:4497
       napi_skb_finish net/core/dev.c:4858
       napi_gro_receive+0x629/0xa50 net/core/dev.c:4889
       e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4018
       e1000_clean_rx_irq+0x1492/0x1d30
      drivers/net/ethernet/intel/e1000/e1000_main.c:4474
       e1000_clean+0x43aa/0x5970 drivers/net/ethernet/intel/e1000/e1000_main.c:3819
       napi_poll net/core/dev.c:5500
       net_rx_action+0x73c/0x1820 net/core/dev.c:5566
       __do_softirq+0x4b4/0x8dd kernel/softirq.c:284
       invoke_softirq kernel/softirq.c:364
       irq_exit+0x203/0x240 kernel/softirq.c:405
       exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:638
       do_IRQ+0x15e/0x1a0 arch/x86/kernel/irq.c:263
       common_interrupt+0x86/0x86
      
      Fixes: d894ba18 ("soreuseport: fix ordering for mixed v4/v6 sockets")
      Fixes: d296ba60
      
       ("soreuseport: Resolve merge conflict for v4/v6 ordering fix")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarCraig Gallek <kraig@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7efc6c1
  30. 27 Nov, 2017 1 commit
  31. 16 Nov, 2017 2 commits
  32. 14 Nov, 2017 1 commit
    • Eric Dumazet's avatar
      tcp: allow drivers to tweak TSQ logic · 3a9b76fd
      Eric Dumazet authored
      I had many reports that TSQ logic breaks wifi aggregation.
      
      Current logic is to allow up to 1 ms of bytes to be queued into qdisc
      and drivers queues.
      
      But Wifi aggregation needs a bigger budget to allow bigger rates to
      be discovered by various TCP Congestion Controls algorithms.
      
      This patch adds an extra socket field, allowing wifi drivers to select
      another log scale to derive TCP Small Queue credit from current pacing
      rate.
      
      Initial value is 10, meaning that this patch does not change current
      behavior.
      
      We expect wifi drivers to set this field to smaller values (tests have
      been done with values from 6 to 9)
      
      They would have to use following template :
      
      if (skb->sk && skb->sk->sk_pacing_shift != MY_PACING_SHIFT)
           skb->sk->sk_pacing_shift = MY_PACING_SHIFT;
      
      Ref: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1670041
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Johannes Berg <johannes.berg@intel.com>
      Cc: Toke Høiland-Jørgensen <toke@toke.dk>
      Cc: Kir Kolyshkin <kir@openvz.org>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a9b76fd
  33. 10 Nov, 2017 1 commit