1. 26 Mar, 2014 1 commit
  2. 11 Mar, 2014 1 commit
    • Eric Dumazet's avatar
      tcp: tcp_release_cb() should release socket ownership · c3f9b018
      Eric Dumazet authored
      Lars Persson reported following deadlock :
      
      -000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock
      -001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0
      -002 |ip_local_deliver_finish(skb = 0x8BD527A0)
      -003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?)
      -004 |netif_receive_skb(skb = 0x8BD527A0)
      -005 |elk_poll(napi = 0x8C770500, budget = 64)
      -006 |net_rx_action(?)
      -007 |__do_softirq()
      -008 |do_softirq()
      -009 |local_bh_enable()
      -010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?)
      -011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0)
      -012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0)
      -013 |tcp_release_cb(sk = 0x8BE6B2A0)
      -014 |release_sock(sk = 0x8BE6B2A0)
      -015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?)
      -016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096)
      -017 |kernel_sendmsg(?, ?, ?, ?, size = 4096)
      -018 |smb_send_kvec()
      -019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0)
      -020 |cifs_call_async()
      -021 |cifs_async_writev(wdata = 0x87FD6580)
      -022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88)
      -023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88)
      -024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88)
      -025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88)
      -026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88)
      -027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0)
      -028 |bdi_writeback_workfn(work = 0x87E4A9CC)
      -029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC)
      -030 |worker_thread(__worker = 0x8B045880)
      -031 |kthread(_create = 0x87CADD90)
      -032 |ret_from_kernel_thread(asm)
      
      Bug occurs because __tcp_checksum_complete_user() enables BH, assuming
      it is running from softirq context.
      
      Lars trace involved a NIC without RX checksum support but other points
      are problematic as well, like the prequeue stuff.
      
      Problem is triggered by a timer, that found socket being owned by user.
      
      tcp_release_cb() should call tcp_write_timer_handler() or
      tcp_delack_timer_handler() in the appropriate context :
      
      BH disabled and socket lock held, but 'owned' field cleared,
      as if they were running from timer handlers.
      
      Fixes: 6f458dfb
      
       ("tcp: improve latencies of timer triggered events")
      Reported-by: default avatarLars Persson <lars.persson@axis.com>
      Tested-by: default avatarLars Persson <lars.persson@axis.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3f9b018
  3. 10 Mar, 2014 1 commit
  4. 06 Mar, 2014 1 commit
  5. 04 Jan, 2014 1 commit
  6. 03 Jan, 2014 1 commit
  7. 31 Dec, 2013 2 commits
  8. 06 Dec, 2013 1 commit
    • Eric W. Biederman's avatar
      tcp_memcontrol: Cleanup/fix cg_proto->memory_pressure handling. · 7f2cbdc2
      Eric W. Biederman authored
      
      
      kill memcg_tcp_enter_memory_pressure.  The only function of
      memcg_tcp_enter_memory_pressure was to reduce deal with the
      unnecessary abstraction that was tcp_memcontrol.  Now that struct
      tcp_memcontrol is gone remove this unnecessary function, the
      unnecessary function pointer, and modify sk_enter_memory_pressure to
      set this field directly, just as sk_leave_memory_pressure cleas this
      field directly.
      
      This fixes a small bug I intruduced when killing struct tcp_memcontrol
      that caused memcg_tcp_enter_memory_pressure to never be called and
      thus failed to ever set cg_proto->memory_pressure.
      
      Remove the cg_proto enter_memory_pressure function as it now serves
      no useful purpose.
      
      Don't test cg_proto->memory_presser in sk_leave_memory_pressure before
      clearing it.  The test was originally there to ensure that the pointer
      was non-NULL.  Now that cg_proto is not a pointer the pointer does not
      matter.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f2cbdc2
  9. 23 Oct, 2013 1 commit
  10. 22 Oct, 2013 1 commit
  11. 21 Oct, 2013 1 commit
  12. 09 Oct, 2013 2 commits
    • Eric Dumazet's avatar
      ipv6: make lookups simpler and faster · efe4208f
      Eric Dumazet authored
      
      
      TCP listener refactoring, part 4 :
      
      To speed up inet lookups, we moved IPv4 addresses from inet to struct
      sock_common
      
      Now is time to do the same for IPv6, because it permits us to have fast
      lookups for all kind of sockets, including upcoming SYN_RECV.
      
      Getting IPv6 addresses in TCP lookups currently requires two extra cache
      lines, plus a dereference (and memory stall).
      
      inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
      
      This patch is way bigger than its IPv4 counter part, because for IPv4,
      we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
      it's not doable easily.
      
      inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
      inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
      
      And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
      at the same offset.
      
      We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
      macro.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      efe4208f
    • Eric Dumazet's avatar
      tcp/dccp: remove twchain · 05dbc7b5
      Eric Dumazet authored
      
      
      TCP listener refactoring, part 3 :
      
      Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
      and parallel SYN processing.
      
      Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
      friend states) sockets, another for TIME_WAIT sockets only.
      
      As the hash table is sized to get at most one socket per bucket, it
      makes little sense to have separate twchain, as it makes the lookup
      slightly more complicated, and doubles hash table memory usage.
      
      If we make sure all socket types have the lookup keys at the same
      offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
      and ESTABLISHED sockets already have common lookup fields for IPv4.
      
      [ INET_TW_MATCH() is no longer needed ]
      
      I'll provide a follow-up to factorize IPv6 lookup as well, to remove
      INET6_TW_MATCH()
      
      This way, SYN_RECV pseudo sockets will be supported the same.
      
      A new sock_gen_put() helper is added, doing either a sock_put() or
      inet_twsk_put() [ and will support SYN_RECV later ].
      
      Note this helper should only be called in real slow path, when rcu
      lookup found a socket that was moved to another identity (freed/reused
      immediately), but could eventually be used in other contexts, like
      sock_edemux()
      
      Before patch :
      
      dmesg | grep "TCP established"
      
      TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
      
      After patch :
      
      TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05dbc7b5
  13. 08 Oct, 2013 1 commit
    • Shawn Bohrer's avatar
      udp: ipv4: Add udp early demux · 421b3885
      Shawn Bohrer authored
      
      
      The removal of the routing cache introduced a performance regression for
      some UDP workloads since a dst lookup must be done for each packet.
      This change caches the dst per socket in a similar manner to what we do
      for TCP by implementing early_demux.
      
      For UDP multicast we can only cache the dst if there is only one
      receiving socket on the host.  Since caching only works when there is
      one receiving socket we do the multicast socket lookup using RCU.
      
      For UDP unicast we only demux sockets with an exact match in order to
      not break forwarding setups.  Additionally since the hash chains may be
      long we only check the first socket to see if it is a match and not
      waste extra time searching the whole chain when we might not find an
      exact match.
      
      Benchmark results from a netperf UDP_RR test:
      Before 87961.22 transactions/s
      After  89789.68 transactions/s
      
      Benchmark results from a fio 1 byte UDP multicast pingpong test
      (Multicast one way unicast response):
      Before 12.97us RTT
      After  12.63us RTT
      Signed-off-by: default avatarShawn Bohrer <sbohrer@rgmadvisors.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      421b3885
  14. 07 Oct, 2013 1 commit
    • Alexei Starovoitov's avatar
      net: fix unsafe set_memory_rw from softirq · d45ed4a4
      Alexei Starovoitov authored
      
      
      on x86 system with net.core.bpf_jit_enable = 1
      
      sudo tcpdump -i eth1 'tcp port 22'
      
      causes the warning:
      [   56.766097]  Possible unsafe locking scenario:
      [   56.766097]
      [   56.780146]        CPU0
      [   56.786807]        ----
      [   56.793188]   lock(&(&vb->lock)->rlock);
      [   56.799593]   <Interrupt>
      [   56.805889]     lock(&(&vb->lock)->rlock);
      [   56.812266]
      [   56.812266]  *** DEADLOCK ***
      [   56.812266]
      [   56.830670] 1 lock held by ksoftirqd/1/13:
      [   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380
      [   56.849757]
      [   56.849757] stack backtrace:
      [   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
      [   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
      [   56.882004]  ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
      [   56.895630]  ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
      [   56.909313]  ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
      [   56.923006] Call Trace:
      [   56.929532]  [<ffffffff8175a145>] dump_stack+0x55/0x76
      [   56.936067]  [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208
      [   56.942445]  [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50
      [   56.948932]  [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150
      [   56.955470]  [<ffffffff810ccb52>] mark_lock+0x282/0x2c0
      [   56.961945]  [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50
      [   56.968474]  [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50
      [   56.975140]  [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90
      [   56.981942]  [<ffffffff810cef72>] lock_acquire+0x92/0x1d0
      [   56.988745]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   56.995619]  [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50
      [   57.002493]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   57.009447]  [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380
      [   57.016477]  [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380
      [   57.023607]  [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460
      [   57.030818]  [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10
      [   57.037896]  [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0
      [   57.044789]  [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0
      [   57.051720]  [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40
      [   57.058727]  [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40
      [   57.065577]  [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30
      [   57.072338]  [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0
      [   57.078962]  [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0
      [   57.085373]  [<ffffffff81058245>] run_ksoftirqd+0x35/0x70
      
      cannot reuse jited filter memory, since it's readonly,
      so use original bpf insns memory to hold work_struct
      
      defer kfree of sk_filter until jit completed freeing
      
      tested on x86_64 and i386
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d45ed4a4
  15. 03 Oct, 2013 1 commit
    • Eric Dumazet's avatar
      inet: consolidate INET_TW_MATCH · 50805466
      Eric Dumazet authored
      
      
      TCP listener refactoring, part 2 :
      
      We can use a generic lookup, sockets being in whatever state, if
      we are sure all relevant fields are at the same place in all socket
      types (ESTABLISH, TIME_WAIT, SYN_RECV)
      
      This patch removes these macros :
      
       inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair
      
      And adds :
      
       sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr
      
      Then, INET_TW_MATCH() is really the same than INET_MATCH()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50805466
  16. 01 Oct, 2013 1 commit
  17. 30 Sep, 2013 1 commit
  18. 28 Sep, 2013 1 commit
    • Eric Dumazet's avatar
      net: introduce SO_MAX_PACING_RATE · 62748f32
      Eric Dumazet authored
      As mentioned in commit afe4fd06
      
       ("pkt_sched: fq: Fair Queue packet
      scheduler"), this patch adds a new socket option.
      
      SO_MAX_PACING_RATE offers the application the ability to cap the
      rate computed by transport layer. Value is in bytes per second.
      
      u32 val = 1000000;
      setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));
      
      To be effectively paced, a flow must use FQ packet scheduler.
      
      Note that a packet scheduler takes into account the headers for its
      computations. The effective payload rate depends on MSS and retransmits
      if any.
      
      I chose to make this pacing rate a SOL_SOCKET option instead of a
      TCP one because this can be used by other protocols.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Steinar H. Gunderson <sesse@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62748f32
  19. 23 Sep, 2013 1 commit
  20. 29 Aug, 2013 1 commit
    • Eric Dumazet's avatar
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet authored
      
      
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95bd09eb
  21. 10 Aug, 2013 1 commit
    • Eric Dumazet's avatar
      net: attempt high order allocations in sock_alloc_send_pskb() · 28d64271
      Eric Dumazet authored
      
      
      Adding paged frags skbs to af_unix sockets introduced a performance
      regression on large sends because of additional page allocations, even
      if each skb could carry at least 100% more payload than before.
      
      We can instruct sock_alloc_send_pskb() to attempt high order
      allocations.
      
      Most of the time, it does a single page allocation instead of 8.
      
      I added an additional parameter to sock_alloc_send_pskb() to
      let other users to opt-in for this new feature on followup patches.
      
      Tested:
      
      Before patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    46861.15
      
      After patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    57981.11
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28d64271
  22. 01 Aug, 2013 1 commit
  23. 31 Jul, 2013 1 commit
    • Eric Dumazet's avatar
      netem: Introduce skb_orphan_partial() helper · f2f872f9
      Eric Dumazet authored
      Commit 547669d4
      
       ("tcp: xps: fix reordering issues") added
      unexpected reorders in case netem is used in a MQ setup for high
      performance test bed.
      
      ETH=eth0
      tc qd del dev $ETH root 2>/dev/null
      tc qd add dev $ETH root handle 1: mq
      for i in `seq 1 32`
      do
       tc qd add dev $ETH parent 1:$i netem delay 100ms
      done
      
      As all tcp packets are orphaned by netem, TCP stack believes it can
      set skb->ooo_okay on all packets.
      
      In order to allow producers to send more packets, we want to
      keep sk_wmem_alloc from reaching sk_sndbuf limit.
      
      We can do that by accounting one byte per skb in netem queues,
      so that TCP stack is not fooled too much.
      
      Tested:
      
      With above MQ/netem setup, scaling number of concurrent flows gives
      linear results and no reorders/retransmits
      
      lpq83:~# for n in 1 10 20 30 40 50 60 70 80 90 100
       do echo -n "n:$n " ; ./super_netperf $n -H 10.7.7.84; done
      n:1 198.46
      n:10 2002.69
      n:20 4000.98
      n:30 6006.35
      n:40 8020.93
      n:50 10032.3
      n:60 12081.9
      n:70 13971.3
      n:80 16009.7
      n:90 17117.3
      n:100 17425.5
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2f872f9
  24. 25 Jul, 2013 2 commits
    • Eric Dumazet's avatar
      tcp: TCP_NOTSENT_LOWAT socket option · c9bee3b7
      Eric Dumazet authored
      
      
      Idea of this patch is to add optional limitation of number of
      unsent bytes in TCP sockets, to reduce usage of kernel memory.
      
      TCP receiver might announce a big window, and TCP sender autotuning
      might allow a large amount of bytes in write queue, but this has little
      performance impact if a large part of this buffering is wasted :
      
      Write queue needs to be large only to deal with large BDP, not
      necessarily to cope with scheduling delays (incoming ACKS make room
      for the application to queue more bytes)
      
      For most workloads, using a value of 128 KB or less is OK to give
      applications enough time to react to POLLOUT events in time
      (or being awaken in a blocking sendmsg())
      
      This patch adds two ways to set the limit :
      
      1) Per socket option TCP_NOTSENT_LOWAT
      
      2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
      not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
      Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
      
      This changes poll()/select()/epoll() to report POLLOUT
      only if number of unsent bytes is below tp->nosent_lowat
      
      Note this might increase number of sendmsg()/sendfile() calls
      when using non blocking sockets,
      and increase number of context switches for blocking sockets.
      
      Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
      defined as :
       Specify the minimum number of bytes in the buffer until
       the socket layer will pass the data to the protocol)
      
      Tested:
      
      netperf sessions, and watching /proc/net/protocols "memory" column for TCP
      
      With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
      used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      Using 128KB has no bad effect on the throughput or cpu usage
      of a single flow, although there is an increase of context switches.
      
      A bonus is that we hold socket lock for a shorter amount
      of time and should improve latencies of ACK processing.
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
                 412,514 context-switches
      
           200.034645535 seconds time elapsed
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
               2,675,818 context-switches
      
           200.029651391 seconds time elapsed
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-By: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9bee3b7
    • Eric Dumazet's avatar
      net: add sk_stream_is_writeable() helper · 64dc6130
      Eric Dumazet authored
      
      
      Several call sites use the hardcoded following condition :
      
      sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)
      
      Lets use a helper because TCP_NOTSENT_LOWAT support will change this
      condition for TCP sockets.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64dc6130
  25. 22 Jul, 2013 1 commit
  26. 03 Jul, 2013 1 commit
  27. 20 Jun, 2013 1 commit
    • Daniel Borkmann's avatar
      net: sock: adapt SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF · eea86af6
      Daniel Borkmann authored
      The current situation is that SOCK_MIN_RCVBUF is 2048 + sizeof(struct sk_buff))
      while SOCK_MIN_SNDBUF is 2048. Since in both cases, skb->truesize is used for
      sk_{r,w}mem_alloc accounting, we should have both sizes adjusted via defining a
      TCP_SKB_MIN_TRUESIZE.
      
      Further, as Eric Dumazet points out, the minimal skb truesize in transmit path is
      SKB_TRUESIZE(2048) after commit f07d960d
      
       ("tcp: avoid frag allocation for
      small frames"), and tcp_sendmsg() tries to limit skb size to half the congestion
      window, meaning we try to build two skbs at minimum. Thus, having SOCK_MIN_SNDBUF
      as 2048 can hit a small regression for some applications setting to low
      SO_SNDBUF / SO_RCVBUF. Note that we define a TCP_SKB_MIN_TRUESIZE, because
      SKB_TRUESIZE(2048) adds SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), but in
      case of TCP skbs, the skb_shared_info is part of the 2048 bytes allocation for
      skb->head.
      
      The minor adaption in sk_stream_moderate_sndbuf() is to silence a warning by
      using a typed max macro, as similarly done in SOCK_MIN_RCVBUF occurences, that
      would appear otherwise.
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eea86af6
  28. 17 Jun, 2013 1 commit
  29. 11 Jun, 2013 1 commit
  30. 11 May, 2013 1 commit
    • Eric Dumazet's avatar
      ipv6: do not clear pinet6 field · f77d6021
      Eric Dumazet authored
      We have seen multiple NULL dereferences in __inet6_lookup_established()
      
      After analysis, I found that inet6_sk() could be NULL while the
      check for sk_family == AF_INET6 was true.
      
      Bug was added in linux-2.6.29 when RCU lookups were introduced in UDP
      and TCP stacks.
      
      Once an IPv6 socket, using SLAB_DESTROY_BY_RCU is inserted in a hash
      table, we no longer can clear pinet6 field.
      
      This patch extends logic used in commit fcbdf09d
      
      
      ("net: fix nulls list corruptions in sk_prot_alloc")
      
      TCP/UDP/UDPLite IPv6 protocols provide their own .clear_sk() method
      to make sure we do not clear pinet6 field.
      
      At socket clone phase, we do not really care, as cloning the parent (non
      NULL) pinet6 is not adding a fatal race.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f77d6021
  31. 14 Apr, 2013 1 commit
  32. 31 Mar, 2013 1 commit
    • Keller, Jacob E's avatar
      net: add option to enable error queue packets waking select · 7d4c04fc
      Keller, Jacob E authored
      
      
      Currently, when a socket receives something on the error queue it only wakes up
      the socket on select if it is in the "read" list, that is the socket has
      something to read. It is useful also to wake the socket if it is in the error
      list, which would enable software to wait on error queue packets without waking
      up for regular data on the socket. The main use case is for receiving
      timestamped transmit packets which return the timestamp to the socket via the
      error queue. This enables an application to select on the socket for the error
      queue only instead of for the regular traffic.
      
      -v2-
      * Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file
      * Modified every socket poll function that checks error queue
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Matthew Vick <matthew.vick@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d4c04fc
  33. 28 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist...
      b67bfe0d
  34. 18 Feb, 2013 1 commit
    • Ying Xue's avatar
      net: fix a compile error when SOCK_REFCNT_DEBUG is enabled · dec34fb0
      Ying Xue authored
      
      
      When SOCK_REFCNT_DEBUG is enabled, below build error is met:
      
      kernel/sysctl_binary.o: In function `sk_refcnt_debug_release':
      include/net/sock.h:1025: multiple definition of `sk_refcnt_debug_release'
      kernel/sysctl.o:include/net/sock.h:1025: first defined here
      kernel/audit.o: In function `sk_refcnt_debug_release':
      include/net/sock.h:1025: multiple definition of `sk_refcnt_debug_release'
      kernel/sysctl.o:include/net/sock.h:1025: first defined here
      make[1]: *** [kernel/built-in.o] Error 1
      make: *** [kernel] Error 2
      
      So we decide to make sk_refcnt_debug_release static to eliminate
      the error.
      Signed-off-by: Ying Xue's avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dec34fb0
  35. 28 Jan, 2013 1 commit
  36. 23 Jan, 2013 1 commit
  37. 17 Jan, 2013 1 commit
    • Vincent Bernat's avatar
      sk-filter: Add ability to lock a socket filter program · d59577b6
      Vincent Bernat authored
      
      
      While a privileged program can open a raw socket, attach some
      restrictive filter and drop its privileges (or send the socket to an
      unprivileged program through some Unix socket), the filter can still
      be removed or modified by the unprivileged program. This commit adds a
      socket option to lock the filter (SO_LOCK_FILTER) preventing any
      modification of a socket filter program.
      
      This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even
      root is not allowed change/drop the filter.
      
      The state of the lock can be read with getsockopt(). No error is
      triggered if the state is not changed. -EPERM is returned when a user
      tries to remove the lock or to change/remove the filter while the lock
      is active. The check is done directly in sk_attach_filter() and
      sk_detach_filter() and does not affect only setsockopt() syscall.
      Signed-off-by: default avatarVincent Bernat <bernat@luffy.cx>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d59577b6