1. 14 Nov, 2016 1 commit
  2. 29 Sep, 2016 1 commit
  3. 01 Dec, 2015 2 commits
    • Eric Dumazet's avatar
      net: fix sock_wake_async() rcu protection · ceb5d58b
      Eric Dumazet authored
      Dmitry provided a syzkaller (http://github.com/google/syzkaller
      
      )
      triggering a fault in sock_wake_async() when async IO is requested.
      
      Said program stressed af_unix sockets, but the issue is generic
      and should be addressed in core networking stack.
      
      The problem is that by the time sock_wake_async() is called,
      we should not access the @flags field of 'struct socket',
      as the inode containing this socket might be freed without
      further notice, and without RCU grace period.
      
      We already maintain an RCU protected structure, "struct socket_wq"
      so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
      is the safe route.
      
      It also reduces number of cache lines needing dirtying, so might
      provide a performance improvement anyway.
      
      In followup patches, we might move remaining flags (SOCK_NOSPACE,
      SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
      being mostly read and let it being shared between cpus.
      
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ceb5d58b
    • Eric Dumazet's avatar
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet authored
      
      
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9cd3e072
  4. 30 Nov, 2015 1 commit
  5. 09 May, 2015 1 commit
    • Jason Baron's avatar
      tcp: set SOCK_NOSPACE under memory pressure · 790ba456
      Jason Baron authored
      
      
      Under tcp memory pressure, calling epoll_wait() in edge triggered
      mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
      even when there is sufficient memory available to continue making
      progress. The problem is that when __sk_mem_schedule() returns 0
      under memory pressure, we do not set the SOCK_NOSPACE flag in the
      tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since
      SOCK_NOSPACE is used to trigger wakeups when incoming acks create
      sufficient new space in the write queue, all outstanding packets
      are acked, but we never wake up with the the EPOLLOUT that we are
      expecting from epoll_wait().
      
      This issue is currently limited to epoll() when used in edge trigger
      mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
      This is sufficient for poll()/select() and epoll() in level trigger
      mode. However, in edge trigger mode, epoll() is relying on the write
      path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we
      can only call epoll_wait() after read/write return -EAGAIN. Thus, in
      the case of the socket write, we are relying on the fact that
      tcp_sendmsg()/network write paths are going to issue a wakeup for
      us at some point in the future when we get -EAGAIN.
      
      Normally, epoll() edge trigger works fine when we've exceeded the
      sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when
      we return -EAGAIN from the write path b/c we are over the tcp memory
      limits and not b/c we are over the sndbuf, we are never going to get
      another wakeup.
      
      I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule()
      will return 0, or failure more readily with SO_SNDBUF:
      
      1) create socket and set SO_SNDBUF to N
      2) add socket as edge trigger
      3) write to socket and block in epoll on -EAGAIN
      4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem
      
      The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory()
      when the socket is non-blocking. Note that SOCK_NOSPACE, in addition
      to waking up outstanding waiters is also used to expand the size of
      the sk->sndbuf. However, we will not expand it by setting it in this
      case because tcp_should_expand_sndbuf(), ensures that no expansion
      occurs when we are under tcp memory pressure.
      
      Note that we could still hang if sk->sk_wmem_queue is 0, when we get
      the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we
      are waiting for and event that will never happen. I believe
      that this case is harder to hit (and did not hit in my testing),
      in that over the tcp 'soft' memory limits, we continue to guarantee a
      minimum write buffer size. Perhaps, we could return -ENOSPC in this
      case, or maybe we simply issue a wakeup in this case, such that we
      keep retrying the write. Note that this case is not specific to
      epoll() ET, but rather would affect blocking sockets as well. So I
      view this patch as bringing epoll() edge-trigger into sync with the
      current poll()/select()/epoll() level trigger and blocking sockets
      behavior.
      
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      790ba456
  6. 14 Jan, 2014 1 commit
  7. 25 Jul, 2013 1 commit
  8. 04 Oct, 2010 1 commit
    • Nagendra Tomar's avatar
      net: Fix the condition passed to sk_wait_event() · 482964e5
      Nagendra Tomar authored
      This patch fixes the condition (3rd arg) passed to sk_wait_event() in
      sk_stream_wait_memory(). The incorrect check in sk_stream_wait_memory()
      causes the following soft lockup in tcp_sendmsg() when the global tcp
      memory pool has exhausted.
      
      >>> snip <<<
      
      localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429]
      localhost kernel: CPU 3:
      localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200]  [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200
      localhost kernel:
      localhost kernel: Call Trace:
      localhost kernel:  [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200
      localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
      localhost kernel:  [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0
      localhost kernel:  [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140
      localhost kernel:  [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130
      localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
      localhost kernel:  [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170
      localhost kernel:  [vfs_write+0x185/0x190] vfs_write+0x185/0x190
      localhost kernel:  [sys_write+0x50/0x90] sys_write+0x50/0x90
      localhost kernel:  [system_call+0x7e/0x83] system_call+0x7e/0x83
      
      >>> snip <<<
      
      What is happening is, that the sk_wait_event() condition passed from
      sk_stream_wait_memory() evaluates to true for the case of tcp global memory
      exhaustion. This is because both sk_stream_memory_free() and vm_wait are true
      which causes sk_wait_event() to *not* call schedule_timeout().
      Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping.
      This causes the caller to again try allocation, which again fails and again
      calls sk_stream_wait_memory(), and so on.
      
      [ Bug introduced by commit c1cbe4b7
      
      
        ("[NET]: Avoid atomic xchg() for non-error case") -DaveM ]
      
      Signed-off-by: default avatarNagendra Singh Tomar <tomer_iisc@yahoo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      482964e5
  9. 12 Jul, 2010 1 commit
  10. 01 May, 2010 1 commit
    • Eric Dumazet's avatar
      net: sock_def_readable() and friends RCU conversion · 43815482
      Eric Dumazet authored
      
      
      sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
      need two atomic operations (and associated dirtying) per incoming
      packet.
      
      RCU conversion is pretty much needed :
      
      1) Add a new structure, called "struct socket_wq" to hold all fields
      that will need rcu_read_lock() protection (currently: a
      wait_queue_head_t and a struct fasync_struct pointer).
      
      [Future patch will add a list anchor for wakeup coalescing]
      
      2) Attach one of such structure to each "struct socket" created in
      sock_alloc_inode().
      
      3) Respect RCU grace period when freeing a "struct socket_wq"
      
      4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
      socket_wq"
      
      5) Change sk_sleep() function to use new sk->sk_wq instead of
      sk->sk_sleep
      
      6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
      a rcu_read_lock() section.
      
      7) Change all sk_has_sleeper() callers to :
        - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
        - Use wq_has_sleeper() to eventually wakeup tasks.
        - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)
      
      8) sock_wake_async() is modified to use rcu protection as well.
      
      9) Exceptions :
        macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
      instead of dynamically allocated ones. They dont need rcu freeing.
      
      Some cleanups or followups are probably needed, (possible
      sk_callback_lock conversion to a spinlock for example...).
      
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43815482
  11. 20 Apr, 2010 1 commit
  12. 18 May, 2009 1 commit
  13. 14 Oct, 2008 1 commit
  14. 26 Jul, 2008 1 commit
  15. 28 Jan, 2008 4 commits
    • Hideo Aoki's avatar
      [NET] CORE: Introducing new memory accounting interface. · 3ab224be
      Hideo Aoki authored
      
      
      This patch introduces new memory accounting functions for each network
      protocol. Most of them are renamed from memory accounting functions
      for stream protocols. At the same time, some stream memory accounting
      functions are removed since other functions do same thing.
      
      Renaming:
      	sk_stream_free_skb()		->	sk_wmem_free_skb()
      	__sk_stream_mem_reclaim()	->	__sk_mem_reclaim()
      	sk_stream_mem_reclaim()		->	sk_mem_reclaim()
      	sk_stream_mem_schedule 		->    	__sk_mem_schedule()
      	sk_stream_pages()      		->	sk_mem_pages()
      	sk_stream_rmem_schedule()	->	sk_rmem_schedule()
      	sk_stream_wmem_schedule()	->	sk_wmem_schedule()
      	sk_charge_skb()			->	sk_mem_charge()
      
      Removeing
      	sk_stream_rfree():	consolidates into sock_rfree()
      	sk_stream_set_owner_r(): consolidates into skb_set_owner_r()
      	sk_stream_mem_schedule()
      
      The following functions are added.
          	sk_has_account(): check if the protocol supports accounting
      	sk_mem_uncharge(): do the opposite of sk_mem_charge()
      
      In addition, to achieve consolidation, updating sk_wmem_queued is
      removed from sk_mem_charge().
      
      Next, to consolidate memory accounting functions, this patch adds
      memory accounting calls to network core functions. Moreover, present
      memory accounting call is renamed to new accounting call.
      
      Finally we replace present memory accounting calls with new interface
      in TCP and SCTP.
      
      Signed-off-by: default avatarTakahiro Yasui <tyasui@redhat.com>
      Signed-off-by: default avatarHideo Aoki <haoki@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ab224be
    • Eric Dumazet's avatar
      [SOCK] Avoid divides in sk_stream_pages() and __sk_stream_mem_reclaim() · 21371f76
      Eric Dumazet authored
      
      
      sk_forward_alloc being signed, we should take care of divides by
      SK_STREAM_MEM_QUANTUM we do in sk_stream_pages() and
      __sk_stream_mem_reclaim()
      
      This patchs introduces SK_STREAM_MEM_QUANTUM_SHIFT, defined
      as ilog2(SK_STREAM_MEM_QUANTUM), to be able to use right
      shifts instead of plain divides.
      
      This should help compiler to choose right shifts instead of
      expensive divides (as seen with CONFIG_CC_OPTIMIZE_FOR_SIZE=y on x86)
      
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21371f76
    • Pavel Emelyanov's avatar
      [NET]: Name magic constants in sock_wake_async() · 8d8ad9d7
      Pavel Emelyanov authored
      
      
      The sock_wake_async() performs a bit different actions
      depending on "how" argument. Unfortunately this argument
      ony has numerical magic values.
      
      I propose to give names to their constants to help people
      reading this function callers understand what's going on
      without looking into this function all the time.
      
      I suppose this is 2.6.25 material, but if it's not (or the
      naming seems poor/bad/awful), I can rework it against the
      current net-2.6 tree.
      
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d8ad9d7
    • Pavel Emelyanov's avatar
      [NET]: Compact sk_stream_mem_schedule() code · 9859a790
      Pavel Emelyanov authored
      
      
      This function references sk->sk_prot->xxx for many times.
      It turned out, that there's so many code in it, that gcc
      cannot always optimize access to sk->sk_prot's fields.
      
      After saving the sk->sk_prot on the stack and comparing
      disassembled code, it turned out that the function became
      ~10 bytes shorter and made less dereferences (on i386 and
      x86_64). Stack consumption didn't grow.
      
      Besides, this patch drives most of this function into the
      80 columns limit.
      
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Acked-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9859a790
  16. 11 Feb, 2007 1 commit
  17. 13 Jul, 2006 1 commit
  18. 20 Apr, 2006 1 commit
    • David S. Miller's avatar
      [NET]: Add skb->truesize assertion checking. · dc6de336
      David S. Miller authored
      
      
      Add some sanity checking.  truesize should be at least sizeof(struct
      sk_buff) plus the current packet length.  If not, then truesize is
      seriously mangled and deserves a kernel log message.
      
      Currently we'll do the check for release of stream socket buffers.
      
      But we can add checks to more spots over time.
      
      Incorporating ideas from Herbert Xu.
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc6de336
  19. 03 Jan, 2006 1 commit
  20. 05 Nov, 2005 1 commit
    • Herbert Xu's avatar
      [NET]: Fix race condition in sk_stream_wait_connect · 6151b31c
      Herbert Xu authored
      
      
      When sk_stream_wait_connect detects a state transition to ESTABLISHED
      or CLOSE_WAIT prior to it going to sleep, it will return without
      calling finish_wait and decrementing sk_write_pending.
      
      This may result in crashes and other unintended behaviour.
      
      The fix is to always call finish_wait and update sk_write_pending since
      it is safe to do so even if the wait entry is no longer on the queue.
      
      This bug was tracked down with the help of Alex Sidorenko and the
      fix is also based on his suggestion.
      
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@mandriva.com>
      6151b31c
  21. 01 May, 2005 1 commit
    • Pavel Pisa's avatar
      [PATCH] DocBook: changes and extensions to the kernel documentation · 4dc3b16b
      Pavel Pisa authored
      I have recompiled Linux kernel 2.6.11.5 documentation for me and our
      university students again.  The documentation could be extended for more
      sources which are equipped by structured comments for recent 2.6 kernels.  I
      have tried to proceed with that task.  I have done that more times from 2.6.0
      time and it gets boring to do same changes again and again.  Linux kernel
      compiles after changes for i386 and ARM targets.  I have added references to
      some more files into kernel-api book, I have added some section names as well.
       So please, check that changes do not break something and that categories are
      not too much skewed.
      
      I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
      by kernel convention.  Most of the other changes are modifications in the
      comments to make kernel-doc happy, accept some parameters description and do
      not bail out on errors.  Changed <pid> to @pid in the description, moved some
      #ifdef before comments to correct function to comments bindings, etc.
      
      You can see result of the modified documentation build at
        http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz
      
      
      
      Some more sources are ready to be included into kernel-doc generated
      documentation.  Sources has been added into kernel-api for now.  Some more
      section names added and probably some more chaos introduced as result of quick
      cleanup work.
      
      Signed-off-by: default avatarPavel Pisa <pisa@cmp.felk.cvut.cz>
      Signed-off-by: default avatarMartin Waitz <tali@admingilde.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4dc3b16b
  22. 16 Apr, 2005 1 commit
    • Linus Torvalds's avatar
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds authored
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4