• Jason Baron's avatar
    tcp: set SOCK_NOSPACE under memory pressure · 790ba456
    Jason Baron authored
    
    
    Under tcp memory pressure, calling epoll_wait() in edge triggered
    mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
    even when there is sufficient memory available to continue making
    progress. The problem is that when __sk_mem_schedule() returns 0
    under memory pressure, we do not set the SOCK_NOSPACE flag in the
    tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since
    SOCK_NOSPACE is used to trigger wakeups when incoming acks create
    sufficient new space in the write queue, all outstanding packets
    are acked, but we never wake up with the the EPOLLOUT that we are
    expecting from epoll_wait().
    
    This issue is currently limited to epoll() when used in edge trigger
    mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
    This is sufficient for poll()/select() and epoll() in level trigger
    mode. However, in edge trigger mode, epoll() is relying on the write
    path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we
    can only call epoll_wait() after read/write return -EAGAIN. Thus, in
    the case of the socket write, we are relying on the fact that
    tcp_sendmsg()/network write paths are going to issue a wakeup for
    us at some point in the future when we get -EAGAIN.
    
    Normally, epoll() edge trigger works fine when we've exceeded the
    sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when
    we return -EAGAIN from the write path b/c we are over the tcp memory
    limits and not b/c we are over the sndbuf, we are never going to get
    another wakeup.
    
    I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule()
    will return 0, or failure more readily with SO_SNDBUF:
    
    1) create socket and set SO_SNDBUF to N
    2) add socket as edge trigger
    3) write to socket and block in epoll on -EAGAIN
    4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem
    
    The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory()
    when the socket is non-blocking. Note that SOCK_NOSPACE, in addition
    to waking up outstanding waiters is also used to expand the size of
    the sk->sndbuf. However, we will not expand it by setting it in this
    case because tcp_should_expand_sndbuf(), ensures that no expansion
    occurs when we are under tcp memory pressure.
    
    Note that we could still hang if sk->sk_wmem_queue is 0, when we get
    the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we
    are waiting for and event that will never happen. I believe
    that this case is harder to hit (and did not hit in my testing),
    in that over the tcp 'soft' memory limits, we continue to guarantee a
    minimum write buffer size. Perhaps, we could return -ENOSPC in this
    case, or maybe we simply issue a wakeup in this case, such that we
    keep retrying the write. Note that this case is not specific to
    epoll() ET, but rather would affect blocking sockets as well. So I
    view this patch as bringing epoll() edge-trigger into sync with the
    current poll()/select()/epoll() level trigger and blocking sockets
    behavior.
    Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
    Acked-by: default avatarEric Dumazet <edumazet@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    790ba456