1. 24 Jan, 2017 1 commit
    • Krister Johansen's avatar
      Introduce a sysctl that modifies the value of PROT_SOCK. · 4548b683
      Krister Johansen authored
      
      
      Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
      that denotes the first unprivileged inet port in the namespace.  To
      disable all privileged ports set this to zero.  It also checks for
      overlap with the local port range.  The privileged and local range may
      not overlap.
      
      The use case for this change is to allow containerized processes to bind
      to priviliged ports, but prevent them from ever being allowed to modify
      their container's network configuration.  The latter is accomplished by
      ensuring that the network namespace is not a child of the user
      namespace.  This modification was needed to allow the container manager
      to disable a namespace's priviliged port restrictions without exposing
      control of the network namespace to processes in the user namespace.
      
      Signed-off-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4548b683
  2. 22 Jan, 2017 2 commits
  3. 20 Jan, 2017 12 commits
  4. 19 Jan, 2017 2 commits
  5. 18 Jan, 2017 17 commits
  6. 17 Jan, 2017 6 commits
    • Jason Baron's avatar
      tcp: accept RST for rcv_nxt - 1 after receiving a FIN · 0e40f4c9
      Jason Baron authored
      
      
      Using a Mac OSX box as a client connecting to a Linux server, we have found
      that when certain applications (such as 'ab'), are abruptly terminated
      (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
      FIN is accepted by the Linux stack but the RST is sent with the same
      sequence number as the FIN, and Linux responds with a challenge ACK per
      RFC 5961. The OSX client then sometimes (they are rate-limited) does not
      reply with any RST as would be expected on a closed socket.
      
      This results in sockets accumulating on the Linux server left mostly in
      the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
      This sequence of events can tie up a lot of resources on the Linux server
      since there may be a lot of data in write buffers at the time of the RST.
      Accepting a RST equal to rcv_nxt - 1, after we have already successfully
      processed a FIN, has made a significant difference for us in practice, by
      freeing up unneeded resources in a more expedient fashion.
      
      A packetdrill test demonstrating the behavior:
      
      // testing mac osx rst behavior
      
      // Establish a connection
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
      0.200 < . 1:1(0) ack 1 win 32768
      0.200 accept(3, ..., ...) = 4
      
      // Client closes the connection
      0.300 < F. 1:1(0) ack 1 win 32768
      
      // now send rst with same sequence
      0.300 < R. 1:1(0) ack 1 win 32768
      
      // make sure we are in TCP_CLOSE
      0.400 %{
      assert tcpi_state == 7
      }%
      
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e40f4c9
    • Gao Feng's avatar
      net: ping: Use right format specifier to avoid type casting · a7ef6715
      Gao Feng authored
      
      
      The inet_num is u16, so use %hu instead of casting it to int. And
      the sk_bound_dev_if is int actually, so it needn't cast to int.
      
      Signed-off-by: default avatarGao Feng <fgao@ikuai8.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7ef6715
    • Lance Richardson's avatar
      bridge: sparse fixes in br_ip6_multicast_alloc_query() · 53631a5f
      Lance Richardson authored
      
      
      Changed type of csum field in struct igmpv3_query from __be16 to
      __sum16 to eliminate type warning, made same change in struct
      igmpv3_report for consistency.
      
      Fixed up an ntohs() where htons() should have been used instead.
      
      Signed-off-by: default avatarLance Richardson <lrichard@redhat.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53631a5f
    • Robert Shearman's avatar
      mpls: Packet stats · 27d69105
      Robert Shearman authored
      
      
      Having MPLS packet stats is useful for observing network operation and
      for diagnosing network problems. In the absence of anything better,
      RFC2863 and RFC3813 are used for guidance for which stats to expose
      and the semantics of them. In particular rx_noroutes maps to in
      unknown protos in RFC2863. The stats are exposed to userspace via
      AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
      RTM_GETSTATS messages.
      
      All the introduced fields are 64-bit, even error ones, to ensure no
      overflow with long uptimes. Per-CPU counters are used to avoid
      cache-line contention on the commonly used fields. The other fields
      have also been made per-CPU for code to avoid performance problems in
      error conditions on the assumption that on some platforms the cost of
      atomic operations could be more expensive than sending the packet
      (which is what would be done in the success case). If that's not the
      case, we could instead not use per-CPU counters for these fields.
      
      Only unicast and non-fragment are exposed at the moment, but other
      counters can be exposed in the future either by adding to the end of
      struct mpls_link_stats or by additional netlink attributes in the
      AF_MPLS IFLA_STATS_AF_SPEC nested attribute.
      
      Signed-off-by: default avatarRobert Shearman <rshearma@brocade.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27d69105
    • Robert Shearman's avatar
      net: AF-specific RTM_GETSTATS attributes · aefb4d4a
      Robert Shearman authored
      
      
      Add the functionality for including address-family-specific per-link
      stats in RTM_GETSTATS messages. This is done through adding a new
      IFLA_STATS_AF_SPEC attribute under which address family attributes are
      nested and then the AF-specific attributes can be further nested. This
      follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
      advantage of presenting an easily extended hierarchy. The rtnl_af_ops
      structure is extended to provide AFs with the opportunity to fill and
      provide the size of their stats attributes.
      
      One alternative would have been to provide AFs with the ability to add
      attributes directly into the RTM_GETSTATS message without a nested
      hierarchy. I discounted this approach as it increases the rate at
      which the 32 attribute number space is used up and it makes
      implementation a little more tricky for stats dump resuming (at the
      moment the order in which attributes are added to the message has to
      match the numeric order of the attributes).
      
      Another alternative would have been to register per-AF RTM_GETSTATS
      handlers. I discounted this approach as I perceived a common use-case
      to be getting all the stats for an interface and this approach would
      necessitate multiple requests/dumps to retrieve them all.
      
      Signed-off-by: default avatarRobert Shearman <rshearma@brocade.com>
      Acked-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aefb4d4a
    • Jamal Hadi Salim's avatar
      net sched actions: fix refcnt when GETing of action after bind · 0faa9cb5
      Jamal Hadi Salim authored
      Demonstrating the issue:
      
      .. add a drop action
      $sudo $TC actions add action drop index 10
      
      .. retrieve it
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 2 bind 0 installed 29 sec used 29 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      ... bug 1 above: reference is two.
          Reference is actually 1 but we forget to subtract 1.
      
      ... do a GET again and we see the same issue
          try a few times and nothing changes
      ~$ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 2 bind 0 installed 31 sec used 31 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      ... lets try to bind the action to a filter..
      $ sudo $TC qdisc add dev lo ingress
      $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
        u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10
      
      ... and now a few GETs:
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 3 bind 1 installed 204 sec used 204 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 4 bind 1 installed 206 sec used 206 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 5 bind 1 installed 235 sec used 235 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      .... as can be observed the reference count keeps going up.
      
      After the fix
      
      $ sudo $TC actions add action drop index 10
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 1 bind 0 installed 4 sec used 4 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 1 bind 0 installed 6 sec used 6 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC qdisc add dev lo ingress
      $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
        u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 2 bind 1 installed 32 sec used 32 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 2 bind 1 installed 33 sec used 33 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      Fixes: aecc5cef
      
       ("net sched actions: fix GETing actions")
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0faa9cb5