1. 24 Oct, 2015 8 commits
    • Jon Paul Maloy's avatar
      tipc: introduce capability bit for broadcast synchronization · fd556f20
      Jon Paul Maloy authored
      
      
      Until now, we have tried to support both the newer, dedicated broadcast
      synchronization mechanism along with the older, less safe, RESET_MSG/
      ACTIVATE_MSG based one. The latter method has turned out to be a hazard
      in a highly dynamic cluster, so we find it safer to disable it completely
      when we find that the former mechanism is supported by the peer node.
      
      For this purpose, we now introduce a new capabability bit,
      TIPC_BCAST_SYNCH, to inform any peer nodes that dedicated broadcast
      syncronization is supported by the present node. The new bit is conveyed
      between peers in the 'capabilities' field of neighbor discovery messages.
      
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: Ying Xue's avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd556f20
    • Jon Paul Maloy's avatar
      tipc: let broadcast transmission use new link transmit function · 2f566124
      Jon Paul Maloy authored
      
      
      This commit simplifies the broadcast link transmission function, by
      leveraging previous changes to the link transmission function and the
      broadcast transmission link life cycle.
      
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: Ying Xue's avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f566124
    • Jon Paul Maloy's avatar
      tipc: make struct tipc_link generic to support broadcast · c1ab3f1d
      Jon Paul Maloy authored
      
      
      Realizing that unicast is just a special case of broadcast, we also see
      that we can go in the other direction, i.e., that modest changes to the
      current unicast link can make it generic enough to support broadcast.
      
      The following changes are introduced here:
      
      - A new counter ("ackers") in struct tipc_link, to indicate how many
        peers need to ack a packet before it can be released.
      - A corresponding counter in the skb user area, to keep track of how
        many peers a are left to ack before a buffer can be released.
      - A new counter ("acked"), to keep persistent track of how far a peer
        has acked at the moment, i.e., where in the transmission queue to
        start updating buffers when the next ack arrives. This is to avoid
        double acknowledgements from a peer, with inadvertent relase of
        packets as a result.
      - A more generic tipc_link_retrans() function, where retransmit starts
        from a given sequence number, instead of the first packet in the
        transmision queue. This is to minimize the number of retransmitted
        packets on the broadcast media.
      
      When the new functionality is taken into use in the next commits,
      we expect it to have minimal effect on unicast mode performance.
      
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: Ying Xue's avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1ab3f1d
    • Jon Paul Maloy's avatar
      tipc: use explicit allocation of broadcast send link · 32301906
      Jon Paul Maloy authored
      
      
      The broadcast link instance (struct tipc_link) used for sending is
      currently aggregated into struct tipc_bclink. This means that we cannot
      use the regular tipc_link_create() function for initiating the link, but
      do instead have to initiate numerous fields directly from the
      bcast_init() function.
      
      We want to reduce dependencies between the broadcast functionality
      and the inner workings of tipc_link. In this commit, we introduce
      a new function tipc_bclink_create() to link.c, and allocate the
      instance of the link separately using this function.
      
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: Ying Xue's avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32301906
    • Jon Paul Maloy's avatar
      tipc: make link implementation independent from struct tipc_bearer · 0e05498e
      Jon Paul Maloy authored
      
      
      In reality, the link implementation is already independent from
      struct tipc_bearer, in that it doesn't store any reference to it.
      However, we still pass on a pointer to a bearer instance in the
      function tipc_link_create(), just to have it extract some
      initialization information from it.
      
      I later commits, we need to create instances of tipc_link without
      having any associated struct tipc_bearer. To facilitate this, we
      want to extract the initialization data already in the creator
      function in node.c, before calling tipc_link_create(), and pass
      this info on as individual parameters in the call.
      
      This commit introduces this change.
      
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: Ying Xue's avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e05498e
    • Jon Paul Maloy's avatar
      tipc: create broadcast transmission link at namespace init · 5fd9fd63
      Jon Paul Maloy authored
      
      
      The broadcast transmission link is currently instantiated when the
      network subsystem is started, i.e., on order from user space via netlink.
      
      This forces the broadcast transmission code to do unnecessary tests for
      the existence of the transmission link, as well in single mode node as
      in network mode.
      
      In this commit, we do instead create the link during initialization of
      the name space, and remove it when it is stopped. The fact that the
      transmission link now has a guaranteed longer life cycle than any of its
      potential clients paves the way for further code simplifcations
      and optimizations.
      
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: Ying Xue's avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fd9fd63
    • Jon Paul Maloy's avatar
      tipc: move broadcast link lock to struct tipc_net · 0043550b
      Jon Paul Maloy authored
      
      
      The broadcast lock will need to be acquired outside bcast.c in a later
      commit. For this reason, we move the lock to struct tipc_net. Consistent
      with the changes in the previous commit, we also introducee two new
      functions tipc_bcast_lock() and tipc_bcast_unlock(). The code that is
      currently using tipc_bclink_lock()/unlock() will be phased out during
      the coming commits in this series.
      
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: Ying Xue's avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0043550b
    • Jon Paul Maloy's avatar
      tipc: move bcast definitions to bcast.c · 6beb19a6
      Jon Paul Maloy authored
      
      
      Currently, a number of structure and function definitions related
      to the broadcast functionality are unnecessarily exposed in the file
      bcast.h. This obscures the fact that the external interface towards
      the broadcast link in fact is very narrow, and causes unnecessary
      recompilations of other files when anything changes in those
      definitions.
      
      In this commit, we move as many of those definitions as is currently
      possible to the file bcast.c.
      
      We also rename the structure 'tipc_bclink' to 'tipc_bc_base', both
      since the name does not correctly describe the contents of this
      struct, and will do so even less in the future, and because we want
      to use the term 'link' more appropriately in the functionality
      introduced later in this series.
      
      Finally, we rename a couple of functions, such as tipc_bclink_xmit()
      and others that will be kept in the future, to include the term 'bcast'
      instead.
      
      There are no functional changes in this commit.
      
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: Ying Xue's avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6beb19a6
  2. 23 Oct, 2015 12 commits
    • Robert Shearman's avatar
      mpls: flow-based multipath selection · 1c78efa8
      Robert Shearman authored
      
      
      Change the selection of a multipath route to use a flow-based
      hash. This more suitable for traffic sensitive to reordering within a
      flow (e.g. TCP, L2VPN) and whilst still allowing a good distribution
      of traffic given enough flows.
      
      Selection of the path for a multipath route is done using a hash of:
      1. Label stack up to MAX_MP_SELECT_LABELS labels or up to and
         including entropy label, whichever is first.
      2. 3-tuple of (L3 src, L3 dst, proto) from IPv4/IPv6 header in MPLS
         payload, if present.
      
      Naturally, a 5-tuple hash using L4 information in addition would be
      possible and be better in some scenarios, but there is a tradeoff
      between looking deeper into the packet to achieve good distribution,
      and packet forwarding performance, and I have erred on the side of the
      latter as the default.
      
      Signed-off-by: default avatarRobert Shearman <rshearma@brocade.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c78efa8
    • Roopa Prabhu's avatar
      mpls: multipath route support · f8efb73c
      Roopa Prabhu authored
      
      
      This patch adds support for MPLS multipath routes.
      
      Includes following changes to support multipath:
      - splits struct mpls_route into 'struct mpls_route + struct mpls_nh'
      
      - 'struct mpls_nh' represents a mpls nexthop label forwarding entry
      
      - moves mpls route and nexthop structures into internal.h
      
      - A mpls_route can point to multiple mpls_nh structs
      
      - the nexthops are maintained as a array (similar to ipv4 fib)
      
      - In the process of restructuring, this patch also consistently changes
        all labels to u8
      
      - Adds support to parse/fill RTA_MULTIPATH netlink attribute for
      multipath routes similar to ipv4/v6 fib
      
      - In this patch, the multipath route nexthop selection algorithm
      simply returns the first nexthop. It is replaced by a
      hash based algorithm from Robert Shearman in the next patch
      
      - mpls_route_update cleanup: remove 'dev' handling in mpls_route_update.
      mpls_route_update though implemented to update based on dev, it was
      never used that way. And the dev handling gets tricky with multiple
      nexthops. Cannot match against any single nexthops dev. So, this patch
      removes the unused 'dev' handling in mpls_route_update.
      
      - dead route/path handling will be implemented in a subsequent patch
      
      Example:
      
      $ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \
                      nexthop as 700 via inet 10.1.1.6 dev swp2 \
                      nexthop as 800 via inet 40.1.1.2 dev swp3
      
      $ip  -f mpls route show
      100
              nexthop as to 200 via inet 10.1.1.2  dev swp1
              nexthop as to 700 via inet 10.1.1.6  dev swp2
              nexthop as to 800 via inet 40.1.1.2  dev swp3
      
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Acked-by: default avatarRobert Shearman <rshearma@brocade.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8efb73c
    • Li RongQing's avatar
      net: sysctl: fix a kmemleak warning · ce9d9b8e
      Li RongQing authored
      
      
      the returned buffer of register_sysctl() is stored into net_header
      variable, but net_header is not used after, and compiler maybe
      optimise the variable out, and lead kmemleak reported the below warning
      
      	comm "swapper/0", pid 1, jiffies 4294937448 (age 267.270s)
      	hex dump (first 32 bytes):
      	90 38 8b 01 c0 ff ff ff 00 00 00 00 01 00 00 00 .8..............
      	01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      	backtrace:
      	[<ffffffc00020f134>] create_object+0x10c/0x2a0
      	[<ffffffc00070ff44>] kmemleak_alloc+0x54/0xa0
      	[<ffffffc0001fe378>] __kmalloc+0x1f8/0x4f8
      	[<ffffffc00028e984>] __register_sysctl_table+0x64/0x5a0
      	[<ffffffc00028eef0>] register_sysctl+0x30/0x40
      	[<ffffffc00099c304>] net_sysctl_init+0x20/0x58
      	[<ffffffc000994dd8>] sock_init+0x10/0xb0
      	[<ffffffc0000842e0>] do_one_initcall+0x90/0x1b8
      	[<ffffffc000966bac>] kernel_init_freeable+0x218/0x2f0
      	[<ffffffc00070ed6c>] kernel_init+0x1c/0xe8
      	[<ffffffc000083bfc>] ret_from_fork+0xc/0x50
      	[<ffffffffffffffff>] 0xffffffffffffffff <<end check kmemleak>>
      
      Before fix, the objdump result on ARM64:
      0000000000000000 <net_sysctl_init>:
         0:   a9be7bfd        stp     x29, x30, [sp,#-32]!
         4:   90000001        adrp    x1, 0 <net_sysctl_init>
         8:   90000000        adrp    x0, 0 <net_sysctl_init>
         c:   910003fd        mov     x29, sp
        10:   91000021        add     x1, x1, #0x0
        14:   91000000        add     x0, x0, #0x0
        18:   a90153f3        stp     x19, x20, [sp,#16]
        1c:   12800174        mov     w20, #0xfffffff4                // #-12
        20:   94000000        bl      0 <register_sysctl>
        24:   b4000120        cbz     x0, 48 <net_sysctl_init+0x48>
        28:   90000013        adrp    x19, 0 <net_sysctl_init>
        2c:   91000273        add     x19, x19, #0x0
        30:   9101a260        add     x0, x19, #0x68
        34:   94000000        bl      0 <register_pernet_subsys>
        38:   2a0003f4        mov     w20, w0
        3c:   35000060        cbnz    w0, 48 <net_sysctl_init+0x48>
        40:   aa1303e0        mov     x0, x19
        44:   94000000        bl      0 <register_sysctl_root>
        48:   2a1403e0        mov     w0, w20
        4c:   a94153f3        ldp     x19, x20, [sp,#16]
        50:   a8c27bfd        ldp     x29, x30, [sp],#32
        54:   d65f03c0        ret
      After:
      0000000000000000 <net_sysctl_init>:
         0:   a9bd7bfd        stp     x29, x30, [sp,#-48]!
         4:   90000000        adrp    x0, 0 <net_sysctl_init>
         8:   910003fd        mov     x29, sp
         c:   a90153f3        stp     x19, x20, [sp,#16]
        10:   90000013        adrp    x19, 0 <net_sysctl_init>
        14:   91000000        add     x0, x0, #0x0
        18:   91000273        add     x19, x19, #0x0
        1c:   f90013f5        str     x21, [sp,#32]
        20:   aa1303e1        mov     x1, x19
        24:   12800175        mov     w21, #0xfffffff4                // #-12
        28:   94000000        bl      0 <register_sysctl>
        2c:   f9002260        str     x0, [x19,#64]
        30:   b40001a0        cbz     x0, 64 <net_sysctl_init+0x64>
        34:   90000014        adrp    x20, 0 <net_sysctl_init>
        38:   91000294        add     x20, x20, #0x0
        3c:   9101a280        add     x0, x20, #0x68
        40:   94000000        bl      0 <register_pernet_subsys>
        44:   2a0003f5        mov     w21, w0
        48:   35000080        cbnz    w0, 58 <net_sysctl_init+0x58>
        4c:   aa1403e0        mov     x0, x20
        50:   94000000        bl      0 <register_sysctl_root>
        54:   14000004        b       64 <net_sysctl_init+0x64>
        58:   f9402260        ldr     x0, [x19,#64]
        5c:   94000000        bl      0 <unregister_sysctl_table>
        60:   f900227f        str     xzr, [x19,#64]
        64:   2a1503e0        mov     w0, w21
        68:   f94013f5        ldr     x21, [sp,#32]
        6c:   a94153f3        ldp     x19, x20, [sp,#16]
        70:   a8c37bfd        ldp     x29, x30, [sp],#48
        74:   d65f03c0        ret
      
      Add the possible error handle to free the net_header to remove the
      kmemleak warning
      
      Signed-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce9d9b8e
    • Hiroshi Shimamoto's avatar
      if_link: Add control trust VF · dd461d6a
      Hiroshi Shimamoto authored
      
      
      Add netlink directives and ndo entry to trust VF user.
      
      This controls the special permission of VF user.
      The administrator will dedicatedly trust VF user to use some features
      which impacts security and/or performance.
      
      The administrator never turn it on unless VF user is fully trusted.
      
      CC: Sy Jong Choi <sy.jong.choi@intel.com>
      Signed-off-by: default avatarHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Acked-by: default avatarGreg Rose <gregory.v.rose@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      dd461d6a
    • Eric Dumazet's avatar
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet authored
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1
      
       ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e0724d0
    • Li RongQing's avatar
      af_key: fix two typos · f6b8dec9
      Li RongQing authored
      
      
      Signed-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6b8dec9
    • Paolo Abeni's avatar
      ipv4: implement support for NOPREFIXROUTE ifa flag for ipv4 address · 7b131180
      Paolo Abeni authored
      
      
      Currently adding a new ipv4 address always cause the creation of the
      related network route, with default metric. When a host has multiple
      interfaces on the same network, multiple routes with the same metric
      are created.
      
      If the userspace wants to set specific metric on each routes, i.e.
      giving better metric to ethernet links in respect to Wi-Fi ones,
      the network routes must be deleted and recreated, which is error-prone.
      
      This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4
      address. When an address is added with such flag set, no associated
      network route is created, no network route is deleted when
      said IP is gone and it's up to the user space manage such route.
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b131180
    • Hannes Frederic Sowa's avatar
      ipv6: protect mtu calculation of wrap-around and infinite loop by rounding issues · b72a2b01
      Hannes Frederic Sowa authored
      
      
      Raw sockets with hdrincl enabled can insert ipv6 extension headers
      right into the data stream. In case we need to fragment those packets,
      we reparse the options header to find the place where we can insert
      the fragment header. If the extension headers exceed the link's MTU we
      actually cannot make progress in such a case.
      
      Instead of ending up in broken arithmetic or rounding towards 0 and
      entering an endless loop in ip6_fragment, just prevent those cases by
      aborting early and signal -EMSGSIZE to user space.
      
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b72a2b01
    • Andrew Shewmaker's avatar
      tcp: allow dctcp alpha to drop to zero · c80dbe04
      Andrew Shewmaker authored
      
      
      If alpha is strictly reduced by alpha >> dctcp_shift_g and if alpha is less
      than 1 << dctcp_shift_g, then alpha may never reach zero. For example,
      given shift_g=4 and alpha=15, alpha >> dctcp_shift_g yields 0 and alpha
      remains 15. The effect isn't noticeable in this case below cwnd=137, but
      could gradually drive uncongested flows with leftover alpha down to
      cwnd=137. A larger dctcp_shift_g would have a greater effect.
      
      This change causes alpha=15 to drop to 0 instead of being decrementing by 1
      as it would when alpha=16. However, it requires one less conditional to
      implement since it doesn't have to guard against subtracting 1 from 0U. A
      decay of 15 is not unreasonable since an equal or greater amount occurs at
      alpha >= 240.
      
      Signed-off-by: default avatarAndrew G. Shewmaker <agshew@gmail.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c80dbe04
    • lucien's avatar
      ipv6: fix the incorrect return value of throw route · ab997ad4
      lucien authored
      
      
      The error condition -EAGAIN, which is signaled by throw routes, tells
      the rules framework to walk on searching for next matches. If the walk
      ends and we stop walking the rules with the result of a throw route we
      have to translate the error conditions to -ENETUNREACH.
      
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab997ad4
    • Pravin B Shelar's avatar
      openvswitch: Fix egress tunnel info. · fc4099f1
      Pravin B Shelar authored
      While transitioning to netdev based vport we broke OVS
      feature which allows user to retrieve tunnel packet egress
      information for lwtunnel devices.  Following patch fixes it
      by introducing ndo operation to get the tunnel egress info.
      Same ndo operation can be used for lwtunnel devices and compat
      ovs-tnl-vport devices. So after adding such device operation
      we can remove similar operation from ovs-vport.
      
      Fixes: 614732ea
      
       ("openvswitch: Use regular VXLAN net_device device").
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc4099f1
    • Jorgen Hansen's avatar
      VSOCK: Fix lockdep issue. · 8566b86a
      Jorgen Hansen authored
      
      
      The recent fix for the vsock sock_put issue used the wrong
      initializer for the transport spin_lock causing an issue when
      running with lockdep checking.
      
      Testing: Verified fix on kernel with lockdep enabled.
      
      Reviewed-by: default avatarThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: default avatarJorgen Hansen <jhansen@vmware.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8566b86a
  3. 22 Oct, 2015 20 commits
    • Vivien Didelot's avatar
      net: dsa: remove port_fdb_getnext · 1a49a2fb
      Vivien Didelot authored
      
      
      No driver implements port_fdb_getnext anymore, and port_fdb_dump is
      preferred anyway, so remove this function from DSA.
      
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a49a2fb
    • Vivien Didelot's avatar
      net: dsa: add port_fdb_dump function · ea70ba98
      Vivien Didelot authored
      
      
      Not all switch chips support a Get Next operation to iterate on its FDB.
      So add a more simple port_fdb_dump function for them.
      
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea70ba98
    • David Ahern's avatar
      net: ipv6: Dont add RT6_LOOKUP_F_IFACE flag if saddr set · d46a9d67
      David Ahern authored
      741a11d9 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set")
      adds the RT6_LOOKUP_F_IFACE flag to make device index mismatch fatal if
      oif is given. Hajime reported that this change breaks the Mobile IPv6
      use case that wants to force the message through one interface yet use
      the source address from another interface. Handle this case by only
      adding the flag if oif is set and saddr is not set.
      
      Fixes: 741a11d9
      
       ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set")
      Cc: Hajime Tazaki <thehajime@gmail.com>
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d46a9d67
    • Jorgen Hansen's avatar
      VSOCK: sock_put wasn't safe to call in interrupt context · 4ef7ea91
      Jorgen Hansen authored
      
      
      In the vsock vmci_transport driver, sock_put wasn't safe to call
      in interrupt context, since that may call the vsock destructor
      which in turn calls several functions that should only be called
      from process context. This change defers the callling of these
      functions  to a worker thread. All these functions were
      deallocation of resources related to the transport itself.
      
      Furthermore, an unused callback was removed to simplify the
      cleanup.
      
      Multiple customers have been hitting this issue when using
      VMware tools on vSphere 2015.
      
      Also added a version to the vmci transport module (starting from
      1.0.2.0-k since up until now it appears that this module was
      sharing version with vsock that is currently at 1.0.1.0-k).
      
      Reviewed-by: default avatarAditya Asarwade <asarwade@vmware.com>
      Reviewed-by: default avatarThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: default avatarJorgen Hansen <jhansen@vmware.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ef7ea91
    • David Herrmann's avatar
      netlink: fix locking around NETLINK_LIST_MEMBERSHIPS · 47191d65
      David Herrmann authored
      
      
      Currently, NETLINK_LIST_MEMBERSHIPS grabs the netlink table while copying
      the membership state to user-space. However, grabing the netlink table is
      effectively a write_lock_irq(), and as such we should not be triggering
      page-faults in the critical section.
      
      This can be easily reproduced by the following snippet:
          int s = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
          void *p = mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
          int r = getsockopt(s, 0x10e, 9, p, (void*)((char*)p + 4092));
      
      This should work just fine, but currently triggers EFAULT and a possible
      WARN_ON below handle_mm_fault().
      
      Fix this by reducing locking of NETLINK_LIST_MEMBERSHIPS to a read-side
      lock. The write-lock was overkill in the first place, and the read-lock
      allows page-faults just fine.
      
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47191d65
    • Pravin B Shelar's avatar
      openvswitch: Use dev_queue_xmit for vport send. · aec15924
      Pravin B Shelar authored
      
      
      With use of lwtunnel, we can directly call dev_queue_xmit()
      rather than calling netdev vport send operation.
      Following change make tunnel vport code bit cleaner.
      
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aec15924
    • Pravin B Shelar's avatar
      openvswitch: Fix incorrect type use. · 99e28f18
      Pravin B Shelar authored
      Patch fixes following sparse warning.
      net/openvswitch/flow_netlink.c:583:30: warning: incorrect type in assignment (different base types)
      net/openvswitch/flow_netlink.c:583:30:    expected restricted __be16 [usertype] ipv4
      net/openvswitch/flow_netlink.c:583:30:    got int
      
      Fixes: 6b26ba3a
      
       ("openvswitch: netlink attributes for IPv6 tunneling")
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99e28f18
    • Marcel Holtmann's avatar
      Bluetooth: Increase minor version of core module · 13972adc
      Marcel Holtmann authored
      
      
      With the addition of support for diagnostic feature, it makes sense to
      increase the minor version of the Bluetooth core module.
      
      The module version is not used anywhere, but it gives a nice extra
      hint for debugging purposes.
      
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarJohan Hedberg <johan.hedberg@intel.com>
      13972adc
    • Alexander Aring's avatar
      ieee802154: 6lowpan: fix memory leak · aeedebff
      Alexander Aring authored
      
      
      Looking at current situation of memory management in 6lowpan receive
      function I detected some invalid handling. After calling
      lowpan_invoke_rx_handlers we will do a kfree_skb and then NET_RX_DROP on
      error handling. We don't do this before, also on
      skb_share_check/skb_unshare which might manipulate the reference
      counters.
      
      After running some 'grep -r "dev_add_pack" net/' to look how others
      packet-layer receive callbacks works I detected that every subsystem do
      a kfree_skb, then NET_RX_DROP without calling skb functions which
      might manipulate the skb reference counters. This is the reason why we
      should do the same here like all others subsystems. I didn't find any
      documentation how the packet-layer receive callbacks handle NET_RX_DROP
      return values either.
      
      This patch will add a kfree_skb, then NET_RX_DROP handling for the
      "trivial checks", in case of skb_share_check/skb_unshare the kfree_skb
      call will be done inside these functions.
      
      Signed-off-by: default avatarAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      aeedebff
    • Johan Hedberg's avatar
      Bluetooth: Make hci_disconnect() behave correctly for all states · 88d07feb
      Johan Hedberg authored
      
      
      There are a few places that don't explicitly check the connection
      state before calling hci_disconnect(). To make this API do the right
      thing take advantage of the new hci_abort_conn() API and also make
      sure to only read the clock offset if we're really connected.
      
      Signed-off-by: default avatarJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      88d07feb
    • Johan Hedberg's avatar
      Bluetooth: Take advantage of connection abort helpers · 89e0ccc8
      Johan Hedberg authored
      
      
      Convert the various places mapping connection state to
      disconnect/cancel HCI command to use the new hci_abort_conn helper
      API.
      
      Signed-off-by: default avatarJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      89e0ccc8
    • Johan Hedberg's avatar
      Bluetooth: Introduce hci_req helper to abort a connection · dcc0f0d9
      Johan Hedberg authored
      
      
      There are several different places needing to make sure that a
      connection gets disconnected or canceled. The exact action needed
      depends on the connection state, so centralizing this logic can save
      quite a lot of code duplication.
      
      Signed-off-by: default avatarJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      dcc0f0d9
    • Johan Hedberg's avatar
      Bluetooth: Fix crash in SMP when unpairing · c81d555a
      Johan Hedberg authored
      
      
      When unpairing the keys stored in hci_dev are removed. If SMP is
      ongoing the SMP context will also have references to these keys, so
      removing them from the hci_dev lists will make the pointers invalid.
      This can result in the following type of crashes:
      
       BUG: unable to handle kernel paging request at 6b6b6b6b
       IP: [<c11f26be>] __list_del_entry+0x44/0x71
       *pde = 00000000
       Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       Modules linked in: hci_uart btqca btusb btintel btbcm btrtl hci_vhci rfcomm bluetooth_6lowpan bluetooth
       CPU: 0 PID: 723 Comm: kworker/u5:0 Not tainted 4.3.0-rc3+ #1379
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
       Workqueue: hci0 hci_rx_work [bluetooth]
       task: f19da940 ti: f1a94000 task.ti: f1a94000
       EIP: 0060:[<c11f26be>] EFLAGS: 00010202 CPU: 0
       EIP is at __list_del_entry+0x44/0x71
       EAX: c0088d20 EBX: f30fcac0 ECX: 6b6b6b6b EDX: 6b6b6b6b
       ESI: f4b60000 EDI: c0088d20 EBP: f1a95d90 ESP: f1a95d8c
        DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
       CR0: 8005003b CR2: 6b6b6b6b CR3: 319e5000 CR4: 00000690
       Stack:
        f30fcac0 f1a95db0 f82dc3e1 f1bfc000 00000000 c106524f f1bfc000 f30fd020
        f1a95dc0 f1a95dd0 f82dcbdb f1a95de0 f82dcbdb 00000067 f1bfc000 f30fd020
        f1a95de0 f1a95df0 f82d1126 00000067 f82d1126 00000006 f30fd020 f1bfc000
       Call Trace:
        [<f82dc3e1>] smp_chan_destroy+0x192/0x240 [bluetooth]
        [<c106524f>] ? trace_hardirqs_on_caller+0x14e/0x169
        [<f82dcbdb>] smp_teardown_cb+0x47/0x64 [bluetooth]
        [<f82dcbdb>] ? smp_teardown_cb+0x47/0x64 [bluetooth]
        [<f82d1126>] l2cap_chan_del+0x5d/0x14d [bluetooth]
        [<f82d1126>] ? l2cap_chan_del+0x5d/0x14d [bluetooth]
        [<f82d40ef>] l2cap_conn_del+0x109/0x17b [bluetooth]
        [<f82d40ef>] ? l2cap_conn_del+0x109/0x17b [bluetooth]
        [<f82c0205>] ? hci_event_packet+0x5b1/0x2092 [bluetooth]
        [<f82d41aa>] l2cap_disconn_cfm+0x49/0x50 [bluetooth]
        [<f82d41aa>] ? l2cap_disconn_cfm+0x49/0x50 [bluetooth]
        [<f82c0228>] hci_event_packet+0x5d4/0x2092 [bluetooth]
        [<c1332c16>] ? skb_release_data+0x6a/0x95
        [<f82ce5d4>] ? hci_send_to_monitor+0xe7/0xf4 [bluetooth]
        [<c1409708>] ? _raw_spin_unlock_irqrestore+0x44/0x57
        [<f82b3bb0>] hci_rx_work+0xf1/0x28b [bluetooth]
        [<f82b3bb0>] ? hci_rx_work+0xf1/0x28b [bluetooth]
        [<c10635a0>] ? __lock_is_held+0x2e/0x44
        [<c104772e>] process_one_work+0x232/0x432
        [<c1071ddc>] ? rcu_read_lock_sched_held+0x50/0x5a
        [<c104772e>] ? process_one_work+0x232/0x432
        [<c1047d48>] worker_thread+0x1b8/0x255
        [<c1047b90>] ? rescuer_thread+0x23c/0x23c
        [<c104bb71>] kthread+0x91/0x96
        [<c14096a7>] ? _raw_spin_unlock_irq+0x27/0x44
        [<c1409d61>] ret_from_kernel_thread+0x21/0x30
        [<c104bae0>] ? kthread_parkme+0x1e/0x1e
      
      To solve the issue, introduce a new smp_cancel_pairing() API that can
      be used to clean up the SMP state before touching the hci_dev lists.
      
      Signed-off-by: default avatarJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      c81d555a
    • Johan Hedberg's avatar
      Bluetooth: Disable auto-connection parameters when unpairing · fc64361a
      Johan Hedberg authored
      
      
      For connection parameters that are left around until a disconnection
      we should at least clear any auto-connection properties. This way a
      new Add Device call is required to re-set them after calling Unpair
      Device.
      
      Signed-off-by: default avatarJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      fc64361a
    • Eric Dumazet's avatar
      ipv6: gro: support sit protocol · feec0cb3
      Eric Dumazet authored
      Tom Herbert added SIT support to GRO with commit
      19424e05
      
       ("sit: Add gro callbacks to sit_offload"),
      later reverted by Herbert Xu.
      
      The problem came because Tom patch was building GRO
      packets without proper meta data : If packets were locally
      delivered, we would not care.
      
      But if packets needed to be forwarded, GSO engine was not
      able to segment individual segments.
      
      With the following patch, we correctly set skb->encapsulation
      and inner network header. We also update gso_type.
      
      Tested:
      
      Server :
      netserver
      modprobe dummy
      ifconfig dummy0 8.0.0.1 netmask 255.255.255.0 up
      arp -s 8.0.0.100 4e:32:51:04:47:e5
      iptables -I INPUT -s 10.246.7.151 -j TEE --gateway 8.0.0.100
      ifconfig sixtofour0
      sixtofour0 Link encap:IPv6-in-IPv4
                inet6 addr: 2002:af6:798::1/128 Scope:Global
                inet6 addr: 2002:af6:798::/128 Scope:Global
                UP RUNNING NOARP  MTU:1480  Metric:1
                RX packets:411169 errors:0 dropped:0 overruns:0 frame:0
                TX packets:409414 errors:0 dropped:0 overruns:0 carrier:0
                collisions:0 txqueuelen:0
                RX bytes:20319631739 (20.3 GB)  TX bytes:29529556 (29.5 MB)
      
      Client :
      netperf -H 2002:af6:798::1 -l 1000 &
      
      Checked on server traffic copied on dummy0 and verify segments were
      properly rebuilt, with proper IP headers, TCP checksums...
      
      tcpdump on eth0 shows proper GRO aggregation takes place.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      feec0cb3
    • Joe Stringer's avatar
      openvswitch: Serialize nested ct actions if provided · e754ec69
      Joe Stringer authored
      
      
      If userspace provides a ct action with no nested mark or label, then the
      storage for these fields is zeroed. Later when actions are requested,
      such zeroed fields are serialized even though userspace didn't
      originally specify them. Fix the behaviour by ensuring that no action is
      serialized in this case, and reject actions where userspace attempts to
      set these fields with mask=0. This should make netlink marshalling
      consistent across deserialization/reserialization.
      
      Reported-by: default avatarJarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: default avatarJoe Stringer <joestringer@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e754ec69
    • Joe Stringer's avatar
      openvswitch: Mark connections new when not confirmed. · 4f0909ee
      Joe Stringer authored
      
      
      New, related connections are marked as such as part of ovs_ct_lookup(),
      but they are not marked as "new" if the commit flag is used. Make this
      consistent by setting the "new" flag whenever !nf_ct_is_confirmed(ct).
      
      Reported-by: default avatarJarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: default avatarJoe Stringer <joestringer@nicira.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f0909ee
    • Joe Stringer's avatar
      openvswitch: Reject ct_state masks for unknown bits · 9e384715
      Joe Stringer authored
      
      
      Currently, 0-bits are generated in ct_state where the bit position is
      undefined, and matches are accepted on these bit-positions. If userspace
      requests to match the 0-value for this bit then it may expect only a
      subset of traffic to match this value, whereas currently all packets
      will have this bit set to 0. Fix this by rejecting such masks.
      
      Signed-off-by: default avatarJoe Stringer <joestringer@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e384715
    • Renato Westphal's avatar
      tcp: remove improper preemption check in tcp_xmit_probe_skb() · e2e8009f
      Renato Westphal authored
      Commit e520af48 introduced the following bug when setting the
      TCP_REPAIR sockoption:
      
      [ 2860.657036] BUG: using __this_cpu_add() in preemptible [00000000] code: daemon/12164
      [ 2860.657045] caller is __this_cpu_preempt_check+0x13/0x20
      [ 2860.657049] CPU: 1 PID: 12164 Comm: daemon Not tainted 4.2.3 #1
      [ 2860.657051] Hardware name: Dell Inc. PowerEdge R210 II/0JP7TR, BIOS 2.0.5 03/13/2012
      [ 2860.657054]  ffffffff81c7f071 ffff880231e9fdf8 ffffffff8185d765 0000000000000002
      [ 2860.657058]  0000000000000001 ffff880231e9fe28 ffffffff8146ed91 ffff880231e9fe18
      [ 2860.657062]  ffffffff81cd1a5d ffff88023534f200 ffff8800b9811000 ffff880231e9fe38
      [ 2860.657065] Call Trace:
      [ 2860.657072]  [<ffffffff8185d765>] dump_stack+0x4f/0x7b
      [ 2860.657075]  [<ffffffff8146ed91>] check_preemption_disabled+0xe1/0xf0
      [ 2860.657078]  [<ffffffff8146edd3>] __this_cpu_preempt_check+0x13/0x20
      [ 2860.657082]  [<ffffffff817e0bc7>] tcp_xmit_probe_skb+0xc7/0x100
      [ 2860.657085]  [<ffffffff817e1e2d>] tcp_send_window_probe+0x2d/0x30
      [ 2860.657089]  [<ffffffff817d1d8c>] do_tcp_setsockopt.isra.29+0x74c/0x830
      [ 2860.657093]  [<ffffffff817d1e9c>] tcp_setsockopt+0x2c/0x30
      [ 2860.657097]  [<ffffffff81767b74>] sock_common_setsockopt+0x14/0x20
      [ 2860.657100]  [<ffffffff817669e1>] SyS_setsockopt+0x71/0xc0
      [ 2860.657104]  [<ffffffff81865172>] entry_SYSCALL_64_fastpath+0x16/0x75
      
      Since tcp_xmit_probe_skb() can be called from process context, use
      NET_INC_STATS() instead of NET_INC_STATS_BH().
      
      Fixes: e520af48
      
       ("tcp: add TCPWinProbe and TCPKeepAlive SNMP counters")
      Signed-off-by: default avatarRenato Westphal <renatow@taghos.com.br>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2e8009f
    • Arad, Ronen's avatar
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen authored
      
      
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      rtnl_link_get_size().
      
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      
      Signed-off-by: default avatarRonen Arad <ronen.arad@intel.com>
      Acked-by: default avatarSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1974ed0