- 05 Dec, 2018 1 commit
-
-
Eric Dumazet authored
TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12 as a step to enable bigger tcp sndbuf limits. It works reasonably well, but the following happens : Once the limit is reached, TCP stack generates an [E]POLLOUT event for every incoming ACK packet. This causes a high number of context switches. This patch implements the strategy David Miller added in sock_def_write_space() : - If TCP socket has a notsent_lowat constraint of X bytes, allow sendmsg() to fill up to X bytes, but send [E]POLLOUT only if number of notsent bytes is below X/2 This considerably reduces TCP_NOTSENT_LOWAT overhead, while allowing to keep the pipe full. Tested: 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM A:/# cat /proc/sys/net/ipv4/tcp_wmem 4096 262144 64000000 A:/# super_netperf 100 -H B -l 1000 -- -K bbr & A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/ A:/# vmstat 5 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 256220672 13532 694976 0 0 10 0 28 14 0 1 99 0 0 2 0 0 256320016 13532 698480 0 0 512 0 715901 5927 0 10 90 0 0 0 0 0 256197232 13532 700992 0 0 735 13 771161 5849 0 11 89 0 0 1 0 0 256233824 13532 703320 0 0 512 23 719650 6635 0 11 89 0 0 2 0 0 256226880 13532 705780 0 0 642 4 775650 6009 0 12 88 0 0 A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow A:/# vmstat 5 5 # check that context switches have not inflated too much. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 260386512 13592 662148 0 0 10 0 17 14 0 1 99 0 0 0 0 0 260519680 13592 604184 0 0 512 13 726843 12424 0 10 90 0 0 1 1 0 260435424 13592 598360 0 0 512 25 764645 12925 0 10 90 0 0 1 0 0 260855392 13592 578380 0 0 512 7 722943 13624 0 11 88 0 0 1 0 0 260445008 13592 601176 0 0 614 34 772288 14317 0 10 90 0 0 Signed-off-by:
Eric Dumazet <edumazet@google.com> Acked-by:
Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 04 Dec, 2018 1 commit
-
-
Ido Schimmel authored
Commit abf4bb6b ("skbuff: Add the offload_mr_fwd_mark field") added the 'offload_mr_fwd_mark' field to indicate that a packet has already undergone L3 multicast routing by a capable device. The field is used to prevent the kernel from forwarding a packet through a netdev through which the device has already forwarded the packet. Currently, no unicast packet is routed by both the device and the kernel, but this is about to change by subsequent patches and we need to be able to mark such packets, so that they will no be forwarded twice. Instead of adding yet another field to 'struct sk_buff', we can just rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet either has a multicast or a unicast destination IP. While at it, add a comment about both 'offload_fwd_mark' and 'offload_l3_fwd_mark'. Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 03 Dec, 2018 8 commits
-
-
Willem de Bruijn authored
With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info. Release of its last reference triggers a completion notification. The TCP stack in tcp_sendmsg_locked holds an extra ref independent of the skbs, because it can build, send and free skbs within its loop, possibly reaching refcount zero and freeing the ubuf_info too soon. The UDP stack currently also takes this extra ref, but does not need it as all skbs are sent after return from __ip(6)_append_data. Avoid the extra refcount_inc and refcount_dec_and_test, and generally the sock_zerocopy_put in the common path, by passing the initial reference to the first skb. This approach is taken instead of initializing the refcount to 0, as that would generate error "refcount_t: increment on 0" on the next skb_zcopy_set. Changes v3 -> v4 - Move skb_zcopy_set below the only kfree_skb that might cause a premature uarg destroy before skb_zerocopy_put_abort - Move the entire skb_shinfo assignment block, to keep that cacheline access in one place Signed-off-by:
Willem de Bruijn <willemb@google.com> Acked-by:
Paolo Abeni <pabeni@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and interpret flag MSG_ZEROCOPY. This patch was previously part of the zerocopy RFC patchsets. Zerocopy is not effective at small MTU. With segmentation offload building larger datagrams, the benefit of page flipping outweights the cost of generating a completion notification. tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on test patch and making skb_orphan_frags_rx same as skb_orphan_frags: ipv4 udp -t 1 tx=191312 (11938 MB) txc=0 zc=n rx=191312 (11938 MB) ipv4 udp -z -t 1 tx=304507 (19002 MB) txc=304507 zc=y rx=304507 (19002 MB) ok ipv6 udp -t 1 tx=174485 (10888 MB) txc=0 zc=n rx=174485 (10888 MB) ipv6 udp -z -t 1 tx=294801 (18396 MB) txc=294801 zc=y rx=294801 (18396 MB) ok Changes v1 -> v2 - Fixup reverse christmas tree violation v2 -> v3 - Split refcount avoidance optimization into separate patch - Fix refcount leak on error in fragmented case (thanks to Paolo Abeni for pointing this one out!) - Fix refcount inc on zero - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data. This is needed since commit 5cf4a853 ("tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp. Signed-off-by:
Willem de Bruijn <willemb@google.com> Acked-by:
Paolo Abeni <pabeni@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Bartosz Golaszewski authored
We've switched all users to nvmem_get_mac_address(). Remove the now dead code. Signed-off-by:
Bartosz Golaszewski <bgolaszewski@baylibre.com> Reviewed-by:
Rob Herring <robh@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Bartosz Golaszewski authored
We already have of_get_nvmem_mac_address() but some non-DT systems want to read the MAC address from NVMEM too. Implement a generalized routine that takes struct device as argument. Signed-off-by:
Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
NeilBrown authored
Some users of rhashtables might need to move an object from one table to another - this appears to be the reason for the incomplete usage of NULLS markers. To support these, we store a unique NULLS_MARKER at the end of each chain, and when a search fails to find a match, we check if the NULLS marker found was the expected one. If not, the search may not have examined all objects in the target bucket, so it is repeated. The unique NULLS_MARKER is derived from the address of the head of the chain. As this cannot be derived at load-time the static rhnull in rht_bucket_nested() needs to be initialised at run time. Any caller of a lookup function must still be prepared for the possibility that the object returned is in a different table - it might have been there for some time. Note that this does NOT provide support for other uses of NULLS_MARKERs such as allocating with SLAB_TYPESAFE_BY_RCU or changing the key of an object and re-inserting it in the same table. These could only be done safely if new objects were inserted at the *start* of a hash chain, and that is not currently the case. Signed-off-by:
NeilBrown <neilb@suse.com> Acked-by:
Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Alexis Bauvin authored
Existing functions to retreive the l3mdev of a device did not walk the master chain to find the upper master. This patch adds a function to find the l3mdev, even indirect through e.g. a bridge: +----------+ | | | vrf-blue | | | +----+-----+ | | +----+-----+ | | | br-blue | | | +----+-----+ | | +----+-----+ | | | eth0 | | | +----------+ This will properly resolve the l3mdev of eth0 to vrf-blue. Signed-off-by:
Alexis Bauvin <abauvin@scaleway.com> Reviewed-by:
Amine Kherbouche <akherbouche@scaleway.com> Reviewed-by:
David Ahern <dsahern@gmail.com> Tested-by:
Amine Kherbouche <akherbouche@scaleway.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Alexis Bauvin authored
UDP tunnel sockets are always opened unbound to a specific device. This patch allow the socket to be bound on a custom device, which incidentally makes UDP tunnels VRF-aware if binding to an l3mdev. Signed-off-by:
Alexis Bauvin <abauvin@scaleway.com> Reviewed-by:
Amine Kherbouche <akherbouche@scaleway.com> Tested-by:
Amine Kherbouche <akherbouche@scaleway.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Shalom Toledo authored
Many drivers load the device's firmware image during the initialization flow either from the flash or from the disk. Currently this option is not controlled by the user and the driver decides from where to load the firmware image. 'fw_load_policy' gives the ability to control this option which allows the user to choose between different loading policies supported by the driver. This parameter can be useful while testing and/or debugging the device. For example, testing a firmware bug fix. Signed-off-by:
Shalom Toledo <shalomt@mellanox.com> Reviewed-by:
Jiri Pirko <jiri@mellanox.com> Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 01 Dec, 2018 2 commits
-
-
Nicolas Dichtel authored
The userspace may need to control the carrier state. Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by:
Didier Pallard <didier.pallard@6wind.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Paolo Abeni authored
the flowi* structures are used and memsetted by server functions in critical path. Currently flowi_common has a couple of holes that we can eliminate reordering the struct fields. As a side effect, both flowi4 and flowi6 shrink by 8 bytes. Before: pahole -EC flowi_common struct flowi_common { // ... /* size: 40, cachelines: 1, members: 10 */ /* sum members: 32, holes: 1, sum holes: 4 */ /* padding: 4 */ /* last cacheline: 40 bytes */ }; pahole -EC flowi6 struct flowi6 { // ... /* size: 88, cachelines: 2, members: 6 */ /* padding: 4 */ /* last cacheline: 24 bytes */ }; pahole -EC flowi4 struct flowi4 { // ... /* size: 56, cachelines: 1, members: 4 */ /* padding: 4 */ /* last cacheline: 56 bytes */ }; After: struct flowi_common { // ... /* size: 32, cachelines: 1, members: 10 */ /* last cacheline: 32 bytes */ }; struct flowi6 { // ... /* size: 80, cachelines: 2, members: 6 */ /* padding: 4 */ /* last cacheline: 16 bytes */ }; struct flowi4 { // ... /* size: 48, cachelines: 1, members: 4 */ /* padding: 4 */ /* last cacheline: 48 bytes */ }; Signed-off-by:
Paolo Abeni <pabeni@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 30 Nov, 2018 8 commits
-
-
Ariel Elior authored
Most of the doorbelling entities are outside of the core module. L2 queues, Roce queues, iscsi and fcoe all need to register. Make the APIs available for these drivers. Signed-off-by:
Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by:
Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by:
Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Ariel Elior authored
Add the database used to register doorbelling entities, and APIs for adding and deleting entries, and logic for traversing the database and doorbelling once on behalf of all entities. Signed-off-by:
Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by:
Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by:
Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Most linux hosts never setup TCP MD5 keys. We can avoid a cache line miss (accessing tp->md5ig_info) on RX and TX using a jump label. Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
In case GRO is not as efficient as it should be or disabled, we might have a user thread trapped in __release_sock() while softirq handler flood packets up to the point we have to drop. This patch balances work done from user thread and softirq, to give more chances to __release_sock() to complete its work before new packets are added the the backlog. This also helps if we receive many ACK packets, since GRO does not aggregate them. This patch brings ~60% throughput increase on a receiver without GRO, but the spectacular gain is really on 1000x release_sock() latency reduction I have measured. Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Jean-Louis Dupond reported poor iscsi TCP receive performance that we tracked to backlog drops. Apparently we fail to send window updates reflecting the fact that we are under stress. Note that we might lack a proper window increase when backlog is fully processed, since __release_sock() clears sk->sk_backlog.len _after_ all skbs have been processed. This should not matter in practice. If we had a significant load through socket backlog, we are in a dangerous situation. Reported-by:
Jean-Louis Dupond <jean-louis@dupond.be> Signed-off-by:
Eric Dumazet <edumazet@google.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Acked-by:
Yuchung Cheng <ycheng@google.com> Tested-by: Jean-Louis Dupond<jean-louis@dupond.be> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Tell the compiler that most TCP flows are using SACK these days. There is no need to add the unlikely() clause in tcp_is_reno(), the compiler is able to infer it. Signed-off-by:
Eric Dumazet <edumazet@google.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Acked-by:
Yuchung Cheng <ycheng@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Geneviève Bastien authored
Trace events are already present for the receive entry points, to indicate how the reception entered the stack. This patch adds the corresponding exit trace events that will bound the reception such that all events occurring between the entry and the exit can be considered as part of the reception context. This greatly helps for dependency and root cause analyses. Without this, it is not possible with tracepoint instrumentation to determine whether a sched_wakeup event following a netif_receive_skb event is the result of the packet reception or a simple coincidence after further processing by the thread. It is possible using other mechanisms like kretprobes, but considering the "entry" points are already present, it would be good to add the matching exit events. In addition to linking packets with wakeups, the entry/exit event pair can also be used to perform network stack latency analyses. Signed-off-by:
Geneviève Bastien <gbastien@versatic.net> CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: Steven Rostedt <rostedt@goodmis.org> CC: Ingo Molnar <mingo@redhat.com> CC: David S. Miller <davem@davemloft.net> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> (tracing side) Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Edward Cree authored
There are no such structs flow_dissector_key_flow_vlan or flow_dissector_key_flow_tags, the actual structs used are struct flow_dissector_key_vlan and struct flow_dissector_key_tags. So correct the comments against FLOW_DISSECTOR_KEY_VLAN, FLOW_DISSECTOR_KEY_FLOW_LABEL and FLOW_DISSECTOR_KEY_CVLAN to refer to those. Signed-off-by:
Edward Cree <ecree@solarflare.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 29 Nov, 2018 1 commit
-
-
Priit Laes authored
Now that these macros are in header file, we can eventually clean up the duplicate macros present in the drivers that utilize the same cordic algorithm implementation. Also add CORDIC_ prefix to nonprefixed macros. Reviewed-by:
Arend van Spriel <arend.vanspriel@broadcom.com> Signed-off-by:
Priit Laes <plaes@plaes.org> Acked-by:
Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by:
Kalle Valo <kvalo@codeaurora.org>
-
- 28 Nov, 2018 3 commits
-
-
John Fastabend authored
This adds a BPF SK_MSG program helper so that we can pop data from a msg. We use this to pop metadata from a previous push data call. Signed-off-by:
John Fastabend <john.fastabend@gmail.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net>
-
Nicolas Dichtel authored
Like the previous patch, the goal is to ease to convert nsids from one netns to another netns. A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids. Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by:
David Ahern <dsahern@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Nicolas Dichtel authored
Like it was done for link and address, add the ability to perform get/dump in another netns by specifying a target nsid attribute. Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by:
David Ahern <dsahern@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 27 Nov, 2018 5 commits
-
-
Nikolay Aleksandrov authored
Use the new boolopt API to add an option which disables learning from link-local packets. The default is kept as before and learning is enabled. This is a simple map from a boolopt bit to a bridge private flag that is tested before learning. v2: pass NULL for extack via sysfs Signed-off-by:
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Nikolay Aleksandrov authored
We have been adding many new bridge options, a big number of which are boolean but still take up netlink attribute ids and waste space in the skb. Recently we discussed learning from link-local packets[1] and decided yet another new boolean option will be needed, thus introducing this API to save some bridge nl space. The API supports changing the value of multiple boolean options at once via the br_boolopt_multi struct which has an optmask (which options to set, bit per opt) and optval (options' new values). Future boolean options will only be added to the br_boolopt_id enum and then will have to be handled in br_boolopt_toggle/get. The API will automatically add the ability to change and export them via netlink, sysfs can use the single boolopt function versions to do the same. The behaviour with failing/succeeding is the same as with normal netlink option changing. If an option requires mapping to internal kernel flag or needs special configuration to be enabled then it should be handled in br_boolopt_toggle. It should also be able to retrieve an option's current state via br_boolopt_get. v2: WARN_ON() on unsupported option as that shouldn't be possible and also will help catch people who add new options without handling them for both set and get. Pass down extack so if an option desires it could set it on error and be more user-friendly. [1] https://www.spinics.net/lists/netdev/msg532698.html Signed-off-by:
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Tiwei Bie authored
Add types and macros for packed ring. Signed-off-by:
Tiwei Bie <tiwei.bie@intel.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Yonghong Song authored
Commit 838e9690 ("bpf: Introduce bpf_func_info") added bpf func info support. The userspace is able to get better ksym's for bpf programs with jit, and is able to print out func prototypes. For a program containing func-to-func calls, the existing implementation returns user specified number of function calls and BTF types if jit is enabled. If the jit is not enabled, it only returns the type for the main function. This is undesirable. Interpreter may still be used and we should keep feature identical regardless of whether jit is enabled or not. This patch fixed this discrepancy. Fixes: 838e9690 ("bpf: Introduce bpf_func_info") Signed-off-by:
Yonghong Song <yhs@fb.com> Acked-by:
Martin KaFai Lau <kafai@fb.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
Make fetching of the BPF call address from ppc64 JIT generic. ppc64 was using a slightly different variant rather than through the insns' imm field encoding as the target address would not fit into that space. Therefore, the target subprog number was encoded into the insns' offset and fetched through fp->aux->func[off]->bpf_func instead. Given there are other JITs with this issue and the mechanism of fetching the address is JIT-generic, move it into the core as a helper instead. On the JIT side, we get information on whether the retrieved address is a fixed one, that is, not changing through JIT passes, or a dynamic one. For the former, JITs can optimize their imm emission because this doesn't change jump offsets throughout JIT process. Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Sandipan Das <sandipan@linux.ibm.com> Tested-by:
Sandipan Das <sandipan@linux.ibm.com> Signed-off-by:
Alexei Starovoitov <ast@kernel.org>
-
- 26 Nov, 2018 2 commits
-
-
Taehee Yoo authored
register_{netdevice/inetaddr/inet6addr}_notifier may return an error value, this patch adds the code to handle these error paths. Signed-off-by:
Taehee Yoo <ap420073@gmail.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Florian Westphal authored
syzbot was able to trigger the WARN in cttimeout_default_get() by passing UDPLITE as l4protocol. Alias UDPLITE to UDP, both use same timeout values. Furthermore, also fetch GRE timeouts. GRE is a bit more complicated, as it still can be a module and its netns_proto_gre struct layout isn't visible outside of the gre module. Can't move timeouts around, it appears conntrack sysctl unregister assumes net_generic() returns nf_proto_net, so we get crash. Expose layout of netns_proto_gre instead. A followup nf-next patch could make gre tracker be built-in as well if needed, its not that large. Last, make the WARN() mention the missing protocol value in case anything else is missing. Reported-by: syzbot+2fae8fa157dd92618cae@syzkaller.appspotmail.com Fixes: 8866df92 ("netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr") Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
- 25 Nov, 2018 3 commits
-
-
Eric Dumazet authored
I do not see how one can effectively use skb_insert() without holding some kind of lock. Otherwise other cpus could have changed the list right before we have a chance of acquiring list->lock. Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this one probably meant to use __skb_insert() since it appears nesqp->pau_list is protected by nesqp->pau_lock. This looks like nesqp->pau_lock could be removed, since nesqp->pau_list.lock could be used instead. Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Faisal Latif <faisal.latif@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: linux-rdma <linux-rdma@vger.kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Heiner Kallweit authored
Similar to netdev_sent_queue add helper __netdev_sent_queue as variant of __netdev_tx_sent_queue. Signed-off-by:
Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Alexey Dobriyan authored
Return code should be formally "netdev_tx_t". Signed-off-by:
Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 24 Nov, 2018 5 commits
-
-
Petr Machata authored
Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of this field from all clients, which were migrated to use switchdev notification in the previous patches. Add a new function switchdev_port_obj_notify() that sends the switchdev notifications SWITCHDEV_PORT_OBJ_ADD and _DEL. Update switchdev_port_obj_del_now() to dispatch to this new function. Drop __switchdev_port_obj_add() and update switchdev_port_obj_add() likewise. Signed-off-by:
Petr Machata <petrm@mellanox.com> Reviewed-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Petr Machata authored
After the transition from switchdev operations to notifier chain (which will take place in following patches), the onus is on the driver to find its own devices below possible layer of LAG or other uppers. The logic to do so is fairly repetitive: each driver is looking for its own devices among the lowers of the notified device. For those that it finds, it calls a handler. To indicate that the event was handled, struct switchdev_notifier_port_obj_info.handled is set. The differences lie only in what constitutes an "own" device and what handler to call. Therefore abstract this logic into two helpers, switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If a driver only supports physical ports under a bridge device, it will simply avoid this layer of indirection. One area where this helper diverges from the current switchdev behavior is the case of mixed lowers, some of which are switchdev ports and some of which are not. Previously, such scenario would fail with -EOPNOTSUPP. The helper could do that for lowers for which the passed-in predicate doesn't hold. That would however break the case that switchdev ports from several different drivers are stashed under one master, a scenario that switchdev currently happily supports. Therefore tolerate any and all unknown netdevices, whether they are backed by a switchdev driver or not. Signed-off-by:
Petr Machata <petrm@mellanox.com> Reviewed-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Petr Machata authored
An offloading driver may need to have access to switchdev events on ports that aren't directly under its control. An example is a VXLAN port attached to a bridge offloaded by a driver. The driver needs to know about VLANs configured on the VXLAN device. However the VXLAN device isn't stashed between the bridge and a front-panel-port device (such as is the case e.g. for LAG devices), so the usual switchdev ops don't reach the driver. VXLAN is likely not the only device type like this: in theory any L2 tunnel device that needs offloading will prompt requirement of this sort. This falsifies the assumption that only the lower devices of a front panel port need to be notified to achieve flawless offloading. A way to fix this is to give up the notion of port object addition / deletion as a switchdev operation, which assumes somewhat tight coupling between the message producer and consumer. And instead send the message over a notifier chain. To that end, introduce two new switchdev notifier types, SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types communicate the same event as the corresponding switchdev op, except in a form of a notification. struct switchdev_notifier_port_obj_info was added to carry the fields that the switchdev op carries. An additional field, handled, will be used to communicate back to switchdev that the event has reached an interested party, which will be important for the two-phase commit. The two switchdev operations themselves are kept in place. Following patches first convert individual clients to the notifier protocol, and only then are the operations removed. Signed-off-by:
Petr Machata <petrm@mellanox.com> Acked-by:
Jiri Pirko <jiri@mellanox.com> Reviewed-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Petr Machata authored
In general one can't assume that a switchdev notifier is called in a non-atomic context, and correspondingly, the switchdev notifier chain is an atomic one. However, port object addition and deletion messages are delivered from a process context. Even the MDB addition messages, whose delivery is scheduled from atomic context, are queued and the delivery itself takes place in blocking context. For VLAN messages in particular, keeping the blocking nature is important for error reporting. Therefore introduce a blocking notifier chain and related service functions to distribute the notifications for which a blocking context can be assumed. Signed-off-by:
Petr Machata <petrm@mellanox.com> Reviewed-by:
Jiri Pirko <jiri@mellanox.com> Reviewed-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Petr Machata authored
The two macros SWITCHDEV_OBJ_PORT_VLAN() and SWITCHDEV_OBJ_PORT_MDB() expand to a container_of() call, yielding an appropriate container of their sole argument. However, due to a name collision, the first argument, i.e. the contained object pointer, is not the only one to get expanded. The third argument, which is a structure member name, and should be kept literal, gets expanded as well. The only safe way to use these two macros is therefore to name the local variable passed to them "obj". To fix this, rename the sole argument of the two macros from "obj" (which collides with the member name) to "OBJ". Additionally, instead of passing "OBJ" to container_of() verbatim, parenthesize it, so that a comma in the passed-in expression doesn't pollute the container_of() invocation. Signed-off-by:
Petr Machata <petrm@mellanox.com> Acked-by:
Jiri Pirko <jiri@mellanox.com> Reviewed-by:
Ido Schimmel <idosch@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- 23 Nov, 2018 1 commit
-
-
Madalin Bucur authored
Check that the values received by the portal interrupt coalesce change APIs are in range. Signed-off-by:
Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by:
Roy Pledge <roy.pledge@nxp.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-