Commit 2ad7bf36 authored by Mahesh Bandewar's avatar Mahesh Bandewar Committed by David S. Miller
Browse files

ipvlan: Initial check-in of the IPVLAN driver.



This driver is very similar to the macvlan driver except that it
uses L3 on the frame to determine the logical interface while
functioning as packet dispatcher. It inherits L2 of the master
device hence the packets on wire will have the same L2 for all
the packets originating from all virtual devices off of the same
master device.

This driver was developed keeping the namespace use-case in
mind. Hence most of the examples given here take that as the
base setup where main-device belongs to the default-ns and
virtual devices are assigned to the additional namespaces.

The device operates in two different modes and the difference
in these two modes in primarily in the TX side.

(a) L2 mode : In this mode, the device behaves as a L2 device.
TX processing upto L2 happens on the stack of the virtual device
associated with (namespace). Packets are switched after that
into the main device (default-ns) and queued for xmit.

RX processing is simple and all multicast, broadcast (if
applicable), and unicast belonging to the address(es) are
delivered to the virtual devices.

(b) L3 mode : In this mode, the device behaves like a L3 device.
TX processing upto L3 happens on the stack of the virtual device
associated with (namespace). Packets are switched to the
main-device (default-ns) for the L2 processing. Hence the routing
table of the default-ns will be used in this mode.

RX processins is somewhat similar to the L2 mode except that in
this mode only Unicast packets are delivered to the virtual device
while main-dev will handle all other packets.

The devices can be added using the "ip" command from the iproute2
package -

	ip link add link <master> <virtual> type ipvlan mode [ l2 | l3 ]
Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Tim Hockin <thockin@google.com>
Cc: Brandon Philips <brandon.philips@coreos.com>
Cc: Pavel Emelianov <xemul@parallels.com>
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parent 2bbea0a8
IPVLAN Driver HOWTO
Initial Release:
Mahesh Bandewar <maheshb AT google.com>
1. Introduction:
This is conceptually very similar to the macvlan driver with one major
exception of using L3 for mux-ing /demux-ing among slaves. This property makes
the master device share the L2 with it's slave devices. I have developed this
driver in conjuntion with network namespaces and not sure if there is use case
outside of it.
2. Building and Installation:
In order to build the driver, please select the config item CONFIG_IPVLAN.
The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module
(CONFIG_IPVLAN=m).
3. Configuration:
There are no module parameters for this driver and it can be configured
using IProute2/ip utility.
ip link add link <master-dev> <slave-dev> type ipvlan mode { l2 | L3 }
e.g. ip link add link ipvl0 eth0 type ipvlan mode l2
4. Operating modes:
IPvlan has two modes of operation - L2 and L3. For a given master device,
you can select one of these two modes and all slaves on that master will
operate in the same (selected) mode. The RX mode is almost identical except
that in L3 mode the slaves wont receive any multicast / broadcast traffic.
L3 mode is more restrictive since routing is controlled from the other (mostly)
default namespace.
4.1 L2 mode:
In this mode TX processing happens on the stack instance attached to the
slave device and packets are switched and queued to the master device to send
out. In this mode the slaves will RX/TX multicast and broadcast (if applicable)
as well.
4.2 L3 mode:
In this mode TX processing upto L3 happens on the stack instance attached
to the slave device and packets are switched to the stack instance of the
master device for the L2 processing and routing from that instance will be
used before packets are queued on the outbound device. In this mode the slaves
will not receive nor can send multicast / broadcast traffic.
5. What to choose (macvlan vs. ipvlan)?
These two devices are very similar in many regards and the specific use
case could very well define which device to choose. if one of the following
situations defines your use case then you can choose to use ipvlan -
(a) The Linux host that is connected to the external switch / router has
policy configured that allows only one mac per port.
(b) No of virtual devices created on a master exceed the mac capacity and
puts the NIC in promiscous mode and degraded performance is a concern.
(c) If the slave device is to be put into the hostile / untrusted network
namespace where L2 on the slave could be changed / misused.
6. Example configuration:
+=============================================================+
| Host: host1 |
| |
| +----------------------+ +----------------------+ |
| | NS:ns0 | | NS:ns1 | |
| | | | | |
| | | | | |
| | ipvl0 | | ipvl1 | |
| +----------#-----------+ +-----------#----------+ |
| # # |
| ################################ |
| # eth0 |
+==============================#==============================+
(a) Create two network namespaces - ns0, ns1
ip netns add ns0
ip netns add ns1
(b) Create two ipvlan slaves on eth0 (master device)
ip link add link eth0 ipvl0 type ipvlan mode l2
ip link add link eth0 ipvl1 type ipvlan mode l2
(c) Assign slaves to the respective network namespaces
ip link set dev ipvl0 netns ns0
ip link set dev ipvl1 netns ns1
(d) Now switch to the namespace (ns0 or ns1) to configure the slave devices
- For ns0
(1) ip netns exec ns0 bash
(2) ip link set dev ipvl0 up
(3) ip link set dev lo up
(4) ip -4 addr add 127.0.0.1 dev lo
(5) ip -4 addr add $IPADDR dev ipvl0
(6) ip -4 route add default via $ROUTER dev ipvl0
- For ns1
(1) ip netns exec ns1 bash
(2) ip link set dev ipvl1 up
(3) ip link set dev lo up
(4) ip -4 addr add 127.0.0.1 dev lo
(5) ip -4 addr add $IPADDR dev ipvl1
(6) ip -4 route add default via $ROUTER dev ipvl1
......@@ -145,6 +145,24 @@ config MACVTAP
To compile this driver as a module, choose M here: the module
will be called macvtap.
config IPVLAN
tristate "IP-VLAN support"
---help---
This allows one to create virtual devices off of a main interface
and packets will be delivered based on the dest L3 (IPv6/IPv4 addr)
on packets. All interfaces (including the main interface) share L2
making it transparent to the connected L2 switch.
Ipvlan devices can be added using the "ip" command from the
iproute2 package starting with the iproute2-X.Y.ZZ release:
"ip link add link <main-dev> [ NAME ] type ipvlan"
To compile this driver as a module, choose M here: the module
will be called ipvlan.
config VXLAN
tristate "Virtual eXtensible Local Area Network (VXLAN)"
depends on INET
......
......@@ -6,6 +6,7 @@
# Networking Core Drivers
#
obj-$(CONFIG_BONDING) += bonding/
obj-$(CONFIG_IPVLAN) += ipvlan/
obj-$(CONFIG_DUMMY) += dummy.o
obj-$(CONFIG_EQUALIZER) += eql.o
obj-$(CONFIG_IFB) += ifb.o
......
#
# Makefile for the Ethernet Ipvlan driver
#
obj-$(CONFIG_IPVLAN) += ipvlan.o
ipvlan-objs := ipvlan_core.o ipvlan_main.o
/*
* Copyright (c) 2014 Mahesh Bandewar <maheshb@google.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License as
* published by the Free Software Foundation; either version 2 of
* the License, or (at your option) any later version.
*
*/
#ifndef __IPVLAN_H
#define __IPVLAN_H
#include <linux/kernel.h>
#include <linux/types.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/rculist.h>
#include <linux/notifier.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/if_arp.h>
#include <linux/if_link.h>
#include <linux/if_vlan.h>
#include <linux/ip.h>
#include <linux/inetdevice.h>
#include <net/rtnetlink.h>
#include <net/gre.h>
#include <net/route.h>
#include <net/addrconf.h>
#define IPVLAN_DRV "ipvlan"
#define IPV_DRV_VER "0.1"
#define IPVLAN_HASH_SIZE (1 << BITS_PER_BYTE)
#define IPVLAN_HASH_MASK (IPVLAN_HASH_SIZE - 1)
#define IPVLAN_MAC_FILTER_BITS 8
#define IPVLAN_MAC_FILTER_SIZE (1 << IPVLAN_MAC_FILTER_BITS)
#define IPVLAN_MAC_FILTER_MASK (IPVLAN_MAC_FILTER_SIZE - 1)
typedef enum {
IPVL_IPV6 = 0,
IPVL_ICMPV6,
IPVL_IPV4,
IPVL_ARP,
} ipvl_hdr_type;
struct ipvl_pcpu_stats {
u64 rx_pkts;
u64 rx_bytes;
u64 rx_mcast;
u64 tx_pkts;
u64 tx_bytes;
struct u64_stats_sync syncp;
u32 rx_errs;
u32 tx_drps;
};
struct ipvl_port;
struct ipvl_dev {
struct net_device *dev;
struct list_head pnode;
struct ipvl_port *port;
struct net_device *phy_dev;
struct list_head addrs;
int ipv4cnt;
int ipv6cnt;
struct ipvl_pcpu_stats *pcpu_stats;
DECLARE_BITMAP(mac_filters, IPVLAN_MAC_FILTER_SIZE);
netdev_features_t sfeatures;
u32 msg_enable;
u16 mtu_adj;
};
struct ipvl_addr {
struct ipvl_dev *master; /* Back pointer to master */
union {
struct in6_addr ip6; /* IPv6 address on logical interface */
struct in_addr ip4; /* IPv4 address on logical interface */
} ipu;
#define ip6addr ipu.ip6
#define ip4addr ipu.ip4
struct hlist_node hlnode; /* Hash-table linkage */
struct list_head anode; /* logical-interface linkage */
struct rcu_head rcu;
ipvl_hdr_type atype;
};
struct ipvl_port {
struct net_device *dev;
struct hlist_head hlhead[IPVLAN_HASH_SIZE];
struct list_head ipvlans;
struct rcu_head rcu;
int count;
u16 mode;
};
static inline struct ipvl_port *ipvlan_port_get_rcu(const struct net_device *d)
{
return rcu_dereference(d->rx_handler_data);
}
static inline struct ipvl_port *ipvlan_port_get_rtnl(const struct net_device *d)
{
return rtnl_dereference(d->rx_handler_data);
}
static inline bool ipvlan_dev_master(struct net_device *d)
{
return d->priv_flags & IFF_IPVLAN_MASTER;
}
static inline bool ipvlan_dev_slave(struct net_device *d)
{
return d->priv_flags & IFF_IPVLAN_SLAVE;
}
void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev);
void ipvlan_set_port_mode(struct ipvl_port *port, u32 nval);
void ipvlan_init_secret(void);
unsigned int ipvlan_mac_hash(const unsigned char *addr);
rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb);
int ipvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev);
void ipvlan_ht_addr_add(struct ipvl_dev *ipvlan, struct ipvl_addr *addr);
bool ipvlan_addr_busy(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6);
struct ipvl_addr *ipvlan_ht_addr_lookup(const struct ipvl_port *port,
const void *iaddr, bool is_v6);
void ipvlan_ht_addr_del(struct ipvl_addr *addr, bool sync);
#endif /* __IPVLAN_H */
/* Copyright (c) 2014 Mahesh Bandewar <maheshb@google.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License as
* published by the Free Software Foundation; either version 2 of
* the License, or (at your option) any later version.
*
*/
#include "ipvlan.h"
static u32 ipvlan_jhash_secret;
void ipvlan_init_secret(void)
{
net_get_random_once(&ipvlan_jhash_secret, sizeof(ipvlan_jhash_secret));
}
static void ipvlan_count_rx(const struct ipvl_dev *ipvlan,
unsigned int len, bool success, bool mcast)
{
if (!ipvlan)
return;
if (likely(success)) {
struct ipvl_pcpu_stats *pcptr;
pcptr = this_cpu_ptr(ipvlan->pcpu_stats);
u64_stats_update_begin(&pcptr->syncp);
pcptr->rx_pkts++;
pcptr->rx_bytes += len;
if (mcast)
pcptr->rx_mcast++;
u64_stats_update_end(&pcptr->syncp);
} else {
this_cpu_inc(ipvlan->pcpu_stats->rx_errs);
}
}
static u8 ipvlan_get_v6_hash(const void *iaddr)
{
const struct in6_addr *ip6_addr = iaddr;
return __ipv6_addr_jhash(ip6_addr, ipvlan_jhash_secret) &
IPVLAN_HASH_MASK;
}
static u8 ipvlan_get_v4_hash(const void *iaddr)
{
const struct in_addr *ip4_addr = iaddr;
return jhash_1word(ip4_addr->s_addr, ipvlan_jhash_secret) &
IPVLAN_HASH_MASK;
}
struct ipvl_addr *ipvlan_ht_addr_lookup(const struct ipvl_port *port,
const void *iaddr, bool is_v6)
{
struct ipvl_addr *addr;
u8 hash;
hash = is_v6 ? ipvlan_get_v6_hash(iaddr) :
ipvlan_get_v4_hash(iaddr);
hlist_for_each_entry_rcu(addr, &port->hlhead[hash], hlnode) {
if (is_v6 && addr->atype == IPVL_IPV6 &&
ipv6_addr_equal(&addr->ip6addr, iaddr))
return addr;
else if (!is_v6 && addr->atype == IPVL_IPV4 &&
addr->ip4addr.s_addr ==
((struct in_addr *)iaddr)->s_addr)
return addr;
}
return NULL;
}
void ipvlan_ht_addr_add(struct ipvl_dev *ipvlan, struct ipvl_addr *addr)
{
struct ipvl_port *port = ipvlan->port;
u8 hash;
hash = (addr->atype == IPVL_IPV6) ?
ipvlan_get_v6_hash(&addr->ip6addr) :
ipvlan_get_v4_hash(&addr->ip4addr);
hlist_add_head_rcu(&addr->hlnode, &port->hlhead[hash]);
}
void ipvlan_ht_addr_del(struct ipvl_addr *addr, bool sync)
{
hlist_del_rcu(&addr->hlnode);
if (sync)
synchronize_rcu();
}
bool ipvlan_addr_busy(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6)
{
struct ipvl_port *port = ipvlan->port;
struct ipvl_addr *addr;
list_for_each_entry(addr, &ipvlan->addrs, anode) {
if ((is_v6 && addr->atype == IPVL_IPV6 &&
ipv6_addr_equal(&addr->ip6addr, iaddr)) ||
(!is_v6 && addr->atype == IPVL_IPV4 &&
addr->ip4addr.s_addr == ((struct in_addr *)iaddr)->s_addr))
return true;
}
if (ipvlan_ht_addr_lookup(port, iaddr, is_v6))
return true;
return false;
}
static void *ipvlan_get_L3_hdr(struct sk_buff *skb, int *type)
{
void *lyr3h = NULL;
switch (skb->protocol) {
case htons(ETH_P_ARP): {
struct arphdr *arph;
if (unlikely(!pskb_may_pull(skb, sizeof(*arph))))
return NULL;
arph = arp_hdr(skb);
*type = IPVL_ARP;
lyr3h = arph;
break;
}
case htons(ETH_P_IP): {
u32 pktlen;
struct iphdr *ip4h;
if (unlikely(!pskb_may_pull(skb, sizeof(*ip4h))))
return NULL;
ip4h = ip_hdr(skb);
pktlen = ntohs(ip4h->tot_len);
if (ip4h->ihl < 5 || ip4h->version != 4)
return NULL;
if (skb->len < pktlen || pktlen < (ip4h->ihl * 4))
return NULL;
*type = IPVL_IPV4;
lyr3h = ip4h;
break;
}
case htons(ETH_P_IPV6): {
struct ipv6hdr *ip6h;
if (unlikely(!pskb_may_pull(skb, sizeof(*ip6h))))
return NULL;
ip6h = ipv6_hdr(skb);
if (ip6h->version != 6)
return NULL;
*type = IPVL_IPV6;
lyr3h = ip6h;
/* Only Neighbour Solicitation pkts need different treatment */
if (ipv6_addr_any(&ip6h->saddr) &&
ip6h->nexthdr == NEXTHDR_ICMP) {
*type = IPVL_ICMPV6;
lyr3h = ip6h + 1;
}
break;
}
default:
return NULL;
}
return lyr3h;
}
unsigned int ipvlan_mac_hash(const unsigned char *addr)
{
u32 hash = jhash_1word(__get_unaligned_cpu32(addr+2),
ipvlan_jhash_secret);
return hash & IPVLAN_MAC_FILTER_MASK;
}
static void ipvlan_multicast_frame(struct ipvl_port *port, struct sk_buff *skb,
const struct ipvl_dev *in_dev, bool local)
{
struct ethhdr *eth = eth_hdr(skb);
struct ipvl_dev *ipvlan;
struct sk_buff *nskb;
unsigned int len;
unsigned int mac_hash;
int ret;
if (skb->protocol == htons(ETH_P_PAUSE))
return;
list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
if (local && (ipvlan == in_dev))
continue;
mac_hash = ipvlan_mac_hash(eth->h_dest);
if (!test_bit(mac_hash, ipvlan->mac_filters))
continue;
ret = NET_RX_DROP;
len = skb->len + ETH_HLEN;
nskb = skb_clone(skb, GFP_ATOMIC);
if (!nskb)
goto mcast_acct;
if (ether_addr_equal(eth->h_dest, ipvlan->phy_dev->broadcast))
nskb->pkt_type = PACKET_BROADCAST;
else
nskb->pkt_type = PACKET_MULTICAST;
nskb->dev = ipvlan->dev;
if (local)
ret = dev_forward_skb(ipvlan->dev, nskb);
else
ret = netif_rx(nskb);
mcast_acct:
ipvlan_count_rx(ipvlan, len, ret == NET_RX_SUCCESS, true);
}
/* Locally generated? ...Forward a copy to the main-device as
* well. On the RX side we'll ignore it (wont give it to any
* of the virtual devices.
*/
if (local) {
nskb = skb_clone(skb, GFP_ATOMIC);
if (nskb) {
if (ether_addr_equal(eth->h_dest, port->dev->broadcast))
nskb->pkt_type = PACKET_BROADCAST;
else
nskb->pkt_type = PACKET_MULTICAST;
dev_forward_skb(port->dev, nskb);
}
}
}
static int ipvlan_rcv_frame(struct ipvl_addr *addr, struct sk_buff *skb,
bool local)
{
struct ipvl_dev *ipvlan = addr->master;
struct net_device *dev = ipvlan->dev;
unsigned int len;
rx_handler_result_t ret = RX_HANDLER_CONSUMED;
bool success = false;
len = skb->len + ETH_HLEN;
if (unlikely(!(dev->flags & IFF_UP))) {
kfree_skb(skb);
goto out;
}
skb = skb_share_check(skb, GFP_ATOMIC);
if (!skb)
goto out;
skb->dev = dev;
skb->pkt_type = PACKET_HOST;
if (local) {
if (dev_forward_skb(ipvlan->dev, skb) == NET_RX_SUCCESS)
success = true;
} else {
ret = RX_HANDLER_ANOTHER;
success = true;
}
out:
ipvlan_count_rx(ipvlan, len, success, false);
return ret;
}
static struct ipvl_addr *ipvlan_addr_lookup(struct ipvl_port *port,
void *lyr3h, int addr_type,
bool use_dest)
{
struct ipvl_addr *addr = NULL;
if (addr_type == IPVL_IPV6) {
struct ipv6hdr *ip6h;
struct in6_addr *i6addr;
ip6h = (struct ipv6hdr *)lyr3h;
i6addr = use_dest ? &ip6h->daddr : &ip6h->saddr;
addr = ipvlan_ht_addr_lookup(port, i6addr, true);
} else if (addr_type == IPVL_ICMPV6) {
struct nd_msg *ndmh;
struct in6_addr *i6addr;
/* Make sure that the NeighborSolicitation ICMPv6 packets
* are handled to avoid DAD issue.
*/
ndmh = (struct nd_msg *)lyr3h;
if (ndmh->icmph.icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) {
i6addr = &ndmh->target;
addr = ipvlan_ht_addr_lookup(port, i6addr, true);
}
} else if (addr_type == IPVL_IPV4) {
struct iphdr *ip4h;
__be32 *i4addr;
ip4h = (struct iphdr *)lyr3h;
i4addr = use_dest ? &ip4h->daddr : &ip4h->saddr;
addr = ipvlan_ht_addr_lookup(port, i4addr, false);
} else if (addr_type == IPVL_ARP) {
struct arphdr *arph;
unsigned char *arp_ptr;
__be32 dip;
arph = (struct arphdr *)lyr3h;
arp_ptr = (unsigned char *)(arph + 1);
if (use_dest)
arp_ptr += (2 * port->dev->addr_len) + 4;
else
arp_ptr += port->dev->addr_len;
memcpy(&dip, arp_ptr, 4);
addr = ipvlan_ht_addr_lookup(port, &dip, false);
}
return addr;
}
static int ipvlan_process_v4_outbound(struct sk_buff *skb)
{
const struct iphdr *ip4h = ip_hdr(skb);
struct net_device *dev = skb->dev;
struct rtable *rt;
int err, ret = NET_XMIT_DROP;
struct flowi4 fl4 = {
.flowi4_oif = dev->iflink,
.flowi4_tos = RT_TOS(ip4h->tos),