1. 01 Jan, 2019 1 commit
    • Willem de Bruijn's avatar
      ip: validate header length on virtual device xmit · cb9f1b78
      Willem de Bruijn authored
      KMSAN detected read beyond end of buffer in vti and sit devices when
      passing truncated packets with PF_PACKET. The issue affects additional
      ip tunnel devices.
      
      Extend commit 76c0ddd8 ("ip6_tunnel: be careful when accessing the
      inner header") and commit ccfec9e5
      
       ("ip_tunnel: be careful when
      accessing the inner header").
      
      Move the check to a separate helper and call at the start of each
      ndo_start_xmit function in net/ipv4 and net/ipv6.
      
      Minor changes:
      - convert dev_kfree_skb to kfree_skb on error path,
        as dev_kfree_skb calls consume_skb which is not for error paths.
      - use pskb_network_may_pull even though that is pedantic here,
        as the same as pskb_may_pull for devices without llheaders.
      - do not cache ipv6 hdrs if used only once
        (unsafe across pskb_may_pull, was more relevant to earlier patch)
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb9f1b78
  2. 21 Dec, 2018 1 commit
    • Eric Dumazet's avatar
      ipv6: tunnels: fix two use-after-free · cbb49697
      Eric Dumazet authored
      xfrm6_policy_check() might have re-allocated skb->head, we need
      to reload ipv6 header pointer.
      
      sysbot reported :
      
      BUG: KASAN: use-after-free in __ipv6_addr_type+0x302/0x32f net/ipv6/addrconf_core.c:40
      Read of size 4 at addr ffff888191b8cb70 by task syz-executor2/1304
      
      CPU: 0 PID: 1304 Comm: syz-executor2 Not tainted 4.20.0-rc7+ #356
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x244/0x39d lib/dump_stack.c:113
       print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
       __ipv6_addr_type+0x302/0x32f net/ipv6/addrconf_core.c:40
       ipv6_addr_type include/net/ipv6.h:403 [inline]
       ip6_tnl_get_cap+0x27/0x190 net/ipv6/ip6_tunnel.c:727
       ip6_tnl_rcv_ctl+0xdb/0x2a0 net/ipv6/ip6_tunnel.c:757
       vti6_rcv+0x336/0x8f3 net/ipv6/ip6_vti.c:321
       xfrm6_ipcomp_rcv+0x1a5/0x3a0 net/ipv6/xfrm6_protocol.c:132
       ip6_protocol_deliver_rcu+0x372/0x1940 net/ipv6/ip6_input.c:394
       ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:434
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip6_input+0xe9/0x600 net/ipv6/ip6_input.c:443
      IPVS: ftp: loaded support on port[0] = 21
       ip6_mc_input+0x514/0x11c0 net/ipv6/ip6_input.c:537
       dst_input include/net/dst.h:450 [inline]
       ip6_rcv_finish+0x17a/0x330 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ipv6_rcv+0x115/0x640 net/ipv6/ip6_input.c:272
       __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4973
       __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5083
       process_backlog+0x24e/0x7a0 net/core/dev.c:5923
       napi_poll net/core/dev.c:6346 [inline]
       net_rx_action+0x7fa/0x19b0 net/core/dev.c:6412
       __do_softirq+0x308/0xb7e kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1027
       </IRQ>
       do_softirq.part.14+0x126/0x160 kernel/softirq.c:337
       do_softirq+0x19/0x20 kernel/softirq.c:340
       netif_rx_ni+0x521/0x860 net/core/dev.c:4569
       dev_loopback_xmit+0x287/0x8c0 net/core/dev.c:3576
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip6_finish_output2+0x193a/0x2930 net/ipv6/ip6_output.c:84
       ip6_fragment+0x2b06/0x3850 net/ipv6/ip6_output.c:727
       ip6_finish_output+0x6b7/0xc50 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:278 [inline]
       ip6_output+0x232/0x9d0 net/ipv6/ip6_output.c:171
       dst_output include/net/dst.h:444 [inline]
       ip6_local_out+0xc5/0x1b0 net/ipv6/output_core.c:176
       ip6_send_skb+0xbc/0x340 net/ipv6/ip6_output.c:1727
       ip6_push_pending_frames+0xc5/0xf0 net/ipv6/ip6_output.c:1747
       rawv6_push_pending_frames net/ipv6/raw.c:615 [inline]
       rawv6_sendmsg+0x3a3e/0x4b40 net/ipv6/raw.c:945
      kobject: 'queues' (0000000089e6eea2): kobject_add_internal: parent: 'tunl0', set: '<NULL>'
      kobject: 'queues' (0000000089e6eea2): kobject_uevent_env
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
      kobject: 'queues' (0000000089e6eea2): kobject_uevent_env: filter function caused the event to drop!
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:631
       sock_write_iter+0x35e/0x5c0 net/socket.c:900
       call_write_iter include/linux/fs.h:1857 [inline]
       new_sync_write fs/read_write.c:474 [inline]
       __vfs_write+0x6b8/0x9f0 fs/read_write.c:487
      kobject: 'rx-0' (00000000e2d902d9): kobject_add_internal: parent: 'queues', set: 'queues'
      kobject: 'rx-0' (00000000e2d902d9): kobject_uevent_env
       vfs_write+0x1fc/0x560 fs/read_write.c:549
       ksys_write+0x101/0x260 fs/read_write.c:598
      kobject: 'rx-0' (00000000e2d902d9): fill_kobj_path: path = '/devices/virtual/net/tunl0/queues/rx-0'
       __do_sys_write fs/read_write.c:610 [inline]
       __se_sys_write fs/read_write.c:607 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:607
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      kobject: 'tx-0' (00000000443b70ac): kobject_add_internal: parent: 'queues', set: 'queues'
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457669
      Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f9bd200bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457669
      RDX: 000000000000058f RSI: 00000000200033c0 RDI: 0000000000000003
      kobject: 'tx-0' (00000000443b70ac): kobject_uevent_env
      RBP: 000000000072bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f9bd200c6d4
      R13: 00000000004c2dcc R14: 00000000004da398 R15: 00000000ffffffff
      
      Allocated by task 1304:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
       __do_kmalloc_node mm/slab.c:3684 [inline]
       __kmalloc_node_track_caller+0x50/0x70 mm/slab.c:3698
       __kmalloc_reserve.isra.41+0x41/0xe0 net/core/skbuff.c:140
       __alloc_skb+0x155/0x760 net/core/skbuff.c:208
      kobject: 'tx-0' (00000000443b70ac): fill_kobj_path: path = '/devices/virtual/net/tunl0/queues/tx-0'
       alloc_skb include/linux/skbuff.h:1011 [inline]
       __ip6_append_data.isra.49+0x2f1a/0x3f50 net/ipv6/ip6_output.c:1450
       ip6_append_data+0x1bc/0x2d0 net/ipv6/ip6_output.c:1619
       rawv6_sendmsg+0x15ab/0x4b40 net/ipv6/raw.c:938
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:631
       ___sys_sendmsg+0x7fd/0x930 net/socket.c:2116
       __sys_sendmsg+0x11d/0x280 net/socket.c:2154
       __do_sys_sendmsg net/socket.c:2163 [inline]
       __se_sys_sendmsg net/socket.c:2161 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2161
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kobject: 'gre0' (00000000cb1b2d7b): kobject_add_internal: parent: 'net', set: 'devices'
      
      Freed by task 1304:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
       kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
       __cache_free mm/slab.c:3498 [inline]
       kfree+0xcf/0x230 mm/slab.c:3817
       skb_free_head+0x93/0xb0 net/core/skbuff.c:553
       pskb_expand_head+0x3b2/0x10d0 net/core/skbuff.c:1498
       __pskb_pull_tail+0x156/0x18a0 net/core/skbuff.c:1896
       pskb_may_pull include/linux/skbuff.h:2188 [inline]
       _decode_session6+0xd11/0x14d0 net/ipv6/xfrm6_policy.c:150
       __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:3272
      kobject: 'gre0' (00000000cb1b2d7b): kobject_uevent_env
       __xfrm_policy_check+0x380/0x2c40 net/xfrm/xfrm_policy.c:3322
       __xfrm_policy_check2 include/net/xfrm.h:1170 [inline]
       xfrm_policy_check include/net/xfrm.h:1175 [inline]
       xfrm6_policy_check include/net/xfrm.h:1185 [inline]
       vti6_rcv+0x4bd/0x8f3 net/ipv6/ip6_vti.c:316
       xfrm6_ipcomp_rcv+0x1a5/0x3a0 net/ipv6/xfrm6_protocol.c:132
       ip6_protocol_deliver_rcu+0x372/0x1940 net/ipv6/ip6_input.c:394
       ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:434
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ip6_input+0xe9/0x600 net/ipv6/ip6_input.c:443
       ip6_mc_input+0x514/0x11c0 net/ipv6/ip6_input.c:537
       dst_input include/net/dst.h:450 [inline]
       ip6_rcv_finish+0x17a/0x330 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:289 [inline]
       ipv6_rcv+0x115/0x640 net/ipv6/ip6_input.c:272
       __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4973
       __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5083
       process_backlog+0x24e/0x7a0 net/core/dev.c:5923
      kobject: 'gre0' (00000000cb1b2d7b): fill_kobj_path: path = '/devices/virtual/net/gre0'
       napi_poll net/core/dev.c:6346 [inline]
       net_rx_action+0x7fa/0x19b0 net/core/dev.c:6412
       __do_softirq+0x308/0xb7e kernel/softirq.c:292
      
      The buggy address belongs to the object at ffff888191b8cac0
       which belongs to the cache kmalloc-512 of size 512
      The buggy address is located 176 bytes inside of
       512-byte region [ffff888191b8cac0, ffff888191b8ccc0)
      The buggy address belongs to the page:
      page:ffffea000646e300 count:1 mapcount:0 mapping:ffff8881da800940 index:0x0
      flags: 0x2fffc0000000200(slab)
      raw: 02fffc0000000200 ffffea0006eaaa48 ffffea00065356c8 ffff8881da800940
      raw: 0000000000000000 ffff888191b8c0c0 0000000100000006 0000000000000000
      page dumped because: kasan: bad access detected
      kobject: 'queues' (000000005fd6226e): kobject_add_internal: parent: 'gre0', set: '<NULL>'
      
      Memory state around the buggy address:
       ffff888191b8ca00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff888191b8ca80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
      >ffff888191b8cb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                                   ^
       ffff888191b8cb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff888191b8cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 0d3c703a ("ipv6: Cleanup IPv6 tunnel receive path")
      Fixes: ed1efb2a
      
       ("ipv6: Add support for IPsec virtual tunnel interfaces")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbb49697
  3. 18 Oct, 2018 1 commit
    • Stefano Brivio's avatar
      ip6_tunnel: Fix encapsulation layout · d4d576f5
      Stefano Brivio authored
      Commit 058214a4 ("ip6_tun: Add infrastructure for doing
      encapsulation") added the ip6_tnl_encap() call in ip6_tnl_xmit(), before
      the call to ipv6_push_frag_opts() to append the IPv6 Tunnel Encapsulation
      Limit option (option 4, RFC 2473, par. 5.1) to the outer IPv6 header.
      
      As long as the option didn't actually end up in generated packets, this
      wasn't an issue. Then commit 89a23c8b ("ip6_tunnel: Fix missing tunnel
      encapsulation limit option") fixed sending of this option, and the
      resulting layout, e.g. for FoU, is:
      
      .-------------------.------------.----------.-------------------.----- - -
      | Outer IPv6 Header | UDP header | Option 4 | Inner IPv6 Header | Payload
      '-------------------'------------'----------'-------------------'----- - -
      
      Needless to say, FoU and GUE (at least) won't work over IPv6. The option
      is appended by default, and I couldn't find a way to disable it with the
      current iproute2.
      
      Turn this into a more reasonable:
      
      .-------------------.----------.------------.-------------------.----- - -
      | Outer IPv6 Header | Option 4 | UDP header | Inner IPv6 Header | Payload
      '-------------------'----------'------------'-------------------'----- - -
      
      With this, and with 84dad559 ("udp6: fix encap return code for
      resubmitting"), FoU and GUE work again over IPv6.
      
      Fixes: 058214a4
      
       ("ip6_tun: Add infrastructure for doing encapsulation")
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4d576f5
  4. 20 Sep, 2018 1 commit
    • Paolo Abeni's avatar
      ip6_tunnel: be careful when accessing the inner header · 76c0ddd8
      Paolo Abeni authored
      the ip6 tunnel xmit ndo assumes that the processed skb always
      contains an ip[v6] header, but syzbot has found a way to send
      frames that fall short of this assumption, leading to the following splat:
      
      BUG: KMSAN: uninit-value in ip6ip6_tnl_xmit net/ipv6/ip6_tunnel.c:1307
      [inline]
      BUG: KMSAN: uninit-value in ip6_tnl_start_xmit+0x7d2/0x1ef0
      net/ipv6/ip6_tunnel.c:1390
      CPU: 0 PID: 4504 Comm: syz-executor558 Not tainted 4.16.0+ #87
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:17 [inline]
        dump_stack+0x185/0x1d0 lib/dump_stack.c:53
        kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
        __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
        ip6ip6_tnl_xmit net/ipv6/ip6_tunnel.c:1307 [inline]
        ip6_tnl_start_xmit+0x7d2/0x1ef0 net/ipv6/ip6_tunnel.c:1390
        __netdev_start_xmit include/linux/netdevice.h:4066 [inline]
        netdev_start_xmit include/linux/netdevice.h:4075 [inline]
        xmit_one net/core/dev.c:3026 [inline]
        dev_hard_start_xmit+0x5f1/0xc70 net/core/dev.c:3042
        __dev_queue_xmit+0x27ee/0x3520 net/core/dev.c:3557
        dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
        packet_snd net/packet/af_packet.c:2944 [inline]
        packet_sendmsg+0x7c70/0x8a30 net/packet/af_packet.c:2969
        sock_sendmsg_nosec net/socket.c:630 [inline]
        sock_sendmsg net/socket.c:640 [inline]
        ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
        __sys_sendmmsg+0x42d/0x800 net/socket.c:2136
        SYSC_sendmmsg+0xc4/0x110 net/socket.c:2167
        SyS_sendmmsg+0x63/0x90 net/socket.c:2162
        do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
        entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x441819
      RSP: 002b:00007ffe58ee8268 EFLAGS: 00000213 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000441819
      RDX: 0000000000000002 RSI: 0000000020000100 RDI: 0000000000000003
      RBP: 00000000006cd018 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000213 R12: 0000000000402510
      R13: 00000000004025a0 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
        kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
        kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
        kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
        kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
        slab_post_alloc_hook mm/slab.h:445 [inline]
        slab_alloc_node mm/slub.c:2737 [inline]
        __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
        __kmalloc_reserve net/core/skbuff.c:138 [inline]
        __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
        alloc_skb include/linux/skbuff.h:984 [inline]
        alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
        sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
        packet_alloc_skb net/packet/af_packet.c:2803 [inline]
        packet_snd net/packet/af_packet.c:2894 [inline]
        packet_sendmsg+0x6454/0x8a30 net/packet/af_packet.c:2969
        sock_sendmsg_nosec net/socket.c:630 [inline]
        sock_sendmsg net/socket.c:640 [inline]
        ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
        __sys_sendmmsg+0x42d/0x800 net/socket.c:2136
        SYSC_sendmmsg+0xc4/0x110 net/socket.c:2167
        SyS_sendmmsg+0x63/0x90 net/socket.c:2162
        do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
        entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      This change addresses the issue adding the needed check before
      accessing the inner header.
      
      The ipv4 side of the issue is apparently there since the ipv4 over ipv6
      initial support, and the ipv6 side predates git history.
      
      Fixes: c4d3efaf ("[IPV6] IP6TUNNEL: Add support to IPv4 over IPv6 tunnel.")
      Fixes: 1da177e4
      
       ("Linux-2.6.12-rc2")
      Reported-by: syzbot+3fde91d4d394747d6db4@syzkaller.appspotmail.com
      Tested-by: default avatarAlexander Potapenko <glider@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76c0ddd8
  5. 04 Sep, 2018 1 commit
  6. 07 Aug, 2018 1 commit
  7. 06 Aug, 2018 1 commit
  8. 01 Jun, 2018 1 commit
  9. 05 Apr, 2018 1 commit
  10. 27 Mar, 2018 1 commit
  11. 16 Mar, 2018 1 commit
    • David Ahern's avatar
      net/ipv6: Change address check to always take a device argument · 232378e8
      David Ahern authored
      
      
      ipv6_chk_addr_and_flags determines if an address is a local address and
      optionally if it is an address on a specific device. For example, it is
      called by ip6_route_info_create to determine if a given gateway address
      is a local address. The address check currently does not consider L3
      domains and as a result does not allow a route to be added in one VRF
      if the nexthop points to an address in a second VRF. e.g.,
      
          $ ip route add 2001:db8:1::/64 vrf r2 via 2001:db8:102::23
          Error: Invalid gateway address.
      
      where 2001:db8:102::23 is an address on an interface in vrf r1.
      
      ipv6_chk_addr_and_flags needs to allow callers to always pass in a device
      with a separate argument to not limit the address to the specific device.
      The device is used used to determine the L3 domain of interest.
      
      To that end add an argument to skip the device check and update callers
      to always pass a device where possible and use the new argument to mean
      any address in the domain.
      
      Update a handful of users of ipv6_chk_addr with a NULL dev argument. This
      patch handles the change to these callers without adding the domain check.
      
      ip6_validate_gw needs to handle 2 cases - one where the device is given
      as part of the nexthop spec and the other where the device is resolved.
      There is at least 1 VRF case where deferring the check to only after
      the route lookup has resolved the device fails with an unintuitive error
      "RTNETLINK answers: No route to host" as opposed to the preferred
      "Error: Gateway can not be a local address." The 'no route to host'
      error is because of the fallback to a full lookup. The check is done
      twice to avoid this error.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      232378e8
  12. 09 Mar, 2018 1 commit
    • Eric Dumazet's avatar
      net: do not create fallback tunnels for non-default namespaces · 79134e6c
      Eric Dumazet authored
      
      
      fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
      ip6tnl0, ip6gre0) are automatically created when the corresponding
      module is loaded.
      
      These tunnels are also automatically created when a new network
      namespace is created, at a great cost.
      
      In many cases, netns are used for isolation purposes, and these
      extra network devices are a waste of resources. We are using
      thousands of netns per host, and hit the netns creation/delete
      bottleneck a lot. (Many thanks to Kirill for recent work on this)
      
      Add a new sysctl so that we can opt-out from this automatic creation.
      
      Note that these tunnels are still created for the initial namespace,
      to be the least intrusive for typical setups.
      
      Tested:
      lpk43:~# cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
      done
      wait
      
      lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m37.521s
      user	0m0.886s
      sys	7m7.084s
      lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m4.761s
      user	0m0.851s
      sys	1m8.343s
      lpk43:~#
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79134e6c
  13. 04 Mar, 2018 1 commit
  14. 27 Feb, 2018 2 commits
  15. 25 Jan, 2018 1 commit
  16. 02 Jan, 2018 2 commits
  17. 19 Dec, 2017 1 commit
    • Xin Long's avatar
      ip6_tunnel: get the min mtu properly in ip6_tnl_xmit · c9fefa08
      Xin Long authored
      
      
      Now it's using IPV6_MIN_MTU as the min mtu in ip6_tnl_xmit, but
      IPV6_MIN_MTU actually only works when the inner packet is ipv6.
      
      With IPV6_MIN_MTU for ipv4 packets, the new pmtu for inner dst
      couldn't be set less than 1280. It would cause tx_err and the
      packet to be dropped when the outer dst pmtu is close to 1280.
      
      Jianlin found it by running ipv4 traffic with the topo:
      
        (client) gre6 <---> eth1 (route) eth2 <---> gre6 (server)
      
      After changing eth2 mtu to 1300, the performance became very
      low, or the connection was even broken. The issue also affects
      ip4ip6 and ip6ip6 tunnels.
      
      So if the inner packet is ipv4, 576 should be considered as the
      min mtu.
      
      Note that for ip4ip6 and ip6ip6 tunnels, the inner packet can
      only be ipv4 or ipv6, but for gre6 tunnel, it may also be ARP.
      This patch using 576 as the min mtu for non-ipv6 packet works
      for all those cases.
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9fefa08
  18. 07 Dec, 2017 1 commit
  19. 04 Dec, 2017 1 commit
  20. 13 Nov, 2017 3 commits
  21. 25 Oct, 2017 2 commits
    • Mark Rutland's avatar
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland authored
      
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: Mark Rutland's avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6aa7de05
    • Shmulik Ladkani's avatar
      ip6_tunnel: Allow rcv/xmit even if remote address is a local address · 908d140a
      Shmulik Ladkani authored
      
      
      Currently, ip6_tnl_xmit_ctl drops tunneled packets if the remote
      address (outer v6 destination) is one of host's locally configured
      addresses.
      Same applies to ip6_tnl_rcv_ctl: it drops packets if the remote address
      (outer v6 source) is a local address.
      
      This prevents using ipxip6 (and ip6_gre) tunnels whose local/remote
      endpoints are on same host; OTOH v4 tunnels (ipip or gre) allow such
      configurations.
      
      An example where this proves useful is a system where entities are
      identified by their unique v6 addresses, and use tunnels to encapsulate
      traffic between them. The limitation prevents placing several entities
      on same host.
      
      Introduce IP6_TNL_F_ALLOW_LOCAL_REMOTE which allows to bypass this
      restriction.
      Signed-off-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      908d140a
  22. 18 Oct, 2017 1 commit
  23. 01 Oct, 2017 1 commit
  24. 19 Sep, 2017 1 commit
    • Eric Dumazet's avatar
      ipv6: speedup ipv6 tunnels dismantle · bb401cae
      Eric Dumazet authored
      
      
      Implement exit_batch() method to dismantle more devices
      per round.
      
      (rtnl_lock() ...
       unregister_netdevice_many() ...
       rtnl_unlock())
      
      Tested:
      $ cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
      done
      wait ; grep net_namespace /proc/slabinfo
      
      Before patch :
      $ time ./add_del_unshare.sh
      net_namespace        110    267   5504    1    2 : tunables    8    4    0 : slabdata    110    267      0
      
      real    3m25.292s
      user    0m0.644s
      sys     0m40.153s
      
      After patch:
      
      $ time ./add_del_unshare.sh
      net_namespace        126    282   5504    1    2 : tunables    8    4    0 : slabdata    126    282      0
      
      real	1m38.965s
      user	0m0.688s
      sys	0m37.017s
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb401cae
  25. 18 Sep, 2017 1 commit
    • Xin Long's avatar
      ip6_tunnel: do not allow loading ip6_tunnel if ipv6 is disabled in cmdline · 8c22dab0
      Xin Long authored
      
      
      If ipv6 has been disabled from cmdline since kernel started, it makes
      no sense to allow users to create any ip6 tunnel. Otherwise, it could
      some potential problem.
      
      Jianlin found a kernel crash caused by this in ip6_gre when he set
      ipv6.disable=1 in grub:
      
      [  209.588865] Unable to handle kernel paging request for data at address 0x00000080
      [  209.588872] Faulting instruction address: 0xc000000000a3aa6c
      [  209.588879] Oops: Kernel access of bad area, sig: 11 [#1]
      [  209.589062] NIP [c000000000a3aa6c] fib_rules_lookup+0x4c/0x260
      [  209.589071] LR [c000000000b9ad90] fib6_rule_lookup+0x50/0xb0
      [  209.589076] Call Trace:
      [  209.589097] fib6_rule_lookup+0x50/0xb0
      [  209.589106] rt6_lookup+0xc4/0x110
      [  209.589116] ip6gre_tnl_link_config+0x214/0x2f0 [ip6_gre]
      [  209.589125] ip6gre_newlink+0x138/0x3a0 [ip6_gre]
      [  209.589134] rtnl_newlink+0x798/0xb80
      [  209.589142] rtnetlink_rcv_msg+0xec/0x390
      [  209.589151] netlink_rcv_skb+0x138/0x150
      [  209.589159] rtnetlink_rcv+0x48/0x70
      [  209.589169] netlink_unicast+0x538/0x640
      [  209.589175] netlink_sendmsg+0x40c/0x480
      [  209.589184] ___sys_sendmsg+0x384/0x4e0
      [  209.589194] SyS_sendmsg+0xd4/0x140
      [  209.589201] SyS_socketcall+0x3e0/0x4f0
      [  209.589209] system_call+0x38/0xe0
      
      This patch is to return -EOPNOTSUPP in ip6_tunnel_init if ipv6 has been
      disabled from cmdline.
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c22dab0
  26. 13 Sep, 2017 1 commit
  27. 09 Sep, 2017 1 commit
  28. 27 Jun, 2017 3 commits
  29. 19 Jun, 2017 1 commit
  30. 16 Jun, 2017 1 commit
  31. 07 Jun, 2017 1 commit
    • David S. Miller's avatar
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller authored
      
      
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf124db5
  32. 04 Jun, 2017 1 commit
  33. 26 May, 2017 1 commit
    • Peter Dawson's avatar
      ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets · 0e9a7095
      Peter Dawson authored
      This fix addresses two problems in the way the DSCP field is formulated
       on the encapsulating header of IPv6 tunnels.
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195661
      
      1) The IPv6 tunneling code was manipulating the DSCP field of the
       encapsulating packet using the 32b flowlabel. Since the flowlabel is
       only the lower 20b it was incorrect to assume that the upper 12b
       containing the DSCP and ECN fields would remain intact when formulating
       the encapsulating header. This fix handles the 'inherit' and
       'fixed-value' DSCP cases explicitly using the extant dsfield u8 variable.
      
      2) The use of INET_ECN_encapsulate(0, dsfield) in ip6_tnl_xmit was
       incorrect and resulted in the DSCP value always being set to 0.
      
      Commit 90427ef5 ("ipv6: fix flow labels when the traffic class
       is non-0") caused the regression by masking out the flowlabel
       which exposed the incorrect handling of the DSCP portion of the
       flowlabel in ip6_tunnel and ip6_gre.
      
      Fixes: 90427ef5
      
       ("ipv6: fix flow labels when the traffic class is non-0")
      Signed-off-by: default avatarPeter Dawson <peter.a.dawson@boeing.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e9a7095