1. 18 Jan, 2019 1 commit
    • Jason Wang's avatar
      vhost: log dirty page correctly · cc5e7107
      Jason Wang authored
      Vhost dirty page logging API is designed to sync through GPA. But we
      try to log GIOVA when device IOTLB is enabled. This is wrong and may
      lead to missing data after migration.
      
      To solve this issue, when logging with device IOTLB enabled, we will:
      
      1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
         get HVA, for writable descriptor, get HVA through iovec. For used
         ring update, translate its GIOVA to HVA
      2) traverse the GPA->HVA mapping to get the possible GPA and log
         through GPA. Pay attention this reverse mapping is not guaranteed
         to be unique, so we should log each possible GPA in this case.
      
      This fix the failure of scp to guest during migration. In -next, we
      will probably support passing GIOVA->GPA instead of GIOVA->HVA.
      
      Fixes: 6b1e6cc7
      
       ("vhost: new device IOTLB API")
      Reported-by: default avatarJintack Lim <jintack@cs.columbia.edu>
      Cc: Jintack Lim <jintack@cs.columbia.edu>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc5e7107
  2. 13 Dec, 2018 1 commit
  3. 27 Nov, 2018 1 commit
  4. 17 Nov, 2018 1 commit
    • Jason Wang's avatar
      vhost_net: mitigate page reference counting during page frag refill · e4dab1e6
      Jason Wang authored
      
      
      We do a get_page() which involves a atomic operation. This patch tries
      to mitigate a per packet atomic operation by maintaining a reference
      bias which is initially USHRT_MAX. Each time a page is got, instead of
      calling get_page() we decrease the bias and when we find it's time to
      use a new page we will decrease the bias at one time through
      __page_cache_drain_cache().
      
      Testpmd(virtio_user + vhost_net) + XDP_DROP on TAP shows about 1.6%
      improvement.
      
      Before: 4.63Mpps
      After:  4.71Mpps
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4dab1e6
  5. 08 Oct, 2018 1 commit
  6. 27 Sep, 2018 3 commits
  7. 21 Sep, 2018 1 commit
  8. 13 Sep, 2018 2 commits
    • Jason Wang's avatar
      vhost_net: batch submitting XDP buffers to underlayer sockets · 0a0be13b
      Jason Wang authored
      
      
      This patch implements XDP batching for vhost_net. The idea is first to
      try to do userspace copy and build XDP buff directly in vhost. Instead
      of submitting the packet immediately, vhost_net will batch them in an
      array and submit every 64 (VHOST_NET_BATCH) packets to the under layer
      sockets through msg_control of sendmsg().
      
      When XDP is enabled on the TUN/TAP, TUN/TAP can process XDP inside a
      loop without caring GUP thus it can do batch map flushing. When XDP is
      not enabled or not supported, the underlayer socket need to build skb
      and pass it to network core. The batched packet submission allows us
      to do batching like netif_receive_skb_list() in the future.
      
      This saves lots of indirect calls for better cache utilization. For
      the case that we can't so batching e.g when sndbuf is limited or
      packet size is too large, we will go for usual one packet per
      sendmsg() way.
      
      Doing testpmd on various setups gives us:
      
      Test                /+pps%
      XDP_DROP on TAP     /+44.8%
      XDP_REDIRECT on TAP /+29%
      macvtap (skb)       /+26%
      
      Netperf tests shows obvious improvements for small packet transmission:
      
      size/session/+thu%/+normalize%
         64/     1/   +2%/    0%
         64/     2/   +3%/   +1%
         64/     4/   +7%/   +5%
         64/     8/   +8%/   +6%
        256/     1/   +3%/    0%
        256/     2/  +10%/   +7%
        256/     4/  +26%/  +22%
        256/     8/  +27%/  +23%
        512/     1/   +3%/   +2%
        512/     2/  +19%/  +14%
        512/     4/  +43%/  +40%
        512/     8/  +45%/  +41%
       1024/     1/   +4%/    0%
       1024/     2/  +27%/  +21%
       1024/     4/  +38%/  +73%
       1024/     8/  +15%/  +24%
       2048/     1/  +10%/   +7%
       2048/     2/  +16%/  +12%
       2048/     4/    0%/   +2%
       2048/     8/    0%/   +2%
       4096/     1/  +36%/  +60%
       4096/     2/  -11%/  -26%
       4096/     4/    0%/  +14%
       4096/     8/    0%/   +4%
      16384/     1/   -1%/   +5%
      16384/     2/    0%/   +2%
      16384/     4/    0%/   -3%
      16384/     8/    0%/   +4%
      65535/     1/    0%/  +10%
      65535/     2/    0%/   +8%
      65535/     4/    0%/   +1%
      65535/     8/    0%/   +3%
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a0be13b
    • Jason Wang's avatar
      tun: switch to new type of msg_control · fe8dd45b
      Jason Wang authored
      
      
      This patch introduces to a new tun/tap specific msg_control:
      
      #define TUN_MSG_UBUF 1
      #define TUN_MSG_PTR  2
      struct tun_msg_ctl {
             int type;
             void *ptr;
      };
      
      This allows us to pass different kinds of msg_control through
      sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will
      be used by the existed vhost_net zerocopy code. The second is XDP
      buff, which allows vhost_net to pass XDP buff to TUN. This could be
      used to implement accepting an array of XDP buffs from vhost_net in
      the following patches.
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe8dd45b
  9. 06 Aug, 2018 1 commit
    • Jason Wang's avatar
      vhost: switch to use new message format · 429711ae
      Jason Wang authored
      We use to have message like:
      
      struct vhost_msg {
      	int type;
      	union {
      		struct vhost_iotlb_msg iotlb;
      		__u8 padding[64];
      	};
      };
      
      Unfortunately, there will be a hole of 32bit in 64bit machine because
      of the alignment. This leads a different formats between 32bit API and
      64bit API. What's more it will break 32bit program running on 64bit
      machine.
      
      So fixing this by introducing a new message type with an explicit
      32bit reserved field after type like:
      
      struct vhost_msg_v2 {
      	__u32 type;
      	__u32 reserved;
      	union {
      		struct vhost_iotlb_msg iotlb;
      		__u8 padding[64];
      	};
      };
      
      We will have a consistent ABI after switching to use this. To enable
      this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for
      userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2).
      
      Fixes: 6b1e6cc7
      
       ("vhost: new device IOTLB API")
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      429711ae
  10. 22 Jul, 2018 9 commits
  11. 04 Jul, 2018 4 commits
  12. 23 Jun, 2018 1 commit
  13. 12 Jun, 2018 1 commit
    • Kees Cook's avatar
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook authored
      
      
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      6da2ec56
  14. 30 May, 2018 1 commit
  15. 24 Apr, 2018 1 commit
    • Paolo Abeni's avatar
      vhost_net: use packet weight for rx handler, too · db688c24
      Paolo Abeni authored
      Similar to commit a2ac9990 ("vhost-net: set packet weight of
      tx polling to 2 * vq size"), we need a packet-based limit for
      handler_rx, too - elsewhere, under rx flood with small packets,
      tx can be delayed for a very long time, even without busypolling.
      
      The pkt limit applied to handle_rx must be the same applied by
      handle_tx, or we will get unfair scheduling between rx and tx.
      Tying such limit to the queue length makes it less effective for
      large queue length values and can introduce large process
      scheduler latencies, so a constant valued is used - likewise
      the existing bytes limit.
      
      The selected limit has been validated with PVP[1] performance
      test with different queue sizes:
      
      queue size		256	512	1024
      
      baseline		366	354	362
      weight 128		715	723	670
      weight 256		740	745	733
      weight 512		600	460	583
      weight 1024		423	427	418
      
      A packet weight of 256 gives peek performances in under all the
      tested scenarios.
      
      No measurable regression in unidirectional performance tests has
      been detected.
      
      [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db688c24
  16. 17 Apr, 2018 1 commit
  17. 09 Apr, 2018 1 commit
    • haibinzhang(张海斌)'s avatar
      vhost-net: set packet weight of tx polling to 2 * vq size · a2ac9990
      haibinzhang(张海斌) authored
      
      
      handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
      polling udp packets with small length(e.g. 1byte udp payload), because setting
      VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
      
      Ping-Latencies shown below were tested between two Virtual Machines using
      netperf (UDP_STREAM, len=1), and then another machine pinged the client:
      
      vq size=256
      Packet-Weight   Ping-Latencies(millisecond)
                         min      avg       max
      Origin           3.319   18.489    57.303
      64               1.643    2.021     2.552
      128              1.825    2.600     3.224
      256              1.997    2.710     4.295
      512              1.860    3.171     4.631
      1024             2.002    4.173     9.056
      2048             2.257    5.650     9.688
      4096             2.093    8.508    15.943
      
      vq size=512
      Packet-Weight   Ping-Latencies(millisecond)
                         min      avg       max
      Origin           6.537   29.177    66.245
      64               2.798    3.614     4.403
      128              2.861    3.820     4.775
      256              3.008    4.018     4.807
      512              3.254    4.523     5.824
      1024             3.079    5.335     7.747
      2048             3.944    8.201    12.762
      4096             4.158   11.057    19.985
      
      Seems pretty consistent, a small dip at 2 VQ sizes.
      Ring size is a hint from device about a burst size it can tolerate. Based on
      benchmarks, set the weight to 2 * vq size.
      
      To evaluate this change, another tests were done using netperf(RR, TX) between
      two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
      tweaked through qemu. Results shown below does not show obvious changes.
      
      vq size=256 TCP_RR                vq size=512 TCP_RR
      size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
         1/       1/  -7%/        -2%      1/       1/   0%/        -2%
         1/       4/  +1%/         0%      1/       4/  +1%/         0%
         1/       8/  +1%/        -2%      1/       8/   0%/        +1%
        64/       1/  -6%/         0%     64/       1/  +7%/        +3%
        64/       4/   0%/        +2%     64/       4/  -1%/        +1%
        64/       8/   0%/         0%     64/       8/  -1%/        -2%
       256/       1/  -3%/        -4%    256/       1/  -4%/        -2%
       256/       4/  +3%/        +4%    256/       4/  +1%/        +2%
       256/       8/  +2%/         0%    256/       8/  +1%/        -1%
      
      vq size=256 UDP_RR                vq size=512 UDP_RR
      size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
         1/       1/  -5%/        +1%      1/       1/  -3%/        -2%
         1/       4/  +4%/        +1%      1/       4/  -2%/        +2%
         1/       8/  -1%/        -1%      1/       8/  -1%/         0%
        64/       1/  -2%/        -3%     64/       1/  +1%/        +1%
        64/       4/  -5%/        -1%     64/       4/  +2%/         0%
        64/       8/   0%/        -1%     64/       8/  -2%/        +1%
       256/       1/  +7%/        +1%    256/       1/  -7%/         0%
       256/       4/  +1%/        +1%    256/       4/  -3%/        -4%
       256/       8/  +2%/        +2%    256/       8/  +1%/        +1%
      
      vq size=256 TCP_STREAM            vq size=512 TCP_STREAM
      size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
        64/       1/   0%/        -3%     64/       1/   0%/         0%
        64/       4/  +3%/        -1%     64/       4/  -2%/        +4%
        64/       8/  +9%/        -4%     64/       8/  -1%/        +2%
       256/       1/  +1%/        -4%    256/       1/  +1%/        +1%
       256/       4/  -1%/        -1%    256/       4/  -3%/         0%
       256/       8/  +7%/        +5%    256/       8/  -3%/         0%
       512/       1/  +1%/         0%    512/       1/  -1%/        -1%
       512/       4/  +1%/        -1%    512/       4/   0%/         0%
       512/       8/  +7%/        -5%    512/       8/  +6%/        -1%
      1024/       1/   0%/        -1%   1024/       1/   0%/        +1%
      1024/       4/  +3%/         0%   1024/       4/  +1%/         0%
      1024/       8/  +8%/        +5%   1024/       8/  -1%/         0%
      2048/       1/  +2%/        +2%   2048/       1/  -1%/         0%
      2048/       4/  +1%/         0%   2048/       4/   0%/        -1%
      2048/       8/  -2%/         0%   2048/       8/   5%/        -1%
      4096/       1/  -2%/         0%   4096/       1/  -2%/         0%
      4096/       4/  +2%/         0%   4096/       4/   0%/         0%
      4096/       8/  +9%/        -2%   4096/       8/  -5%/        -1%
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarHaibin Zhang <haibinzhang@tencent.com>
      Signed-off-by: default avatarYunfang Tai <yunfangtai@tencent.com>
      Signed-off-by: default avatarLidong Chen <lidongchen@tencent.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2ac9990
  18. 26 Mar, 2018 1 commit
  19. 09 Mar, 2018 3 commits
    • Jason Wang's avatar
      vhost_net: examine pointer types during un-producing · 3a403076
      Jason Wang authored
      After commit fc72d1d5 ("tuntap: XDP transmission"), we can
      actually queueing XDP pointers in the pointer ring, so we should
      examine the pointer type before freeing the pointer.
      
      Fixes: fc72d1d5
      
       ("tuntap: XDP transmission")
      Reported-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a403076
    • Jason Wang's avatar
      vhost_net: keep private_data and rx_ring synced · 303fd71b
      Jason Wang authored
      We get pointer ring from the exported sock, this means we should keep
      rx_ring and vq->private synced during both vq stop and backend set,
      otherwise we may see stale rx_ring.
      
      Fixes: c67df11f
      
       ("vhost_net: try batch dequing from skb array")
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      303fd71b
    • Alexander Potapenko's avatar
      vhost_net: initialize rx_ring in vhost_net_open() · ab7e34b3
      Alexander Potapenko authored
      KMSAN reported a use of uninit memory in vhost_net_buf_unproduce()
      while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring:
      
      ==================================================================
      BUG: KMSAN: use of uninitialized memory in vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vho
      et.c:170
      CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170
       vhost_net_stop_vq drivers/vhost/net.c:974 [inline]
       vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982
       vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015
       __fput+0x49f/0xa00 fs/file_table.c:209
       ____fput+0x37/0x40 fs/file_table.c:243
       task_work_run+0x243/0x2c0 kernel/task_work.c:113
       tracehook_notify_resume include/linux/tracehook.h:191 [inline]
       exit_to_usermode_loop arch/x86/entry/common.c:166 [inline]
       prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196
       syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265
       do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292
      ...
      origin:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline]
       kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213
       kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392
       kmalloc_large_node_hook mm/slub.c:1366 [inline]
       kmalloc_large_node mm/slub.c:3808 [inline]
       __kmalloc_node+0x100e/0x1290 mm/slub.c:3818
       kmalloc_node include/linux/slab.h:554 [inline]
       kvmalloc_node+0x1a5/0x2e0 mm/util.c:419
       kvmalloc include/linux/mm.h:541 [inline]
       vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921
       misc_open+0x7b5/0x8b0 drivers/char/misc.c:154
       chrdev_open+0xc28/0xd90 fs/char_dev.c:417
       do_dentry_open+0xccb/0x1430 fs/open.c:752
       vfs_open+0x272/0x2e0 fs/open.c:866
       do_last fs/namei.c:3378 [inline]
       path_openat+0x49ad/0x6580 fs/namei.c:3519
       do_filp_open+0x267/0x640 fs/namei.c:3553
       do_sys_open+0x6ad/0x9c0 fs/open.c:1059
       SYSC_openat+0xc7/0xe0 fs/open.c:1086
       SyS_openat+0x63/0x90 fs/open.c:1080
       do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287
      ==================================================================
      
      Fixes: c67df11f
      
       ("vhost_net: try batch dequing from skb array")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab7e34b3
  20. 12 Feb, 2018 1 commit
    • Denys Vlasenko's avatar
      net: make getname() functions return length rather than use int* parameter · 9b2c45d4
      Denys Vlasenko authored
      
      
      Changes since v1:
      Added changes in these files:
          drivers/infiniband/hw/usnic/usnic_transport.c
          drivers/staging/lustre/lnet/lnet/lib-socket.c
          drivers/target/iscsi/iscsi_target_login.c
          drivers/vhost/net.c
          fs/dlm/lowcomms.c
          fs/ocfs2/cluster/tcp.c
          security/tomoyo/network.c
      
      Before:
      All these functions either return a negative error indicator,
      or store length of sockaddr into "int *socklen" parameter
      and return zero on success.
      
      "int *socklen" parameter is awkward. For example, if caller does not
      care, it still needs to provide on-stack storage for the value
      it does not need.
      
      None of the many FOO_getname() functions of various protocols
      ever used old value of *socklen. They always just overwrite it.
      
      This change drops this parameter, and makes all these functions, on success,
      return length of sockaddr. It's always >= 0 and can be differentiated
      from an error.
      
      Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
      
      rpc_sockname() lost "int buflen" parameter, since its only use was
      to be passed to kernel_getsockname() as &buflen and subsequently
      not used in any way.
      
      Userspace API is not changed.
      
          text    data     bss      dec     hex filename
      30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
      30108109 2633612  873672 33615393 200ee21 vmlinux.o
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      CC: linux-bluetooth@vger.kernel.org
      CC: linux-decnet-user@lists.sourceforge.net
      CC: linux-wireless@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: linux-sctp@vger.kernel.org
      CC: linux-nfs@vger.kernel.org
      CC: linux-x25@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b2c45d4
  21. 11 Feb, 2018 1 commit
    • Linus Torvalds's avatar
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds authored
      
      
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  22. 01 Feb, 2018 1 commit
  23. 29 Jan, 2018 1 commit
  24. 10 Jan, 2018 1 commit
    • Jason Wang's avatar
      vhost_net: batch used ring update in rx · e2b3b35e
      Jason Wang authored
      
      
      This patch tries to batched used ring update during RX. This is pretty
      fit for the case when guest is much faster (e.g dpdk based
      backend). In this case, used ring is almost empty:
      
      - we may get serious cache line misses/contending on both used ring
        and used idx.
      - at most 1 packet could be dequeued at one time, batching in guest
        does not make much effect.
      
      Update used ring in a batch can help since guest won't access the used
      ring until used idx was advanced for several descriptors and since we
      advance used ring for every N packets, guest will only need to access
      used idx for every N packet since it can cache the used idx. To have a
      better interaction for both batch dequeuing and dpdk batching,
      VHOST_RX_BATCH was used as the maximum number of descriptors that
      could be batched.
      
      Test were done between two machines with 2.40GHz Intel(R) Xeon(R) CPU
      E5-2630 connected back to back through ixgbe. Traffic were generated
      on one remote ixgbe through MoonGen and measure the RX pps through
      testpmd in guest when do xdp_redirect_map from local ixgbe to
      tap. RX pps were increased from 3.05 Mpps to 4.00 Mpps (about 31%
      improvement).
      
      One possible concern for this is the implications for TCP (especially
      latency sensitive workload). Result[1] does not show obvious changes
      for most of the netperf test (RR, TX, and RX). And we do get some
      improvements for RX on some specific size.
      
      Guest RX:
      
      size/sessions/+thu%/+normalize%
         64/     1/   +2%/   +2%
         64/     2/   +2%/   -1%
         64/     4/   +1%/   +1%
         64/     8/    0%/    0%
        256/     1/   +6%/   -3%
        256/     2/   -3%/   +2%
        256/     4/  +11%/  +11%
        256/     8/    0%/    0%
        512/     1/   +4%/    0%
        512/     2/   +2%/   +2%
        512/     4/    0%/   -1%
        512/     8/   -8%/   -8%
       1024/     1/   -7%/  -17%
       1024/     2/   -8%/   -7%
       1024/     4/   +1%/    0%
       1024/     8/    0%/    0%
       2048/     1/  +30%/  +14%
       2048/     2/  +46%/  +40%
       2048/     4/    0%/    0%
       2048/     8/    0%/    0%
       4096/     1/  +23%/  +22%
       4096/     2/  +26%/  +23%
       4096/     4/    0%/   +1%
       4096/     8/    0%/    0%
      16384/     1/   -2%/   -3%
      16384/     2/   +1%/   -4%
      16384/     4/   -1%/   -3%
      16384/     8/    0%/   -1%
      65535/     1/  +15%/   +7%
      65535/     2/   +4%/   +7%
      65535/     4/    0%/   +1%
      65535/     8/    0%/    0%
      
      TCP_RR:
      
      size/sessions/+thu%/+normalize%
          1/     1/    0%/   +1%
          1/    25/   +2%/   +1%
          1/    50/   +4%/   +1%
         64/     1/    0%/   -4%
         64/    25/   +2%/   +1%
         64/    50/    0%/   -1%
        256/     1/    0%/    0%
        256/    25/    0%/    0%
        256/    50/   +4%/   +2%
      
      Guest TX:
      
      size/sessions/+thu%/+normalize%
         64/     1/   +4%/   -2%
         64/     2/   -6%/   -5%
         64/     4/   +3%/   +6%
         64/     8/    0%/   +3%
        256/     1/  +15%/  +16%
        256/     2/  +11%/  +12%
        256/     4/   +1%/    0%
        256/     8/   +5%/   +5%
        512/     1/   -1%/   -6%
        512/     2/    0%/   -8%
        512/     4/   -2%/   +4%
        512/     8/   +6%/   +9%
       1024/     1/   +3%/   +1%
       1024/     2/   +3%/   +9%
       1024/     4/    0%/   +7%
       1024/     8/    0%/   +7%
       2048/     1/   +8%/   +2%
       2048/     2/   +3%/   -1%
       2048/     4/   -1%/  +11%
       2048/     8/   +3%/   +9%
       4096/     1/   +8%/   +8%
       4096/     2/    0%/   -7%
       4096/     4/   +4%/   +4%
       4096/     8/   +2%/   +5%
      16384/     1/   -3%/   +1%
      16384/     2/   -1%/  -12%
      16384/     4/   -1%/   +5%
      16384/     8/    0%/   +1%
      65535/     1/    0%/   -3%
      65535/     2/   +5%/  +16%
      65535/     4/   +1%/   +2%
      65535/     8/   +1%/   -1%
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2b3b35e