1. 18 Sep, 2019 1 commit
    • Stefan Hajnoczi's avatar
      virtio-fs: add virtiofs filesystem · a62a8ef9
      Stefan Hajnoczi authored
      Add a basic file system module for virtio-fs.  This does not yet contain
      shared data support between host and guest or metadata coherency speedups.
      However it is already significantly faster than virtio-9p.
      Design Overview
      With the goal of designing something with better performance and local file
      system semantics, a bunch of ideas were proposed.
       - Use fuse protocol (instead of 9p) for communication between guest and
         host.  Guest kernel will be fuse client and a fuse server will run on
         host to serve the requests.
       - For data access inside guest, mmap portion of file in QEMU address space
         and guest accesses this memory using dax.  That way guest page cache is
         bypassed and there is only one copy of data (on host).  This will also
         enable mmap(MAP_SHARED) between guests.
       - For metadata coherency, there is a shared memory region which contains
         version number associated with metadata and any guest changing metadata
         updates version number and other guests refresh metadata on next access.
         This is yet to be implemented.
      How virtio-fs differs from existing approaches
      The unique idea behind virtio-fs is to take advantage of the co-location of
      the virtual machine and hypervisor to avoid communication (vmexits).
      DAX allows file contents to be accessed without communication with the
      hypervisor.  The shared memory region for metadata avoids communication in
      the common case where metadata is unchanged.
      By replacing expensive communication with cheaper shared memory accesses,
      we expect to achieve better performance than approaches based on network
      file system protocols.  In addition, this also makes it easier to achieve
      local file system semantics (coherency).
      These techniques are not applicable to network file system protocols since
      the communications channel is bypassed by taking advantage of shared memory
      on a local machine.  This is why we decided to build virtio-fs rather than
      focus on 9P or NFS.
      Caching Modes
      Like virtio-9p, different caching modes are supported which determine the
      coherency level as well.  The “cache=FOO” and “writeback” options control
      the level of coherence between the guest and host filesystems.
       - cache=none
         metadata, data and pathname lookup are not cached in guest.  They are
         always fetched from host and any changes are immediately pushed to host.
       - cache=always
         metadata, data and pathname lookup are cached in guest and never expire.
       - cache=auto
         metadata and pathname lookup cache expires after a configured amount of
         time (default is 1 second).  Data is cached while the file is open
         (close to open consistency).
       - writeback/no_writeback
         These options control the writeback strategy.  If writeback is disabled,
         then normal writes will immediately be synchronized with the host fs.
         If writeback is enabled, then writes may be cached in the guest until
         the file is closed or an fsync(2) performed.  This option has no effect
         on mmap-ed writes or writes going through the DAX mechanism.
      Signed-off-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
  2. 12 Sep, 2019 11 commits
  3. 10 Sep, 2019 20 commits
    • Miklos Szeredi's avatar
      fuse: stop copying pages to fuse_req · 05ea48cc
      Miklos Szeredi authored
      The page array pointers are also duplicated across fuse_args_pages and
      fuse_req.  Get rid of the fuse_req ones.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: stop copying args to fuse_req · d4993774
      Miklos Szeredi authored
      No need to duplicate the argument arrays in fuse_req, so just dereference
      req->args instead of copying to the fuse_req internal ones.
      This allows further cleanup of the fuse_req structure.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: clean up fuse_req · 145b673b
      Miklos Szeredi authored
      Get rid of request specific fields in fuse_req that are not used anymore.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: simplify request allocation · 7213394c
      Miklos Szeredi authored
      Page arrays are not allocated together with the request anymore.  Get rid
      of the dead code
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: unexport request ops · 66abc359
      Miklos Szeredi authored
      All requests are now sent with one of the fuse_simple_... helpers.  Get rid
      of the old api from the fuse internal header.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: convert release to simple api · 4cb54866
      Miklos Szeredi authored
      Since we cannot reserve the request structure up-front, make sure that the
      request allocation doesn't fail using __GFP_NOFAIL.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: convert writepages to simple api · 33826ebb
      Miklos Szeredi authored
      Derive fuse_writepage_args from fuse_io_args.
      Sending the request is tricky since it was done with fi->lock held, hence
      we must either use atomic allocation or release the lock.  Both are
      possible so try atomic first and if it fails, release the lock and do the
      regular allocation with GFP_NOFS and __GFP_NOFAIL.  Both flags are
      necessary for correct operation.
      Move the page realloc function from dev.c to file.c and convert to using
      The last caller of fuse_write_fill() is gone, so get rid of it.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: convert readdir to simple api · 43f5098e
      Miklos Szeredi authored
      The old fuse_read_fill() helper can be deleted, now that the last user is
      The fuse_io_args struct is moved to fuse_i.h so it can be shared between
      readdir/read code.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: add simple background helper · 12597287
      Miklos Szeredi authored
      Create a helper named fuse_simple_background() that is similar to
      fuse_simple_request().  Unlike the latter, it returns immediately and calls
      the supplied 'end' callback when the reply is received.
      The supplied 'args' pointer is stored in 'fuse_req' which allows the
      callback to interpret the output arguments decoded from the reply.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: move page alloc · 4c4f03f7
      Miklos Szeredi authored
      fuse_req_pages_alloc() is moved to file.c, since its internal use by the
      device code will eventually be removed.
      Rename to fuse_pages_alloc() to signify that it's not only usable for
      fuse_req page array.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: add pages to fuse_args · 68583165
      Miklos Szeredi authored
      Derive fuse_args_pages from fuse_args. This is used to handle requests
      which use pages for input or output.  The related flags are added to
      New FR_ALLOC_PAGES flags is added to indicate whether the page arrays in
      fuse_req need to be freed by fuse_put_request() or not.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: convert destroy to simple api · 1ccd1ea2
      Miklos Szeredi authored
      We can use the "force" flag to make sure the DESTROY request is always sent
      to userspace.  So no need to keep it allocated during the lifetime of the
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: add nocreds to fuse_args · e413754b
      Miklos Szeredi authored
      In some cases it makes no sense to set pid/uid/gid fields in the request
      header.  Allow fuse_simple_background() to omit these.  This is only
      required in the "force" case, so for now just WARN if set otherwise.
      Fold fuse_get_req_nofail_nopages() into its only caller.  Comment is
      obsolete anyway.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: convert fuse_force_forget() to simple api · 3545fe21
      Miklos Szeredi authored
      Move this function to the readdir.c where its only caller resides.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: add noreply to fuse_args · 454a7613
      Miklos Szeredi authored
      This will be used by fuse_force_forget().
      We can expand fuse_request_send() into fuse_simple_request().  The
      FR_WAITING bit has already been set, no need to check.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: convert flush to simple api · c500ebaa
      Miklos Szeredi authored
      Add 'force' to fuse_args and use fuse_get_req_nofail_nopages() to allocate
      the request in that case.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: simplify 'nofail' request · 40ac7ab2
      Miklos Szeredi authored
      Instead of complex games with a reserved request, just use __GFP_NOFAIL.
      Both calers (flush, readdir) guarantee that connection was already
      initialized, so no need to wait for fc->initialized.
      Also remove unneeded clearing of FR_BACKGROUND flag.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: rearrange and resize fuse_args fields · 1f4e9d03
      Miklos Szeredi authored
      This makes the structure better packed.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Miklos Szeredi's avatar
      fuse: flatten 'struct fuse_args' · d5b48543
      Miklos Szeredi authored
      ...to make future expansion simpler.  The hiearachical structure is a
      historical thing that does not serve any practical purpose.
      The generated code is excatly the same before and after the patch.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Eric Biggers's avatar
      fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock · 76e43c8c
      Eric Biggers authored
      When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
      and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
      This may have to wait for fuse_iqueue::waitq.lock to be released by one
      of many places that take it with IRQs enabled.  Since the IRQ handler
      may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
      Fix it by protecting the state of struct fuse_iqueue with a separate
      spinlock, and only accessing fuse_iqueue::waitq using the versions of
      the waitqueue functions which do IRQ-safe locking internally.
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <sys/mount.h>
      	#include <sys/stat.h>
      	#include <sys/syscall.h>
      	#include <unistd.h>
      	#include <linux/aio_abi.h>
      	int main()
      		char opts[128];
      		int fd = open("/dev/fuse", O_RDWR);
      		aio_context_t ctx = 0;
      		struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
      		struct iocb *cbp = &cb;
      		sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
      		mkdir("mnt", 0700);
      		mount("foo",  "mnt", "fuse", 0, opts);
      		syscall(__NR_io_setup, 1, &ctx);
      		syscall(__NR_io_submit, ctx, 1, &cbp);
      Beginning of lockdep output:
      	WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
      	5.3.0-rc5 #9 Not tainted
      	syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      	000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
      	000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
      	000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
      	and this task is already holding:
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
      	which would create a new lock dependency:
      	 (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
      	but this new dependency connects a SOFTIRQ-irq-safe lock:
      Reported-by: default avatar <syzbot+af05535bb79520f95431@syzkaller.appspotmail.com>
      Reported-by: default avatar <syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com>
      Fixes: bfe4037e
       ("aio: implement IOCB_CMD_POLL")
      Cc: <stable@vger.kernel.org> # v4.19+
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
  4. 24 Apr, 2019 2 commits
    • Kirill Smelkov's avatar
      fuse: allow filesystems to have precise control over data cache · ad2ba64d
      Kirill Smelkov authored
      On networked filesystems file data can be changed externally.  FUSE
      provides notification messages for filesystem to inform kernel that
      metadata or data region of a file needs to be invalidated in local page
      cache. That provides the basis for filesystem implementations to invalidate
      kernel cache explicitly based on observed filesystem-specific events.
      FUSE has also "automatic" invalidation mode(*) when the kernel
      automatically invalidates data cache of a file if it sees mtime change.  It
      also automatically invalidates whole data cache of a file if it sees file
      size being changed.
      The automatic mode has corresponding capability - FUSE_AUTO_INVAL_DATA.
      However, due to probably historical reason, that capability controls only
      whether mtime change should be resulting in automatic invalidation or
      not. A change in file size always results in invalidating whole data cache
      of a file irregardless of whether FUSE_AUTO_INVAL_DATA was negotiated(+).
      The filesystem I write[1] represents data arrays stored in networked
      database as local files suitable for mmap. It is read-only filesystem -
      changes to data are committed externally via database interfaces and the
      filesystem only glues data into contiguous file streams suitable for mmap
      and traditional array processing. The files are big - starting from
      hundreds gigabytes and more. The files change regularly, and frequently by
      data being appended to their end. The size of files thus changes
      If a file was accessed locally and some part of its data got into page
      cache, we want that data to stay cached unless there is memory pressure, or
      unless corresponding part of the file was actually changed. However current
      FUSE behaviour - when it sees file size change - is to invalidate the whole
      file. The data cache of the file is thus completely lost even on small size
      change, and despite that the filesystem server is careful to accurately
      translate database changes into FUSE invalidation messages to kernel.
      Let's fix it: if a filesystem, through new FUSE_EXPLICIT_INVAL_DATA
      capability, indicates to kernel that it is fully responsible for data cache
      invalidation, then the kernel won't invalidate files data cache on size
      change and only truncate that cache to new size in case the size decreased.
      (*) see 72d0d248 "fuse: add FUSE_AUTO_INVAL_DATA init flag",
      eed2179e "fuse: invalidate inode mapping if mtime changes"
      (+) in writeback mode the kernel does not invalidate data cache on file
      size change, but neither it allows the filesystem to set the size due to
      external event (see 8373200b "fuse: Trust kernel i_size only")
      [1] https://lab.nexedi.com/kirr/wendelin.core/blob/a50f1d9f/wcfs/wcfs.go#L20
      Signed-off-by: default avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    • Kirill Smelkov's avatar
      fuse: convert printk -> pr_* · f2294482
      Kirill Smelkov authored
      Functions, like pr_err, are a more modern variant of printing compared to
      printk. They could be used to denoise sources by using needed level in
      the print function name, and by automatically inserting per-driver /
      function / ... print prefix as defined by pr_fmt macro. pr_* are also
      said to be used in Documentation/process/coding-style.rst and more
      recent code - for example overlayfs - uses them instead of printk.
      Convert CUSE and FUSE to use the new pr_* functions.
      CUSE output stays completely unchanged, while FUSE output is amended a
      bit for "trying to steal weird page" warning - the second line now comes
      also with "fuse:" prefix. I hope it is ok.
      Suggested-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarKirill Smelkov <kirr@nexedi.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
  5. 13 Feb, 2019 6 commits