1. 13 Dec, 2019 2 commits
    • Josef Bacik's avatar
      btrfs: skip log replay on orphaned roots · 9bc574de
      Josef Bacik authored
      
      
      My fsstress modifications coupled with generic/475 uncovered a failure
      to mount and replay the log if we hit a orphaned root.  We do not want
      to replay the log for an orphan root, but it's completely legitimate to
      have an orphaned root with a log attached.  Fix this by simply skipping
      replaying the log.  We still need to pin it's root node so that we do
      not overwrite it while replaying other logs, as we re-read the log root
      at every stage of the replay.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9bc574de
    • Filipe Manana's avatar
      Btrfs: fix missing data checksums after replaying a log tree · 40e046ac
      Filipe Manana authored
      
      
      When logging a file that has shared extents (reflinked with other files or
      with itself), we can end up logging multiple checksum items that cover
      overlapping ranges. This confuses the search for checksums at log replay
      time causing some checksums to never be added to the fs/subvolume tree.
      
      Consider the following example of a file that shares the same extent at
      offsets 0 and 256Kb:
      
         [ bytenr 13893632, offset 64Kb, len 64Kb  ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 13893632, offset 0, len 256Kb    ]
         256Kb                                     512Kb
      
      When logging the inode, at tree-log.c:copy_items(), when processing the
      file extent item at offset 0, we log a checksum item covering the range
      13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
      64Kb + 64Kb, respectively.
      
      Later when processing the extent item at offset 256K, we log the checksums
      for the range from 13893632 to 14155776 (which corresponds to 13893632 +
      256Kb). These checksums get merged with the checksum item for the range
      from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
      So after this we get the two following checksum items in the log tree:
      
         (...)
         item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
                 range start 13631488 end 14155776 length 524288
         item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
                 range start 13959168 end 14024704 length 65536
      
      The first one covers the range from the second one, they overlap.
      
      So far this does not cause a problem after replaying the log, because
      when replaying the file extent item for offset 256K, we copy all the
      checksums for the extent 13893632 from the log tree to the fs/subvolume
      tree, since searching for an checksum item for bytenr 13893632 leaves us
      at the first checksum item, which covers the whole range of the extent.
      
      However if we write 64Kb to file offset 256Kb for example, we will
      not be able to find and copy the checksums for the last 128Kb of the
      extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
      
      After writing 64Kb into file offset 256Kb we get the following extent
      layout for our file:
      
         [ bytenr 13893632, offset 64K, len 64Kb   ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 14155776, offset 0, len 64Kb     ]
         256Kb                                     320Kb
      
         [ bytenr 13893632, offset 64Kb, len 192Kb ]
         320Kb                                     512Kb
      
      After fsync'ing the file, if we have a power failure and then mount
      the filesystem to replay the log, the following happens:
      
      1) When replaying the file extent item for file offset 320Kb, we
         lookup for the checksums for the extent range from 13959168
         (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
         to btrfs_lookup_csums_range();
      
      2) btrfs_lookup_csums_range() finds the checksum item that starts
         precisely at offset 13959168 (item 7 in the log tree, shown before);
      
      3) However that checksum item only covers 64Kb of data, and not 192Kb
         of data;
      
      4) As a result only the checksums for the first 64Kb of data referenced
         by the file extent item are found and copied to the fs/subvolume tree.
         The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
         the corresponding data checksums found and copied to the fs/subvolume
         tree.
      
      5) After replaying the log userspace will not be able to read the file
         range from 384Kb to 512Kb, because the checksums are missing and
         resulting in an -EIO error.
      
      The following steps reproduce this scenario:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
        $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
      
        $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        <power failure>
      
        $ mount /dev/sdc /mnt/sdc
        $ md5sum /mnt/sdc/foobar
        md5sum: /mnt/sdc/foobar: Input/output error
      
        $ dmesg | tail
        [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
        [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
        [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
        [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
        [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
        [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
        [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
        [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
        [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
        [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
      
      Fix this simply by deleting first any checksums, from the log tree, for the
      range of the extent we are logging at copy_items(). This ensures we do not
      get checksum items in the log tree that have overlapping ranges.
      
      This is a long time issue that has been present since we have the clone
      (and deduplication) ioctl, and can happen both when an extent is shared
      between different files and within the same file.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40e046ac
  2. 18 Nov, 2019 5 commits
  3. 01 Oct, 2019 1 commit
    • Josef Bacik's avatar
      btrfs: fix incorrect updating of log root tree · 4203e968
      Josef Bacik authored
      
      
      We've historically had reports of being unable to mount file systems
      because the tree log root couldn't be read.  Usually this is the "parent
      transid failure", but could be any of the related errors, including
      "fsid mismatch" or "bad tree block", depending on which block got
      allocated.
      
      The modification of the individual log root items are serialized on the
      per-log root root_mutex.  This means that any modification to the
      per-subvol log root_item is completely protected.
      
      However we update the root item in the log root tree outside of the log
      root tree log_mutex.  We do this in order to allow multiple subvolumes
      to be updated in each log transaction.
      
      This is problematic however because when we are writing the log root
      tree out we update the super block with the _current_ log root node
      information.  Since these two operations happen independently of each
      other, you can end up updating the log root tree in between writing out
      the dirty blocks and setting the super block to point at the current
      root.
      
      This means we'll point at the new root node that hasn't been written
      out, instead of the one we should be pointing at.  Thus whatever garbage
      or old block we end up pointing at complains when we mount the file
      system later and try to replay the log.
      
      Fix this by copying the log's root item into a local root item copy.
      Then once we're safely under the log_root_tree->log_mutex we update the
      root item in the log_root_tree.  This way we do not modify the
      log_root_tree while we're committing it, fixing the problem.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4203e968
  4. 12 Sep, 2019 1 commit
    • Filipe Manana's avatar
      Btrfs: fix assertion failure during fsync and use of stale transaction · 410f954c
      Filipe Manana authored
      
      
      Sometimes when fsync'ing a file we need to log that other inodes exist and
      when we need to do that we acquire a reference on the inodes and then drop
      that reference using iput() after logging them.
      
      That generally is not a problem except if we end up doing the final iput()
      (dropping the last reference) on the inode and that inode has a link count
      of 0, which can happen in a very short time window if the logging path
      gets a reference on the inode while it's being unlinked.
      
      In that case we end up getting the eviction callback, btrfs_evict_inode(),
      invoked through the iput() call chain which needs to drop all of the
      inode's items from its subvolume btree, and in order to do that, it needs
      to join a transaction at the helper function evict_refill_and_join().
      However because the task previously started a transaction at the fsync
      handler, btrfs_sync_file(), it has current->journal_info already pointing
      to a transaction handle and therefore evict_refill_and_join() will get
      that transaction handle from btrfs_join_transaction(). From this point on,
      two different problems can happen:
      
      1) evict_refill_and_join() will often change the transaction handle's
         block reserve (->block_rsv) and set its ->bytes_reserved field to a
         value greater than 0. If evict_refill_and_join() never commits the
         transaction, the eviction handler ends up decreasing the reference
         count (->use_count) of the transaction handle through the call to
         btrfs_end_transaction(), and after that point we have a transaction
         handle with a NULL ->block_rsv (which is the value prior to the
         transaction join from evict_refill_and_join()) and a ->bytes_reserved
         value greater than 0. If after the eviction/iput completes the inode
         logging path hits an error or it decides that it must fallback to a
         transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
         non-zero value from btrfs_log_dentry_safe(), and because of that
         non-zero value it tries to commit the transaction using a handle with
         a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
         the transaction commit hit an assertion failure at
         btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
         the ->block_rsv is NULL. The produced stack trace for that is like the
         following:
      
         [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
         [192922.917553] ------------[ cut here ]------------
         [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
         [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
         [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
         [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
         [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
         (...)
         [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
         [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
         [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
         [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
         [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
         [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
         [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
         [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
         [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [192922.925105] Call Trace:
         [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
         [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
         [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
         [192922.926731]  do_fsync+0x38/0x60
         [192922.927138]  __x64_sys_fdatasync+0x13/0x20
         [192922.927543]  do_syscall_64+0x60/0x1c0
         [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         (...)
         [192922.934077] ---[ end trace f00808b12068168f ]---
      
      2) If evict_refill_and_join() decides to commit the transaction, it will
         be able to do it, since the nested transaction join only increments the
         transaction handle's ->use_count reference counter and it does not
         prevent the transaction from getting committed. This means that after
         eviction completes, the fsync logging path will be using a transaction
         handle that refers to an already committed transaction. What happens
         when using such a stale transaction can be unpredictable, we are at
         least having a use-after-free on the transaction handle itself, since
         the transaction commit will call kmem_cache_free() against the handle
         regardless of its ->use_count value, or we can end up silently losing
         all the updates to the log tree after that iput() in the logging path,
         or using a transaction handle that in the meanwhile was allocated to
         another task for a new transaction, etc, pretty much unpredictable
         what can happen.
      
      In order to fix both of them, instead of using iput() during logging, use
      btrfs_add_delayed_iput(), so that the logging path of fsync never drops
      the last reference on an inode, that step is offloaded to a safe context
      (usually the cleaner kthread).
      
      The assertion failure issue was sporadically triggered by the test case
      generic/475 from fstests, which loads the dm error target while fsstress
      is running, which lead to fsync failing while logging inodes with -EIO
      errors and then trying later to commit the transaction, triggering the
      assertion failure.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      410f954c
  5. 09 Sep, 2019 7 commits
  6. 02 Jul, 2019 2 commits
    • Filipe Manana's avatar
      Btrfs: fix fsync not persisting dentry deletions due to inode evictions · 803f0f64
      Filipe Manana authored
      In order to avoid searches on a log tree when unlinking an inode, we check
      if the inode being unlinked was logged in the current transaction, as well
      as the inode of its parent directory. When any of the inodes are logged,
      we proceed to delete directory items and inode reference items from the
      log, to ensure that if a subsequent fsync of only the inode being unlinked
      or only of the parent directory when the other is not fsync'ed as well,
      does not result in the entry still existing after a power failure.
      
      That check however is not reliable when one of the inodes involved (the
      one being unlinked or its parent directory's inode) is evicted, since the
      logged_trans field is transient, that is, it is not stored on disk, so it
      is lost when the inode is evicted and loaded into memory again (which is
      set to zero on load). As a consequence the checks currently being done by
      btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
      return true if the inode was evicted before, regardless of the inode
      having been logged or not before (and in the current transaction), this
      results in the dentry being unlinked still existing after a log replay
      if after the unlink operation only one of the inodes involved is fsync'ed.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Keep an open file descriptor on our directory while we evict inodes.
        # We just want to evict the file's inode, the directory's inode must not
        # be evicted.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time to background process to chdir to our test
        # directory.
        $ sleep 0.5
      
        # Trigger eviction of the file's inode.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # Unlink our file and fsync the parent directory. After a power failure
        # we don't expect to see the file anymore, since we fsync'ed the parent
        # directory.
        $ rm -f $SCRATCH_MNT/dir/foo
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ ls /mnt/dir
        foo
        $
         --> file still there, unlink not persisted despite explicit fsync on dir
      
      Fix this by checking if the inode has the full_sync bit set in its runtime
      flags as well, since that bit is set everytime an inode is loaded from
      disk, or for other less common cases such as after a shrinking truncate
      or failure to allocate extent maps for holes, and gets cleared after the
      first fsync. Also consider the inode as possibly logged only if it was
      last modified in the current transaction (besides having the full_fsync
      flag set).
      
      Fixes: 3a5f1d45
      
       ("Btrfs: Optimize btree walking while logging inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      803f0f64
    • Filipe Manana's avatar
      Btrfs: fix data loss after inode eviction, renaming it, and fsync it · d1d832a0
      Filipe Manana authored
      When we log an inode, regardless of logging it completely or only that it
      exists, we always update it as logged (logged_trans and last_log_commit
      fields of the inode are updated). This is generally fine and avoids future
      attempts to log it from having to do repeated work that brings no value.
      
      However, if we write data to a file, then evict its inode after all the
      dealloc was flushed (and ordered extents completed), rename the file and
      fsync it, we end up not logging the new extents, since the rename may
      result in logging that the inode exists in case the parent directory was
      logged before. The following reproducer shows and explains how this can
      happen:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ touch /mnt/dir/bar
      
        # Do a direct IO write instead of a buffered write because with a
        # buffered write we would need to make sure dealloc gets flushed and
        # complete before we do the inode eviction later, and we can not do that
        # from user space with call to things such as sync(2) since that results
        # in a transaction commit as well.
        $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
      
        # Keep the directory dir in use while we evict inodes. We want our file
        # bar's inode to be evicted but we don't want our directory's inode to
        # be evicted (if it were evicted too, we would not be able to reproduce
        # the issue since the first fsync below, of file foo, would result in a
        # transaction commit.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time for the background process to chdir.
        $ sleep 0.1
      
        # Evict all inodes, except the inode for the directory dir because it is
        # currently in use by our background process.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # fsync file foo, which ends up persisting information about the parent
        # directory because it is a new inode.
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Rename bar, this results in logging that this inode exists (inode item,
        # names, xattrs) because the parent directory is in the log.
        $ mv /mnt/dir/bar /mnt/dir/baz
      
        # Now fsync baz, which ends up doing absolutely nothing because of the
        # rename operation which logged that the inode exists only.
        $ xfs_io -c fsync /mnt/dir/baz
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/dir/baz
        0000000
      
          --> Empty file, data we wrote is missing.
      
      Fix this by not updating last_sub_trans of an inode when we are logging
      only that it exists and the inode was not yet logged since it was loaded
      from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
      return false for this scenario and make us log the inode. The logged_trans
      of the inode is still always setsince that alone is used to track if names
      need to be deleted as part of unlink operations.
      
      Fixes: 257c62e1
      
       ("Btrfs: avoid tree log commit when there are no changes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1d832a0
  7. 28 May, 2019 2 commits
    • Filipe Manana's avatar
      Btrfs: fix race updating log root item during fsync · 06989c79
      Filipe Manana authored
      When syncing the log, the final phase of a fsync operation, we need to
      either create a log root's item or update the existing item in the log
      tree of log roots, and that depends on the current value of the log
      root's log_transid - if it's 1 we need to create the log root item,
      otherwise it must exist already and we update it. Since there is no
      synchronization between updating the log_transid and checking it for
      deciding whether the log root's item needs to be created or updated, we
      end up with a tiny race window that results in attempts to update the
      item to fail because the item was not yet created:
      
                    CPU 1                                    CPU 2
      
        btrfs_sync_log()
      
          lock root->log_mutex
      
          set log root's log_transid to 1
      
          unlock root->log_mutex
      
                                                     btrfs_sync_log()
      
                                                       lock root->log_mutex
      
                                                       sets log root's
                                                       log_transid to 2
      
                                                       unlock root->log_mutex
      
          update_log_root()
      
            sees log root's log_transid
            with a value of 2
      
              calls btrfs_update_root(),
              which fails with -EUCLEAN
              and causes transaction abort
      
      Until recently the race lead to a BUG_ON at btrfs_update_root(), but after
      the recent commit 7ac1e464 ("btrfs: Don't panic when we can't find a
      root key") we just abort the current transaction.
      
      A sample trace of the BUG_ON() on a SLE12 kernel:
      
        ------------[ cut here ]------------
        kernel BUG at ../fs/btrfs/root-tree.c:157!
        Oops: Exception in kernel mode, sig: 5 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        (...)
        Supported: Yes, External
        CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G                 X 4.4.156-94.57-default #1
        task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000
        NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000
        REGS: c00000ff42b0b860 TRAP: 0700   Tainted: G                 X  (4.4.156-94.57-default)
        MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 22444484  XER: 20000000
        CFAR: d000000036aba66c SOFTE: 1
        GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054
        GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000
        GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079
        GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023
        GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28
        GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001
        GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888
        GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20
        NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs]
        LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs]
        Call Trace:
        [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable)
        [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs]
        [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs]
        [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120
        [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0
        [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40
        [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100
        Instruction dump:
        7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0
        e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0
        ---[ end trace 8f2dc8f919cabab8 ]---
      
      So fix this by doing the check of log_transid and updating or creating the
      log root's item while holding the root's log_mutex.
      
      Fixes: 7237f183
      
       ("Btrfs: fix tree logs parallel sync")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06989c79
    • Filipe Manana's avatar
      Btrfs: fix fsync not persisting changed attributes of a directory · 60d9f503
      Filipe Manana authored
      While logging an inode we follow its ancestors and for each one we mark
      it as logged in the current transaction, even if we have not logged it.
      As a consequence if we change an attribute of an ancestor, such as the
      UID or GID for example, and then explicitly fsync it, we end up not
      logging the inode at all despite returning success to user space, which
      results in the attribute being lost if a power failure happens after
      the fsync.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ chown 6007:6007 /mnt/dir
      
        $ sync
      
        $ chown 9003:9003 /mnt/dir
        $ touch /mnt/dir/file
        $ xfs_io -c fsync /mnt/dir/file
      
        # fsync our directory after fsync'ing the new file, should persist the
        # new values for the uid and gid.
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ stat -c %u:%g /mnt/dir
        6007:6007
      
          --> should be 9003:9003, the uid and gid were not persisted, despite
              the explicit fsync on the directory prior to the power failure
      
      Fix this by not updating the logged_trans field of ancestor inodes when
      logging an inode, since we have not logged them. Let only future calls to
      btrfs_log_inode() to mark inodes as logged.
      
      This could be triggered by my recent fsync fuzz tester for fstests, for
      which an fstests patch exists titled "fstests: generic, fsync fuzz tester
      with fsstress".
      
      Fixes: 12fcfd22
      
       ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60d9f503
  8. 16 May, 2019 1 commit
    • Filipe Manana's avatar
      Btrfs: avoid fallback to transaction commit during fsync of files with holes · ebb92906
      Filipe Manana authored
      When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
      file that has holes and has file extent items spanning two or more leafs,
      we can end up falling to back to a full transaction commit due to a logic
      bug that leads to failure to insert a duplicate file extent item that is
      meant to represent a hole between the last file extent item of a leaf and
      the first file extent item in the next leaf. The failure (EEXIST error)
      leads to a transaction commit (as most errors when logging an inode do).
      
      For example, we have the two following leafs:
      
      Leaf N:
      
        -----------------------------------------------
        | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
        -----------------------------------------------
        The file extent item at the end of leaf N has a length of 4Kb,
        representing the file range from 64K to 68K - 1.
      
      Leaf N + 1:
      
        -----------------------------------------------
        | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
        -----------------------------------------------
        The file extent item at the first slot of leaf N + 1 has a length of
        4Kb too, representing the file range from 72K to 76K - 1.
      
      During the full fsync path, when we are at tree-log.c:copy_items() with
      leaf N as a parameter, after processing the last file extent item, that
      represents the extent at offset 64K, we take a look at the first file
      extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
      between the two extents, and therefore we insert a file extent item
      representing that hole, starting at file offset 68K and ending at offset
      72K - 1. However we don't update the value of *last_extent, which is used
      to represent the end offset (plus 1, non-inclusive end) of the last file
      extent item inserted in the log, so it stays with a value of 68K and not
      with a value of 72K.
      
      Then, when copy_items() is called for leaf N + 1, because the value of
      *last_extent is smaller then the offset of the first extent item in the
      leaf (68K < 72K), we look at the last file extent item in the previous
      leaf (leaf N) and see it there's a 4K gap between it and our first file
      extent item (again, 68K < 72K), so we decide to insert a file extent item
      representing the hole, starting at file offset 68K and ending at offset
      72K - 1, this insertion will fail with -EEXIST being returned from
      btrfs_insert_file_extent() because we already inserted a file extent item
      representing a hole for this offset (68K) in the previous call to
      copy_items(), when processing leaf N.
      
      The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
      which falls back to a full transaction commit.
      
      Fix this by adjusting *last_extent after inserting a hole when we had to
      look at the next leaf.
      
      Fixes: 4ee3fad3
      
       ("Btrfs: fix fsync after hole punching when using no-holes feature")
      Cc: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ebb92906
  9. 29 Apr, 2019 9 commits
  10. 20 Mar, 2019 1 commit
    • Filipe Manana's avatar
      Btrfs: fix assertion failure on fsync with NO_HOLES enabled · 0ccc3876
      Filipe Manana authored
      Back in commit a89ca6f2 ("Btrfs: fix fsync after truncate when
      no_holes feature is enabled") I added an assertion that is triggered when
      an inline extent is found to assert that the length of the (uncompressed)
      data the extent represents is the same as the i_size of the inode, since
      that is true most of the time I couldn't find or didn't remembered about
      any exception at that time. Later on the assertion was expanded twice to
      deal with a case of a compressed inline extent representing a range that
      matches the sector size followed by an expanding truncate, and another
      case where fallocate can update the i_size of the inode without adding
      or updating existing extents (if the fallocate range falls entirely within
      the first block of the file). These two expansion/fixes of the assertion
      were done by commit 7ed586d0 ("Btrfs: fix assertion on fsync of
      regular file when using no-holes feature") and commit 6399fb5a
      
      
      ("Btrfs: fix assertion failure during fsync in no-holes mode").
      These however missed the case where an falloc expands the i_size of an
      inode to exactly the sector size and inline extent exists, for example:
      
       $ mkfs.btrfs -f -O no-holes /dev/sdc
       $ mount /dev/sdc /mnt
      
       $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
       wrote 1096/1096 bytes at offset 0
       1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
      
       $ xfs_io -c "falloc 1096 3000" /mnt/foobar
       $ xfs_io -c "fsync" /mnt/foobar
       Segmentation fault
      
       $ dmesg
       [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
       [701253.602962] ------------[ cut here ]------------
       [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
       [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G        W         5.0.0-rc8-btrfs-next-45 #1
       [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
       (...)
       [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
       [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
       [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
       [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
       [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
       [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
       [701253.607769] FS:  00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
       [701253.608163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
       [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [701253.609608] Call Trace:
       [701253.609994]  btrfs_log_inode+0xdfb/0xe40 [btrfs]
       [701253.610383]  btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
       [701253.610770]  ? do_raw_spin_unlock+0x49/0xc0
       [701253.611150]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [701253.611537]  btrfs_sync_file+0x3b2/0x440 [btrfs]
       [701253.612010]  ? do_sysinfo+0xb0/0xf0
       [701253.612552]  do_fsync+0x38/0x60
       [701253.612988]  __x64_sys_fsync+0x10/0x20
       [701253.613360]  do_syscall_64+0x60/0x1b0
       [701253.613733]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [701253.614103] RIP: 0033:0x7f14da4e66d0
       (...)
       [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
       [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
       [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
       [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
       [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
       (...)
       [701253.619941] ---[ end trace e088d74f132b6da5 ]---
      
      Updating the assertion again to allow for this particular case would result
      in a meaningless assertion, plus there is currently no risk of logging
      content that would result in any corruption after a log replay if the size
      of the data encoded in an inline extent is greater than the inode's i_size
      (which is not currently possibe either with or without compression),
      therefore just remove the assertion.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ccc3876
  11. 13 Mar, 2019 2 commits
    • Josef Bacik's avatar
      btrfs: remove WARN_ON in log_dir_items · 2cc83342
      Josef Bacik authored
      When Filipe added the recursive directory logging stuff in
      2f2ff0ee
      
       ("Btrfs: fix metadata inconsistencies after directory
      fsync") he specifically didn't take the directory i_mutex for the
      children directories that we need to log because of lockdep.  This is
      generally fine, but can lead to this WARN_ON() tripping if we happen to
      run delayed deletion's in between our first search and our second search
      of dir_item/dir_indexes for this directory.  We expect this to happen,
      so the WARN_ON() isn't necessary.  Drop the WARN_ON() and add a comment
      so we know why this case can happen.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2cc83342
    • Filipe Manana's avatar
      Btrfs: fix incorrect file size after shrinking truncate and fsync · bf504110
      Filipe Manana authored
      If we do a shrinking truncate against an inode which is already present
      in the respective log tree and then rename it, as part of logging the new
      name we end up logging an inode item that reflects the old size of the
      file (the one which we previously logged) and not the new smaller size.
      The decision to preserve the size previously logged was added by commit
      1a4bcf47 ("Btrfs: fix fsync data loss after adding hard link to
      inode") in order to avoid data loss after replaying the log. However that
      decision is only needed for the case the logged inode size is smaller then
      the current size of the inode, as explained in that commit's change log.
      If the current size of the inode is smaller then the previously logged
      size, we know a shrinking truncate happened and therefore need to use
      that smaller size.
      
      Example to trigger the problem:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
        $ xfs_io -c "fsync" /mnt/foo
        $ xfs_io -c "truncate 3000" /mnt/foo
      
        $ mv /mnt/foo /mnt/bar
        $ xfs_io -c "fsync" /mnt/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/bar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0008000
      
      Once we rename the file, we log its name (and inode item), and because
      the inode was already logged before in the current transaction, we log it
      with a size of 8000 bytes because that is the size we previously logged
      (with the first fsync). As part of the rename, besides logging the inode,
      we do also sync the log, which is done since commit d4682ba0
      ("Btrfs: sync log after logging new name"), so the next fsync against our
      inode is effectively a no-op, since no new changes happened since the
      rename operation. Even if did not sync the log during the rename
      operation, the same problem (fize size of 8000 bytes instead of 3000
      bytes) would be visible after replaying the log if the log ended up
      getting synced to disk through some other means, such as for example by
      fsyncing some other modified file. In the example above the fsync after
      the rename operation is there just because not every filesystem may
      guarantee logging/journalling the inode (and syncing the log/journal)
      during the rename operation, for example it is needed for f2fs, but not
      for ext4 and xfs.
      
      Fix this scenario by, when logging a new name (which is triggered by
      rename and link operations), using the current size of the inode instead
      of the previously logged inode size.
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
      
      
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: default avatarSeulbae Kim <seulbae@gatech.edu>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf504110
  12. 25 Feb, 2019 4 commits
    • Filipe Manana's avatar
      Btrfs: add missing error handling after doing leaf/node binary search · cbca7d59
      Filipe Manana authored
      
      
      The function map_private_extent_buffer() can return an -EINVAL error, and
      it is called by generic_bin_search() which will return back the error. The
      btrfs_bin_search() function in turn calls generic_bin_search() and the
      key_search() function calls btrfs_bin_search(), so both can return the
      -EINVAL error coming from the map_private_extent_buffer() function. Some
      callers of these functions were ignoring that these functions can return
      an error, so fix them to deal with error return values.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbca7d59
    • Filipe Manana's avatar
      Btrfs: fix fsync after succession of renames and unlink/rmdir · a3baaf0d
      Filipe Manana authored
      
      
      After a succession of renames operations of different files and unlinking
      one of them, if we fsync one of the renamed files we can end up with a
      log that will either fail to replay at mount time or result in a filesystem
      that is in an inconsistent state. One example scenario:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir
        $ touch /mnt/testdir/fname1
        $ touch /mnt/testdir/fname2
      
        $ sync
      
        $ mv /mnt/testdir/fname1 /mnt/testdir/fname3
        $ rm -f /mnt/testdir/fname2
        $ ln /mnt/testdir/fname3 /mnt/testdir/fname2
      
        $ touch /mnt/testdir/fname1
        $ xfs_io -c "fsync" /mnt/testdir/fname1
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ umount /mnt
        $ btrfs check /dev/sdb
        [1/7] checking root items
        [2/7] checking extents
        [3/7] checking free space cache
        [4/7] checking fs roots
        root 5 inode 259 errors 2, no orphan item
        ERROR: errors found in fs roots
        Opening filesystem to check...
        Checking filesystem on /dev/sdc
        UUID: 20e4abb8-5a19-4492-8bb4-6084125c2d0d
        found 393216 bytes used, error(s) found
        total csum bytes: 0
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 122986
        file data blocks allocated: 262144
         referenced 262144
      
      On a kernel without the first patch in this series, titled
      "[PATCH] Btrfs: fix fsync after succession of renames of different files",
      we get instead an error when mounting the filesystem due to failure of
      replaying the log:
      
        $ mount /dev/sdb /mnt
        mount: mount /dev/sdb on /mnt failed: File exists
      
      Fix this by logging the parent directory of an inode whenever we find an
      inode that no longer exists (was unlinked in the current transaction),
      during the procedure which finds inodes that have old names that collide
      with new names of other inodes.
      
      A test case for fstests follows soon.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3baaf0d
    • Filipe Manana's avatar
      Btrfs: fix fsync after succession of renames of different files · 6b5fc433
      Filipe Manana authored
      
      
      After a succession of rename operations of different files and fsyncing
      one of them, such that each file gets a new name that corresponds to an
      old name of another file, we can end up with a log that will cause a
      failure when attempted to replay at mount time (an EEXIST error).
      We currently have correct behaviour when such succession of renames
      involves only two files, but if there are more files involved, we end up
      not logging all the inodes that are needed, therefore resulting in a
      failure when attempting to replay the log.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir
        $ touch /mnt/testdir/fname1
        $ touch /mnt/testdir/fname2
      
        $ sync
      
        $ mv /mnt/testdir/fname1 /mnt/testdir/fname3
        $ mv /mnt/testdir/fname2 /mnt/testdir/fname4
        $ ln /mnt/testdir/fname3 /mnt/testdir/fname2
      
        $ touch /mnt/testdir/fname1
        $ xfs_io -c "fsync" /mnt/testdir/fname1
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        mount: mount /dev/sdb on /mnt failed: File exists
      
      So fix this by checking all inode dependencies when logging an inode. That
      is, if one logged inode A has a new name that matches the old name of some
      other inode B, check if inode B has a new name that matches the old name
      of some other inode C, and so on. This fix is implemented not by doing any
      recursive function calls but by using an iterative method using a linked
      list that is used in a first-in-first-out fashion.
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b5fc433
    • David Sterba's avatar
      btrfs: open code now trivial btrfs_set_lock_blocking · 8bead258
      David Sterba authored
      
      
      btrfs_set_lock_blocking is now only a simple wrapper around
      btrfs_set_lock_blocking_write. The name does not bring any semantic
      value that could not be inferred from the new function so there's no
      point keeping it.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bead258
  13. 17 Dec, 2018 3 commits
    • Andrea Gelmini's avatar
      btrfs: Fix typos in comments and strings · 52042d8e
      Andrea Gelmini authored
      
      
      The typos accumulate over time so once in a while time they get fixed in
      a large patch.
      Signed-off-by: default avatarAndrea Gelmini <andrea.gelmini@gelma.net>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52042d8e
    • Filipe Manana's avatar
      Btrfs: fix fsync of files with multiple hard links in new directories · 41bd6067
      Filipe Manana authored
      The log tree has a long standing problem that when a file is fsync'ed we
      only check for new ancestors, created in the current transaction, by
      following only the hard link for which the fsync was issued. We follow the
      ancestors using the VFS' dget_parent() API. This means that if we create a
      new link for a file in a directory that is new (or in an any other new
      ancestor directory) and then fsync the file using an old hard link, we end
      up not logging the new ancestor, and on log replay that new hard link and
      ancestor do not exist. In some cases, involving renames, the file will not
      exist at all.
      
      Example:
      
        mkfs.btrfs -f /dev/sdb
        mount /dev/sdb /mnt
      
        mkdir /mnt/A
        touch /mnt/foo
        ln /mnt/foo /mnt/A/bar
        xfs_io -c fsync /mnt/foo
      
        <power failure>
      
      In this example after log replay only the hard link named 'foo' exists
      and directory A does not exist, which is unexpected. In other major linux
      filesystems, such as ext4, xfs and f2fs for example, both hard links exist
      and so does directory A after mounting again the filesystem.
      
      Checking if any new ancestors are new and need to be logged was added in
      2009 by commit 12fcfd22 ("Btrfs: tree logging unlink/rename fixes"),
      however only for the ancestors of the hard link (dentry) for which the
      fsync was issued, instead of checking for all ancestors for all of the
      inode's hard links.
      
      So fix this by tracking the id of the last transaction where a hard link
      was created for an inode and then on fsync fallback to a full transaction
      commit when an inode has more than one hard link and at least one new hard
      link was created in the current transaction. This is the simplest solution
      since this is not a common use case (adding frequently hard links for
      which there's an ancestor created in the current transaction and then
      fsync the file). In case it ever becomes a common use case, a solution
      that consists of iterating the fs/subvol btree for each hard link and
      check if any ancestor is new, could be implemented.
      
      This solves many unexpected scenarios reported by Jayashree Mohan and
      Vijay Chidambaram, and for which there is a new test case for fstests
      under review.
      
      Fixes: 12fcfd22
      
       ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: default avatarVijay Chidambaram <vvijay03@gmail.com>
      Reported-by: default avatarJayashree Mohan <jayashree2912@gmail.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      41bd6067
    • Filipe Manana's avatar
      Btrfs: simpler and more efficient cleanup of a log tree's extent io tree · 59b0713a
      Filipe Manana authored
      
      
      We currently are in a loop finding each range (corresponding to a btree
      node/leaf) in a log root's extent io tree and then clean it up. This is a
      waste of time since we are traversing the extent io tree's rb_tree more
      times then needed (one for a range lookup and another for cleaning it up)
      without any good reason.
      
      We free the log trees when we are in the critical section of a transaction
      commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's
      of great convenience to do everything as fast as possible in order to
      reduce the time we block other tasks from starting a new transaction.
      
      So fix this by traversing the extent io tree once and cleaning up all its
      records in one go while traversing it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59b0713a