1. 13 Dec, 2019 1 commit
  2. 18 Nov, 2019 4 commits
  3. 17 Oct, 2019 1 commit
  4. 27 Sep, 2019 2 commits
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls · d4e20494
      Qu Wenruo authored
      [BUG]
      The following script can cause btrfs qgroup data space leak:
      
        mkfs.btrfs -f $dev
        mount $dev -o nospace_cache $mnt
      
        btrfs subv create $mnt/subv
        btrfs quota en $mnt
        btrfs quota rescan -w $mnt
        btrfs qgroup limit 128m $mnt/subv
      
        for (( i = 0; i < 3; i++)); do
                # Create 3 64M holes for latter fallocate to fail
                truncate -s 192m $mnt/subv/file
                xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
                xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
                sync
      
                # it's supposed to fail, and each failure will leak at least 64M
                # data space
                xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
                rm $mnt/subv/file
                sync
        done
      
        # Shouldn't fail after we removed the file
        xfs_io -f -c "falloc 0 64m" $mnt/subv/file
      
      [CAUSE]
      Btrfs qgroup data reserve code allow multiple reservations to happen on
      a single extent_changeset:
      E.g:
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
      
      Btrfs qgroup code has its internal tracking to make sure we don't
      double-reserve in above example.
      
      The only pattern utilizing this feature is in the main while loop of
      btrfs_fallocate() function.
      
      However btrfs_qgroup_reserve_data()'s error handling has a bug in that
      on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
      flag but doesn't free previously reserved bytes.
      
      This bug has a two fold effect:
      - Clearing EXTENT_QGROUP_RESERVED ranges
        This is the correct behavior, but it prevents
        btrfs_qgroup_check_reserved_leak() to catch the leakage as the
        detector is purely EXTENT_QGROUP_RESERVED flag based.
      
      - Leak the previously reserved data bytes.
      
      The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
      the last one fails, leaking space reserved in the previous ones.
      
      [FIX]
      Also free previously reserved data bytes when btrfs_qgroup_reserve_data
      fails.
      
      Fixes: 52472553
      
       ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4e20494
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space · bab32fc0
      Qu Wenruo authored
      
      
      [BUG]
      Under the following case with qgroup enabled, if some error happened
      after we have reserved delalloc space, then in error handling path, we
      could cause qgroup data space leakage:
      
      From btrfs_truncate_block() in inode.c:
      
      	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
      					   block_start, blocksize);
      	if (ret)
      		goto out;
      
       again:
      	page = find_or_create_page(mapping, index, mask);
      	if (!page) {
      		btrfs_delalloc_release_space(inode, data_reserved,
      					     block_start, blocksize, true);
      		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
      		ret = -ENOMEM;
      		goto out;
      	}
      
      [CAUSE]
      In the above case, btrfs_delalloc_reserve_space() will call
      btrfs_qgroup_reserve_data() and mark the io_tree range with
      EXTENT_QGROUP_RESERVED flag.
      
      In the error handling path, we have the following call stack:
      btrfs_delalloc_release_space()
      |- btrfs_free_reserved_data_space()
         |- btrsf_qgroup_free_data()
            |- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
               |- qgroup_free_reserved_data(reserved=@reserved)
                  |- clear_record_extent_bits();
                  |- freed += changeset.bytes_changed;
      
      However due to a completion bug, qgroup_free_reserved_data() will clear
      EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
      than the correct BTRFS_I(inode)->io_tree.
      Since io_failure_tree is never marked with that flag,
      btrfs_qgroup_free_data() will not free any data reserved space at all,
      causing a leakage.
      
      This type of error handling can only be triggered by errors outside of
      qgroup code. So EDQUOT error from qgroup can't trigger it.
      
      [FIX]
      Fix the wrong target io_tree.
      Reported-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Fixes: bc42bda2
      
       ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bab32fc0
  5. 24 Sep, 2019 1 commit
    • Filipe Manana's avatar
      Btrfs: fix race setting up and completing qgroup rescan workers · 13fc1d27
      Filipe Manana authored
      There is a race between setting up a qgroup rescan worker and completing
      a qgroup rescan worker that can lead to callers of the qgroup rescan wait
      ioctl to either not wait for the rescan worker to complete or to hang
      forever due to missing wake ups. The following diagram shows a sequence
      of steps that illustrates the race.
      
              CPU 1                                                         CPU 2                                  CPU 3
      
       btrfs_ioctl_quota_rescan()
        btrfs_qgroup_rescan()
         qgroup_rescan_init()
          mutex_lock(&fs_info->qgroup_rescan_lock)
          spin_lock(&fs_info->qgroup_lock)
      
          fs_info->qgroup_flags |=
            BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
          init_completion(
            &fs_info->qgroup_rescan_completion)
      
          fs_info->qgroup_rescan_running = true
      
          mutex_unlock(&fs_info->qgroup_rescan_lock)
          spin_unlock(&fs_info->qgroup_lock)
      
          btrfs_init_work()
           --> starts the worker
      
                                                              btrfs_qgroup_rescan_worker()
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_flags &=
                                                                 ~BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
                                                               starts transaction, updates qgroup status
                                                               item, etc
      
                                                                                                                 btrfs_ioctl_quota_rescan()
                                                                                                                  btrfs_qgroup_rescan()
                                                                                                                   qgroup_rescan_init()
                                                                                                                    mutex_lock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_lock(&fs_info->qgroup_lock)
      
                                                                                                                    fs_info->qgroup_flags |=
                                                                                                                      BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                                                                                    init_completion(
                                                                                                                      &fs_info->qgroup_rescan_completion)
      
                                                                                                                    fs_info->qgroup_rescan_running = true
      
                                                                                                                    mutex_unlock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_unlock(&fs_info->qgroup_lock)
      
                                                                                                                    btrfs_init_work()
                                                                                                                     --> starts another worker
      
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_rescan_running = false
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
      							 complete_all(&fs_info->qgroup_rescan_completion)
      
      Before the rescan worker started by the task at CPU 3 completes, if
      another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
      because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
      fs_info->qgroup_flags, which is expected and correct behaviour.
      
      However if other task calls btrfs_ioctl_quota_rescan_wait() before the
      rescan worker started by the task at CPU 3 completes, it will return
      immediately without waiting for the new rescan worker to complete,
      because fs_info->qgroup_rescan_running is set to false by CPU 2.
      
      This race is making test case btrfs/171 (from fstests) to fail often:
      
        btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
            --- tests/btrfs/171.out     2018-09-16 21:30:48.505104287 +0100
            +++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad      2019-09-19 02:01:36.938486039 +0100
            @@ -1,2 +1,3 @@
             QA output created by 171
            +ERROR: quota rescan failed: Operation now in progress
             Silence is golden
            ...
            (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad'  to see the entire diff)
      
      That is because the test calls the btrfs-progs commands "qgroup quota
      rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
      calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
      commands 'qgroup assign' and 'qgroup remove' often call the rescan start
      ioctl after calling the qgroup assign ioctl,
      btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
      for a rescan worker to complete.
      
      Another problem the race can cause is missing wake ups for waiters,
      since the call to complete_all() happens outside a critical section and
      after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
      diagram above, if we have a waiter for the first rescan task (executed
      by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
      if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
      before it calls complete_all() against
      fs_info->qgroup_rescan_completion, the task at CPU 3 calls
      init_completion() against fs_info->qgroup_rescan_completion which
      re-initilizes its wait queue to an empty queue, therefore causing the
      rescan worker at CPU 2 to call complete_all() against an empty queue,
      never waking up the task waiting for that rescan worker.
      
      Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
      fs_info->qgroup_rescan_running to false in the same critical section,
      delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
      call to complete_all() in that same critical section. This gives the
      protection needed to avoid rescan wait ioctl callers not waiting for a
      running rescan worker and the lost wake ups problem, since setting that
      rescan flag and boolean as well as initializing the wait queue is done
      already in a critical section delimited by that mutex (at
      qgroup_rescan_init()).
      
      Fixes: 57254b6e ("Btrfs: add ioctl to wait for qgroup rescan completion")
      Fixes: d2c609b8
      
       ("btrfs: properly track when rescan worker is running")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13fc1d27
  6. 09 Sep, 2019 2 commits
    • Qu Wenruo's avatar
      btrfs: qgroup: Try our best to delete qgroup relations · 73798c46
      Qu Wenruo authored
      
      
      When we try to delete qgroups, we're pretty cautious, we make sure both
      qgroups exist and there is a relationship between them, then try to
      delete the relation.
      
      This behavior is OK, but the problem is we need to two relation items,
      and if we failed the first item deletion, we error out, leaving the
      other relation item in qgroup tree.
      
      Sometimes the error from del_qgroup_relation_item() could just be
      -ENOENT, thus we can ignore that error and continue without any problem.
      
      Further more, such cautious behavior makes qgroup relation deletion
      impossible for orphan relation items.
      
      This patch will enhance __del_qgroup_relation():
      - If both qgroups and their relation items exist
        Go the regular deletion routine and update their accounting if needed.
      
      - If any qgroup or relation item doesn't exist
        Then we still try to delete the orphan items anyway, but don't trigger
        the accounting update.
      
      By this, we try our best to remove relation items, and can handle orphan
      relation items properly, while still keep the existing behavior for good
      qgroup tree.
      Reported-by: default avatarAndrei Borzenkov <arvidjaar@gmail.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      73798c46
    • Josef Bacik's avatar
      btrfs: move basic block_group definitions to their own header · aac0023c
      Josef Bacik authored
      
      
      This is prep work for moving all of the block group cache code into its
      own file.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minor comment updates ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aac0023c
  7. 02 Jul, 2019 1 commit
    • Qu Wenruo's avatar
      btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit() · e88439de
      Qu Wenruo authored
      
      
      [BUG]
      Lockdep will report the following circular locking dependency:
      
        WARNING: possible circular locking dependency detected
        5.2.0-rc2-custom #24 Tainted: G           O
        ------------------------------------------------------
        btrfs/8631 is trying to acquire lock:
        000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs]
      
        but task is already holding lock:
        000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (&fs_info->tree_log_mutex){+.+.}:
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_commit_transaction+0x475/0xa00 [btrfs]
               btrfs_commit_super+0x71/0x80 [btrfs]
               close_ctree+0x2bd/0x320 [btrfs]
               btrfs_put_super+0x15/0x20 [btrfs]
               generic_shutdown_super+0x72/0x110
               kill_anon_super+0x18/0x30
               btrfs_kill_super+0x16/0xa0 [btrfs]
               deactivate_locked_super+0x3a/0x80
               deactivate_super+0x51/0x60
               cleanup_mnt+0x3f/0x80
               __cleanup_mnt+0x12/0x20
               task_work_run+0x94/0xb0
               exit_to_usermode_loop+0xd8/0xe0
               do_syscall_64+0x210/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #1 (&fs_info->reloc_mutex){+.+.}:
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_commit_transaction+0x40d/0xa00 [btrfs]
               btrfs_quota_enable+0x2da/0x730 [btrfs]
               btrfs_ioctl+0x2691/0x2b40 [btrfs]
               do_vfs_ioctl+0xa9/0x6d0
               ksys_ioctl+0x67/0x90
               __x64_sys_ioctl+0x1a/0x20
               do_syscall_64+0x65/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}:
               lock_acquire+0xa7/0x190
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_qgroup_inherit+0x40/0x620 [btrfs]
               create_pending_snapshot+0x9d7/0xe60 [btrfs]
               create_pending_snapshots+0x94/0xb0 [btrfs]
               btrfs_commit_transaction+0x415/0xa00 [btrfs]
               btrfs_mksubvol+0x496/0x4e0 [btrfs]
               btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
               btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
               btrfs_ioctl+0xa90/0x2b40 [btrfs]
               do_vfs_ioctl+0xa9/0x6d0
               ksys_ioctl+0x67/0x90
               __x64_sys_ioctl+0x1a/0x20
               do_syscall_64+0x65/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        other info that might help us debug this:
      
        Chain exists of:
          &fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&fs_info->tree_log_mutex);
                                       lock(&fs_info->reloc_mutex);
                                       lock(&fs_info->tree_log_mutex);
          lock(&fs_info->qgroup_ioctl_lock#2);
      
         *** DEADLOCK ***
      
        6 locks held by btrfs/8631:
         #0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60
         #1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs]
         #2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs]
         #3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs]
         #4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs]
         #5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
      
      [CAUSE]
      Due to the delayed subvolume creation, we need to call
      btrfs_qgroup_inherit() inside commit transaction code, with a lot of
      other mutex hold.
      This hell of lock chain can lead to above problem.
      
      [FIX]
      On the other hand, we don't really need to hold qgroup_ioctl_lock if
      we're in the context of create_pending_snapshot().
      As in that context, we're the only one being able to modify qgroup.
      
      All other qgroup functions which needs qgroup_ioctl_lock are either
      holding a transaction handle, or will start a new transaction:
        Functions will start a new transaction():
        * btrfs_quota_enable()
        * btrfs_quota_disable()
        Functions hold a transaction handler:
        * btrfs_add_qgroup_relation()
        * btrfs_del_qgroup_relation()
        * btrfs_create_qgroup()
        * btrfs_remove_qgroup()
        * btrfs_limit_qgroup()
        * btrfs_qgroup_inherit() call inside create_subvol()
      
      So we have a higher level protection provided by transaction, thus we
      don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit().
      
      Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold
      qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in
      create_pending_snapshot() is already protected by transaction.
      
      So the fix is to detect the context by checking
      trans->transaction->state.
      If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction
      context and no need to get the mutex.
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e88439de
  8. 28 May, 2019 1 commit
    • Qu Wenruo's avatar
      btrfs: qgroup: Check bg while resuming relocation to avoid NULL pointer dereference · 57949d03
      Qu Wenruo authored
      [BUG]
      When mounting a fs with reloc tree and has qgroup enabled, it can cause
      NULL pointer dereference at mount time:
      
        BUG: kernel NULL pointer dereference, address: 00000000000000a8
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP NOPTI
        RIP: 0010:btrfs_qgroup_add_swapped_blocks+0x186/0x300 [btrfs]
        Call Trace:
         replace_path.isra.23+0x685/0x900 [btrfs]
         merge_reloc_root+0x26e/0x5f0 [btrfs]
         merge_reloc_roots+0x10a/0x1a0 [btrfs]
         btrfs_recover_relocation+0x3cd/0x420 [btrfs]
         open_ctree+0x1bc8/0x1ed0 [btrfs]
         btrfs_mount_root+0x544/0x680 [btrfs]
         legacy_get_tree+0x34/0x60
         vfs_get_tree+0x2d/0xf0
         fc_mount+0x12/0x40
         vfs_kern_mount.part.12+0x61/0xa0
         vfs_kern_mount+0x13/0x20
         btrfs_mount+0x16f/0x860 [btrfs]
         legacy_get_tree+0x34/0x60
         vfs_get_tree+0x2d/0xf0
         do_mount+0x81f/0xac0
         ksys_mount+0xbf/0xe0
         __x64_sys_mount+0x25/0x30
         do_syscall_64+0x65/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      In btrfs_recover_relocation(), we don't have enough info to determine
      which block group we're relocating, but only to merge existing reloc
      trees.
      
      Thus in btrfs_recover_relocation(), rc->block_group is NULL.
      btrfs_qgroup_add_swapped_blocks() hasn't taken this into consideration,
      and causes a NULL pointer dereference.
      
      The bug is introduced by commit 3d0174f7
      
       ("btrfs: qgroup: Only trace
      data extents in leaves if we're relocating data block group"), and
      later qgroup refactoring still keeps this optimization.
      
      [FIX]
      Thankfully in the context of btrfs_recover_relocation(), there is no
      other progress can modify tree blocks, thus those swapped tree blocks
      pair will never affect qgroup numbers, no matter whatever we set for
      block->trace_leaf.
      
      So we only need to check if @bg is NULL before accessing @bg->flags.
      Reported-by: default avatarJuan Erbes <jerbes@gmail.com>
      Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1134806
      Fixes: 3d0174f7
      
       ("btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group")
      CC: stable@vger.kernel.org # 4.20+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57949d03
  9. 29 Apr, 2019 2 commits
  10. 19 Mar, 2019 1 commit
  11. 27 Feb, 2019 1 commit
  12. 25 Feb, 2019 8 commits
    • Qu Wenruo's avatar
      btrfs: qgroup: Make qgroup async transaction commit more aggressive · f5fef459
      Qu Wenruo authored
      [BUG]
      Btrfs qgroup will still hit EDQUOT under the following case:
      
        $ dev=/dev/test/test
        $ mnt=/mnt/btrfs
        $ umount $mnt &> /dev/null
        $ umount $dev &> /dev/null
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt -o nospace_cache
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ fallocate -l 900M $mnt/subv/padding
        $ sync
      
        $ rm $mnt/subv/padding
      
        # Hit EDQUOT
        $ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file
      
      [CAUSE]
      Since commit a514d638
      
       ("btrfs: qgroup: Commit transaction in advance
      to reduce early EDQUOT"), btrfs is not forced to commit transaction to
      reclaim more quota space.
      
      Instead, we just check pertrans metadata reservation against some
      threshold and try to do asynchronously transaction commit.
      
      However in above case, the pertrans metadata reservation is pretty small
      thus it will never trigger asynchronous transaction commit.
      
      [FIX]
      Instead of only accounting pertrans metadata reservation, we calculate
      how much free space we have, and if there isn't much free space left,
      commit transaction asynchronously to try to free some space.
      
      This may slow down the fs when we have less than 32M free qgroup space,
      but should reduce a lot of false EDQUOT, so the cost should be
      acceptable.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f5fef459
    • Qu Wenruo's avatar
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to... · 1418bae1
      Qu Wenruo authored
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
      
      [BUG]
      Btrfs/139 will fail with a high probability if the testing machine (VM)
      has only 2G RAM.
      
      Resulting the final write success while it should fail due to EDQUOT,
      and the fs will have quota exceeding the limit by 16K.
      
      The simplified reproducer will be: (needs a 2G ram VM)
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ for i in $(seq -w  1 8); do
        	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
        	echo "file $i written" > /dev/kmsg
          done
        $ sync
        $ btrfs qgroup show -pcre --raw $mnt
      
      The last pwrite will not trigger EDQUOT and final 'qgroup show' will
      show something like:
      
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5             16384        16384         none         none ---     ---
        0/256      1073758208   1073758208         none   1073741824 ---     ---
      
      And 1073758208 is larger than
        > 1073741824.
      
      [CAUSE]
      It's a bug in btrfs qgroup data reserved space management.
      
      For quota limit, we must ensure that:
        reserved (data + metadata) + rfer/excl <= limit
      
      Since rfer/excl is only updated at transaction commmit time, reserved
      space needs to be taken special care.
      
      One important part of reserved space is data, and for a new data extent
      written to disk, we still need to take the reserved space until
      rfer/excl numbers get updated.
      
      Originally when an ordered extent finishes, we migrate the reserved
      qgroup data space from extent_io tree to delayed ref head of the data
      extent, expecting delayed ref will only be cleaned up at commit
      transaction time.
      
      However for small RAM machine, due to memory pressure dirty pages can be
      flushed back to disk without committing a transaction.
      
      The related events will be something like:
      
        file 1 written
        btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
        btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
        btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
        btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
        btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
        cleanup_ref_head: num_bytes=54947840
        cleanup_ref_head: num_bytes=5636096
        cleanup_ref_head: num_bytes=569344
        cleanup_ref_head: num_bytes=57344
        cleanup_ref_head: num_bytes=8192
        ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
        file 2 written
        ...
        file 8 written
        cleanup_ref_head: num_bytes=8192
        ...
        btrfs_commit_transaction  <<< the only transaction committed during
      				the test
      
      When file 2 is written, we have already freed 128M reserved qgroup data
      space for ino 258. Thus later write won't trigger EDQUOT.
      
      This allows us to write more data beyond qgroup limit.
      
      In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
      
      [FIX]
      By moving reserved qgroup data space from btrfs_delayed_ref_head to
      btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
      space won't be freed half way before commit transaction, thus fix the
      problem.
      
      Fixes: f64d5ca8
      
       ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1418bae1
    • Qu Wenruo's avatar
      btrfs: qgroup: Remove duplicated trace points for qgroup_rsv_add/release · c8f72b98
      Qu Wenruo authored
      Inside qgroup_rsv_add/release(), we have trace events
      trace_qgroup_update_reserve() to catch reserved space update.
      
      However we still have two manual trace_qgroup_update_reserve() calls
      just outside these functions.  Remove these duplicated calls.
      
      Fixes: 64ee4e75
      
       ("btrfs: qgroup: Update trace events to use new separate rsv types")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8f72b98
    • David Sterba's avatar
      btrfs: replace btrfs_set_lock_blocking_rw with appropriate helpers · 300aa896
      David Sterba authored
      
      
      We can use the right helper where the lock type is a fixed parameter.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      300aa896
    • Qu Wenruo's avatar
      btrfs: qgroup: Cleanup old subtree swap code · 9627736b
      Qu Wenruo authored
      
      
      Since it's replaced by new delayed subtree swap code, remove the
      original code.
      
      The cleanup is small since most of its core function is still used by
      delayed subtree swap trace.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9627736b
    • Qu Wenruo's avatar
      btrfs: qgroup: Use delayed subtree rescan for balance · f616f5cd
      Qu Wenruo authored
      
      
      Before this patch, qgroup code traces the whole subtree of subvolume and
      reloc trees unconditionally.
      
      This makes qgroup numbers consistent, but it could cause tons of
      unnecessary extent tracing, which causes a lot of overhead.
      
      However for subtree swap of balance, just swap both subtrees because
      they contain the same contents and tree structure, so qgroup numbers
      won't change.
      
      It's the race window between subtree swap and transaction commit could
      cause qgroup number change.
      
      This patch will delay the qgroup subtree scan until COW happens for the
      subtree root.
      
      So if there is no other operations for the fs, balance won't cause extra
      qgroup overhead. (best case scenario)
      Depending on the workload, most of the subtree scan can still be
      avoided.
      
      Only for worst case scenario, it will fall back to old subtree swap
      overhead. (scan all swapped subtrees)
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
      And after file system population, there is no other activity, so it
      should be the best case scenario.
      
                           | v4.20-rc1            | w/ patchset    | diff
      -----------------------------------------------------------------------
      relocated extents    | 22615                | 22457          | -0.1%
      qgroup dirty extents | 163457               | 121606         | -25.6%
      time (sys)           | 22.884s              | 18.842s        | -17.6%
      time (real)          | 27.724s              | 22.884s        | -17.5%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f616f5cd
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce per-root swapped blocks infrastructure · 370a11b8
      Qu Wenruo authored
      
      
      To allow delayed subtree swap rescan, btrfs needs to record per-root
      information about which tree blocks get swapped.  This patch introduces
      the required infrastructure.
      
      The designed workflow will be:
      
      1) Record the subtree root block that gets swapped.
      
         During subtree swap:
         O = Old tree blocks
         N = New tree blocks
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               NA     OB                          OA      OB
             /  |     |  \                      /  |      |  \
           NC  ND     OE  OF                   OC  OD     OE  OF
      
        In this case, NA and OA are going to be swapped, record (NA, OA) into
        subvolume tree X.
      
      2) After subtree swap.
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               OA     OB                          NA      OB
             /  |     |  \                      /  |      |  \
           OC  OD     OE  OF                   NC  ND     OE  OF
      
      3a) COW happens for OB
          If we are going to COW tree block OB, we check OB's bytenr against
          tree X's swapped_blocks structure.
          If it doesn't fit any, nothing will happen.
      
      3b) COW happens for NA
          Check NA's bytenr against tree X's swapped_blocks, and get a hit.
          Then we do subtree scan on both subtrees OA and NA.
          Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
      
          Then no matter what we do to subvolume tree X, qgroup numbers will
          still be correct.
          Then NA's record gets removed from X's swapped_blocks.
      
      4)  Transaction commit
          Any record in X's swapped_blocks gets removed, since there is no
          modification to swapped subtrees, no need to trigger heavy qgroup
          subtree rescan for them.
      
      This will introduce 128 bytes overhead for each btrfs_root even qgroup
      is not enabled. This is to reduce memory allocations and potential
      failures.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      370a11b8
    • Qu Wenruo's avatar
      btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap · 5aea1a4f
      Qu Wenruo authored
      
      
      Refactor btrfs_qgroup_trace_subtree_swap() into
      qgroup_trace_subtree_swap(), which only needs two extent buffer and some
      other bool to control the behavior.
      
      This provides the basis for later delayed subtree scan work.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5aea1a4f
  13. 17 Dec, 2018 3 commits
    • Andrea Gelmini's avatar
      btrfs: Fix typos in comments and strings · 52042d8e
      Andrea Gelmini authored
      
      
      The typos accumulate over time so once in a while time they get fixed in
      a large patch.
      Signed-off-by: default avatarAndrea Gelmini <andrea.gelmini@gelma.net>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52042d8e
    • Filipe Manana's avatar
      Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation · 9a6f209e
      Filipe Manana authored
      If the quota enable and snapshot creation ioctls are called concurrently
      we can get into a deadlock where the task enabling quotas will deadlock
      on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
      twice, or the task creating a snapshot tries to commit the transaction
      while the task enabling quota waits for the former task to commit the
      transaction while holding the mutex. The following time diagrams show how
      both cases happen.
      
      First scenario:
      
                 CPU 0                                    CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
          btrfs_start_transaction()
      
                                                   btrfs_ioctl()
                                                    btrfs_ioctl_snap_create_v2
                                                     create_snapshot()
                                                      --> adds snapshot to the
                                                          list pending_snapshots
                                                          of the current
                                                          transaction
      
          btrfs_commit_transaction()
           create_pending_snapshots()
             create_pending_snapshot()
              qgroup_account_snapshot()
               btrfs_qgroup_inherit()
      	   mutex_lock(fs_info->qgroup_ioctl_lock)
      	    --> deadlock, mutex already locked
      	        by this task at
      		btrfs_quota_enable()
      
      Second scenario:
      
                 CPU 0                                    CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
          btrfs_start_transaction()
      
                                                   btrfs_ioctl()
                                                    btrfs_ioctl_snap_create_v2
                                                     create_snapshot()
                                                      --> adds snapshot to the
                                                          list pending_snapshots
                                                          of the current
                                                          transaction
      
                                                      btrfs_commit_transaction()
                                                       --> waits for task at
                                                           CPU 0 to release
                                                           its transaction
                                                           handle
      
          btrfs_commit_transaction()
           --> sees another task started
               the transaction commit first
           --> releases its transaction
               handle
           --> waits for the transaction
               commit to be completed by
               the task at CPU 1
      
                                                       create_pending_snapshot()
                                                        qgroup_account_snapshot()
                                                         btrfs_qgroup_inherit()
                                                          mutex_lock(fs_info->qgroup_ioctl_lock)
                                                           --> deadlock, task at CPU 0
                                                               has the mutex locked but
                                                               it is waiting for us to
                                                               finish the transaction
                                                               commit
      
      So fix this by setting the quota enabled flag in fs_info after committing
      the transaction at btrfs_quota_enable(). This ends up serializing quota
      enable and snapshot creation as if the snapshot creation happened just
      before the quota enable request. The quota rescan task, scheduled after
      committing the transaction in btrfs_quote_enable(), will do the accounting.
      
      Fixes: 6426c7ad
      
       ("btrfs: qgroup: Fix qgroup accounting when creating snapshot")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a6f209e
    • Nikolay Borisov's avatar
      btrfs: Remove unnecessary tree locking code in qgroup_rescan_leaf · df449714
      Nikolay Borisov authored
      
      
      In qgroup_rescan_leaf a copy is made of the target leaf by calling
      btrfs_clone_extent_buffer. The latter allocates a new buffer and
      attaches a new set of pages and copies the content of the source buffer.
      The new scratch buffer is only used to iterate it's items, it's not
      published anywhere and cannot be accessed by a third party.
      
      Hence, it's not necessary to perform any locking on it whatsoever.
      Furthermore, remove the extra extent_buffer_get call since the new
      buffer is always allocated with a reference count of 1 which is
      sufficient here.  No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df449714
  14. 22 Nov, 2018 1 commit
    • Filipe Manana's avatar
      Btrfs: fix race between enabling quotas and subvolume creation · 552f0329
      Filipe Manana authored
      
      
      We have a race between enabling quotas end subvolume creation that cause
      subvolume creation to fail with -EINVAL, and the following diagram shows
      how it happens:
      
                    CPU 0                                          CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
      
                                                        btrfs_ioctl()
                                                         create_subvol()
                                                          btrfs_qgroup_inherit()
                                                           -> save fs_info->quota_root
                                                              into quota_root
                                                           -> stores a NULL value
                                                           -> tries to lock the mutex
                                                              qgroup_ioctl_lock
                                                              -> blocks waiting for
                                                                 the task at CPU0
      
         -> sets BTRFS_FS_QUOTA_ENABLED in fs_info
         -> sets quota_root in fs_info->quota_root
            (non-NULL value)
      
         mutex_unlock(fs_info->qgroup_ioctl_lock)
      
                                                           -> checks quota enabled
                                                              flag is set
                                                           -> returns -EINVAL because
                                                              fs_info->quota_root was
                                                              NULL before it acquired
                                                              the mutex
                                                              qgroup_ioctl_lock
                                                         -> ioctl returns -EINVAL
      
      Returning -EINVAL to user space will be confusing if all the arguments
      passed to the subvolume creation ioctl were valid.
      
      Fix it by grabbing the value from fs_info->quota_root after acquiring
      the mutex.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      552f0329
  15. 15 Oct, 2018 9 commits
    • Lu Fengqi's avatar
      btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branch · b90e22ba
      Lu Fengqi authored
      
      
      There is no reason to put this check in (!qgroup)'s else branch because
      if qgroup is null, it will goto out directly. So move it out to reduce
      indentation level.  No functional change.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b90e22ba
    • Qu Wenruo's avatar
      btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled · 3628b4ca
      Qu Wenruo authored
      
      
      Some qgroup trace events like btrfs_qgroup_release_data() and
      btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
      not enabled.
      
      This is caused by the lack of qgroup status check before calling some
      qgroup functions.  Thankfully the functions can handle quota disabled
      case well and just do nothing for qgroup disabled case.
      
      This patch will do earlier check before triggering related trace events.
      
      And for enabled <-> disabled race case:
      
      1) For enabled->disabled case
         Disable will wipe out all qgroups data including reservation and
         excl/rfer. Even if we leak some reservation or numbers, it will
         still be cleared, so nothing will go wrong.
      
      2) For disabled -> enabled case
         Current btrfs_qgroup_release_data() will use extent_io tree to ensure
         we won't underflow reservation. And for delayed_ref we use
         head->qgroup_reserved to record the reserved space, so in that case
         head->qgroup_reserved should be 0 and we won't underflow.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3628b4ca
    • Qu Wenruo's avatar
      btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group · 3d0174f7
      Qu Wenruo authored
      
      
      For qgroup_trace_extent_swap(), if we find one leaf that needs to be
      traced, we will also iterate all file extents and trace them.
      
      This is OK if we're relocating data block groups, but if we're
      relocating metadata block groups, balance code itself has ensured that
      both subtree of file tree and reloc tree contain the same contents.
      
      That's to say, if we're relocating metadata block groups, all file
      extents in reloc and file tree should match, thus no need to trace them.
      This should reduce the total number of dirty extents processed in metadata
      block group balance.
      
      [[Benchmark]] (with all previous enhancement)
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22851          | -0.3%
      qgroup dirty extents | 227757       | 140886         | -38.1%
      time (sys)           | 65.253s      | 37.464s        | -42.6%
      time (real)          | 74.032s      | 44.722s        | -39.6%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d0174f7
    • Qu Wenruo's avatar
      btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents · 5f527822
      Qu Wenruo authored
      
      
      Before this patch, with quota enabled during balance, we need to mark
      the whole subtree dirty for quota.
      
      E.g.
      OO = Old tree blocks (from file tree)
      NN = New tree blocks (from reloc tree)
      
              File tree (src)		          Reloc tree (dst)
                  OO (a)                              NN (a)
                 /  \                                /  \
           (b) OO    OO (c)                    (b) NN    NN (c)
              /  \  /  \                          /  \  /  \
             OO  OO OO OO (d)                    OO  OO OO NN (d)
      
      For old balance + quota case, quota will mark the whole src and dst tree
      dirty, including all the 3 old tree blocks in reloc tree.
      
      It's doable for small file tree or new tree blocks are all located at
      lower level.
      
      But for large file tree or new tree blocks are all located at higher
      level, this will lead to mark the whole tree dirty, and be unbelievably
      slow.
      
      This patch will change how we handle such balance with quota enabled
      case.
      
      Now we will search from (b) and (c) for any new tree blocks whose
      generation is equal to @last_snapshot, and only mark them dirty.
      
      In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
      (NN(a) will be traced when COW happens for nodeptr modification).  And
      also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
      happens for nodeptr modification.)
      
      For above case, we could skip 3 tree blocks, but for larger tree, we can
      skip tons of unmodified tree blocks, and hugely speed up balance.
      
      This patch will introduce a new function,
      btrfs_qgroup_trace_subtree_swap(), which will do the following main
      work:
      
      1) Read out real root eb
         And setup basic dst_path for later calls
      2) Call qgroup_trace_new_subtree_blocks()
         To trace all new tree blocks in reloc tree and their counter
         parts in the file tree.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f527822
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree · ea49f3e7
      Qu Wenruo authored
      
      
      Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
      all new tree blocks in a reloc tree.
      So that qgroup could skip unrelated tree blocks during balance, which
      should hugely speedup balance speed when quota is enabled.
      
      The function qgroup_trace_new_subtree_blocks() itself only cares about
      new tree blocks in reloc tree.
      
      All its main works are:
      
      1) Read out tree blocks according to parent pointers
      
      2) Do recursive depth-first search
         Will call the same function on all its children tree blocks, with
         search level set to current level -1.
         And will also skip all children whose generation is smaller than
         @last_snapshot.
      
      3) Call qgroup_trace_extent_swap() to trace tree blocks
      
      So although we have parameter list related to source file tree, it's not
      used at all, but only passed to qgroup_trace_extent_swap().
      Thus despite the tree read code, the core should be pretty short and all
      about recursive depth-first search.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea49f3e7
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce function to trace two swaped extents · 25982561
      Qu Wenruo authored
      
      
      Introduce a new function, qgroup_trace_extent_swap(), which will be used
      later for balance qgroup speedup.
      
      The basis idea of balance is swapping tree blocks between reloc tree and
      the real file tree.
      
      The swap will happen in highest tree block, but there may be a lot of
      tree blocks involved.
      
      For example:
       OO = Old tree blocks
       NN = New tree blocks allocated during balance
      
                File tree (257)                  Reloc tree for 257
      L2              OO                                NN
                    /    \                            /    \
      L1          OO      OO (a)                    OO      NN (a)
                 / \     / \                       / \     / \
      L0       OO   OO OO   OO                   OO   OO NN   NN
                       (b)  (c)                          (b)  (c)
      
      When calling qgroup_trace_extent_swap(), we will pass:
      @src_eb = OO(a)
      @dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
      @dst_level = 0
      @root_level = 1
      
      In that case, qgroup_trace_extent_swap() will search from OO(a) to
      reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.
      
      The main work of qgroup_trace_extent_swap() can be split into 3 parts:
      
      1) Tree search from @src_eb
         It should acts as a simplified btrfs_search_slot().
         The key for search can be extracted from @dst_path->nodes[dst_level]
         (first key).
      
      2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
         NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
         They should be marked during preivous (@dst_level = 1) iteration.
      
      3) Mark file extents in leaves dirty
         We don't have good way to pick out new file extents only.
         So we still follow the old method by scanning all file extents in
         the leave.
      
      This function can free us from keeping two pathes, thus later we only need
      to care about how to iterate all new tree blocks in reloc tree.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ copy changelog to function comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25982561
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted · c337e7b0
      Qu Wenruo authored
      
      
      Number of qgroup dirty extents is directly linked to the performance
      overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
      record how many dirty extents is processed in
      btrfs_qgroup_account_extents().
      
      This will be pretty handy to analyze later balance performance
      improvement.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c337e7b0
    • Qu Wenruo's avatar
      btrfs: qgroup: Dirty all qgroups before rescan · 9c7b0c2e
      Qu Wenruo authored
      
      
      [BUG]
      In the following case, rescan won't zero out the number of qgroup 1/0:
      
        $ mkfs.btrfs -fq $DEV
        $ mount $DEV /mnt
      
        $ btrfs quota enable /mnt
        $ btrfs qgroup create 1/0 /mnt
        $ btrfs sub create /mnt/sub
        $ btrfs qgroup assign 0/257 1/0 /mnt
      
        $ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
        $ btrfs sub snap /mnt/sub /mnt/snap
        $ btrfs quota rescan -w /mnt
        $ btrfs qgroup show -pcre /mnt
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5          16.00KiB     16.00KiB         none         none ---     ---
        0/257      1016.00KiB     16.00KiB         none         none 1/0     ---
        0/258      1016.00KiB     16.00KiB         none         none ---     ---
        1/0        1016.00KiB     16.00KiB         none         none ---     0/257
      
      So far so good, but:
      
        $ btrfs qgroup remove 0/257 1/0 /mnt
        WARNING: quotas may be inconsistent, rescan needed
        $ btrfs quota rescan -w /mnt
        $ btrfs qgroup show -pcre  /mnt
        qgoupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5          16.00KiB     16.00KiB         none         none ---     ---
        0/257      1016.00KiB     16.00KiB         none         none ---     ---
        0/258      1016.00KiB     16.00KiB         none         none ---     ---
        1/0        1016.00KiB     16.00KiB         none         none ---     ---
      	     ^^^^^^^^^^     ^^^^^^^^ not cleared
      
      [CAUSE]
      Before rescan we call qgroup_rescan_zero_tracking() to zero out all
      qgroups' accounting numbers.
      
      However we don't mark all qgroups dirty, but rely on rescan to do so.
      
      If we have any high level qgroup without children, it won't be marked
      dirty during rescan, since we cannot reach that qgroup.
      
      This will cause QGROUP_INFO items of childless qgroups never get updated
      in the quota tree, thus their numbers will stay the same in "btrfs
      qgroup show" output.
      
      [FIX]
      Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if
      we have childless qgroups, their QGROUP_INFO items will still get
      updated during rescan.
      Reported-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Tested-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c7b0c2e
    • Misono Tomohiro's avatar
      btrfs: Remove 'objectid' member from struct btrfs_root · 4fd786e6
      Misono Tomohiro authored
      
      
      There are two members in struct btrfs_root which indicate root's
      objectid: objectid and root_key.objectid.
      
      They are both set to the same value in __setup_root():
      
        static void __setup_root(struct btrfs_root *root,
                                 struct btrfs_fs_info *fs_info,
                                 u64 objectid)
        {
          ...
          root->objectid = objectid;
          ...
          root->root_key.objectid = objecitd;
          ...
        }
      
      and not changed to other value after initialization.
      
      grep in btrfs directory shows both are used in many places:
        $ grep -rI "root->root_key.objectid" | wc -l
        133
        $ grep -rI "root->objectid" | wc -l
        55
       (4.17, inc. some noise)
      
      It is confusing to have two similar variable names and it seems
      that there is no rule about which should be used in a certain case.
      
      Since ->root_key itself is needed for tree reloc tree, let's remove
      'objecitd' member and unify code to use ->root_key.objectid in all places.
      Signed-off-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4fd786e6
  16. 23 Aug, 2018 1 commit
  17. 06 Aug, 2018 1 commit