1. 13 Dec, 2019 1 commit
  2. 18 Nov, 2019 7 commits
    • David Sterba's avatar
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba authored
      
      
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a019e9e1
    • Qu Wenruo's avatar
      btrfs: scrub: Don't check free space before marking a block group RO · b12de528
      Qu Wenruo authored
      
      
      [BUG]
      When running btrfs/072 with only one online CPU, it has a pretty high
      chance to fail:
      
        btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
        - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
            --- tests/btrfs/072.out     2019-10-22 15:18:14.008965340 +0800
            +++ /xfstests-dev/results//btrfs/072.out.bad      2019-11-14 15:56:45.877152240 +0800
            @@ -1,2 +1,3 @@
             QA output created by 072
             Silence is golden
            +Scrub find errors in "-m dup -d single" test
            ...
      
      And with the following call trace:
      
        BTRFS info (device dm-5): scrub: started on devid 1
        ------------[ cut here ]------------
        BTRFS: Transaction aborted (error -27)
        WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        CPU: 0 PID: 55087 Comm: btrfs Tainted: G        W  O      5.4.0-rc1-custom+ #13
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
        Call Trace:
         __btrfs_end_transaction+0xdb/0x310 [btrfs]
         btrfs_end_transaction+0x10/0x20 [btrfs]
         btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
         scrub_enumerate_chunks+0x264/0x940 [btrfs]
         btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
         btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
         do_vfs_ioctl+0x636/0xaa0
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x43/0x50
         do_syscall_64+0x79/0xe0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        ---[ end trace 166c865cec7688e7 ]---
      
      [CAUSE]
      The error number -27 is -EFBIG, returned from the following call chain:
      btrfs_end_transaction()
      |- __btrfs_end_transaction()
         |- btrfs_create_pending_block_groups()
            |- btrfs_finish_chunk_alloc()
               |- btrfs_add_system_chunk()
      
      This happens because we have used up all space of
      btrfs_super_block::sys_chunk_array.
      
      The root cause is, we have the following bad loop of creating tons of
      system chunks:
      
      1. The only SYSTEM chunk is being scrubbed
         It's very common to have only one SYSTEM chunk.
      2. New SYSTEM bg will be allocated
         As btrfs_inc_block_group_ro() will check if we have enough space
         after marking current bg RO. If not, then allocate a new chunk.
      3. New SYSTEM bg is still empty, will be reclaimed
         During the reclaim, we will mark it RO again.
      4. That newly allocated empty SYSTEM bg get scrubbed
         We go back to step 2, as the bg is already mark RO but still not
         cleaned up yet.
      
      If the cleaner kthread doesn't get executed fast enough (e.g. only one
      CPU), then we will get more and more empty SYSTEM chunks, using up all
      the space of btrfs_super_block::sys_chunk_array.
      
      [FIX]
      Since scrub/dev-replace doesn't always need to allocate new extent,
      especially chunk tree extent, so we don't really need to do chunk
      pre-allocation.
      
      To break above spiral, here we introduce a new parameter to
      btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
      need extra chunk pre-allocation.
      
      For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
      @do_chunk_alloc=false.
      This should keep unnecessary empty chunks from popping up for scrub.
      
      Also, since there are two parameters for btrfs_inc_block_group_ro(),
      add more comment for it.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b12de528
    • David Sterba's avatar
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba authored
      
      
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32da5386
    • David Sterba's avatar
      btrfs: add dedicated members for start and length of a block group · b3470b5d
      David Sterba authored
      
      
      The on-disk format of block group item makes use of the key that stores
      the offset and length. This is further used in the code, although this
      makes thing harder to understand. The key is also packed so the
      offset/length is not properly aligned as u64.
      
      Add start (key.objectid) and length (key.offset) members to block group
      and remove the embedded key.  When the item is searched or written, a
      local variable for key is used.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3470b5d
    • David Sterba's avatar
      btrfs: move block_group_item::used to block group · bf38be65
      David Sterba authored
      
      
      For unknown reasons, the member 'used' in the block group struct is
      stored in the b-tree item and accessed everywhere using the special
      accessor helper. Let's unify it and make it a regular member and only
      update the item before writing it to the tree.
      
      The item is still being used for flags and chunk_objectid, there's some
      duplication until the item is removed in following patches.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf38be65
    • David Sterba's avatar
      btrfs: opencode extent_buffer_get · 67439dad
      David Sterba authored
      
      
      The helper is trivial and we can understand what the atomic_inc on
      something named refs does.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      67439dad
    • David Sterba's avatar
      btrfs: drop unused parameter is_new from btrfs_iget · 4c66e0d4
      David Sterba authored
      The parameter is now always set to NULL and could be dropped. The last
      user was get_default_root but that got reworked in 05dbe683
      
       ("Btrfs:
      unify subvol= and subvolid= mounting") and the parameter became unused.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c66e0d4
  3. 15 Oct, 2019 1 commit
    • Qu Wenruo's avatar
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo authored
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459
      
       ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595
      
       ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8702ba93
  4. 11 Oct, 2019 1 commit
  5. 25 Sep, 2019 1 commit
    • Qu Wenruo's avatar
      btrfs: relocation: fix use-after-free on dead relocation roots · 1fac4a54
      Qu Wenruo authored
      
      
      [BUG]
      One user reported a reproducible KASAN report about use-after-free:
      
        BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
        BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
        Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579
      
        CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P           OE     5.2.11-arch1-1-kasan #4
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
        Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        Call Trace:
         dump_stack+0x7b/0xba
         print_address_description+0x6c/0x22e
         ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         __kasan_report.cold+0x1b/0x3b
         ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         kasan_report+0x12/0x17
         __asan_report_store8_noabort+0x17/0x20
         btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         record_root_in_trans+0x2a0/0x370 [btrfs]
         btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
         start_transaction+0x1ab/0xe90 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
         ? lock_repin_lock+0x400/0x400
         ? __kmem_cache_shutdown.cold+0x140/0x1ad
         ? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
         finish_ordered_fn+0x15/0x20 [btrfs]
         normal_work_helper+0x1bd/0xca0 [btrfs]
         ? process_one_work+0x819/0x1720
         ? kasan_check_read+0x11/0x20
         btrfs_endio_write_helper+0x12/0x20 [btrfs]
         process_one_work+0x8c9/0x1720
         ? pwq_dec_nr_in_flight+0x2f0/0x2f0
         ? worker_thread+0x1d9/0x1030
         worker_thread+0x98/0x1030
         kthread+0x2bb/0x3b0
         ? process_one_work+0x1720/0x1720
         ? kthread_park+0x120/0x120
         ret_from_fork+0x35/0x40
      
        Allocated by task 369692:
         __kasan_kmalloc.part.0+0x44/0xc0
         __kasan_kmalloc.constprop.0+0xba/0xc0
         kasan_kmalloc+0x9/0x10
         kmem_cache_alloc_trace+0x138/0x260
         btrfs_read_tree_root+0x92/0x360 [btrfs]
         btrfs_read_fs_root+0x10/0xb0 [btrfs]
         create_reloc_root+0x47d/0xa10 [btrfs]
         btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
         record_root_in_trans+0x2a0/0x370 [btrfs]
         btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
         start_transaction+0x1ab/0xe90 [btrfs]
         btrfs_start_transaction+0x1e/0x20 [btrfs]
         __btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
         btrfs_prealloc_file_range+0x13/0x20 [btrfs]
         prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
         relocate_file_extent_cluster+0x193/0xc30 [btrfs]
         relocate_data_extent+0x1f8/0x490 [btrfs]
         relocate_block_group+0x600/0x1060 [btrfs]
         btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
         btrfs_relocate_chunk+0x9e/0x180 [btrfs]
         btrfs_balance+0x14e4/0x2fc0 [btrfs]
         btrfs_ioctl_balance+0x47f/0x640 [btrfs]
         btrfs_ioctl+0x119d/0x8380 [btrfs]
         do_vfs_ioctl+0x9f5/0x1060
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x73/0xb0
         do_syscall_64+0xa5/0x370
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 369692:
         __kasan_slab_free+0x14f/0x210
         kasan_slab_free+0xe/0x10
         kfree+0xd8/0x270
         btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
         clean_dirty_subvols+0x227/0x340 [btrfs]
         relocate_block_group+0x972/0x1060 [btrfs]
         btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
         btrfs_relocate_chunk+0x9e/0x180 [btrfs]
         btrfs_balance+0x14e4/0x2fc0 [btrfs]
         btrfs_ioctl_balance+0x47f/0x640 [btrfs]
         btrfs_ioctl+0x119d/0x8380 [btrfs]
         do_vfs_ioctl+0x9f5/0x1060
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x73/0xb0
         do_syscall_64+0xa5/0x370
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff88856f671100
         which belongs to the cache kmalloc-4k of size 4096
        The buggy address is located 1552 bytes inside of
         4096-byte region [ffff88856f671100, ffff88856f672100)
        The buggy address belongs to the page:
        page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
        flags: 0x2ffff0000010200(slab|head)
        raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
        raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                 ^
         ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
        BTRFS info (device sdi1): 1 enospc errors during balance
        BTRFS info (device sdi1): balance: ended with status: -28
      
      [CAUSE]
      The problem happens when finish_ordered_io() get called with balance
      still running, while the reloc root of that subvolume is already dead.
      (Tree is swap already done, but tree not yet deleted for possible qgroup
      usage.)
      
      That means root->reloc_root still exists, but that reloc_root can be
      under btrfs_drop_snapshot(), thus we shouldn't access it.
      
      The following race could cause the use-after-free problem:
      
                      CPU1              |                CPU2
      --------------------------------------------------------------------------
                                        | relocate_block_group()
                                        | |- unset_reloc_control(rc)
                                        | |- btrfs_commit_transaction()
      btrfs_finish_ordered_io()         | |- clean_dirty_subvols()
      |- btrfs_join_transaction()       |    |
         |- record_root_in_trans()      |    |
            |- btrfs_init_reloc_root()  |    |
               |- if (root->reloc_root) |    |
               |                        |    |- root->reloc_root = NULL
               |                        |    |- btrfs_drop_snapshot(reloc_root);
               |- reloc_root->last_trans|
                       = trans->transid |
      	    ^^^^^^^^^^^^^^^^^^^^^^
                  Use after free
      
      [FIX]
      Fix it by the following modifications:
      
      - Test if the root has dead reloc tree before accessing root->reloc_root
        If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
        create or update root->reloc_tree
      
      - Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
        reloc tree
        To co-operate with above modification, so as long as
        BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
        reloc tree at record_root_in_trans().
      Reported-by: default avatarCebtenzzre <cebtenzzre@gmail.com>
      Fixes: d2311e69
      
       ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fac4a54
  6. 09 Sep, 2019 2 commits
  7. 04 Jul, 2019 1 commit
  8. 28 May, 2019 1 commit
    • Qu Wenruo's avatar
      btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON() · 30d40577
      Qu Wenruo authored
      [BUG]
      When a fs has orphan reloc tree along with unfinished balance:
        ...
              item 16 key (TREE_RELOC ROOT_ITEM FS_TREE) itemoff 12090 itemsize 439
                      generation 12 root_dirid 256 bytenr 300400640 level 1 refs 0 <<<
                      lastsnap 8 byte_limit 0 bytes_used 1359872 flags 0x0(none)
                      uuid 7c48d938-33a3-4aae-ab19-6e5c9d406e46
              item 17 key (BALANCE TEMPORARY_ITEM 0) itemoff 11642 itemsize 448
                      temporary item objectid BALANCE offset 0
                      balance status flags 14
      
      Then at mount time, we can hit the following kernel BUG_ON():
        BTRFS info (device dm-3): relocating block group 298844160 flags metadata|dup
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/relocation.c:1413!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 897 Comm: btrfs-balance Tainted: G           O      5.2.0-rc1-custom #15
        RIP: 0010:create_reloc_root+0x1eb/0x200 [btrfs]
        Call Trace:
         btrfs_init_reloc_root+0x96/0xb0 [btrfs]
         record_root_in_trans+0xb2/0xe0 [btrfs]
         btrfs_record_root_in_trans+0x55/0x70 [btrfs]
         select_reloc_root+0x7e/0x230 [btrfs]
         do_relocation+0xc4/0x620 [btrfs]
         relocate_tree_blocks+0x592/0x6a0 [btrfs]
         relocate_block_group+0x47b/0x5d0 [btrfs]
         btrfs_relocate_block_group+0x183/0x2f0 [btrfs]
         btrfs_relocate_chunk+0x4e/0xe0 [btrfs]
         btrfs_balance+0x864/0xfa0 [btrfs]
         balance_kthread+0x3b/0x50 [btrfs]
         kthread+0x123/0x140
         ret_from_fork+0x27/0x50
      
      [CAUSE]
      In btrfs, reloc trees are used to record swapped tree blocks during
      balance.
      Reloc tree either get merged (replace old tree blocks of its parent
      subvolume) in next transaction if its ref is 1 (fresh).
      Or is already merged and will be cleaned up if its ref is 0 (orphan).
      
      After commit d2311e69 ("btrfs: relocation: Delay reloc tree deletion
      after merge_reloc_roots"), reloc tree cleanup is delayed until one block
      group is balanced.
      
      Since fresh reloc roots are recorded during merge, as long as there
      is no power loss, those orphan reloc roots converted from fresh ones are
      handled without problem.
      
      However when power loss happens, orphan reloc roots can be recorded
      on-disk, thus at next mount time, we will have orphan reloc roots from
      on-disk data directly, and ignored by clean_dirty_subvols() routine.
      
      Then when background balance starts to balance another block group, and
      needs to create new reloc root for the same root, btrfs_insert_item()
      returns -EEXIST, and trigger that BUG_ON().
      
      [FIX]
      For orphan reloc roots, also queue them to rc->dirty_subvol_roots, so
      all reloc roots no matter orphan or not, can be cleaned up properly and
      avoid above BUG_ON().
      
      And to cooperate with above change, clean_dirty_subvols() will check if
      the queued root is a reloc root or a subvol root.
      For a subvol root, do the old work, and for a orphan reloc root, clean it
      up.
      
      Fixes: d2311e69
      
       ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.1
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30d40577
  9. 29 Apr, 2019 8 commits
    • Qu Wenruo's avatar
      btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent() · ffd4bb2a
      Qu Wenruo authored
      
      
      Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long
      parameter list and the confusing @owner parameter.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffd4bb2a
    • Qu Wenruo's avatar
      btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref() · 82fa113f
      Qu Wenruo authored
      
      
      Use the new btrfs_ref structure and replace parameter list to clean up
      the usage of owner and level to distinguish the extent types.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      82fa113f
    • David Sterba's avatar
      btrfs: get fs_info from block group in lookup_free_space_inode · 7949f339
      David Sterba authored
      
      
      We can read fs_info from the block group cache structure and can drop it
      from the parameters.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7949f339
    • Nikolay Borisov's avatar
      btrfs: Remove redundant inode argument from btrfs_add_ordered_sum · f9756261
      Nikolay Borisov authored
      
      
      Ordered csums are keyed off of a btrfs_ordered_extent, which already has
      a reference to the inode. This implies that an explicit inode argument
      is redundant. So remove it.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9756261
    • Josef Bacik's avatar
      btrfs: fix panic during relocation after ENOSPC before writeback happens · ff612ba7
      Josef Bacik authored
      
      
      We've been seeing the following sporadically throughout our fleet
      
      panic: kernel BUG at fs/btrfs/relocation.c:4584!
      netversion: 5.0-0
      Backtrace:
       #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8
       #1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c
       #2 [ffffc90003adb988] crash_kexec at ffffffff811048ad
       #3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a
       #4 [ffffc90003adb9c0] do_trap at ffffffff81019114
       #5 [ffffc90003adba00] do_error_trap at ffffffff810195d0
       #6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b
          [exception RIP: btrfs_reloc_cow_block+692]
          RIP: ffffffff8143b614  RSP: ffffc90003adbb68  RFLAGS: 00010246
          RAX: fffffffffffffff7  RBX: ffff8806b9c32000  RCX: ffff8806aad00690
          RDX: ffff880850b295e0  RSI: ffff8806b9c32000  RDI: ffff88084f205bd0
          RBP: ffff880849415000   R8: ffffc90003adbbe0   R9: ffff88085ac90000
          R10: ffff8805f7369140  R11: 0000000000000000  R12: ffff880850b295e0
          R13: ffff88084f205bd0  R14: 0000000000000000  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd
       #8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3
       #9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c
      
      The way relocation moves data extents is by creating a reloc inode and
      preallocating extents in this inode and then copying the data into these
      preallocated extents.  Once we've done this for all of our extents,
      we'll write out these dirty pages, which marks the extent written, and
      goes into btrfs_reloc_cow_block().  From here we get our current
      reloc_control, which _should_ match the reloc_control for the current
      block group we're relocating.
      
      However if we get an ENOSPC in this path at some point we'll bail out,
      never initiating writeback on this inode.  Not a huge deal, unless we
      happen to be doing relocation on a different block group, and this block
      group is now rc->stage == UPDATE_DATA_PTRS.  This trips the BUG_ON() in
      btrfs_reloc_cow_block(), because we expect to be done modifying the data
      inode.  We are in fact done modifying the metadata for the data inode
      we're currently using, but not the one from the failed block group, and
      thus we BUG_ON().
      
      (This happens when writeback finishes for extents from the previous
      group, when we are at btrfs_finish_ordered_io() which updates the data
      reloc tree (inode item, drops/adds extent items, etc).)
      
      Fix this by writing out the reloc data inode always, and then breaking
      out of the loop after that point to keep from tripping this BUG_ON()
      later.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [ add note from Filipe ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff612ba7
    • Qu Wenruo's avatar
      btrfs: reloc: Fix NULL pointer dereference due to expanded reloc_root lifespan · 10995c04
      Qu Wenruo authored
      Commit d2311e69 ("btrfs: relocation: Delay reloc tree deletion after
      merge_reloc_roots()") expands the life span of root->reloc_root.
      
      This breaks certain checs of fs_info->reloc_ctl.  Before that commit, if
      we have a root with valid reloc_root, then it's ensured to have
      fs_info->reloc_ctl.
      
      But now since reloc_root doesn't always mean a valid fs_info->reloc_ctl,
      such check is unreliable and can cause the following NULL pointer
      dereference:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000005c1
        IP: btrfs_reloc_pre_snapshot+0x20/0x50 [btrfs]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 0 PID: 10379 Comm: snapperd Not tainted
        Call Trace:
         create_pending_snapshot+0xd7/0xfc0 [btrfs]
         create_pending_snapshots+0x8e/0xb0 [btrfs]
         btrfs_commit_transaction+0x2ac/0x8f0 [btrfs]
         btrfs_mksubvol+0x561/0x570 [btrfs]
         btrfs_ioctl_snap_create_transid+0x189/0x190 [btrfs]
         btrfs_ioctl_snap_create_v2+0x102/0x150 [btrfs]
         btrfs_ioctl+0x5c9/0x1e60 [btrfs]
         do_vfs_ioctl+0x90/0x5f0
         SyS_ioctl+0x74/0x80
         do_syscall_64+0x7b/0x150
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
        RIP: 0033:0x7fd7cdab8467
      
      Fix it by explicitly checking fs_info->reloc_ctl other than using the
      implied root->reloc_root.
      
      Fixes: d2311e69
      
       ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      10995c04
    • Qu Wenruo's avatar
      btrfs: Introduce extent_io_tree::owner to distinguish different io_trees · 43eb5f29
      Qu Wenruo authored
      
      
      Btrfs has the following different extent_io_trees used:
      
      - fs_info::free_extents[2]
      - btrfs_inode::io_tree - for both normal inodes and the btree inode
      - btrfs_inode::io_failure_tree
      - btrfs_transaction::dirty_pages
      - btrfs_root::dirty_log_pages
      
      If we want to trace changes in those trees, it will be pretty hard to
      distinguish them.
      
      Instead of using hard-to-read pointer address, this patch will introduce
      a new member extent_io_tree::owner to track the owner.
      
      This modification needs all the callers of extent_io_tree_init() to
      accept a new parameter @owner.
      
      This patch provides the basis for later trace events.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      43eb5f29
    • Qu Wenruo's avatar
      btrfs: Introduce fs_info to extent_io_tree · c258d6e3
      Qu Wenruo authored
      
      
      This patch will add a new member fs_info to extent_io_tree.
      
      This provides the basis for later trace events to distinguish the output
      between different btrfs filesystems. While this increases the size of
      the structure, we want to know the source of the trace events and
      passing the fs_info as an argument to all contexts is not possible.
      
      The selftests are now allowed to set it to NULL as they don't use the
      tracepoints.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c258d6e3
  10. 25 Feb, 2019 6 commits
    • Filipe Manana's avatar
      Btrfs: add missing error handling after doing leaf/node binary search · cbca7d59
      Filipe Manana authored
      
      
      The function map_private_extent_buffer() can return an -EINVAL error, and
      it is called by generic_bin_search() which will return back the error. The
      btrfs_bin_search() function in turn calls generic_bin_search() and the
      key_search() function calls btrfs_bin_search(), so both can return the
      -EINVAL error coming from the map_private_extent_buffer() function. Some
      callers of these functions were ignoring that these functions can return
      an error, so fix them to deal with error return values.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbca7d59
    • David Sterba's avatar
      btrfs: open code now trivial btrfs_set_lock_blocking · 8bead258
      David Sterba authored
      
      
      btrfs_set_lock_blocking is now only a simple wrapper around
      btrfs_set_lock_blocking_write. The name does not bring any semantic
      value that could not be inferred from the new function so there's no
      point keeping it.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bead258
    • Qu Wenruo's avatar
      btrfs: qgroup: Use delayed subtree rescan for balance · f616f5cd
      Qu Wenruo authored
      
      
      Before this patch, qgroup code traces the whole subtree of subvolume and
      reloc trees unconditionally.
      
      This makes qgroup numbers consistent, but it could cause tons of
      unnecessary extent tracing, which causes a lot of overhead.
      
      However for subtree swap of balance, just swap both subtrees because
      they contain the same contents and tree structure, so qgroup numbers
      won't change.
      
      It's the race window between subtree swap and transaction commit could
      cause qgroup number change.
      
      This patch will delay the qgroup subtree scan until COW happens for the
      subtree root.
      
      So if there is no other operations for the fs, balance won't cause extra
      qgroup overhead. (best case scenario)
      Depending on the workload, most of the subtree scan can still be
      avoided.
      
      Only for worst case scenario, it will fall back to old subtree swap
      overhead. (scan all swapped subtrees)
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
      And after file system population, there is no other activity, so it
      should be the best case scenario.
      
                           | v4.20-rc1            | w/ patchset    | diff
      -----------------------------------------------------------------------
      relocated extents    | 22615                | 22457          | -0.1%
      qgroup dirty extents | 163457               | 121606         | -25.6%
      time (sys)           | 22.884s              | 18.842s        | -17.6%
      time (real)          | 27.724s              | 22.884s        | -17.5%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f616f5cd
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce per-root swapped blocks infrastructure · 370a11b8
      Qu Wenruo authored
      
      
      To allow delayed subtree swap rescan, btrfs needs to record per-root
      information about which tree blocks get swapped.  This patch introduces
      the required infrastructure.
      
      The designed workflow will be:
      
      1) Record the subtree root block that gets swapped.
      
         During subtree swap:
         O = Old tree blocks
         N = New tree blocks
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               NA     OB                          OA      OB
             /  |     |  \                      /  |      |  \
           NC  ND     OE  OF                   OC  OD     OE  OF
      
        In this case, NA and OA are going to be swapped, record (NA, OA) into
        subvolume tree X.
      
      2) After subtree swap.
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               OA     OB                          NA      OB
             /  |     |  \                      /  |      |  \
           OC  OD     OE  OF                   NC  ND     OE  OF
      
      3a) COW happens for OB
          If we are going to COW tree block OB, we check OB's bytenr against
          tree X's swapped_blocks structure.
          If it doesn't fit any, nothing will happen.
      
      3b) COW happens for NA
          Check NA's bytenr against tree X's swapped_blocks, and get a hit.
          Then we do subtree scan on both subtrees OA and NA.
          Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
      
          Then no matter what we do to subvolume tree X, qgroup numbers will
          still be correct.
          Then NA's record gets removed from X's swapped_blocks.
      
      4)  Transaction commit
          Any record in X's swapped_blocks gets removed, since there is no
          modification to swapped subtrees, no need to trigger heavy qgroup
          subtree rescan for them.
      
      This will introduce 128 bytes overhead for each btrfs_root even qgroup
      is not enabled. This is to reduce memory allocations and potential
      failures.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      370a11b8
    • Qu Wenruo's avatar
      btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots · d2311e69
      Qu Wenruo authored
      
      
      Relocation code will drop btrfs_root::reloc_root as soon as
      merge_reloc_root() finishes.
      
      However later qgroup code will need to access btrfs_root::reloc_root
      after merge_reloc_root() for delayed subtree rescan.
      
      So alter the timming of resetting btrfs_root:::reloc_root, make it
      happens after transaction commit.
      
      With this patch, we will introduce a new btrfs_root::state,
      BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user
      that although btrfs_root::reloc_tree is still non-NULL, but still it's
      not used any more.
      
      The lifespan of btrfs_root::reloc tree will become:
                Old behavior            |              New
      ------------------------------------------------------------------------
      btrfs_init_reloc_root()      ---  | btrfs_init_reloc_root()      ---
        set reloc_root              |   |   set reloc_root              |
                                    |   |                               |
                                    |   |                               |
      merge_reloc_root()            |   | merge_reloc_root()            |
      |- btrfs_update_reloc_root() ---  | |- btrfs_update_reloc_root() -+-
           clear btrfs_root::reloc_root |      set ROOT_DEAD_RELOC_TREE |
                                        |      record root into dirty   |
                                        |      roots rbtree             |
                                        |                               |
                                        | reloc_block_group() Or        |
                                        | btrfs_recover_relocation()    |
                                        | | After transaction commit    |
                                        | |- clean_dirty_subvols()     ---
                                        |     clear btrfs_root::reloc_root
      
      During ROOT_DEAD_RELOC_TREE set lifespan, the only user of
      btrfs_root::reloc_tree should be qgroup.
      
      Since reloc root needs a longer life-span, this patch will also delay
      btrfs_drop_snapshot() call.
      Now btrfs_drop_snapshot() is called in clean_dirty_subvols().
      
      This patch will increase the size of btrfs_root by 16 bytes.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2311e69
    • Julia Lawall's avatar
      Btrfs: drop useless LIST_HEAD in merge_reloc_root · 9cf10cc1
      Julia Lawall authored
      Drop LIST_HEAD where the variable it declares is never used.
      
      The uses were removed in 3fd0a558 ("Btrfs: Metadata ENOSPC
      handling for balance"), but not the declaration.
      
      The semantic patch that fixes this problem is as follows:
      (http://coccinelle.lip6.fr/
      
      )
      
      // <smpl>
      @@
      identifier x;
      @@
      - LIST_HEAD(x);
        ... when != x
      // </smpl>
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@lip6.fr>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9cf10cc1
  11. 17 Dec, 2018 3 commits
    • Andrea Gelmini's avatar
      btrfs: Fix typos in comments and strings · 52042d8e
      Andrea Gelmini authored
      
      
      The typos accumulate over time so once in a while time they get fixed in
      a large patch.
      Signed-off-by: default avatarAndrea Gelmini <andrea.gelmini@gelma.net>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52042d8e
    • Anand Jain's avatar
      btrfs: add helper to describe block group flags · f89e09cf
      Anand Jain authored
      
      
      Factor out helper that describes block group flags from
      describe_relocation. The result will not be longer than the given size.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f89e09cf
    • Omar Sandoval's avatar
      Btrfs: prevent ioctls from interfering with a swap file · eede2bf3
      Omar Sandoval authored
      
      
      A later patch will implement swap file support for Btrfs, but before we
      do that, we need to make sure that the various Btrfs ioctls cannot
      change a swap file.
      
      When a swap file is active, we must make sure that the extents of the
      file are not moved and that they don't become shared. That means that
      the following are not safe:
      
      - chattr +c (enable compression)
      - reflink
      - dedupe
      - snapshot
      - defrag
      
      Don't allow those to happen on an active swap file.
      
      Additionally, balance, resize, device remove, and device replace are
      also unsafe if they affect an active swapfile. Add a red-black tree of
      block groups and devices which contain an active swapfile. Relocation
      checks each block group against this tree and skips it or errors out for
      balance or resize, respectively. Device remove and device replace check
      the tree for the device they will operate on.
      
      Note that we don't have to worry about chattr -C (disable nocow), which
      we ignore for non-empty files, because an active swapfile must be
      non-empty and can't be truncated. We also don't have to worry about
      autodefrag because it's only done on COW files. Truncate and fallocate
      are already taken care of by the generic code. Device add doesn't do
      relocation so it's not an issue, either.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eede2bf3
  12. 23 Nov, 2018 1 commit
  13. 15 Oct, 2018 7 commits
    • Qu Wenruo's avatar
      btrfs: relocation: Remove redundant tree level check · 06bbf672
      Qu Wenruo authored
      Commit 581c1760
      
       ("btrfs: Validate child tree block's level and first
      key") has made tree block level check mandatory.
      
      So if tree block level doesn't match, we won't get a valid extent
      buffer.  The extra WARN_ON() check can be removed completely.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06bbf672
    • Qu Wenruo's avatar
      btrfs: relocation: Cleanup while loop using rbtree_postorder_for_each_entry_safe · 98ff7b94
      Qu Wenruo authored
      
      
      And add one line comment explaining what we're doing for each loop.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98ff7b94
    • Qu Wenruo's avatar
      btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group · 3d0174f7
      Qu Wenruo authored
      
      
      For qgroup_trace_extent_swap(), if we find one leaf that needs to be
      traced, we will also iterate all file extents and trace them.
      
      This is OK if we're relocating data block groups, but if we're
      relocating metadata block groups, balance code itself has ensured that
      both subtree of file tree and reloc tree contain the same contents.
      
      That's to say, if we're relocating metadata block groups, all file
      extents in reloc and file tree should match, thus no need to trace them.
      This should reduce the total number of dirty extents processed in metadata
      block group balance.
      
      [[Benchmark]] (with all previous enhancement)
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22851          | -0.3%
      qgroup dirty extents | 227757       | 140886         | -38.1%
      time (sys)           | 65.253s      | 37.464s        | -42.6%
      time (real)          | 74.032s      | 44.722s        | -39.6%
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d0174f7
    • Qu Wenruo's avatar
      btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents · 5f527822
      Qu Wenruo authored
      
      
      Before this patch, with quota enabled during balance, we need to mark
      the whole subtree dirty for quota.
      
      E.g.
      OO = Old tree blocks (from file tree)
      NN = New tree blocks (from reloc tree)
      
              File tree (src)		          Reloc tree (dst)
                  OO (a)                              NN (a)
                 /  \                                /  \
           (b) OO    OO (c)                    (b) NN    NN (c)
              /  \  /  \                          /  \  /  \
             OO  OO OO OO (d)                    OO  OO OO NN (d)
      
      For old balance + quota case, quota will mark the whole src and dst tree
      dirty, including all the 3 old tree blocks in reloc tree.
      
      It's doable for small file tree or new tree blocks are all located at
      lower level.
      
      But for large file tree or new tree blocks are all located at higher
      level, this will lead to mark the whole tree dirty, and be unbelievably
      slow.
      
      This patch will change how we handle such balance with quota enabled
      case.
      
      Now we will search from (b) and (c) for any new tree blocks whose
      generation is equal to @last_snapshot, and only mark them dirty.
      
      In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
      (NN(a) will be traced when COW happens for nodeptr modification).  And
      also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
      happens for nodeptr modification.)
      
      For above case, we could skip 3 tree blocks, but for larger tree, we can
      skip tons of unmodified tree blocks, and hugely speed up balance.
      
      This patch will introduce a new function,
      btrfs_qgroup_trace_subtree_swap(), which will do the following main
      work:
      
      1) Read out real root eb
         And setup basic dst_path for later calls
      2) Call qgroup_trace_new_subtree_blocks()
         To trace all new tree blocks in reloc tree and their counter
         parts in the file tree.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f527822
    • Qu Wenruo's avatar
      btrfs: relocation: Add basic extent backref related comments for build_backref_tree · fa6ac715
      Qu Wenruo authored
      
      
      fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
      although it works pretty fine, it's not that easy to understand.
      Especially combined with the complex btrfs backref format.
      
      This patch adds some basic comment for the backref build part of the
      code, making it less hard to read, at least for backref searching part.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa6ac715
    • Qu Wenruo's avatar
      btrfs: Handle owner mismatch gracefully when walking up tree · 65c6e82b
      Qu Wenruo authored
      [BUG]
      When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
      when trying to recover balance:
      
        kernel BUG at fs/btrfs/extent-tree.c:8956!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
        RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
        RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
        Call Trace:
         walk_up_tree+0x172/0x1f0 [btrfs]
         btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
         merge_reloc_roots+0xe1/0x1d0 [btrfs]
         btrfs_recover_relocation+0x3ea/0x420 [btrfs]
         open_ctree+0x1af3/0x1dd0 [btrfs]
         btrfs_mount_root+0x66b/0x740 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         btrfs_mount+0x16d/0x890 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         do_mount+0x1fd/0xda0
         ksys_mount+0xba/0xd0
         __x64_sys_mount+0x21/0x30
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      Extent tree corruption.  In this particular case, reloc tree root's
      owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
      corrupted and we failed the owner check in walk_up_tree().
      
      [FIX]
      It's pretty hard to take care of every extent tree corruption, but at
      least we can remove such BUG_ON() and exit more gracefully.
      
      And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
      the same root (which is obviously invalid), we needs to make
      __del_reloc_root() more robust to detect such invalid sharing to avoid
      possible NULL dereference as root->node can be NULL in this case.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
      
      Reported-by: default avatarXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65c6e82b
    • Misono Tomohiro's avatar
      btrfs: Remove 'objectid' member from struct btrfs_root · 4fd786e6
      Misono Tomohiro authored
      
      
      There are two members in struct btrfs_root which indicate root's
      objectid: objectid and root_key.objectid.
      
      They are both set to the same value in __setup_root():
      
        static void __setup_root(struct btrfs_root *root,
                                 struct btrfs_fs_info *fs_info,
                                 u64 objectid)
        {
          ...
          root->objectid = objectid;
          ...
          root->root_key.objectid = objecitd;
          ...
        }
      
      and not changed to other value after initialization.
      
      grep in btrfs directory shows both are used in many places:
        $ grep -rI "root->root_key.objectid" | wc -l
        133
        $ grep -rI "root->objectid" | wc -l
        55
       (4.17, inc. some noise)
      
      It is confusing to have two similar variable names and it seems
      that there is no rule about which should be used in a certain case.
      
      Since ->root_key itself is needed for tree reloc tree, let's remove
      'objecitd' member and unify code to use ->root_key.objectid in all places.
      Signed-off-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4fd786e6