1. 13 Dec, 2019 2 commits
    • Josef Bacik's avatar
      btrfs: don't double lock the subvol_sem for rename exchange · 943eb3bf
      Josef Bacik authored
      If we're rename exchanging two subvols we'll try to lock this lock
      twice, which is bad.  Just lock once if either of the ino's are subvols.
      Fixes: cdd1fedf
       ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Josef Bacik's avatar
      btrfs: do not call synchronize_srcu() in inode_tree_del · f72ff01d
      Josef Bacik authored
      Testing with the new fsstress uncovered a pretty nasty deadlock with
      lookup and snapshot deletion.
      Process A
       -> final iput
         -> inode_tree_del
           -> synchronize_srcu(subvol_srcu)
      Process B
      btrfs_lookup  <- srcu_read_lock() acquired here
        -> btrfs_iget
          -> find inode that has I_FREEING set
            -> __wait_on_freeing_inode()
      We're holding the srcu_read_lock() while doing the iget in order to make
      sure our fs root doesn't go away, and then we are waiting for the inode
      to finish freeing.  However because the free'ing process is doing a
      synchronize_srcu() we deadlock.
      Fix this by dropping the synchronize_srcu() in inode_tree_del().  We
      don't need people to stop accessing the fs root at this point, we're
      only adding our empty root to the dead roots list.
      A larger much more invasive fix is forthcoming to address how we deal
      with fs roots, but this fixes the immediate problem.
      Fixes: 76dda93c
       ("Btrfs: add snapshot/subvolume destroy ioctl")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  2. 18 Nov, 2019 13 commits
    • David Sterba's avatar
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba authored
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Josef Bacik's avatar
      btrfs: record all roots for rename exchange on a subvol · 3e174099
      Josef Bacik authored
      Testing with the new fsstress support for subvolumes uncovered a pretty
      bad problem with rename exchange on subvolumes.  We're modifying two
      different subvolumes, but we only start the transaction on one of them,
      so the other one is not added to the dirty root list.  This is caught by
      btrfs_cow_block() with a warning because the root has not been updated,
      however if we do not modify this root again we'll end up pointing at an
      invalid root because the root item is never updated.
      Fix this by making sure we add the destination root to the trans list,
      the same as we do with normal renames.  This fixes the corruption.
      Fixes: cdd1fedf
       ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • David Sterba's avatar
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba authored
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • David Sterba's avatar
      btrfs: sink write flags to cow_file_range_async · fac07d2b
      David Sterba authored
      In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
      cow_file_range_async gained wbc as a parameter and this makes passing
      write flags redundant. Set it inside the function and remove the
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Filipe Manana's avatar
      Btrfs: remove unnecessary delalloc mutex for inodes · 16ad3be1
      Filipe Manana authored
      The inode delalloc mutex was added a long time ago by commit f248679e
      ("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and
      the reason for its introduction is not very clear from the change log. It
      claims it solves bogus warnings from lockdep, however it lacks an example
      report/warning from lockdep, or any explanation.
      Since we have enough concurrentcy protection from the locks of the space
      info and block reserve objects, and such lockdep warnings don't seem to
      exist anymore (at least on a 5.3 kernel I couldn't get them with fstests,
      ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing
      the size of the btrfs_inode structure. With some quick fio tests doing
      direct IO and mmap writes I couldn't observe any significant performance
      increase either (direct IO writes that don't increase the file's size
      don't hold the inode's lock for their entire duration and mmap writes
      don't hold the inode's lock at all), which are the only type of writes
      that could see any performance gain due to less serialization.
      Review feedback from Josef:
      The problem was taking the i_mutex in mmap, which is how I was
      protecting delalloc reservations originally.  The delalloc mutex didn't
      come with all of the other dependencies.  That's what the lockdep
      messages were about, removing the lock isn't going to make them appear
      We _had_ to lock around this because we used to do tricks to keep from
      over-reserving, and if we didn't serialize delalloc reservations we'd
      end up with ugly accounting problems when we tried to clean things up.
      However with my recentish changes this isn't the case anymore.  Every
      operation is responsible for reserving its space, and then adding it to
      the inode.  Then cleaning up is straightforward and can't be mucked up
      by other users.  So we no longer need the delalloc mutex to safe us from
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • David Sterba's avatar
      btrfs: get bdev from latest_dev for dio bh_result · 8530c37a
      David Sterba authored
      To remove use of extent_map::bdev we need to find a replacement, and the
      latest_bdev is the only one we can use here, because inode::i_bdev and
      superblock::s_bdev are NULL.
      The DIO code uses bdev in two places:
      * to read blocksize to perform alignment checks in
        do_blockdev_direct_IO, but we do them in btrfs code before any call to
      * in the following call chain:
           sdio->get_block() <-- this is btrfs_get_blocks_direct
        subsequently the map_bh->b_dev member is used in clean_bdev_aliases
        and dio_new_bio to set the bio's bdev to that of the buffer_head.
        However, because we have provided a submit function dio_bio_submit
        calls our submission function and ignores the bdev.
      So it's safe to pass any valid bdev that's used within the filesystem.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Filipe Manana's avatar
      Btrfs: fix metadata space leak on fixup worker failure to set range as delalloc · 53687007
      Filipe Manana authored
      In the fixup worker, if we fail to mark the range as delalloc in the io
      tree, we must release the previously reserved metadata, as well as update
      the outstanding extents counter for the inode, otherwise we leak metadata
      In pratice we can't return an error from btrfs_set_extent_delalloc(),
      which is just a wrapper around __set_extent_bit(), as for most errors
      __set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as
      well) and returning an -EEXIST error doesn't happen in this case since
      the exclusive bits parameter always has a value of 0 through this code
      path. Nevertheless, just fix the error handling in the fixup worker,
      in case one day __set_extent_bit() can return an error to this code
      Fixes: f3038ee3
       ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Nikolay Borisov's avatar
      btrfs: Rename btrfs_join_transaction_nolock · 8d510121
      Nikolay Borisov authored
      This function is used only during the final phase of freespace cache
      writeout. This is necessary since using the plain btrfs_join_transaction
      api is deadlock prone. The deadlock looks like:
             btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
                                          inode and blocks transaction commit
      				    until freespace cache writeout
      T2: <-- after T1 has triggered the writeout
          btrfs_join_transaction <--- this would block waiting for current
                                      transaction to commit, but since trans
      				commit is waiting for this writeout to
      The special purpose functions prevents it by simply skipping the "wait
      for writeout" since it's guaranteed the transaction won't proceed until
      we are done.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Chris Mason's avatar
      Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios · ec39f769
      Chris Mason authored
      Async CRCs and compression submit IO through helper threads, which means
      they have IO priority inversions when cgroup IO controllers are in use.
      This flags all of the writes submitted by btrfs helper threads as
      REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
      work items to avoid the priority inversion.
      For the compression code, we take a reference on the wbc's blkg css and
      pass it down to the async workers.
      For the async CRCs, the bio already has the correct css, we just need to
      tell the block layer to use REQ_CGROUP_PUNT.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Modified-and-reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Chris Mason's avatar
      Btrfs: only associate the locked page with one async_chunk struct · 1d53c9e6
      Chris Mason authored
      The btrfs writepages function collects a large range of pages flagged
      for delayed allocation, and then sends them down through the COW code
      for processing.  When compression is on, we allocate one async_chunk
      structure for every 512K, and then run those pages through the
      compression code for IO submission.
      writepages starts all of this off with a single page, locked by the
      original call to extent_write_cache_pages(), and it's important to keep
      track of this page because it has already been through
      The btrfs async_chunk struct has a pointer to the locked_page, and when
      we're redirtying the page because compression had to fallback to
      uncompressed IO, we use page->index to decide if a given async_chunk
      struct really owns that page.
      But, this is racey.  If a given delalloc range is broken up into two
      async_chunks (chunkA and chunkB), we can end up with something like
       submit compressed bios(chunkA)
        submit_compressed_extents <--- falls back to buffered writeout
      The end result is that chunkA is completed and cleaned up before chunkB
      even starts processing.  This means we can free locked_page() and reuse
      it elsewhere.  If we get really lucky, it'll have the same page->index
      in its new home as it did before.
      While we're processing chunkB, we might decide we need to fall back to
      uncompressed IO, and so compress_file_range() will call
      __set_page_dirty_nobufers() on chunkB->locked_page.
      Without cgroups in use, this creates as a phantom dirty page, which
      isn't great but isn't the end of the world. What can happen, it can go
      through the fixup worker and the whole COW machinery again:
      in submit_compressed_extents():
        while (async extents) {
          if (!page_started ...)
          else if (...)
      This hasn't been observed in practice but is still possible.
      With cgroups in use, we might crash in the accounting code because
      page->mapping->i_wb isn't set.
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP: percpu_counter_add_batch+0x11/0x70
        PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        CPU: 16 PID: 2172 Comm: rm Not tainted
        RIP: 0010:percpu_counter_add_batch+0x11/0x70
        RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
        RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
        RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
        RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
        R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
        R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
        FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
      The fix here is to make asyc_chunk->locked_page NULL everywhere but the
      one async_chunk struct that's allowed to do things to the locked page.
      Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
      Fixes: 771ed689
       ("Btrfs: Optimize compressed writeback and reads")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      [ update changelog from mail thread discussion ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Chris Mason's avatar
      Btrfs: stop using btrfs_schedule_bio() · 08635bae
      Chris Mason authored
      btrfs_schedule_bio() hands IO off to a helper thread to do the actual
      submit_bio() call.  This has been used to make sure async crc and
      compression helpers don't get stuck on IO submission.  To maintain good
      performance, over time the IO submission threads duplicated some IO
      scheduler characteristics such as high and low priority IOs and they
      also made some ugly assumptions about request allocation batch sizes.
      All of this cost at least one extra context switch during IO submission,
      and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
      using btrfs_schedule_bio().  We may need to adjust the number of async
      helper threads for crcs and compression, but long term it's a better
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • David Sterba's avatar
      btrfs: drop unused parameter is_new from btrfs_iget · 4c66e0d4
      David Sterba authored
      The parameter is now always set to NULL and could be dropped. The last
      user was get_default_root but that got reworked in 05dbe683
      unify subvol= and subvolid= mounting") and the parameter became unused.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Omar Sandoval's avatar
      btrfs: get rid of unique workqueue helper functions · a0cac0ec
      Omar Sandoval authored
      Commit 9e0af237
       ("Btrfs: fix task hang under heavy compressed
      write") worked around the issue that a recycled work item could get a
      false dependency on the original work item due to how the workqueue code
      guarantees non-reentrancy. It did so by giving different work functions
      to different types of work.
      However, the fixes in the previous few patches are more complete, as
      they prevent a work item from being recycled at all (except for a tiny
      window that the kernel workqueue code handles for us). This obsoletes
      the previous fix, so we don't need the unique helpers for correctness.
      The only other reason to keep them would be so they show up in stack
      traces, but they always seem to be optimized to a tail call, so they
      don't show up anyways. So, let's just get rid of the extra indirection.
      While we're here, rename normal_work_helper() to the more informative
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  3. 11 Nov, 2019 1 commit
    • Filipe Manana's avatar
      Btrfs: fix log context list corruption after rename exchange operation · e6c61710
      Filipe Manana authored
      During rename exchange we might have successfully log the new name in the
      source root's log tree, in which case we leave our log context (allocated
      on stack) in the root's list of log contextes. However we might fail to
      log the new name in the destination root, in which case we fallback to
      a transaction commit later and never sync the log of the source root,
      which causes the source root log context to remain in the list of log
      contextes. This later causes invalid memory accesses because the context
      was allocated on stack and after rename exchange finishes the stack gets
      reused and overwritten for other purposes.
      The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
      detect this and report something like the following:
        [  691.489929] ------------[ cut here ]------------
        [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
        [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
        [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
        [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
        [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
        [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
        [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
        [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
        [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
        [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
        [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
        [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
        [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [  691.490029] Call Trace:
        [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
        [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
        [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
        [  691.490111]  ? dget_parent+0xb8/0x460
        [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
        [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
        [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
        [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
        [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
        [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
        [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
        [  691.490204]  ? __do_sys_newstat+0x88/0xd0
        [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
        [  691.490218]  ? do_fsync+0x38/0x60
        [  691.490220]  do_fsync+0x38/0x60
        [  691.490224]  __x64_sys_fdatasync+0x32/0x40
        [  691.490228]  do_syscall_64+0x9f/0x540
        [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [  691.490235] RIP: 0033:0x7f4b253ad5f0
        [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
        [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
        [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
        [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
        [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
      This started happening recently when running some test cases from fstests
      like btrfs/004 for example, because support for rename exchange was added
      last week to fsstress from fstests.
      So fix this by deleting the log context for the source root from the list
      if we have logged the new name in the source root.
      Reported-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Fixes: d4682ba0
       ("Btrfs: sync log after logging new name")
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: default avatarSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  4. 04 Nov, 2019 1 commit
    • Josef Bacik's avatar
      btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range · d98da499
      Josef Bacik authored
      We hit a regression while rolling out 5.2 internally where we were
      hitting the following panic
        kernel BUG at mm/page-writeback.c:2659!
        RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
        Call Trace:
         ? extent_clear_unlock_delalloc+0x43/0x70
         ? rescuer_thread+0x340/0x340
         ? kthread_create_on_node+0x60/0x60
      This is happening because the page is not locked when doing
      clear_page_dirty_for_io.  Looking at the core dump it was because our
      async_extent had a ram_size of 24576 but our async_chunk range only
      spanned 20480, so we had a whole extra page in our ram_size for our
      This happened because we try not to compress pages outside of our
      i_size, however a cleanup patch changed us to do
      actual_end = min_t(u64, i_size_read(inode), end + 1);
      which is problematic because i_size_read() can evaluate to different
      values in between checking and assigning.  So either an expanding
      truncate or a fallocate could increase our i_size while we're doing
      writeout and actual_end would end up being past the range we have
      I confirmed this was what was happening by installing a debug kernel
      that had
        actual_end = min_t(u64, i_size_read(inode), end + 1);
        if (actual_end > end + 1) {
      	  printk(KERN_ERR "KABOOM\n");
      	  actual_end = end + 1;
      and installing it onto 500 boxes of the tier that had been seeing the
      problem regularly.  Last night I got my debug message and no panic,
      confirming what I expected.
      [ dsterba: the assembly confirms a tiny race window:
          mov    0x20(%rsp),%rax
          cmp    %rax,0x48(%r15)           # read
          movl   $0x0,0x18(%rsp)
          mov    %rax,%r12
          mov    %r14,%rax
          cmovbe 0x48(%r15),%r12           # eval
        Where r15 is inode and 0x48 is offset of i_size.
        The original fix was to revert 62b37622 that would do an
        intermediate assignment and this would also avoid the doulble
        evaluation but is not future-proof, should the compiler merge the
        stores and call i_size_read anyway.
        There's a patch adding READ_ONCE to i_size_read but that's not being
        applied at the moment and we need to fix the bug. Instead, emulate
        READ_ONCE by two barrier()s that's what effectively happens. The
        assembly confirms single evaluation:
          mov    0x48(%rbp),%rax          # read once
          mov    0x20(%rsp),%rcx
          mov    $0x20,%edx
          cmp    %rax,%rcx
          cmovbe %rcx,%rax
          mov    %rax,(%rsp)
          mov    %rax,%rcx
          mov    %r14,%rax
        Where 0x48(%rbp) is inode->i_size stored to %eax.
      Fixes: 62b37622
       ("btrfs: Remove isize local variable in compress_file_range")
      CC: stable@vger.kernel.org # v5.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ changelog updated ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  5. 15 Oct, 2019 1 commit
    • Qu Wenruo's avatar
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo authored
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS is metadata space reserved for each transaction started by
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      However there is inconsistency in how we handle PREALLOC metadata space.
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      We always free qgroup PREALLOC meta space.
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      We only free qgroup PREALLOC meta space when something went wrong.
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      - Free the unused metadata space
        Normally in:
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      E.g. for 64K buffered write, the metadata rsv works like:
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459
       ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595
       ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  6. 01 Oct, 2019 1 commit
    • Josef Bacik's avatar
      btrfs: allocate new inode in NOFS context · 11a19a90
      Josef Bacik authored
      A user reported a lockdep splat
       WARNING: possible circular locking dependency detected
       5.2.11-gentoo #2 Not tainted
       kswapd0/711 is trying to acquire lock:
       000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500
      but task is already holding lock:
       000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
      which lock already depends on the new lock.
      the existing dependency chain (in reverse order) is:
      -> #1 (fs_reclaim){+.+.}:
      -> #0 (sb_internal){.+.+}:
      other info that might help us debug this:
       Possible unsafe locking scenario:
       CPU0 CPU1
       ---- ----
      *** DEADLOCK ***
       3 locks held by kswapd0/711:
       #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
       #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
       #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0
      stack backtrace:
       CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
       Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
       Call Trace:
       ? start_transaction+0x3a8/0x500
       ? start_transaction+0x3a8/0x500
       ? var_wake_function+0x20/0x20
       ? discard_new_inode+0xc0/0xc0
       ? discard_new_inode+0xc0/0xc0
       ? __wake_up_common_lock+0x90/0x90
       ? balance_pgdat+0x640/0x640
       ? __kthread_create_on_node+0x160/0x160
      This is because btrfs_new_inode() calls new_inode() under the
      transaction.  We could probably move the new_inode() outside of this but
      for now just wrap it in memalloc_nofs_save().
      Reported-by: default avatarZdenek Sojka <zsojka@seznam.cz>
      Fixes: 712e36c5
       ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  7. 09 Sep, 2019 18 commits
  8. 17 Jul, 2019 1 commit
    • Qu Wenruo's avatar
      btrfs: inode: Don't compress if NODATASUM or NODATACOW set · 42c16da6
      Qu Wenruo authored
      As btrfs(5) specified:
      	If nodatacow or nodatasum are enabled, compression is disabled.
      If NODATASUM or NODATACOW set, we should not compress the extent.
      Normally NODATACOW is detected properly in run_delalloc_range() so
      compression won't happen for NODATACOW.
      However for NODATASUM we don't have any check, and it can cause
      compressed extent without csum pretty easily, just by:
        mkfs.btrfs -f $dev
        mount $dev $mnt -o nodatasum
        touch $mnt/foobar
        mount -o remount,datasum,compress $mnt
        xfs_io -f -c "pwrite 0 128K" $mnt/foobar
      And in fact, we have a bug report about corrupted compressed extent
      without proper data checksum so even RAID1 can't recover the corruption.
      Running compression without proper checksum could cause more damage when
      corruption happens, as compressed data could make the whole extent
      unreadable, so there is no need to allow compression for
      The fix will refactor the inode compression check into two parts:
      - inode_can_compress()
        As the hard requirement, checked at btrfs_run_delalloc_range(), so no
        compression will happen for NODATASUM inode at all.
      - inode_need_compress()
        As the soft requirement, checked at btrfs_run_delalloc_range() and
      Reported-by: default avatarJames Harvey <jamespharvey20@gmail.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  9. 04 Jul, 2019 1 commit
  10. 02 Jul, 2019 1 commit
    • Josef Bacik's avatar
      btrfs: run delayed iput at unlink time · 63611e73
      Josef Bacik authored
      We have been seeing issues in production where a cleaner script will end
      up unlinking a bunch of files that have pending iputs.  This means they
      will get their final iput's run at btrfs-cleaner time and thus are not
      throttled, which impacts the workload.
      Since we are unlinking these files we can just drop the delayed iput at
      unlink time.  We are already holding a reference to the inode so this
      will not be the final iput and thus is completely safe to do at this
      point.  Doing this means we are more likely to be doing the final iput
      at unlink time, and thus will get the IO charged to the caller and get
      throttled appropriately without affecting the main workload.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>