1. 13 Dec, 2019 1 commit
  2. 18 Nov, 2019 14 commits
    • David Sterba's avatar
      btrfs: drop bdev argument from submit_extent_page · fa17ed06
      David Sterba authored
      
      
      After previous patches removing bdev being passed around to set it to
      bio, it has become unused in submit_extent_page. So it now has "only" 13
      parameters.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa17ed06
    • David Sterba's avatar
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba authored
      
      
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a019e9e1
    • David Sterba's avatar
      btrfs: drop bio_set_dev where not needed · 1a418027
      David Sterba authored
      
      
      bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
      also changing some state bits if there was a different bdev set before.
      This is one thing that's not needed.
      
      Another thing is that setting a bdev at bio allocation time is too early
      and actually does not work with plain redundancy profiles, where each
      time we submit a bio to a device, the bdev is set correctly.
      
      In many places the bio bdev is set to latest_bdev that seems to serve as
      a stub pointer "just to put something to bio". But we don't have to do
      that.
      
      Where do we know which bdev to set:
      
      * for regular IO: submit_stripe_bio that's called by btrfs_map_bio
      
      * repair IO: repair_io_failure, read or write from specific device
      
      * super block write (using buffer_heads but uses raw bdev) and barriers
      
      * scrub: this does not use all regular IO paths as it needs to reach all
        copies, verify and fixup eventually, and for that all bdev management
        is independent
      
      * raid56: rbio_add_io_page, for the RMW write
      
      * integrity-checker: does it's own low-level block tracking
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a418027
    • David Sterba's avatar
      btrfs: get bdev directly from fs_devices in submit_extent_page · 429aebc0
      David Sterba authored
      
      
      This is preparatory patch to remove @bdev parameter from
      submit_extent_page. It can't be removed completely, because the cgroups
      need it for wbc when initializing the bio
      
      wbc_init_bio
        bio_associate_blkg_from_css
          dereference bdev->bi_disk->queue
      
      The bdev pointer is the same as latest_bdev, thus no functional change.
      We can retrieve it from fs_devices that's reachable through several
      dereferences. The local variable shadows the parameter, but that's only
      temporary.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      429aebc0
    • David Sterba's avatar
      btrfs: sink write_flags to __extent_writepage_io · 57e5ffeb
      David Sterba authored
      
      
      __extent_writepage reads write flags from wbc and passes both to
      __extent_writepage_io. This makes write_flags redundant and we can
      remove it.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57e5ffeb
    • Tejun Heo's avatar
      btrfs: Avoid getting stuck during cyclic writebacks · f7bddf1e
      Tejun Heo authored
      
      
      During a cyclic writeback, extent_write_cache_pages() uses done_index
      to update the writeback_index after the current run is over.  However,
      instead of current index + 1, it gets to to the current index itself.
      
      Unfortunately, this, combined with returning on EOF instead of looping
      back, can lead to the following pathlogical behavior.
      
      1. There is a single file which has accumulated enough dirty pages to
         trigger balance_dirty_pages() and the writer appending to the file
         with a series of short writes.
      
      2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
      
      3. Writeback kicks in and the cursor is on the last page of the dirty
         file.  Writeback is started or skipped if already in progress.  As
         it's EOF, extent_write_cache_pages() returns and the cursor is set
         to done_index which is pointing to the last page.
      
      4. Writeback is done.  Nothing happens till balance_dirty_pages
         finishes, at which point we go back to #1.
      
      This can almost completely stall out writing back of the file and keep
      the system over dirty threshold for a long time which can mess up the
      whole system.  We encountered this issue in production with a package
      handling application which can reliably reproduce the issue when
      running under tight memory limits.
      
      Reading the comment in the error handling section, this seems to be to
      avoid accidentally skipping a page in case the write attempt on the
      page doesn't succeed.  However, this concern seems bogus.
      
      On each page, the code either:
      
      * Skips and moves onto the next page.
      
      * Fails issue and sets done_index to index + 1.
      
      * Successfully issues and continue to the next page if budget allows
        and not EOF.
      
      IOW, as long as it's not EOF and there's budget, the code never
      retries writing back the same page.  Only when a page happens to be
      the last page of a particular run, we end up retrying the page, which
      can't possibly guarantee anything data integrity related.  Besides,
      cyclic writes are only used for non-syncing writebacks meaning that
      there's no data integrity implication to begin with.
      
      Fix it by always setting done_index past the current page being
      processed.
      
      Note that this problem exists in other writepages too.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7bddf1e
    • Chris Mason's avatar
      Btrfs: extent_write_locked_range() should attach inode->i_wb · dbb70bec
      Chris Mason authored
      
      
      extent_write_locked_range() is used when we're falling back to buffered
      IO from inside of compression.  It allocates its own wbc and should
      associate it with the inode's i_wb to make sure the IO goes down from
      the correct cgroup.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dbb70bec
    • Chris Mason's avatar
      Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios · ec39f769
      Chris Mason authored
      
      
      Async CRCs and compression submit IO through helper threads, which means
      they have IO priority inversions when cgroup IO controllers are in use.
      
      This flags all of the writes submitted by btrfs helper threads as
      REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
      work items to avoid the priority inversion.
      
      For the compression code, we take a reference on the wbc's blkg css and
      pass it down to the async workers.
      
      For the async CRCs, the bio already has the correct css, we just need to
      tell the block layer to use REQ_CGROUP_PUNT.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Modified-and-reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec39f769
    • Chris Mason's avatar
      Btrfs: only associate the locked page with one async_chunk struct · 1d53c9e6
      Chris Mason authored
      The btrfs writepages function collects a large range of pages flagged
      for delayed allocation, and then sends them down through the COW code
      for processing.  When compression is on, we allocate one async_chunk
      structure for every 512K, and then run those pages through the
      compression code for IO submission.
      
      writepages starts all of this off with a single page, locked by the
      original call to extent_write_cache_pages(), and it's important to keep
      track of this page because it has already been through
      clear_page_dirty_for_io().
      
      The btrfs async_chunk struct has a pointer to the locked_page, and when
      we're redirtying the page because compression had to fallback to
      uncompressed IO, we use page->index to decide if a given async_chunk
      struct really owns that page.
      
      But, this is racey.  If a given delalloc range is broken up into two
      async_chunks (chunkA and chunkB), we can end up with something like
      this:
      
       compress_file_range(chunkA)
       submit_compress_extents(chunkA)
       submit compressed bios(chunkA)
       put_page(locked_page)
      
      				 compress_file_range(chunkB)
      				 ...
      
      Or:
      
       async_cow_submit
        submit_compressed_extents <--- falls back to buffered writeout
         cow_file_range
          extent_clear_unlock_delalloc
           __process_pages_contig
             put_page(locked_pages)
      
      					    async_cow_submit
      
      The end result is that chunkA is completed and cleaned up before chunkB
      even starts processing.  This means we can free locked_page() and reuse
      it elsewhere.  If we get really lucky, it'll have the same page->index
      in its new home as it did before.
      
      While we're processing chunkB, we might decide we need to fall back to
      uncompressed IO, and so compress_file_range() will call
      __set_page_dirty_nobufers() on chunkB->locked_page.
      
      Without cgroups in use, this creates as a phantom dirty page, which
      isn't great but isn't the end of the world. What can happen, it can go
      through the fixup worker and the whole COW machinery again:
      
      in submit_compressed_extents():
        while (async extents) {
        ...
          cow_file_range
          if (!page_started ...)
            extent_write_locked_range
          else if (...)
            unlock_page
          continue;
      
      This hasn't been observed in practice but is still possible.
      
      With cgroups in use, we might crash in the accounting code because
      page->mapping->i_wb isn't set.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP: percpu_counter_add_batch+0x11/0x70
        PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        CPU: 16 PID: 2172 Comm: rm Not tainted
        RIP: 0010:percpu_counter_add_batch+0x11/0x70
        RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
        RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
        RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
        RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
        R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
        R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
        FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         account_page_cleaned+0x15b/0x1f0
         __cancel_dirty_page+0x146/0x200
         truncate_cleanup_page+0x92/0xb0
         truncate_inode_pages_range+0x202/0x7d0
         btrfs_evict_inode+0x92/0x5a0
         evict+0xc1/0x190
         do_unlinkat+0x176/0x280
         do_syscall_64+0x63/0x1a0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      The fix here is to make asyc_chunk->locked_page NULL everywhere but the
      one async_chunk struct that's allowed to do things to the locked page.
      
      Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
      Fixes: 771ed689
      
       ("Btrfs: Optimize compressed writeback and reads")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      [ update changelog from mail thread discussion ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d53c9e6
    • Josef Bacik's avatar
      btrfs: move the failrec tree stuff into extent-io-tree.h · b3f167aa
      Josef Bacik authored
      
      
      This needs to be cleaned up in the future, but for now it belongs to the
      extent-io-tree stuff since it uses the internal tree search code.
      Needed to export get_state_failrec and set_state_failrec as well since
      we're not going to move the actual IO part of the failrec stuff out at
      this point.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3f167aa
    • Josef Bacik's avatar
      btrfs: export find_delalloc_range · 083e75e7
      Josef Bacik authored
      
      
      This utilizes internal stuff to the extent_io_tree, so we need to export
      it before we move it.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      083e75e7
    • Josef Bacik's avatar
      btrfs: move extent_io_tree defs to their own header · 9c7d3a54
      Josef Bacik authored
      
      
      extent_io.c/h are huge, encompassing a bunch of different things.  The
      extent_io_tree code can live on its own, so separate this out.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c7d3a54
    • Josef Bacik's avatar
      btrfs: separate out the extent io init function · 6f0d04f8
      Josef Bacik authored
      
      
      We are moving extent_io_tree into it's on file, so separate out the
      extent_state init stuff from extent_io_tree_init().
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f0d04f8
    • Josef Bacik's avatar
      btrfs: separate out the extent leak code · 33ca832f
      Josef Bacik authored
      
      
      We check both extent buffer and extent state leaks in the same function,
      separate these two functions out so we can move them around.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33ca832f
  3. 24 Sep, 2019 2 commits
  4. 12 Sep, 2019 1 commit
    • Filipe Manana's avatar
      Btrfs: fix unwritten extent buffers and hangs on future writeback attempts · 18dfa711
      Filipe Manana authored
      The lock_extent_buffer_io() returns 1 to the caller to tell it everything
      went fine and the callers needs to start writeback for the extent buffer
      (submit a bio, etc), 0 to tell the caller everything went fine but it does
      not need to start writeback for the extent buffer, and a negative value if
      some error happened.
      
      When it's about to return 1 it tries to lock all pages, and if a try lock
      on a page fails, and we didn't flush any existing bio in our "epd", it
      calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
      an error. The page might have been locked elsewhere, not with the goal
      of starting writeback of the extent buffer, and even by some code other
      than btrfs, like page migration for example, so it does not mean the
      writeback of the extent buffer was already started by some other task,
      so returning a 0 tells the caller (btree_write_cache_pages()) to not
      start writeback for the extent buffer. Note that epd might currently have
      either no bio, so flush_write_bio() returns 0 (success) or it might have
      a bio for another extent buffer with a lower index (logical address).
      
      Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
      extent buffer and writeback is never started for the extent buffer,
      future attempts to writeback the extent buffer will hang forever waiting
      on that bit to be cleared, since it can only be cleared after writeback
      completes. Such hang is reported with a trace like the following:
      
        [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
        [49887.347059]       Not tainted 5.2.13-gentoo #2
        [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [49887.347062] btrfs-transacti D    0  1752      2 0x80004000
        [49887.347064] Call Trace:
        [49887.347069]  ? __schedule+0x265/0x830
        [49887.347071]  ? bit_wait+0x50/0x50
        [49887.347072]  ? bit_wait+0x50/0x50
        [49887.347074]  schedule+0x24/0x90
        [49887.347075]  io_schedule+0x3c/0x60
        [49887.347077]  bit_wait_io+0x8/0x50
        [49887.347079]  __wait_on_bit+0x6c/0x80
        [49887.347081]  ? __lock_release.isra.29+0x155/0x2d0
        [49887.347083]  out_of_line_wait_on_bit+0x7b/0x80
        [49887.347084]  ? var_wake_function+0x20/0x20
        [49887.347087]  lock_extent_buffer_for_io+0x28c/0x390
        [49887.347089]  btree_write_cache_pages+0x18e/0x340
        [49887.347091]  do_writepages+0x29/0xb0
        [49887.347093]  ? kmem_cache_free+0x132/0x160
        [49887.347095]  ? convert_extent_bit+0x544/0x680
        [49887.347097]  filemap_fdatawrite_range+0x70/0x90
        [49887.347099]  btrfs_write_marked_extents+0x53/0x120
        [49887.347100]  btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
        [49887.347102]  btrfs_commit_transaction+0x6bb/0x990
        [49887.347103]  ? start_transaction+0x33e/0x500
        [49887.347105]  transaction_kthread+0x139/0x15c
      
      So fix this by not overwriting the return value (ret) with the result
      from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
      bit in case flush_write_bio() returns an error, otherwise it will hang
      any future attempts to writeback the extent buffer, and undo all work
      done before (set back EXTENT_BUFFER_DIRTY, etc).
      
      This is a regression introduced in the 5.2 kernel.
      
      Fixes: 2e3c2513 ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
      Fixes: f4340622
      
       ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
      Reported-by: default avatarZdenek Sojka <zsojka@seznam.cz>
      Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u
      
      Reported-by: default avatarStefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t
      
      Reported-by: default avatarDrazen Kacar <drazen.kacar@oradian.com>
      Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18dfa711
  5. 09 Sep, 2019 2 commits
  6. 10 Jul, 2019 1 commit
  7. 05 Jul, 2019 1 commit
  8. 04 Jul, 2019 1 commit
  9. 02 Jul, 2019 5 commits
  10. 01 Jul, 2019 5 commits
  11. 30 Apr, 2019 1 commit
  12. 29 Apr, 2019 6 commits