1. 13 Dec, 2019 1 commit
    • Filipe Manana's avatar
      Btrfs: fix missing data checksums after replaying a log tree · 40e046ac
      Filipe Manana authored
      When logging a file that has shared extents (reflinked with other files or
      with itself), we can end up logging multiple checksum items that cover
      overlapping ranges. This confuses the search for checksums at log replay
      time causing some checksums to never be added to the fs/subvolume tree.
      Consider the following example of a file that shares the same extent at
      offsets 0 and 256Kb:
         [ bytenr 13893632, offset 64Kb, len 64Kb  ]
         0                                         64Kb
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
         [ bytenr 13893632, offset 0, len 256Kb    ]
         256Kb                                     512Kb
      When logging the inode, at tree-log.c:copy_items(), when processing the
      file extent item at offset 0, we log a checksum item covering the range
      13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
      64Kb + 64Kb, respectively.
      Later when processing the extent item at offset 256K, we log the checksums
      for the range from 13893632 to 14155776 (which corresponds to 13893632 +
      256Kb). These checksums get merged with the checksum item for the range
      from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
      So after this we get the two following checksum items in the log tree:
         item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
                 range start 13631488 end 14155776 length 524288
         item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
                 range start 13959168 end 14024704 length 65536
      The first one covers the range from the second one, they overlap.
      So far this does not cause a problem after replaying the log, because
      when replaying the file extent item for offset 256K, we copy all the
      checksums for the extent 13893632 from the log tree to the fs/subvolume
      tree, since searching for an checksum item for bytenr 13893632 leaves us
      at the first checksum item, which covers the whole range of the extent.
      However if we write 64Kb to file offset 256Kb for example, we will
      not be able to find and copy the checksums for the last 128Kb of the
      extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
      After writing 64Kb into file offset 256Kb we get the following extent
      layout for our file:
         [ bytenr 13893632, offset 64K, len 64Kb   ]
         0                                         64Kb
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
         [ bytenr 14155776, offset 0, len 64Kb     ]
         256Kb                                     320Kb
         [ bytenr 13893632, offset 64Kb, len 192Kb ]
         320Kb                                     512Kb
      After fsync'ing the file, if we have a power failure and then mount
      the filesystem to replay the log, the following happens:
      1) When replaying the file extent item for file offset 320Kb, we
         lookup for the checksums for the extent range from 13959168
         (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
         to btrfs_lookup_csums_range();
      2) btrfs_lookup_csums_range() finds the checksum item that starts
         precisely at offset 13959168 (item 7 in the log tree, shown before);
      3) However that checksum item only covers 64Kb of data, and not 192Kb
         of data;
      4) As a result only the checksums for the first 64Kb of data referenced
         by the file extent item are found and copied to the fs/subvolume tree.
         The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
         the corresponding data checksums found and copied to the fs/subvolume
      5) After replaying the log userspace will not be able to read the file
         range from 384Kb to 512Kb, because the checksums are missing and
         resulting in an -EIO error.
      The following steps reproduce this scenario:
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
        $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
        $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
        $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
        $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
        <power failure>
        $ mount /dev/sdc /mnt/sdc
        $ md5sum /mnt/sdc/foobar
        md5sum: /mnt/sdc/foobar: Input/output error
        $ dmesg | tail
        [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
        [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
        [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
        [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
        [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
        [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
        [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
        [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
        [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
        [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
      Fix this simply by deleting first any checksums, from the log tree, for the
      range of the extent we are logging at copy_items(). This ensures we do not
      get checksum items in the log tree that have overlapping ranges.
      This is a long time issue that has been present since we have the clone
      (and deduplication) ioctl, and can happen both when an extent is shared
      between different files and within the same file.
      A test case for fstests follows soon.
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  2. 18 Nov, 2019 1 commit
  3. 01 Jul, 2019 3 commits
  4. 29 Apr, 2019 6 commits
  5. 25 Apr, 2019 1 commit
    • Nikolay Borisov's avatar
      btrfs: Switch memory allocations in async csum calculation path to kvmalloc · a3d46aea
      Nikolay Borisov authored
      Recent multi-page biovec rework allowed creation of bios that can span
      large regions - up to 128 megabytes in the case of btrfs. OTOH btrfs'
      submission path currently allocates a contiguous array to store the
      checksums for every bio submitted. This means we can request up to
      (128mb / BTRFS_SECTOR_SIZE) * 4 bytes + 32bytes of memory from kmalloc.
      On busy systems with possibly fragmented memory said kmalloc can fail
      which will trigger BUG_ON due to improper error handling IO submission
      context in btrfs.
      Until error handling is improved or bios in btrfs limited to a more
      manageable size (e.g. 1m) let's use kvmalloc to fallback to vmalloc for
      such large allocations. There is no hard requirement that the memory
      allocated for checksums during IO submission has to be contiguous, but
      this is a simple fix that does not require several non-contiguous
      For small writes this is unlikely to have any visible effect since
      kmalloc will still satisfy allocation requests as usual. For larger
      requests the code will just fallback to vmalloc.
      We've performed evaluation on several workload types and there was no
      significant difference kmalloc vs kvmalloc.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  6. 17 Dec, 2018 2 commits
  7. 06 Aug, 2018 2 commits
    • David Sterba's avatar
      btrfs: simplify pointer chasing of local fs_info variables · 3ffbd68c
      David Sterba authored
      Functions that get btrfs inode can simply reach the fs_info by
      dereferencing the root and this looks a bit more straightforward
      compared to the btrfs_sb(...) indirection.
      If the transaction handle is available and not NULL it's used instead.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Qu Wenruo's avatar
      btrfs: Get rid of the confusing btrfs_file_extent_inline_len · e41ca589
      Qu Wenruo authored
      We used to call btrfs_file_extent_inline_len() to get the uncompressed
      data size of an inlined extent.
      However this function is hiding evil, for compressed extent, it has no
      choice but to directly read out ram_bytes from btrfs_file_extent_item.
      While for uncompressed extent, it uses item size to calculate the real
      data size, and ignoring ram_bytes completely.
      In fact, for corrupted ram_bytes, due to above behavior kernel
      btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.
      Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
      can be detected pretty easily, thus we can trust ram_bytes without the
      evil btrfs_file_extent_inline_len().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  8. 12 Apr, 2018 1 commit
  9. 19 Jun, 2017 1 commit
    • Liu Bo's avatar
      Btrfs: change how we iterate bios in endio · 17347cec
      Liu Bo authored
      Since dio submit has used bio_clone_fast, the submitted bio may not have a
      reliable bi_vcnt, for the bio vector iterations in checksum related
      functions, bio->bi_iter is not modified yet and it's safe to use
      bio_for_each_segment, while for those bio vector iterations in dio read's
      endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
      bios and use the helper __bio_for_each_segment with the saved bvec_iter to
      access each bvec.
      Also for dio reads which don't get split, we also need to save a copy of
      bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
      each bvec in dio read's endio.  Note that it doesn't affect other calls of
      btrfs_bio_clone() because they don't need to use this iterator.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  10. 09 Jun, 2017 1 commit
  11. 28 Feb, 2017 2 commits
  12. 24 Feb, 2017 1 commit
    • Filipe Manana's avatar
      Btrfs: bulk delete checksum items in the same leaf · 6f546216
      Filipe Manana authored
      Very often we have the checksums for an extent spread in multiple items
      in the checksums tree, and currently the algorithm to delete them starts
      by looking for them one by one and then deleting them one by one, which
      is not optimal since each deletion involves shifting all the other items
      in the leaf and when the leaf reaches some low threshold, to move items
      off the leaf into its left and right neighbor leafs. Also, after each
      item deletion we release our search path and start a new search for other
      checksums items.
      So optimize this by deleting in bulk all the items in the same leaf that
      contain checksums for the extent being freed.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
  13. 14 Feb, 2017 2 commits
  14. 06 Dec, 2016 4 commits
  15. 30 Nov, 2016 3 commits
  16. 03 Aug, 2016 1 commit
  17. 26 Jul, 2016 3 commits
  18. 29 Apr, 2016 1 commit
  19. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      This promise never materialized.  And unlikely will.
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      Let's stop pretending that pages in page cache are special.  They are
      The changes are pretty straight-forward:
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - page_cache_get() -> get_page();
       - page_cache_release() -> put_page();
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      virtual patch
      expression E;
      + E
      expression E;
      + E
      + PAGE_SHIFT
      + PAGE_SIZE
      + PAGE_MASK
      expression E;
      + PAGE_ALIGN(E)
      expression E;
      - page_cache_get(E)
      + get_page(E)
      expression E;
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  20. 21 Mar, 2016 1 commit
    • Chris Mason's avatar
      btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums · 389f239c
      Chris Mason authored
      Commit c40a3d38
       (Btrfs: Compute and look up csums based on
      sectorsized blocks) changes around how we walk the bios while looking up
      crcs.  There's an inner loop that is jumping to the next bvec based on
      sectors and before it derefs the next bvec, it needs to make sure we're
      still in the bio.
      In this case, the outer loop would have decided to stop moving forward
      too, and the bvec deref is never actually used for anything.  But
      CONFIG_DEBUG_PAGEALLOC catches it because we're outside our bio.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
  21. 11 Mar, 2016 1 commit
  22. 01 Feb, 2016 1 commit