1. 13 Dec, 2019 2 commits
    • Filipe Manana's avatar
      Btrfs: fix missing data checksums after replaying a log tree · 40e046ac
      Filipe Manana authored
      
      
      When logging a file that has shared extents (reflinked with other files or
      with itself), we can end up logging multiple checksum items that cover
      overlapping ranges. This confuses the search for checksums at log replay
      time causing some checksums to never be added to the fs/subvolume tree.
      
      Consider the following example of a file that shares the same extent at
      offsets 0 and 256Kb:
      
         [ bytenr 13893632, offset 64Kb, len 64Kb  ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 13893632, offset 0, len 256Kb    ]
         256Kb                                     512Kb
      
      When logging the inode, at tree-log.c:copy_items(), when processing the
      file extent item at offset 0, we log a checksum item covering the range
      13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
      64Kb + 64Kb, respectively.
      
      Later when processing the extent item at offset 256K, we log the checksums
      for the range from 13893632 to 14155776 (which corresponds to 13893632 +
      256Kb). These checksums get merged with the checksum item for the range
      from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
      So after this we get the two following checksum items in the log tree:
      
         (...)
         item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
                 range start 13631488 end 14155776 length 524288
         item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
                 range start 13959168 end 14024704 length 65536
      
      The first one covers the range from the second one, they overlap.
      
      So far this does not cause a problem after replaying the log, because
      when replaying the file extent item for offset 256K, we copy all the
      checksums for the extent 13893632 from the log tree to the fs/subvolume
      tree, since searching for an checksum item for bytenr 13893632 leaves us
      at the first checksum item, which covers the whole range of the extent.
      
      However if we write 64Kb to file offset 256Kb for example, we will
      not be able to find and copy the checksums for the last 128Kb of the
      extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
      
      After writing 64Kb into file offset 256Kb we get the following extent
      layout for our file:
      
         [ bytenr 13893632, offset 64K, len 64Kb   ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 14155776, offset 0, len 64Kb     ]
         256Kb                                     320Kb
      
         [ bytenr 13893632, offset 64Kb, len 192Kb ]
         320Kb                                     512Kb
      
      After fsync'ing the file, if we have a power failure and then mount
      the filesystem to replay the log, the following happens:
      
      1) When replaying the file extent item for file offset 320Kb, we
         lookup for the checksums for the extent range from 13959168
         (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
         to btrfs_lookup_csums_range();
      
      2) btrfs_lookup_csums_range() finds the checksum item that starts
         precisely at offset 13959168 (item 7 in the log tree, shown before);
      
      3) However that checksum item only covers 64Kb of data, and not 192Kb
         of data;
      
      4) As a result only the checksums for the first 64Kb of data referenced
         by the file extent item are found and copied to the fs/subvolume tree.
         The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
         the corresponding data checksums found and copied to the fs/subvolume
         tree.
      
      5) After replaying the log userspace will not be able to read the file
         range from 384Kb to 512Kb, because the checksums are missing and
         resulting in an -EIO error.
      
      The following steps reproduce this scenario:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
        $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
      
        $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        <power failure>
      
        $ mount /dev/sdc /mnt/sdc
        $ md5sum /mnt/sdc/foobar
        md5sum: /mnt/sdc/foobar: Input/output error
      
        $ dmesg | tail
        [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
        [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
        [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
        [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
        [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
        [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
        [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
        [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
        [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
        [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
      
      Fix this simply by deleting first any checksums, from the log tree, for the
      range of the extent we are logging at copy_items(). This ensures we do not
      get checksum items in the log tree that have overlapping ranges.
      
      This is a long time issue that has been present since we have the clone
      (and deduplication) ioctl, and can happen both when an extent is shared
      between different files and within the same file.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40e046ac
    • Josef Bacik's avatar
      btrfs: handle error in btrfs_cache_block_group · db8fe64f
      Josef Bacik authored
      
      
      We have a BUG_ON(ret < 0) in find_free_extent from
      btrfs_cache_block_group.  If we fail to allocate our ctl we'll just
      panic, which is not good.  Instead just go on to another block group.
      If we fail to find a block group we don't want to return ENOSPC, because
      really we got a ENOMEM and that's the root of the problem.  Save our
      return from btrfs_cache_block_group(), and then if we still fail to make
      our allocation return that ret so we get the right error back.
      
      Tested with inject-error.py from bcc.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db8fe64f
  2. 18 Nov, 2019 6 commits
  3. 09 Sep, 2019 32 commits