Skip to content
  • Qu Wenruo's avatar
    btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to... · 1418bae1
    Qu Wenruo authored
    btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
    
    [BUG]
    Btrfs/139 will fail with a high probability if the testing machine (VM)
    has only 2G RAM.
    
    Resulting the final write success while it should fail due to EDQUOT,
    and the fs will have quota exceeding the limit by 16K.
    
    The simplified reproducer will be: (needs a 2G ram VM)
    
      $ mkfs.btrfs -f $dev
      $ mount $dev $mnt
    
      $ btrfs subv create $mnt/subv
      $ btrfs quota enable $mnt
      $ btrfs quota rescan -w $mnt
      $ btrfs qgroup limit -e 1G $mnt/subv
    
      $ for i in $(seq -w  1 8); do
      	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
      	echo "file $i written" > /dev/kmsg
        done
      $ sync
      $ btrfs qgroup show -pcre --raw $mnt
    
    The last pwrite will not trigger EDQUOT and final 'qgroup show' will
    show something like:
    
      qgroupid         rfer         excl     max_rfer     max_excl parent  child
      --------         ----         ----     --------     -------- ------  -----
      0/5             16384        16384         none         none ---     ---
      0/256      1073758208   1073758208         none   1073741824 ---     ---
    
    And 1073758208 is larger than
      > 1073741824.
    
    [CAUSE]
    It's a bug in btrfs qgroup data reserved space management.
    
    For quota limit, we must ensure that:
      reserved (data + metadata) + rfer/excl <= limit
    
    Since rfer/excl is only updated at transaction commmit time, reserved
    space needs to be taken special care.
    
    One important part of reserved space is data, and for a new data extent
    written to disk, we still need to take the reserved space until
    rfer/excl numbers get updated.
    
    Originally when an ordered extent finishes, we migrate the reserved
    qgroup data space from extent_io tree to delayed ref head of the data
    extent, expecting delayed ref will only be cleaned up at commit
    transaction time.
    
    However for small RAM machine, due to memory pressure dirty pages can be
    flushed back to disk without committing a transaction.
    
    The related events will be something like:
    
      file 1 written
      btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
      btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
      btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
      btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
      btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
      cleanup_ref_head: num_bytes=54947840
      cleanup_ref_head: num_bytes=5636096
      cleanup_ref_head: num_bytes=569344
      cleanup_ref_head: num_bytes=57344
      cleanup_ref_head: num_bytes=8192
      ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
      file 2 written
      ...
      file 8 written
      cleanup_ref_head: num_bytes=8192
      ...
      btrfs_commit_transaction  <<< the only transaction committed during
    				the test
    
    When file 2 is written, we have already freed 128M reserved qgroup data
    space for ino 258. Thus later write won't trigger EDQUOT.
    
    This allows us to write more data beyond qgroup limit.
    
    In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
    
    [FIX]
    By moving reserved qgroup data space from btrfs_delayed_ref_head to
    btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
    space won't be freed half way before commit transaction, thus fix the
    problem.
    
    Fixes: f64d5ca8
    
     ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
    Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    1418bae1