Skip to content
  • Yang Shi's avatar
    mm: shmem: make stat.st_blksize return huge page size if THP is on · 89fdcd26
    Yang Shi authored
    Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
    filesystem with huge page support anymore.  tmpfs can use huge page via
    THP when mounting by "huge=" mount option.
    
    When applications use huge page on hugetlbfs, it just need check the
    filesystem magic number, but it is not enough for tmpfs.  Make
    stat.st_blksize return huge page size if it is mounted by appropriate
    "huge=" option to give applications a hint to optimize the behavior with
    THP.
    
    Some applications may not do wisely with THP.  For example, QEMU may
    mmap file on non huge page aligned hint address with MAP_FIXED, which
    results in no pages are PMD mapped even though THP is used.  Some
    applications may mmap file with non huge page aligned offset.  Both
    behaviors make THP pointless.
    
    statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and
    it also may fallback to 4KB page silently if there is not enough huge
    page.  Furthermore, different f_bsize makes max_blocks and free_blocks
    calculation harder but without too much benefit.  Returning huge page
    size via stat.st_blksize sounds good enough.
    
    Since PUD size huge page for THP has not been supported, now it just
    returns HPAGE_PMD_SIZE.
    
    Hugh said:
    
    : Sorry, I have no enthusiasm for this patch; but do I feel strongly
    : enough to override you and everyone else to NAK it?  No, I don't feel
    : that strongly, maybe st_blksize isn't worth arguing over.
    :
    : We did look at struct stat when designing huge tmpfs, to see if there
    : were any fields that should be adjusted for it; but concluded none.
    : Yes, it would sometimes be nice to have a quickly accessible indicator
    : for when tmpfs has been mounted huge (scanning /proc/mounts for options
    : can be tiresome, agreed); but since tmpfs tries to supply huge (or not)
    : pages transparently, no difference seemed right.
    :
    : So, because st_blksize is a not very useful field of struct stat, with
    : "size" in the name, we're going to put HPAGE_PMD_SIZE in there instead
    : of PAGE_SIZE, if the tmpfs was mounted with one of the huge "huge"
    : options (force or always, okay; within_size or advise, not so much).
    : Though HPAGE_PMD_SIZE is no more its "preferred I/O size" or "blocksize
    : for file system I/O" than PAGE_SIZE was.
    :
    : Which we can expect to speed up some applications and disadvantage
    : others, depending on how they interpret st_blksize: just like if we
    : changed it in the same way on non-huge tmpfs.  (Did I actually try
    : changing st_blksize early on, and find it broke something?  If so, I've
    : now forgotten what, and a search through commit messages didn't find
    : it; but I guess we'll find out soon enough.)
    :
    : If there were an mstat() syscall, returning a field "preferred
    : alignment", then we could certainly agree to put HPAGE_PMD_SIZE in
    : there; but in stat()'s st_blksize?  And what happens when (in future)
    : mm maps this or that hard-disk filesystem's blocks with a pmd mapping -
    : should that filesystem then advertise a bigger st_blksize, despite the
    : same disk layout as before?  What happens with DAX?
    :
    : And this change is not going to help the QEMU suboptimality that
    : brought you here (or does QEMU align mmaps according to st_blksize?).
    : QEMU ought to work well with kernels without this change, and kernels
    : with this change; and I hope it can easily deal with both by avoiding
    : that use of MAP_FIXED which prevented the kernel's intended alignment.
    
    [akpm@linux-foundation.org: remove unneeded `else']
    Link: http://lkml.kernel.org/r/1524665633-83806-1-git-send-email-yang.shi@linux.alibaba.com
    
    
    Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
    Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    89fdcd26