1. 24 Sep, 2019 1 commit
    • Aneesh Kumar K.V's avatar
      libnvdimm/dax: Pick the right alignment default when creating dax devices · f5376699
      Aneesh Kumar K.V authored
      Allow arch to provide the supported alignments and use hugepage alignment only
      if we support hugepage. Right now we depend on compile time configs whereas this
      patch switch this to runtime discovery.
      
      Architectures like ppc64 can have THP enabled in code, but then can have
      hugepage size disabled by the hypervisor. This allows us to create dax devices
      with PAGE_SIZE alignment in this case.
      
      Existing dax namespace with alignment larger than PAGE_SIZE will fail to
      initialize in this specific case. We still allow fsdax namespace initialization.
      
      With respect to identifying whether to enable hugepage fault for a dax device,
      if THP is enabled during compile, we default to taking hugepage fault and in dax
      fault handler if we find the fault size > alignment we retry with PAGE_SIZE
      fault size.
      
      This also addresses the below failure scenario on ppc64
      
      ndctl create-namespace --mode=devdax  | grep align
       "align":16777216,
       "align":16777216
      
      cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
       65536 16777216
      
      daxio.static-debug  -z -o /dev/dax0.0
        Bus error (core dumped)
      
        $ dmesg | tail
         lpar: Failed hash pte insert with error -4
         hash-mmu: mm: Hashing failure ! EA=0x7fff17000000 access=0x8000000000000006 current=daxio
         hash-mmu:     trap=0x300 vsid=0x22cb7a3a
      
       ssize=1 base psize=2 psize 10 pte=0xc000000501002b86
         daxio[3860]: bus error (7) at 7fff17000000 nip 7fff973c007c lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b0000+20000]
         daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 f93f00a0 4800012c e93f0088 f93f0120
         daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 <f9490000> e93f0088 39290008 f93f0110
      
      The failure was due to guest kernel using wrong page size.
      
      The namespaces created with 16M alignment will appear as below on a config with
      16M page size disabled.
      
      $ ndctl list -Ni
      [
        {
          "dev":"namespace0.1",
          "mode":"fsdax",
          "map":"dev",
          "size":5351931904,
          "uuid":"fc6e9667-461a-4718-82b4-69b24570bddb",
          "align":16777216,
          "blockdev":"pmem0.1",
          "supported_alignments":[
            65536
          ]
        },
        {
          "dev":"namespace0.0",
          "mode":"fsdax",    <==== devdax 16M alignment marked disabled.
          "map":"mem",
          "size":5368709120,
          "uuid":"a4bdf81a-f2ee-4bc6-91db-7b87eddd0484",
          "state":"disabled"
        }
      ]
      
      Cc: linux-mm@kvack.org
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Link: https://lore.kernel.org/r/20190905154603.10349-8-aneesh.kumar@linux.ibm.com
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      f5376699
  2. 05 Sep, 2019 1 commit
  3. 05 Jul, 2019 1 commit
  4. 05 Jun, 2019 1 commit
  5. 01 May, 2019 1 commit
    • Dan Williams's avatar
      libnvdimm/namespace: Fix label tracking error · c4703ce1
      Dan Williams authored
      Users have reported intermittent occurrences of DIMM initialization
      failures due to duplicate allocations of address capacity detected in
      the labels, or errors of the form below, both have the same root cause.
      
          nd namespace1.4: failed to track label: 0
          WARNING: CPU: 17 PID: 1381 at drivers/nvdimm/label.c:863
      
          RIP: 0010:__pmem_label_update+0x56c/0x590 [libnvdimm]
          Call Trace:
           ? nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm]
           nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm]
           uuid_store+0x17e/0x190 [libnvdimm]
           kernfs_fop_write+0xf0/0x1a0
           vfs_write+0xb7/0x1b0
           ksys_write+0x57/0xd0
           do_syscall_64+0x60/0x210
      
      Unfortunately those reports were typically with a busy parallel
      namespace creation / destruction loop making it difficult to see the
      components of the bug. However, Jane provided a simple reproducer using
      the work-in-progress sub-section implementation.
      
      When ndctl is reconfiguring a namespace it may take an existing defunct
      / disabled namespace and reconfigure it with a new uuid and other
      parameters. Critically namespace_update_uuid() takes existing address
      resources and renames them for the new namespace to use / reconfigure as
      it sees fit. The bug is that this rename only happens in the resource
      tracking tree. Existing labels with the old uuid are not reaped leading
      to a scenario where multiple active labels reference the same span of
      address range.
      
      Teach namespace_update_uuid() to flag any references to the old uuid for
      reaping at the next label update attempt.
      
      Cc: <stable@vger.kernel.org>
      Fixes: bf9bccc1 ("libnvdimm: pmem label sets and namespace instantiation")
      Link: https://github.com/pmem/ndctl/issues/91
      
      Reported-by: default avatarJane Chu <jane.chu@oracle.com>
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Reported-by: default avatarErwin Tsaur <erwin.tsaur@oracle.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      c4703ce1
  6. 21 Jan, 2019 1 commit
    • Dan Williams's avatar
      libnvdimm/security: Require nvdimm_security_setup_events() to succeed · 1cd73865
      Dan Williams authored
      The following warning:
      
          ACPI0012:00: security event setup failed: -19
      
      ...is meant to capture exceptional failures of sysfs_get_dirent(),
      however it will also fail in the common case when security support is
      disabled. A few issues:
      
      1/ A dev_warn() report for a common case is too chatty
      2/ The setup of this notifier is generic, no need for it to be driven
         from the nfit driver, it can exist completely in the core.
      3/ If it fails for any reason besides security support being disabled,
         that's fatal and should abort DIMM activation. Userspace may hang if
         it never gets overwrite notifications.
      4/ The dirent needs to be released.
      
      Move the call to the core 'dimm' driver, make it conditional on security
      support being active, make it fatal for the exceptional case, add the
      missing sysfs_put() at device disable time.
      
      Fixes: 7d988097
      
       ("...Add security DSM overwrite support")
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      1cd73865
  7. 07 Jan, 2019 1 commit
    • Dan Williams's avatar
      acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node · 8fc5c735
      Dan Williams authored
      Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
      Interface Table), is the first known instance of a memory range
      described by a unique "target" proximity domain. Where "initiator" and
      "target" proximity domains is an approach that the ACPI HMAT
      (Heterogeneous Memory Attributes Table) uses to described the unique
      performance properties of a memory range relative to a given initiator
      (e.g. CPU or DMA device).
      
      Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
      char-device follows the traditional notion of 'numa-node' where the
      attribute conveys the closest online numa-node. That numa-node attribute
      is useful for cpu-binding and memory-binding processes *near* the
      device. However, when the memory range backing a 'pmem', or 'dax' device
      is onlined (memory hot-add) the memory-only-numa-node representing that
      address needs to be differentiated from the set of online nodes. In
      other words, the numa-node association of the device depends on whether
      you can bind processes *near* the cpu-numa-node in the offline
      device-case, or bind process *on* the memory-range directly after the
      backing address range is onlined.
      
      Allow for the case that platform firmware describes persistent memory
      with a unique proximity domain, i.e. when it is distinct from the
      proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
      numa-node translation of that proximity through the libnvdimm region
      device to namespaces that are in device-dax mode. With this in place the
      proposed kmem driver [1] can optionally discover a unique numa-node
      number for the address range as it transitions the memory from an
      offline state managed by a device-driver to an online memory range
      managed by the core-mm.
      
      [1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.com
      
      Reported-by: default avatarFan Du <fan.du@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      8fc5c735
  8. 14 Dec, 2018 1 commit
  9. 12 Oct, 2018 1 commit
    • Alexander Duyck's avatar
      nvdimm: Split label init out from the logic for getting config data · 2d657d17
      Alexander Duyck authored
      
      
      This patch splits the initialization of the label data into two functions.
      One for doing the init, and another for reading the actual configuration
      data. The idea behind this is that by doing this we create a symmetry
      between the getting and setting of config data in that we have a function
      for both. In addition it will make it easier for us to identify the bits
      that are related to init versus the pieces that are a wrapper for reading
      data from the ACPI interface.
      
      So for example by splitting things out like this it becomes much more
      obvious that we were performing checks that weren't necessarily related to
      the set/get operations such as relying on ndd->data being present when the
      set and get ops should not care about a locally cached copy of the label
      area.
      Reviewed-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      2d657d17
  10. 18 Jul, 2018 1 commit
    • Michael Callahan's avatar
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan authored
      
      
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: default avatarMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ddcf35d3
  11. 14 Jul, 2018 1 commit
  12. 03 Apr, 2018 1 commit
  13. 17 Mar, 2018 1 commit
    • Bart Van Assche's avatar
      block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into <linux/blkdev.h> · 233bde21
      Bart Van Assche authored
      
      
      It happens often while I'm preparing a patch for a block driver that
      I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
      available for this driver? Do I have to introduce definitions of these
      constants before I can use these constants? To avoid this confusion,
      move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
      <linux/blkdev.h> header file such that these become available for all
      block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
      header file conditional to avoid that including that header file after
      <linux/blkdev.h> causes the compiler to complain about a SECTOR_SIZE
      redefinition.
      
      Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
      not been removed from uapi header files nor from NAND drivers in
      which these constants are used for another purpose than converting
      block layer offsets and sizes into a number of sectors.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      233bde21
  14. 08 Jan, 2018 1 commit
  15. 02 Nov, 2017 1 commit
  16. 28 Sep, 2017 1 commit
  17. 30 Aug, 2017 1 commit
    • Dan Williams's avatar
      libnvdimm, label: fix index block size calculation · 02881768
      Dan Williams authored
      
      
      The old calculation assumed that the label space was 128k and the label
      size is 128. With v1.2 labels where the label size is 256 this
      calculation will return zero. We are saved by the fact that the
      nsindex_size is always pre-initialized from a previous 128 byte
      assumption and we are lucky that the index sizes turn out the same.
      
      Fix this going forward in case we start encountering different
      geometries of label areas besides 128k.
      
      Since the label size can change from one call to the next, drop the
      caching of nsindex_size.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      02881768
  18. 23 Aug, 2017 1 commit
    • Christoph Hellwig's avatar
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig authored
      
      
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      74d46992
  19. 12 Aug, 2017 1 commit
  20. 09 Aug, 2017 1 commit
  21. 05 Aug, 2017 1 commit
  22. 25 Jul, 2017 1 commit
    • Oliver O'Halloran's avatar
      libnvdimm: Stop using HPAGE_SIZE · 0dd69643
      Oliver O'Halloran authored
      
      
      Currently libnvdimm uses HPAGE_SIZE as the default alignment for DAX and
      PFN devices. HPAGE_SIZE is the default hugetlbfs page size and when
      hugetlbfs is disabled it defaults to PAGE_SIZE. Given DAX has more
      in common with THP than hugetlbfs we should proably be using
      HPAGE_PMD_SIZE, but this is undefined when THP is disabled so lets just
      give it a new name.
      
      The other usage of HPAGE_SIZE in libnvdimm is when determining how large
      the altmap should be. For the reasons mentioned above it doesn't really
      make sense to use HPAGE_SIZE here either. PMD_SIZE seems to be safe to
      use in generic code and it happens to match the vmemmap allocation block
      on x86 and Power. It's still a hack, but it's a slightly nicer hack.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      0dd69643
  23. 29 Jun, 2017 1 commit
  24. 15 Jun, 2017 5 commits
  25. 11 May, 2017 1 commit
    • Vishal Verma's avatar
      libnvdimm: add an atomic vs process context flag to rw_bytes · 3ae3d67b
      Vishal Verma authored
      
      
      nsio_rw_bytes can clear media errors, but this cannot be done while we
      are in an atomic context due to locking within ACPI. From the BTT,
      ->rw_bytes may be called either from atomic or process context depending
      on whether the calls happen during initialization or during IO.
      
      During init, we want to ensure error clearing happens, and the flag
      marking process context allows nsio_rw_bytes to do that. When called
      during IO, we're in atomic context, and error clearing can be skipped.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      3ae3d67b
  26. 04 May, 2017 1 commit
  27. 13 Apr, 2017 1 commit
  28. 01 Mar, 2017 1 commit
    • Dan Williams's avatar
      nfit, libnvdimm: fix interleave set cookie calculation · 86ef58a4
      Dan Williams authored
      The interleave-set cookie is a sum that sanity checks the composition of
      an interleave set has not changed from when the namespace was initially
      created.  The checksum is calculated by sorting the DIMMs by their
      location in the interleave-set. The comparison for the sort must be
      64-bit wide, not byte-by-byte as performed by memcmp() in the broken
      case.
      
      Fix the implementation to accept correct cookie values in addition to
      the Linux "memcmp" order cookies, but only allow correct cookies to be
      generated going forward. It does mean that namespaces created by
      third-party-tooling, or created by newer kernels with this fix, will not
      validate on older kernels. However, there are a couple mitigating
      conditions:
      
          1/ platforms with namespace-label capable NVDIMMs are not widely
             available.
      
          2/ interleave-sets with a single-dimm are by definition not affected
             (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.
      
      The cookie stored in the namespace label will be fixed by any write the
      namespace label, the most straightforward way to achieve this is to
      write to the "alt_name" attribute of a namespace in sysfs.
      
      Cc: <stable@vger.kernel.org>
      Fixes: eaf96153
      
       ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
      Reported-by: default avatarNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Tested-by: default avatarNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      86ef58a4
  29. 19 Oct, 2016 2 commits
    • Dan Williams's avatar
      libnvdimm: allow a platform to force enable label support · 42237e39
      Dan Williams authored
      
      
      Platforms like QEMU-KVM implement an NFIT table and label DSMs.
      However, since that environment does not define an aliased
      configuration, the labels are currently ignored and the kernel registers
      a single full-sized pmem-namespace per region. Now that the kernel
      supports sub-divisions of pmem regions the labels have a purpose.
      Arrange for the labels to be honored when we find an existing / valid
      namespace index block.
      
      Cc: <qemu-devel@nongnu.org>
      Cc: Haozhong Zhang <haozhong.zhang@intel.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      42237e39
    • Toshi Kani's avatar
      libnvdimm: use generic iostat interfaces · 8d7c22ac
      Toshi Kani authored
      
      
      nd_iostat_start() and nd_iostat_end() implement the same functionality
      that generic_start_io_acct() and generic_end_io_acct() already provide.
      
      Change nd_iostat_start() and nd_iostat_end() to call the generic iostat
      interfaces.  There is no change in the nd interfaces.
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      8d7c22ac
  30. 01 Oct, 2016 2 commits
  31. 24 Sep, 2016 1 commit
  32. 02 Sep, 2016 1 commit
  33. 08 Aug, 2016 1 commit
  34. 11 Jul, 2016 1 commit
    • Dan Williams's avatar
      libnvdimm: cycle flush hints · 0c27af60
      Dan Williams authored
      
      
      When the NFIT provides multiple flush hint addresses per-dimm it is
      expressing that the platform is capable of processing multiple flush
      requests in parallel.  There is some fixed cost per flush request, let
      the cost be shared in parallel on multiple cpus.
      
      Since there may not be enough flush hint addresses for each cpu to have
      one, keep a per-cpu index of the last used hint, hash it with current
      pid, and assume that access pattern and scheduler randomness will keep
      the flush-hint usage somewhat staggered across cpus.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      0c27af60