1. 24 Sep, 2019 2 commits
    • Aneesh Kumar K.V's avatar
      libnvdimm/region: Enable MAP_SYNC for volatile regions · 4c806b89
      Aneesh Kumar K.V authored
      Some environments want to use a host tmpfs/ramdisk to back guest pmem.
      While the data is not persisted relative to the host it *is* persisted
      relative to guest crashes / reboots. The guest is free to use dax and
      MAP_SYNC to keep filesystem metadata consistent with dax accesses
      without requiring guest fsync(). The guest can also observe that the
      region is volatile and skip cache flushing as global visibility is
      enough to "persist" data relative to the host staying alive over guest
      reset events.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Link: https://lore.kernel.org/r/20190924114327.14700-1-aneesh.kumar@linux.ibm.com
      [djbw: reword the changelog]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Aneesh Kumar K.V's avatar
      libnvdimm/region: Initialize bad block for volatile namespaces · c42adf87
      Aneesh Kumar K.V authored
      We do check for a bad block during namespace init and that use
      region bad block list. We need to initialize the bad block
      for volatile regions for this to work. We also observe a lockdep
      warning as below because the lock is not initialized correctly
      since we skip bad block init for volatile regions.
       INFO: trying to register non-static key.
       the code is fine but needs lockdep annotation.
       turning off the locking correctness validator.
       CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-15699-g3dee241c937e #149
       Call Trace:
       [c0000000f95cb250] [c00000000147dd84] dump_stack+0xe8/0x164 (unreliable)
       [c0000000f95cb2a0] [c00000000022ccd8] register_lock_class+0x308/0xa60
       [c0000000f95cb3a0] [c000000000229cc0] __lock_acquire+0x170/0x1ff0
       [c0000000f95cb4c0] [c00000000022c740] lock_acquire+0x220/0x270
       [c0000000f95cb580] [c000000000a93230] badblocks_check+0xc0/0x290
       [c0000000f95cb5f0] [c000000000d97540] nd_pfn_validate+0x5c0/0x7f0
       [c0000000f95cb6d0] [c000000000d98300] nd_dax_probe+0xd0/0x1f0
       [c0000000f95cb760] [c000000000d9b66c] nd_pmem_probe+0x10c/0x160
       [c0000000f95cb790] [c000000000d7f5ec] nvdimm_bus_probe+0x10c/0x240
       [c0000000f95cb820] [c000000000d0f844] really_probe+0x254/0x4e0
       [c0000000f95cb8b0] [c000000000d0fdfc] driver_probe_device+0x16c/0x1e0
       [c0000000f95cb930] [c000000000d10238] device_driver_attach+0x68/0xa0
       [c0000000f95cb970] [c000000000d1040c] __driver_attach+0x19c/0x1c0
       [c0000000f95cb9f0] [c000000000d0c4c4] bus_for_each_dev+0x94/0x130
       [c0000000f95cba50] [c000000000d0f014] driver_attach+0x34/0x50
       [c0000000f95cba70] [c000000000d0e208] bus_add_driver+0x178/0x2f0
       [c0000000f95cbb00] [c000000000d117c8] driver_register+0x108/0x170
       [c0000000f95cbb70] [c000000000d7edb0] __nd_driver_register+0xe0/0x100
       [c0000000f95cbbd0] [c000000001a6baa4] nd_pmem_driver_init+0x34/0x48
       [c0000000f95cbbf0] [c0000000000106f4] do_one_initcall+0x1d4/0x4b0
       [c0000000f95cbcd0] [c0000000019f499c] kernel_init_freeable+0x544/0x65c
       [c0000000f95cbdb0] [c000000000010d6c] kernel_init+0x2c/0x180
       [c0000000f95cbe20] [c00000000000b954] ret_from_kernel_thread+0x5c/0x68
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Link: https://lore.kernel.org/r/20190919083355.26340-1-aneesh.kumar@linux.ibm.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  2. 05 Sep, 2019 2 commits
  3. 29 Aug, 2019 1 commit
  4. 18 Jul, 2019 2 commits
    • Dan Williams's avatar
      driver-core, libnvdimm: Let device subsystems add local lockdep coverage · 87a30e1f
      Dan Williams authored
      For good reason, the standard device_lock() is marked
      lockdep_set_novalidate_class() because there is simply no sane way to
      describe the myriad ways the device_lock() ordered with other locks.
      However, that leaves subsystems that know their own local device_lock()
      ordering rules to find lock ordering mistakes manually. Instead,
      introduce an optional / additional lockdep-enabled lock that a subsystem
      can acquire in all the same paths that the device_lock() is acquired.
      A conversion of the NFIT driver and NVDIMM subsystem to a
      lockdep-validate device_lock() scheme is included. The
      debug_nvdimm_lock() implementation implements the correct lock-class and
      stacking order for the libnvdimm device topology hierarchy.
      Yes, this is a hack, but hopefully it is a useful hack for other
      subsystems device_lock() debug sessions. Quoting Greg:
          "Yeah, it feels a bit hacky but it's really up to a subsystem to mess up
           using it as much as anything else, so user beware :)
           I don't object to it if it makes things easier for you to debug."
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Link: https://lore.kernel.org/r/156341210661.292348.7014034644265455704.stgit@dwillia2-desk3.amr.corp.intel.com
    • Dan Williams's avatar
      libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock · ca6bf264
      Dan Williams authored
      A multithreaded namespace creation/destruction stress test currently
      deadlocks with the following lockup signature:
          INFO: task ndctl:2924 blocked for more than 122 seconds.
                Tainted: G           OE     5.2.0-rc4+ #3382
          "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          ndctl           D    0  2924   1176 0x00000000
          Call Trace:
           ? __schedule+0x27e/0x780
           wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm]
           ? finish_wait+0x80/0x80
           uuid_store+0xe6/0x2e0 [libnvdimm]
           INFO: task ndctl:2923 blocked for more than 122 seconds.
                 Tainted: G           OE     5.2.0-rc4+ #3382
           "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
           ndctl           D    0  2923   1175 0x00000000
           Call Trace:
            ? __schedule+0x27e/0x780
            ? __mutex_lock+0x489/0x910
            ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
            ? __lock_acquire+0x23f/0x1710
            ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
            nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
            __dax_pmem_probe+0x5e/0x210 [dax_pmem_core]
            ? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm]
            dax_pmem_probe+0xc/0x20 [dax_pmem]
            nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
      In this sequence an 'nd_dax' device is being probed and trying to take
      the lock on its backing namespace to validate that the 'nd_dax' device
      indeed has exclusive access to the backing namespace. Meanwhile, another
      thread is trying to update the uuid property of that same backing
      namespace. So one thread is in the probe path trying to acquire the
      lock, and the other thread has acquired the lock and tries to flush the
      probe path.
      Fix this deadlock by not holding the namespace device_lock over the
      wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires
      the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and
      subsequently dropped internally to wait_nvdimm_bus_probe_idle().
      Cc: <stable@vger.kernel.org>
      Fixes: bf9bccc1
       ("libnvdimm: pmem label sets and namespace instantiation")
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Tested-by: default avatarJane Chu <jane.chu@oracle.com>
      Link: https://lore.kernel.org/r/156341210094.292348.2384694131126767789.stgit@dwillia2-desk3.amr.corp.intel.com
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  5. 05 Jul, 2019 2 commits
  6. 05 Jun, 2019 1 commit
  7. 03 Feb, 2019 1 commit
  8. 07 Jan, 2019 1 commit
    • Dan Williams's avatar
      acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node · 8fc5c735
      Dan Williams authored
      Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
      Interface Table), is the first known instance of a memory range
      described by a unique "target" proximity domain. Where "initiator" and
      "target" proximity domains is an approach that the ACPI HMAT
      (Heterogeneous Memory Attributes Table) uses to described the unique
      performance properties of a memory range relative to a given initiator
      (e.g. CPU or DMA device).
      Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
      char-device follows the traditional notion of 'numa-node' where the
      attribute conveys the closest online numa-node. That numa-node attribute
      is useful for cpu-binding and memory-binding processes *near* the
      device. However, when the memory range backing a 'pmem', or 'dax' device
      is onlined (memory hot-add) the memory-only-numa-node representing that
      address needs to be differentiated from the set of online nodes. In
      other words, the numa-node association of the device depends on whether
      you can bind processes *near* the cpu-numa-node in the offline
      device-case, or bind process *on* the memory-range directly after the
      backing address range is onlined.
      Allow for the case that platform firmware describes persistent memory
      with a unique proximity domain, i.e. when it is distinct from the
      proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
      numa-node translation of that proximity through the libnvdimm region
      device to namespaces that are in device-dax mode. With this in place the
      proposed kmem driver [1] can optionally discover a unique numa-node
      number for the address range as it transitions the memory from an
      offline state managed by a device-driver to an online memory range
      managed by the core-mm.
      [1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.com
      Reported-by: default avatarFan Du <fan.du@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  9. 21 Dec, 2018 1 commit
    • Dave Jiang's avatar
      acpi/nfit, libnvdimm/security: Add security DSM overwrite support · 7d988097
      Dave Jiang authored
      Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as
      described by the Intel DSM spec v1.7. This will allow triggering of
      overwrite on Intel NVDIMMs. The overwrite operation can take tens of
      minutes. When the overwrite DSM is issued successfully, the NVDIMMs will
      be unaccessible. The kernel will do backoff polling to detect when the
      overwrite process is completed. According to the DSM spec v1.7, the 128G
      NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will
      take longer.
      Given that overwrite puts the DIMM in an indeterminate state until it
      completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other
      operations from executing when overwrite is happening. The
      NDD_WORK_PENDING flag is added to denote that there is a device reference
      on the nvdimm device for an async workqueue thread context.
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  10. 05 Dec, 2018 1 commit
  11. 28 Sep, 2018 1 commit
    • Dan Williams's avatar
      libnvdimm, region: Fail badblocks listing for inactive regions · 5d394eee
      Dan Williams authored
      While experimenting with region driver loading the following backtrace
      was triggered:
       INFO: trying to register non-static key.
       the code is fine but needs lockdep annotation.
       turning off the locking correctness validator.
       Call Trace:
        ? __lock_acquire+0x2ba/0x1310
        ? kernfs_seq_start+0x2a/0x80
        ? dev_attr_show+0x1c/0x50
        ? __lock_acquire+0x2ba/0x1310
        ? kernfs_seq_start+0x2a/0x80
        ? lock_acquire+0x9e/0x1a0
        ? dev_attr_show+0x1c/0x50
        ? dev_attr_show+0x1c/0x50
      This results from a missing successful call to devm_init_badblocks()
      from nd_region_probe(). Block attempts to show badblocks while the
      region is not enabled.
      Fixes: 6a6bef90
       ("libnvdimm: add mechanism to publish badblocks...")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  12. 25 Jul, 2018 2 commits
  13. 06 Jun, 2018 1 commit
    • Ross Zwisler's avatar
      libnvdimm, pmem: Do not flush power-fail protected CPU caches · 546eb031
      Ross Zwisler authored
      This commit:
      5fdf8e5b ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
      intended to make sure that deep flush was always available even on
      platforms which support a power-fail protected CPU cache.  An unintended
      side effect of this change was that we also lost the ability to skip
      flushing CPU caches on those power-fail protected CPU cache.
      Fix this by skipping the low level cache flushing in dax_flush() if we have
      CPU caches which are power-fail protected.  The user can still override this
      behavior by manually setting the write_cache state of a namespace.  See
      libndctl's ndctl_namespace_write_cache_is_enabled(),
      ndctl_namespace_enable_write_cache() and
      ndctl_namespace_disable_write_cache() functions.
      Cc: <stable@vger.kernel.org>
      Fixes: 5fdf8e5b
       ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  14. 07 Apr, 2018 1 commit
  15. 03 Apr, 2018 1 commit
  16. 21 Mar, 2018 2 commits
    • Dan Williams's avatar
      libnvdimm, nfit: fix persistence domain reporting · fe9a552e
      Dan Williams authored
      The persistence domain is a point in the platform where once writes
      reach that destination the platform claims it will make them persistent
      relative to power loss. In the ACPI NFIT this is currently communicated
      as 2 bits in the "NFIT - Platform Capabilities Structure". The bits
      comprise a hierarchy, i.e. bit0 "CPU Cache Flush to NVDIMM Durability on
      Power Loss Capable" implies bit1 "Memory Controller Flush to NVDIMM
      Durability on Power Loss Capable".
      Commit 96c3a239 "libnvdimm: expose platform persistence attr..."
      shows the persistence domain as flags, but it's really an enumerated
      Fix this newly introduced user ABI to show the closest available
      persistence domain before userspace develops dependencies on seeing, or
      needing to develop code to tolerate, the raw NFIT flags communicated
      through the libnvdimm-generic region attribute.
      Fixes: 96c3a239
       ("libnvdimm: expose platform persistence attr...")
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Dan Williams's avatar
      libnvdimm, region: hide persistence_domain when unknown · 896196dc
      Dan Williams authored
      Similar to other region attributes, do not emit the persistence_domain
      attribute if its contents are empty.
      Fixes: 96c3a239
       ("libnvdimm: expose platform persistence attr...")
      Cc: Dave Jiang <dave.jiang@intel.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  17. 01 Feb, 2018 1 commit
  18. 28 Sep, 2017 1 commit
  19. 05 Aug, 2017 1 commit
  20. 29 Jun, 2017 2 commits
  21. 27 Jun, 2017 3 commits
    • Dan Williams's avatar
      libnvdimm, nfit: enable support for volatile ranges · c9e582aa
      Dan Williams authored
      Allow volatile nfit ranges to participate in all the same infrastructure
      provided for persistent memory regions. A resulting resulting namespace
      device will still be called "pmem", but the parent region type will be
      "nd_volatile". This is in preparation for disabling the dax ->flush()
      operation in the pmem driver when it is hosted on a volatile range.
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Dan Williams's avatar
      libnvdimm, pmem: fix persistence warning · c00b396e
      Dan Williams authored
      The pmem driver assumes if platform firmware describes the memory
      devices associated with a persistent memory range and
      CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to
      flush data to a power-fail safe zone. We warn if the firmware does not
      describe memory devices, but we also need to warn if the architecture
      does not claim pmem support.
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Dan Williams's avatar
      x86, libnvdimm, pmem: remove global pmem api · ca6a4657
      Dan Williams authored
      Now that all callers of the pmem api have been converted to dax helpers that
      call back to the pmem driver, we can remove include/linux/pmem.h and
      Cc: <x86@kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Oliver O'Halloran <oohall@gmail.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  22. 15 Jun, 2017 1 commit
  23. 09 Jun, 2017 1 commit
    • Dan Williams's avatar
      x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations · 0aed55af
      Dan Williams authored
      The pmem driver has a need to transfer data with a persistent memory
      destination and be able to rely on the fact that the destination writes are not
      cached. It is sufficient for the writes to be flushed to a cpu-store-buffer
      (non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync()
      to ensure data-writes have reached a power-fail-safe zone in the platform. The
      fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn
      around and fence previous writes with an "sfence".
      Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
      memcpy_flushcache, that guarantee that the destination buffer is not dirty in
      the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
      will be used to replace the "pmem api" (include/linux/pmem.h +
      arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
      and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
      config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
      This is meant to satisfy the concern from Linus that if a driver wants to do
      something beyond the normal nocache semantics it should be something private to
      that driver [1], and Al's concern that anything uaccess related belongs with
      the rest of the uaccess code [2].
      The first consumer of this interface is a new 'copy_from_iter' dax operation so
      that pmem can inject cache maintenance operations without imposing this
      overhead on other dax-capable drivers.
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
      Cc: <x86@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  24. 04 May, 2017 1 commit
  25. 29 Apr, 2017 1 commit
    • Dan Williams's avatar
      libnvdimm: rework region badblocks clearing · 23f49844
      Dan Williams authored
      Toshi noticed that the new support for a region-level badblocks missed
      the case where errors are cleared due to BTT I/O.
      An initial attempt to fix this ran into a "sleeping while atomic"
      warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
      satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
      However, that lock is not needed since we are not acting on any data that
      is subject to change under that lock. The badblocks instance has its own
      internal lock to handle mutations of the error list.
      So, in order to make it clear that we are just acting on region devices,
      rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
      Eliminate the lock and consolidate all support routines for the new
      nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
      opportunity to cleanup to some unnecessary casts, make the calling
      convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
      resource with the minimal struct clear_badblocks_context, and use the
      DEVICE_ATTR macro.
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reported-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  26. 28 Apr, 2017 1 commit
    • Dan Williams's avatar
      libnvdimm, region: sysfs trigger for nvdimm_flush() · ab630891
      Dan Williams authored
      The nvdimm_flush() mechanism helps to reduce the impact of an ADR
      (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
      platform WPQ (write-pending-queue) buffers when power is removed. The
      nvdimm_flush() mechanism performs that same function on-demand.
      When a pmem namespace is associated with a block device, an
      nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
      request. These requests are typically associated with filesystem
      metadata updates. However, when a namespace is in device-dax mode,
      userspace (think database metadata) needs another path to perform the
      same flushing. In other words this is not required to make data
      persistent, but in the case of metadata it allows for a smaller failure
      domain in the unlikely event of an ADR failure.
      The new 'deep_flush' attribute is visible when the individual DIMMs
      backing a given interleave-set are described by platform firmware. In
      ACPI terms this is "NVDIMM Region Mapping Structures" and associated
      "Flush Hint Address Structures". Reads return "1" if the region supports
      triggering WPQ flushes on all DIMMs. Reads return "0" the flush
      operation is a platform nop, and in that case the attribute is
      Why sysfs and not an ioctl? An ioctl requires establishing a new
      ioctl function number space for device-dax. Given that this would be
      called on a device-dax fd an application could be forgiven for
      accidentally calling this on a filesystem-dax fd. Placing this interface
      in libnvdimm sysfs removes that potential for collision with a
      filesystem ioctl, and it keeps ioctls out of the generic device-dax
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  27. 24 Apr, 2017 1 commit
    • Dan Williams's avatar
      libnvdimm, region: fix flush hint detection crash · bc042fdf
      Dan Williams authored
      In the case where a dimm does not have any associated flush hints the
      ndrd->flush_wpq array may be uninitialized leading to crashes with the
      following signature:
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
       IP: region_visible+0x10f/0x160 [libnvdimm]
       Call Trace:
        nd_async_device_register+0x12/0x40 [libnvdimm]
        ? process_one_work+0x197/0x6c0
        ? process_one_work+0x6c0/0x6c0
        ? kthread_create_on_node+0x60/0x60
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Fixes: f284a4f2
       ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  28. 13 Apr, 2017 2 commits
  29. 01 Mar, 2017 1 commit
    • Dan Williams's avatar
      nfit, libnvdimm: fix interleave set cookie calculation · 86ef58a4
      Dan Williams authored
      The interleave-set cookie is a sum that sanity checks the composition of
      an interleave set has not changed from when the namespace was initially
      created.  The checksum is calculated by sorting the DIMMs by their
      location in the interleave-set. The comparison for the sort must be
      64-bit wide, not byte-by-byte as performed by memcmp() in the broken
      Fix the implementation to accept correct cookie values in addition to
      the Linux "memcmp" order cookies, but only allow correct cookies to be
      generated going forward. It does mean that namespaces created by
      third-party-tooling, or created by newer kernels with this fix, will not
      validate on older kernels. However, there are a couple mitigating
          1/ platforms with namespace-label capable NVDIMMs are not widely
          2/ interleave-sets with a single-dimm are by definition not affected
             (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.
      The cookie stored in the namespace label will be fixed by any write the
      namespace label, the most straightforward way to achieve this is to
      write to the "alt_name" attribute of a namespace in sysfs.
      Cc: <stable@vger.kernel.org>
      Fixes: eaf96153
       ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
      Reported-by: default avatarNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Tested-by: default avatarNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  30. 16 Dec, 2016 1 commit