1. 31 Jan, 2018 1 commit
    • Ming Lei's avatar
      blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a
      Ming Lei authored
      This status is returned from driver to block layer if device related
      resource is unavailable, but driver can guarantee that IO dispatch
      will be triggered in future when the resource is available.
      Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
      returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
      a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
      3 ms because both scsi-mq and nvmefc are using that magic value.
      If a driver can make sure there is in-flight IO, it is safe to return
      BLK_STS_DEV_RESOURCE because:
      1) If all in-flight IOs complete before examining SCHED_RESTART in
      blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
      is run immediately in this case by blk_mq_dispatch_rq_list();
      2) if there is any in-flight IO after/when examining SCHED_RESTART
      in blk_mq_dispatch_rq_list():
      - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
      - otherwise, this request will be dispatched after any in-flight IO is
        completed via blk_mq_sched_restart()
      3) if SCHED_RESTART is set concurently in context because of
      BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
      cases and make sure IO hang can be avoided.
      One invariant is that queue will be rerun if SCHED_RESTART is set.
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Tested-by: default avatarLaurence Oberman <loberman@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  2. 29 Jan, 2018 2 commits
  3. 17 Jan, 2018 1 commit
    • Ming Lei's avatar
      blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21
      Ming Lei authored
      blk_insert_cloned_request() is called in the fast path of a dm-rq driver
      (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
      blk_mq_request_bypass_insert() to directly append the request to the
      blk-mq hctx->dispatch_list of the underlying queue.
      1) This way isn't efficient enough because the hctx spinlock is always
      2) With blk_insert_cloned_request(), we completely bypass underlying
      queue's elevator and depend on the upper-level dm-rq driver's elevator
      to schedule IO.  But dm-rq currently can't get the underlying queue's
      dispatch feedback at all.  Without knowing whether a request was issued
      or not (e.g. due to underlying queue being busy) the dm-rq elevator will
      not be able to provide effective IO merging (as a side-effect of dm-rq
      currently blindly destaging a request from its elevator only to requeue
      it after a delay, which kills any opportunity for merging).  This
      obviously causes very bad sequential IO performance.
      Fix this by updating blk_insert_cloned_request() to use
      blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
      request to be issued directly to the underlying queue and returns the
      dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
      returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
      to _not_ destage the request.  Whereby preserving the opportunity to
      merge IO.
      With this, request-based DM's blk-mq sequential IO performance is vastly
      improved (as much as 3X in mpath/virtio-scsi testing).
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
      they were refactored to make them less fragile and easier to read/review]
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  4. 15 Jan, 2018 1 commit
    • Mike Snitzer's avatar
      dm: fix incomplete request_queue initialization · c100ec49
      Mike Snitzer authored
      DM is no longer prone to having its request_queue be improperly
      Summary of changes:
      - defer DM's blk_register_queue() from add_disk()-time until
        dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().
      - dm_setup_md_queue() is updated to fully initialize DM's request_queue
        (_after_ all table loads have occurred and the request_queue's type,
        features and limits are known).
      A very welcome side-effect of these changes is DM no longer needs to:
      1) backfill the "mq" sysfs entry (because historically DM didn't
      initialize the request_queue to use blk-mq until _after_
      blk_register_queue() was called via add_disk()).
      2) call elv_register_queue() to get .request_fn request-based DM
      device's "iosched" exposed in syfs.
      In addition, blk-mq debugfs support is now made available because
      request-based DM's blk-mq request_queue is now properly initialized
      before dm_setup_md_queue() calls blk_register_queue().
      These changes also stave off the need to introduce new DM-specific
      workarounds in block core, e.g. this proposal:
      In the end DM devices should be less unicorn in nature (relative to
      initialization and availability of block core infrastructure provided by
      the request_queue).
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Tested-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  5. 05 Oct, 2017 1 commit
  6. 28 Aug, 2017 2 commits
  7. 18 Jun, 2017 1 commit
  8. 09 Jun, 2017 3 commits
    • Christoph Hellwig's avatar
      block: switch bios to blk_status_t · 4e4cbee9
      Christoph Hellwig authored
      Replace bi_error with a new bi_status to allow for a clear conversion.
      Note that device mapper overloaded bi_error with a private value, which
      we'll have to keep arround at least for now and thus propagate to a
      proper blk_status_t value.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Christoph Hellwig's avatar
      blk-mq: switch ->queue_rq return value to blk_status_t · fc17b653
      Christoph Hellwig authored
      Use the same values for use for request completion errors as the return
      value from ->queue_rq.  BLK_STS_RESOURCE is special cased to cause
      a requeue, and all the others are completed as-is.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Christoph Hellwig's avatar
      block: introduce new block status code type · 2a842aca
      Christoph Hellwig authored
      Currently we use nornal Linux errno values in the block layer, and while
      we accept any error a few have overloaded magic meanings.  This patch
      instead introduces a new  blk_status_t value that holds block layer specific
      status codes and explicitly explains their meaning.  Helpers to convert from
      and to the previous special meanings are provided for now, but I suspect
      we want to get rid of them in the long run - those drivers that have a
      errno input (e.g. networking) usually get errnos that don't know about
      the special block layer overloads, and similarly returning them to userspace
      will usually return somethings that strictly speaking isn't correct
      for file system operations, but that's left as an exercise for later.
      For now the set of errors is a very limited set that closely corresponds
      to the previous overloaded errno values, but there is some low hanging
      fruite to improve it.
      blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
      typechecking, so that we can easily catch places passing the wrong values.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  9. 15 May, 2017 1 commit
  10. 02 May, 2017 1 commit
  11. 01 May, 2017 2 commits
  12. 27 Apr, 2017 1 commit
  13. 24 Apr, 2017 1 commit
  14. 20 Apr, 2017 2 commits
  15. 08 Apr, 2017 1 commit
  16. 07 Apr, 2017 1 commit
    • Bart Van Assche's avatar
      dm rq: Avoid that request processing stalls sporadically · 6077c2d7
      Bart Van Assche authored
      While running the srp-test software I noticed that request
      processing stalls sporadically at the beginning of a test, namely
      when mkfs is run against a dm-mpath device. Every time when that
      happened the following command was sufficient to resume request
          echo run >/sys/kernel/debug/block/dm-0/state
      This patch avoids that such request processing stalls occur. The
      test I ran is as follows:
          while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  17. 31 Mar, 2017 1 commit
  18. 24 Feb, 2017 1 commit
  19. 03 Feb, 2017 1 commit
  20. 27 Jan, 2017 3 commits
  21. 08 Dec, 2016 1 commit
  22. 21 Nov, 2016 1 commit
  23. 14 Nov, 2016 1 commit
  24. 02 Nov, 2016 5 commits
    • Bart Van Assche's avatar
      dm: Fix a race condition related to stopping and starting queues · 7b17c2f7
      Bart Van Assche authored
      Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request()
      calls have stopped before setting the "queue stopped" flag. This
      allows to remove the "queue stopped" test from dm_mq_queue_rq() and
      dm_mq_requeue_request(). This patch fixes a race condition because
      dm_mq_queue_rq() is called without holding the queue lock and hence
      BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is
      in progress. This patch prevents that the following hang occurs
      sporadically when using dm-mq:
      INFO: task systemd-udevd:10111 blocked for more than 480 seconds.
      Call Trace:
       [<ffffffff8161f397>] schedule+0x37/0x90
       [<ffffffff816239ef>] schedule_timeout+0x27f/0x470
       [<ffffffff8161e76f>] io_schedule_timeout+0x9f/0x110
       [<ffffffff8161fb36>] bit_wait_io+0x16/0x60
       [<ffffffff8161f929>] __wait_on_bit_lock+0x49/0xa0
       [<ffffffff8114fe69>] __lock_page+0xb9/0xc0
       [<ffffffff81165d90>] truncate_inode_pages_range+0x3e0/0x760
       [<ffffffff81166120>] truncate_inode_pages+0x10/0x20
       [<ffffffff81212a20>] kill_bdev+0x30/0x40
       [<ffffffff81213d41>] __blkdev_put+0x71/0x360
       [<ffffffff81214079>] blkdev_put+0x49/0x170
       [<ffffffff812141c0>] blkdev_close+0x20/0x30
       [<ffffffff811d48e8>] __fput+0xe8/0x1f0
       [<ffffffff811d4a29>] ____fput+0x9/0x10
       [<ffffffff810842d3>] task_work_run+0x83/0xb0
       [<ffffffff8106606e>] do_exit+0x3ee/0xc40
       [<ffffffff8106694b>] do_group_exit+0x4b/0xc0
       [<ffffffff81073d9a>] get_signal+0x2ca/0x940
       [<ffffffff8101bf43>] do_signal+0x23/0x660
       [<ffffffff810022b3>] exit_to_usermode_loop+0x73/0xb0
       [<ffffffff81002cb0>] syscall_return_slowpath+0xb0/0xc0
       [<ffffffff81624e33>] entry_SYSCALL_64_fastpath+0xa6/0xa8
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Bart Van Assche's avatar
      dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code · f0d33ab7
      Bart Van Assche authored
      Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED
      in the dm start and stop queue functions, only manipulate the latter
      flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped().
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Bart Van Assche's avatar
      blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request() · 2b053aca
      Bart Van Assche authored
      Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
      are followed by kicking the requeue list. Hence add an argument to
      these two functions that allows to kick the requeue list. This was
      proposed by Christoph Hellwig.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Bart Van Assche's avatar
      blk-mq: Remove blk_mq_cancel_requeue_work() · 9b7dd572
      Bart Van Assche authored
      Since blk_mq_requeue_work() no longer restarts stopped queues
      canceling requeue work is no longer needed to prevent that a
      stopped queue would be restarted. Hence remove this function.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Bart Van Assche's avatar
      blk-mq: Avoid that requeueing starts stopped queues · 52d7f1b5
      Bart Van Assche authored
      Since blk_mq_requeue_work() starts stopped queues and since
      execution of this function can be scheduled after a queue has
      been stopped it is not possible to stop queues without using
      an additional state variable to track whether or not the queue
      has been stopped. Hence modify blk_mq_requeue_work() such that it
      does not start stopped queues. My conclusion after a review of
      the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
      callers is as follows:
      * In the dm driver starting and stopping queues should only happen
        if __dm_suspend() or __dm_resume() is called and not if the
        requeue list is processed.
      * In the SCSI core queue stopping and starting should only be
        performed by the scsi_internal_device_block() and
        scsi_internal_device_unblock() functions but not by any other
        function. Although the blk_mq_stop_hw_queue() call in
        scsi_queue_rq() may help to reduce CPU load if a LLD queue is
        full, figuring out whether or not a queue should be restarted
        when requeueing a command would require to introduce additional
        locking in scsi_mq_requeue_cmd() to avoid a race with
        scsi_internal_device_block(). Avoid this complexity by removing
        the blk_mq_stop_hw_queue() call from scsi_queue_rq().
      * In the NVMe core only the functions that call
        blk_mq_start_stopped_hw_queues() explicitly should start stopped
      * A blk_mq_start_stopped_hwqueues() call must be added in the
        xen-blkfront driver in its blkif_recover() function.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: James Bottomley <jejb@linux.vnet.ibm.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  25. 28 Oct, 2016 1 commit
  26. 18 Oct, 2016 1 commit
  27. 11 Oct, 2016 1 commit
    • Petr Mladek's avatar
      kthread: kthread worker API cleanup · 3989144f
      Petr Mladek authored
      A good practice is to prefix the names of functions by the name
      of the subsystem.
      The kthread worker API is a mix of classic kthreads and workqueues.  Each
      worker has a dedicated kthread.  It runs a generic function that process
      queued works.  It is implemented as part of the kthread subsystem.
      This patch renames the existing kthread worker API to use
      the corresponding name from the workqueues API prefixed by
      __init_kthread_worker()		-> __kthread_init_worker()
      init_kthread_worker()		-> kthread_init_worker()
      init_kthread_work()		-> kthread_init_work()
      insert_kthread_work()		-> kthread_insert_work()
      queue_kthread_work()		-> kthread_queue_work()
      flush_kthread_work()		-> kthread_flush_work()
      flush_kthread_worker()		-> kthread_flush_worker()
      Note that the names of DEFINE_KTHREAD_WORK*() macros stay
      as they are. It is common that the "DEFINE_" prefix has
      precedence over the subsystem names.
      Note that INIT() macros and init() functions use different
      naming scheme. There is no good solution. There are several
      reasons for this solution:
        + "init" in the function names stands for the verb "initialize"
          aka "initialize worker". While "INIT" in the macro names
          stands for the noun "INITIALIZER" aka "worker initializer".
        + INIT() macros are used only in DEFINE() macros
        + init() functions are used close to the other kthread()
          functions. It looks much better if all the functions
          use the same scheme.
        + There will be also kthread_destroy_worker() that will
          be used close to kthread_cancel_work(). It is related
          to the init() function. Again it looks better if all
          functions use the same naming scheme.
        + there are several precedents for such init() function
          names, e.g. amd_iommu_init_device(), free_area_init_node(),
          jump_label_init_type(),  regmap_init_mmio_clk(),
        + It is not an argument but it was inconsistent even before.
      [arnd@arndb.de: fix linux-next merge conflict]
       Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  28. 21 Sep, 2016 1 commit