Commit 7ad67ca5 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Two NVMe pull requests:
     - ana log parse fix from Anton
     - nvme quirks support for Apple devices from Ben
     - fix missing bio completion tracing for multipath stack devices
       from Hannes and Mikhail
     - IP TOS settings for nvme rdma and tcp transports from Israel
     - rq_dma_dir cleanups from Israel
     - tracing for Get LBA Status command from Minwoo
     - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
     - Some consolidation between the fabrics transports for handling
       the CAP register
     - reset race with ns scanning fix for fabrics (move fabrics
       commands to a dedicated request queue with a different lifetime
       from the admin request queue)."
     - controller reset and namespace scan races fixes
     - nvme discovery log change uevent support
     - naming improvements from Keith
     - multiple discovery controllers reject fix from James
     - some regular cleanups from various people

 - Series fixing (and re-fixing) null_blk debug printing and nr_devices
   checks (André)

 - A few pull requests from Song, with fixes from Andy, Guoqing,
   Guilherme, Neil, Nigel, and Yufen.

 - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

 - Bio merge handling unification (Christoph)

 - Pick default elevator correctly for devices with special needs
   (Damien)

 - Block stats fixes (Hou)

 - Timeout and support devices nbd fixes (Mike)

 - Series fixing races around elevator switching and device add/remove
   (Ming)

 - sed-opal cleanups (Revanth)

 - Per device weight support for BFQ (Fam)

 - Support for blk-iocost, a new model that can properly account cost of
   IO workloads. (Tejun)

 - blk-cgroup writeback fixes (Tejun)

 - paride queue init fixes (zhengbin)

 - blk_set_runtime_active() cleanup (Stanley)

 - Block segment mapping optimizations (Bart)

 - lightnvm fixes (Hans/Minwoo/YueHaibing)

 - Various little fixes and cleanups

* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
  null_blk: format pr_* logs with pr_fmt
  null_blk: match the type of parameter nr_devices
  null_blk: do not fail the module load with zero devices
  block: also check RQF_STATS in blk_mq_need_time_stamp()
  block: make rq sector size accessible for block stats
  bfq: Fix bfq linkage error
  raid5: use bio_end_sector in r5_next_bio
  raid5: remove STRIPE_OPS_REQ_PENDING
  md: add feature flag MD_FEATURE_RAID0_LAYOUT
  md/raid0: avoid RAID0 data corruption due to layout confusion.
  raid5: don't set STRIPE_HANDLE to stripe which is in batch list
  raid5: don't increment read_errors on EILSEQ return
  nvmet: fix a wrong error status returned in error log page
  nvme: send discovery log page change events to userspace
  nvme: add uevent variables for controller devices
  nvme: enable aen regardless of the presence of I/O queues
  nvme-fabrics: allow discovery subsystems accept a kato
  nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
  nvme: Remove redundant assignment of cq vector
  nvme: Assign subsys instance from first ctrl
  ...
parents 5260c2b8 9c7eddf1
......@@ -1469,6 +1469,103 @@ IO Interface Files
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
io.cost.qos
A read-write nested-keyed file with exists only on the root
cgroup.
This file configures the Quality of Service of the IO cost
model based controller (CONFIG_BLK_CGROUP_IOCOST) which
currently implements "io.weight" proportional control. Lines
are keyed by $MAJ:$MIN device numbers and not ordered. The
line for a given device is populated on the first write for
the device on "io.cost.qos" or "io.cost.model". The following
nested keys are defined.
====== =====================================
enable Weight-based control enable
ctrl "auto" or "user"
rpct Read latency percentile [0, 100]
rlat Read latency threshold
wpct Write latency percentile [0, 100]
wlat Write latency threshold
min Minimum scaling percentage [1, 10000]
max Maximum scaling percentage [1, 10000]
====== =====================================
The controller is disabled by default and can be enabled by
setting "enable" to 1. "rpct" and "wpct" parameters default
to zero and the controller uses internal device saturation
state to adjust the overall IO rate between "min" and "max".
When a better control quality is needed, latency QoS
parameters can be configured. For example::
8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
shows that on sdb, the controller is enabled, will consider
the device saturated if the 95th percentile of read completion
latencies is above 75ms or write 150ms, and adjust the overall
IO issue rate between 50% and 150% accordingly.
The lower the saturation point, the better the latency QoS at
the cost of aggregate bandwidth. The narrower the allowed
adjustment range between "min" and "max", the more conformant
to the cost model the IO behavior. Note that the IO issue
base rate may be far off from 100% and setting "min" and "max"
blindly can lead to a significant loss of device capacity or
control quality. "min" and "max" are useful for regulating
devices which show wide temporary behavior changes - e.g. a
ssd which accepts writes at the line speed for a while and
then completely stalls for multiple seconds.
When "ctrl" is "auto", the parameters are controlled by the
kernel and may change automatically. Setting "ctrl" to "user"
or setting any of the percentile and latency parameters puts
it into "user" mode and disables the automatic changes. The
automatic mode can be restored by setting "ctrl" to "auto".
io.cost.model
A read-write nested-keyed file with exists only on the root
cgroup.
This file configures the cost model of the IO cost model based
controller (CONFIG_BLK_CGROUP_IOCOST) which currently
implements "io.weight" proportional control. Lines are keyed
by $MAJ:$MIN device numbers and not ordered. The line for a
given device is populated on the first write for the device on
"io.cost.qos" or "io.cost.model". The following nested keys
are defined.
===== ================================
ctrl "auto" or "user"
model The cost model in use - "linear"
===== ================================
When "ctrl" is "auto", the kernel may change all parameters
dynamically. When "ctrl" is set to "user" or any other
parameters are written to, "ctrl" become "user" and the
automatic changes are disabled.
When "model" is "linear", the following model parameters are
defined.
============= ========================================
[r|w]bps The maximum sequential IO throughput
[r|w]seqiops The maximum 4k sequential IOs per second
[r|w]randiops The maximum 4k random IOs per second
============= ========================================
From the above, the builtin linear model determines the base
costs of a sequential and random IO and the cost coefficient
for the IO size. While simple, this model can cover most
common device classes acceptably.
The IO cost model isn't expected to be accurate in absolute
sense and is scaled to the device behavior dynamically.
If needed, tools/cgroup/iocost_coef_gen.py can be used to
generate device-specific coefficients.
io.weight
A read-write flat-keyed file which exists on non-root cgroups.
The default is "default 100".
......
......@@ -1201,12 +1201,6 @@
See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
elevator= [IOSCHED]
Format: { "mq-deadline" | "kyber" | "bfq" }
See Documentation/block/deadline-iosched.rst,
Documentation/block/kyber-iosched.rst and
Documentation/block/bfq-iosched.rst for details.
elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
Specifies physical address of start of kernel core
image elf header and optionally the size. Generally
......
......@@ -274,9 +274,7 @@ To reduce its OS jitter, do any of the following:
(based on an earlier one from Gilad Ben-Yossef) that
reduces or even eliminates vmstat overhead for some
workloads at https://lkml.org/lkml/2013/9/4/379.
e. Boot with "elevator=noop" to avoid workqueue use by
the block layer.
f. If running on high-end powerpc servers, build with
e. If running on high-end powerpc servers, build with
CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
daemon from running on each CPU every second or so.
(This will require editing Kconfig files and will defeat
......@@ -284,12 +282,12 @@ To reduce its OS jitter, do any of the following:
due to the rtas_event_scan() function.
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
g. If running on Cell Processor, build your kernel with
f. If running on Cell Processor, build your kernel with
CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
spu_gov_work().
WARNING: Please check your CPU specifications to
make sure that this is safe on your particular system.
h. If running on PowerMAC, build your kernel with
g. If running on PowerMAC, build your kernel with
CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
avoiding OS jitter from rackmeter_do_timer().
......
.. SPDX-License-Identifier: GPL-2.0
========================
Null block device driver
========================
1. Overview
===========
Overview
========
The null block device (/dev/nullb*) is used for benchmarking the various
The null block device (``/dev/nullb*``) is used for benchmarking the various
block-layer implementations. It emulates a block device of X gigabytes in size.
The following instances are possible:
Single-queue block-layer
- Request-based.
- Single submission queue per device.
- Implements IO scheduling algorithms (CFQ, Deadline, noop).
It does not execute any read/write operation, just mark them as complete in
the request queue. The following instances are possible:
Multi-queue block-layer
......@@ -27,15 +24,15 @@ The following instances are possible:
All of them have a completion queue for each core in the system.
2. Module parameters applicable for all instances
=================================================
Module parameters
=================
queue_mode=[0-2]: Default: 2-Multi-queue
Selects which block-layer the module should instantiate with.
= ============
0 Bio-based
1 Single-queue
1 Single-queue (deprecated)
2 Multi-queue
= ============
......@@ -67,7 +64,7 @@ irqmode=[0-2]: Default: 1-Soft-irq
completion_nsec=[ns]: Default: 10,000ns
Combined with irqmode=2 (timer). The time each completion event must wait.
submit_queues=[1..nr_cpus]:
submit_queues=[1..nr_cpus]: Default: 1
The number of submission queues attached to the device driver. If unset, it
defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module
parameter is 1.
......@@ -75,9 +72,11 @@ submit_queues=[1..nr_cpus]:
hw_queue_depth=[0..qdepth]: Default: 64
The hardware queue depth of the device.
III: Multi-queue specific parameters
Multi-queue specific parameters
-------------------------------
use_per_node_hctx=[0/1]: Default: 0
Number of hardware context queues.
= =====================================================================
0 The number of submit queues are set to the value of the submit_queues
......@@ -87,6 +86,7 @@ use_per_node_hctx=[0/1]: Default: 0
= =====================================================================
no_sched=[0/1]: Default: 0
Enable/disable the io scheduler.
= ======================================
0 nullb* use default blk-mq io scheduler
......@@ -94,6 +94,7 @@ no_sched=[0/1]: Default: 0
= ======================================
blocking=[0/1]: Default: 0
Blocking behavior of the request queue.
= ===============================================================
0 Register as a non-blocking blk-mq driver device.
......@@ -103,6 +104,7 @@ blocking=[0/1]: Default: 0
= ===============================================================
shared_tags=[0/1]: Default: 0
Sharing tags between devices.
= ================================================================
0 Tag set is not shared.
......@@ -111,6 +113,7 @@ shared_tags=[0/1]: Default: 0
= ================================================================
zoned=[0/1]: Default: 0
Device is a random-access or a zoned block device.
= ======================================================================
0 Block device is exposed as a random-access block device.
......
......@@ -2,10 +2,6 @@
Switching Scheduler
===================
To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
'noop' and 'cfq' (the default) are also available. IO schedulers are assigned
globally at boot time only presently.
Each io queue has a set of io scheduler tunables associated with it. These
tunables control how the io scheduler works. You can find these entries
in::
......
......@@ -26,6 +26,9 @@ menuconfig BLOCK
if BLOCK
config BLK_RQ_ALLOC_TIME
bool
config BLK_SCSI_REQUEST
bool
......@@ -132,6 +135,16 @@ config BLK_CGROUP_IOLATENCY
Note, this is an experimental interface and could be changed someday.
config BLK_CGROUP_IOCOST
bool "Enable support for cost model based cgroup IO controller"
depends on BLK_CGROUP=y
select BLK_RQ_ALLOC_TIME
---help---
Enabling this option enables the .weight interface for cost
model based proportional IO control. The IO controller
distributes IO capacity between different groups based on
their share of the overall weight distribution.
config BLK_WBT_MQ
bool "Multiqueue writeback throttling"
default y
......
......@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
......
......@@ -501,11 +501,12 @@ static void bfq_cpd_free(struct blkcg_policy_data *cpd)
kfree(cpd_to_bfqgd(cpd));
}
static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, struct request_queue *q,
struct blkcg *blkcg)
{
struct bfq_group *bfqg;
bfqg = kzalloc_node(sizeof(*bfqg), gfp, node);
bfqg = kzalloc_node(sizeof(*bfqg), gfp, q->node);
if (!bfqg)
return NULL;
......@@ -904,7 +905,7 @@ void bfq_end_wr_async(struct bfq_data *bfqd)
bfq_end_wr_async_queues(bfqd, bfqd->root_group);
}
static int bfq_io_show_weight(struct seq_file *sf, void *v)
static int bfq_io_show_weight_legacy(struct seq_file *sf, void *v)
{
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
......@@ -918,6 +919,60 @@ static int bfq_io_show_weight(struct seq_file *sf, void *v)
return 0;
}
static u64 bfqg_prfill_weight_device(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
struct bfq_group *bfqg = pd_to_bfqg(pd);
if (!bfqg->entity.dev_weight)
return 0;
return __blkg_prfill_u64(sf, pd, bfqg->entity.dev_weight);
}
static int bfq_io_show_weight(struct seq_file *sf, void *v)
{
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
seq_printf(sf, "default %u\n", bfqgd->weight);
blkcg_print_blkgs(sf, blkcg, bfqg_prfill_weight_device,
&blkcg_policy_bfq, 0, false);
return 0;
}
static void bfq_group_set_weight(struct bfq_group *bfqg, u64 weight, u64 dev_weight)
{
weight = dev_weight ?: weight;
bfqg->entity.dev_weight = dev_weight;
/*
* Setting the prio_changed flag of the entity
* to 1 with new_weight == weight would re-set
* the value of the weight to its ioprio mapping.
* Set the flag only if necessary.
*/
if ((unsigned short)weight != bfqg->entity.new_weight) {
bfqg->entity.new_weight = (unsigned short)weight;
/*
* Make sure that the above new value has been
* stored in bfqg->entity.new_weight before
* setting the prio_changed flag. In fact,
* this flag may be read asynchronously (in
* critical sections protected by a different
* lock than that held here), and finding this
* flag set may cause the execution of the code
* for updating parameters whose value may
* depend also on bfqg->entity.new_weight (in
* __bfq_entity_update_weight_prio).
* This barrier makes sure that the new value
* of bfqg->entity.new_weight is correctly
* seen in that code.
*/
smp_wmb();
bfqg->entity.prio_changed = 1;
}
}
static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css,
struct cftype *cftype,
u64 val)
......@@ -936,55 +991,72 @@ static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css,
hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
struct bfq_group *bfqg = blkg_to_bfqg(blkg);
if (!bfqg)
continue;
/*
* Setting the prio_changed flag of the entity
* to 1 with new_weight == weight would re-set
* the value of the weight to its ioprio mapping.
* Set the flag only if necessary.
*/
if ((unsigned short)val != bfqg->entity.new_weight) {
bfqg->entity.new_weight = (unsigned short)val;
/*
* Make sure that the above new value has been
* stored in bfqg->entity.new_weight before
* setting the prio_changed flag. In fact,
* this flag may be read asynchronously (in
* critical sections protected by a different
* lock than that held here), and finding this
* flag set may cause the execution of the code
* for updating parameters whose value may
* depend also on bfqg->entity.new_weight (in
* __bfq_entity_update_weight_prio).
* This barrier makes sure that the new value
* of bfqg->entity.new_weight is correctly
* seen in that code.
*/
smp_wmb();
bfqg->entity.prio_changed = 1;
}
if (bfqg)
bfq_group_set_weight(bfqg, val, 0);
}
spin_unlock_irq(&blkcg->lock);
return ret;
}
static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
static ssize_t bfq_io_set_device_weight(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
u64 weight;
/* First unsigned long found in the file is used */
int ret = kstrtoull(strim(buf), 0, &weight);
int ret;
struct blkg_conf_ctx ctx;
struct blkcg *blkcg = css_to_blkcg(of_css(of));
struct bfq_group *bfqg;
u64 v;
ret = blkg_conf_prep(blkcg, &blkcg_policy_bfq, buf, &ctx);
if (ret)
return ret;
ret = bfq_io_set_weight_legacy(of_css(of), NULL, weight);
if (sscanf(ctx.body, "%llu", &v) == 1) {
/* require "default" on dfl */
ret = -ERANGE;
if (!v)
goto out;
} else if (!strcmp(strim(ctx.body), "default")) {
v = 0;
} else {
ret = -EINVAL;
goto out;
}
bfqg = blkg_to_bfqg(ctx.blkg);
ret = -ERANGE;
if (!v || (v >= BFQ_MIN_WEIGHT && v <= BFQ_MAX_WEIGHT)) {
bfq_group_set_weight(bfqg, bfqg->entity.weight, v);
ret = 0;
}
out:
blkg_conf_finish(&ctx);
return ret ?: nbytes;
}
static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
char *endp;
int ret;
u64 v;
buf = strim(buf);
/* "WEIGHT" or "default WEIGHT" sets the default weight */
v = simple_strtoull(buf, &endp, 0);
if (*endp == '\0' || sscanf(buf, "default %llu", &v) == 1) {
ret = bfq_io_set_weight_legacy(of_css(of), NULL, v);
return ret ?: nbytes;
}
return bfq_io_set_device_weight(of, buf, nbytes, off);
}
#ifdef CONFIG_BFQ_CGROUP_DEBUG
static int bfqg_print_stat(struct seq_file *sf, void *v)
{
......@@ -1141,9 +1213,15 @@ struct cftype bfq_blkcg_legacy_files[] = {
{
.name = "bfq.weight",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = bfq_io_show_weight,
.seq_show = bfq_io_show_weight_legacy,
.write_u64 = bfq_io_set_weight_legacy,
},
{
.name = "bfq.weight_device",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = bfq_io_show_weight,
.write = bfq_io_set_weight,
},
/* statistics, covers only the tasks in the bfqg */
{
......
......@@ -168,6 +168,9 @@ struct bfq_entity {
/* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */
int budget;
/* device weight, if non-zero, it overrides the default weight of
* bfq_group_data */
int dev_weight;
/* weight of the queue */
int weight;
/* next weight if a change is in progress */
......
......@@ -744,6 +744,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
}
#endif
/* Matches the smp_wmb() in bfq_group_set_weight. */
smp_rmb();
old_st->wsum -= entity->weight;
if (entity->new_weight != entity->orig_weight) {
......
......@@ -646,25 +646,20 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
return true;
}
/*
* Check if the @page can be added to the current segment(@bv), and make
* sure to call it only if page_is_mergeable(@bv, @page) is true
*/
static bool can_add_page_to_seg(struct request_queue *q,
struct bio_vec *bv, struct page *page, unsigned len,
unsigned offset)
static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned len, unsigned offset,
bool *same_page)
{
struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
unsigned long mask = queue_segment_boundary(q);
phys_addr_t addr1 = page_to_phys(bv->bv_page) + bv->bv_offset;
phys_addr_t addr2 = page_to_phys(page) + offset + len - 1;
if ((addr1 | mask) != (addr2 | mask))
return false;
if (bv->bv_len + len > queue_max_segment_size(q))
return false;
return true;
return __bio_try_merge_page(bio, page, len, offset, same_page);
}
/**
......@@ -674,7 +669,7 @@ static bool can_add_page_to_seg(struct request_queue *q,
* @page: page to add
* @len: vec entry length
* @offset: vec entry offset
* @put_same_page: put the page if it is same with last added page
* @same_page: return if the merge happen inside the same page
*
* Attempt to add a page to the bio_vec maplist. This can fail for a
* number of reasons, such as the bio being full or target block device
......@@ -685,10 +680,9 @@ static bool can_add_page_to_seg(struct request_queue *q,
*/
static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned int len, unsigned int offset,
bool put_same_page)
bool *same_page)
{
struct bio_vec *bvec;
bool same_page = false;
/*
* cloned bio must not modify vec list
......@@ -700,28 +694,16 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
return 0;
if (bio->bi_vcnt > 0) {
bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
if (page == bvec->bv_page &&
offset == bvec->bv_offset + bvec->bv_len) {
if (put_same_page)
put_page(page);
bvec->bv_len += len;
goto done;
}
if (bio_try_merge_pc_page(q, bio, page, len, offset, same_page))
return len;
/*
* If the queue doesn't support SG gaps and adding this
* offset would create a gap, disallow it.
* If the queue doesn't support SG gaps and adding this segment
* would create a gap, disallow it.
*/
bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
if (bvec_gap_to_prev(q, bvec, offset))
return 0;
if (page_is_mergeable(bvec, page, len, offset, &same_page) &&
can_add_page_to_seg(q, bvec, page, len, offset)) {
bvec->bv_len += len;
goto done;
}
}
if (bio_full(bio, len))
......@@ -735,7 +717,6 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
bvec->bv_len = len;
bvec->bv_offset = offset;
bio->bi_vcnt++;
done:
bio->bi_iter.bi_size += len;
return len;
}
......@@ -743,7 +724,8 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
int bio_add_pc_page(struct request_queue *q, struct bio *bio,