1. 13 Sep, 2019 2 commits
  2. 11 Sep, 2019 5 commits
    • Lukas Wunner's avatar
      spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO · 2b8279ae
      Lukas Wunner authored and Mark Brown's avatar Mark Brown committed
      
      
      The BCM2835 SPI driver currently sets the SPI_CONTROLLER_MUST_TX flag.
      When performing an RX-only transfer, this flag causes the SPI core to
      allocate and DMA-map a dummy buffer which is copied to the TX FIFO.
      The dummy buffer is necessary because the chip is not capable of
      automatically clocking out null bytes.
      
      Avoid the overhead induced by the dummy buffer by preallocating a
      reusable DMA transaction which fills the TX FIFO by cyclically copying
      from the zero page.  The transaction requires very little CPU time to
      submit and generates no interrupts while running.  Specifics are
      provided in kerneldoc comments.
      
      [Nathan Chancellor contributed a DMA mapping fixup for an early version
      of this commit, hence his Signed-off-by.]
      
      Tested-by: default avatarNuno Sá <nuno.sa@analog.com>
      Tested-by: default avatarNoralf Trønnes <noralf@tronnes.org>
      Signed-off-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Acked-by: default avatarStefan Wahren <wahrenst@gmx.net>
      Acked-by: default avatarMartin Sperl <kernel@martin.sperl.org>
      Cc: Robert Jarzmik <robert.jarzmik@free.fr>
      Link: https://lore.kernel.org/r/f45920af18dbf06e34129bbc406f53dc9c5d1075.1568187525.git.lukas@wunner.de
      
      
      Signed-off-by: Mark Brown's avatarMark Brown <broonie@kernel.org>
      2b8279ae
    • Lukas Wunner's avatar
      spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO · 8259bf66
      Lukas Wunner authored and Mark Brown's avatar Mark Brown committed
      
      
      The BCM2835 SPI driver currently sets the SPI_CONTROLLER_MUST_RX flag.
      When performing a TX-only transfer, this flag causes the SPI core to
      allocate and DMA-map a dummy buffer into which the RX FIFO contents are
      copied.  The dummy buffer is necessary because the chip is not capable
      of disabling the receiver or automatically throwing away received data.
      Not reading the RX FIFO isn't an option either since transmission is
      halted once it's full.
      
      Avoid the overhead induced by the dummy buffer by preallocating a
      reusable DMA transaction which cyclically clears the RX FIFO.  The
      transaction requires very little CPU time to submit and generates no
      interrupts while running.  Specifics are provided in kerneldoc comments.
      
      With a ks8851 Ethernet chip attached to the SPI controller, I am seeing
      a 30 us reduction in ping time with this commit (1.819 ms vs. 1.849 ms,
      average of 100,000 packets) as well as a 2% reduction in CPU time
      (75:08 vs. 76:39 for transmission of 5 GByte over the SPI bus).
      
      The commit uses the TX DMA interrupt to signal completion of a transfer.
      This interrupt is raised once all bytes have been written to the
      TX FIFO and it is then necessary to busy-wait for the TX FIFO to become
      empty before the transfer can be finalized.  As an alternative approach,
      I have explored using the SPI controller's DONE interrupt to detect
      completion.  This interrupt is signaled when the TX FIFO becomes empty,
      avoiding the need to busy-wait.  However latency deteriorates compared
      to the present commit and surprisingly, CPU time is slightly higher as
      well:
      
      It turns out that in 45% of the cases, no busy-waiting is needed at all
      and in 76% of the cases, less than 10 busy-wait iterations are
      sufficient for the TX FIFO to drain.  This was measured on an RT kernel.
      On a vanilla kernel, wakeup latency is worse and thus fewer iterations
      are needed.  The measurements were made with an SPI clock of 20 MHz,
      they may differ slightly for slower or faster clock speeds.
      
      Previously we always used the RX DMA interrupt to signal completion of a
      transfer.  Using the TX DMA interrupt now introduces a race condition:
      TX DMA is always started before RX DMA so that bytes are already clocked
      out while RX DMA is still being set up.  But if a TX-only transfer is
      very short, then the TX DMA interrupt may occur before RX DMA is set up.
      If the interrupt happens to occur on the same CPU, setup of RX DMA may
      even be delayed until after the interrupt was handled.
      
      I've solved this by having the TX DMA callback clear the RX FIFO while
      busy-waiting for the TX FIFO to drain, thus avoiding a dependency on
      setup of RX DMA.  Additionally, I am using a lock-free mechanism with
      two flags, tx_dma_active and rx_dma_active plus memory barriers to
      terminate RX DMA either by the TX DMA callback or immediately after
      setting it up, whichever wins the race.  I've explored an alternative
      approach which temporarily disables the TX DMA callback until RX DMA
      has been set up (using tasklet_disable(), local_bh_disable() or
      local_irq_save()), but the performance was minimally worse.
      
      [Nathan Chancellor contributed a DMA mapping fixup for an early version
      of this commit, hence his Signed-off-by.]
      
      Tested-by: default avatarNuno Sá <nuno.sa@analog.com>
      Tested-by: default avatarNoralf Trønnes <noralf@tronnes.org>
      Signed-off-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Acked-by: default avatarStefan Wahren <wahrenst@gmx.net>
      Acked-by: default avatarMartin Sperl <kernel@martin.sperl.org>
      Cc: Robert Jarzmik <robert.jarzmik@free.fr>
      Link: https://lore.kernel.org/r/874949385f28251e2dcaa9494e39a27b50e9f9e4.1568187525.git.lukas@wunner.de
      
      
      Signed-off-by: Mark Brown's avatarMark Brown <broonie@kernel.org>
      8259bf66
    • Lukas Wunner's avatar
      spi: bcm2835: Cache CS register value for ->prepare_message() · 571e31fa
      Lukas Wunner authored and Mark Brown's avatar Mark Brown committed
      The BCM2835 SPI driver needs to set up the clock polarity in its
      ->prepare_message() hook before spi_transfer_one_message() asserts chip
      select to avoid a gratuitous clock signal edge (cf. commit acace73d
      
      
      ("spi: bcm2835: set up spi-mode before asserting cs-gpio")).
      
      Precalculate the CS register value (which selects the clock polarity)
      once in ->setup() and use that cached value in ->prepare_message() and
      ->transfer_one().  This avoids one MMIO read per message and one per
      transfer, yielding a small latency improvement.  Additionally, a
      forthcoming commit will use the precalculated value to derive the
      register value for clearing the RX FIFO, which will eliminate the need
      for an RX dummy buffer when performing TX-only DMA transfers.
      
      Tested-by: default avatarNuno Sá <nuno.sa@analog.com>
      Tested-by: default avatarNoralf Trønnes <noralf@tronnes.org>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Acked-by: default avatarStefan Wahren <wahrenst@gmx.net>
      Acked-by: default avatarMartin Sperl <kernel@martin.sperl.org>
      Link: https://lore.kernel.org/r/d17c1d7fcdc97fffa961b8737cfd80eeb14f9416.1568187525.git.lukas@wunner.de
      
      
      Signed-off-by: Mark Brown's avatarMark Brown <broonie@kernel.org>
      571e31fa
    • Lukas Wunner's avatar
      spi: Guarantee cacheline alignment of driver-private data · 229e6af1
      Lukas Wunner authored and Mark Brown's avatar Mark Brown committed
      
      
      __spi_alloc_controller() uses a single allocation to accommodate struct
      spi_controller and the driver-private data, but places the latter behind
      the former.  This order does not guarantee cacheline alignment of the
      driver-private data.  (It does guarantee cacheline alignment of struct
      spi_controller but the structure doesn't make any use of that property.)
      
      Round up struct spi_controller to cacheline size.  A forthcoming commit
      leverages this to grant DMA access to driver-private data of the BCM2835
      SPI master.
      
      An alternative, less economical approach would be to use two allocations.
      
      A third approach consists of reversing the order to conserve memory.
      But Mark Brown is concerned that it may result in a performance penalty
      on architectures that don't like unaligned accesses.
      
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Link: https://lore.kernel.org/r/01625b9b26b93417fb09d2c15ad02dfe9cdbbbe5.1568187525.git.lukas@wunner.de
      
      
      Signed-off-by: Mark Brown's avatarMark Brown <broonie@kernel.org>
      229e6af1
    • Lukas Wunner's avatar
      spi: bcm2835: Drop dma_pending flag · 1513ceee
      Lukas Wunner authored and Mark Brown's avatar Mark Brown committed
      The BCM2835 SPI driver uses a flag to keep track of whether a DMA
      transfer is in progress.
      
      The flag is used to avoid terminating DMA channels multiple times if a
      transfer finishes orderly while simultaneously the SPI core invokes the
      ->handle_err() callback because the transfer took too long.  However
      terminating DMA channels multiple times is perfectly fine, so the flag
      is unnecessary for this particular purpose.
      
      The flag is also used to avoid invoking bcm2835_spi_undo_prologue()
      multiple times under this race condition.  However multiple *concurrent*
      invocations can no longer happen since commit 2527704d
      
       ("spi:
      bcm2835: Synchronize with callback on DMA termination") because the
      ->handle_err() callback now uses the _sync() variant when terminating
      DMA channels.
      
      The only raison d'être of the flag is therefore that
      bcm2835_spi_undo_prologue() cannot cope with multiple *sequential*
      invocations.  Achieve that by setting tx_prologue to 0 at the end of
      the function.  Subsequent invocations thus become no-ops.
      
      With that, the dma_pending flag becomes unnecessary, so drop it.
      
      Tested-by: default avatarNuno Sá <nuno.sa@analog.com>
      Tested-by: default avatarNoralf Trønnes <noralf@tronnes.org>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Acked-by: default avatarStefan Wahren <wahrenst@gmx.net>
      Acked-by: default avatarMartin Sperl <kernel@martin.sperl.org>
      Link: https://lore.kernel.org/r/062b03b7f86af77a13ce0ec3b22e0bdbfcfba10d.1568187525.git.lukas@wunner.de
      
      
      Signed-off-by: Mark Brown's avatarMark Brown <broonie@kernel.org>
      1513ceee
  3. 10 Sep, 2019 1 commit
    • Lukas Wunner's avatar
      spi: bcm2835: Work around DONE bit erratum · 4c524191
      Lukas Wunner authored and Mark Brown's avatar Mark Brown committed
      Commit 3bd7f658
      
       ("spi: bcm2835: Overcome sglist entry length
      limitation") amended the BCM2835 SPI driver with support for DMA
      transfers whose buffers are not aligned to 4 bytes and require more than
      one sglist entry.
      
      When testing this feature with upcoming commits to speed up TX-only and
      RX-only transfers, I noticed that SPI transmission sometimes breaks.
      A function introduced by the commit, bcm2835_spi_transfer_prologue(),
      performs one or two PIO transmissions as a prologue to the actual DMA
      transmission.  It turns out that the breakage goes away if the DONE bit
      in the CS register is set when ending such a PIO transmission.
      
      The DONE bit signifies emptiness of the TX FIFO.  According to the spec,
      the bit is of type RO, so writing it should never have any effect.
      Perhaps the spec is wrong and the bit is actually of type RW1C.
      E.g. the I2C controller on the BCM2835 does have an RW1C DONE bit which
      needs to be cleared by the driver.  Another, possibly more likely
      explanation is that it's a hardware erratum since the issue does not
      occur consistently.
      
      Either way, amend bcm2835_spi_transfer_prologue() to always write the
      DONE bit.
      
      Usually a transmission is ended by bcm2835_spi_reset_hw().  If the
      transmission was successful, the TX FIFO is empty and thus the DONE bit
      is set when bcm2835_spi_reset_hw() reads the CS register.  The bit is
      then written back to the register, so we happen to do the right thing.
      
      However if DONE is not set, e.g. because transmission is aborted with
      a non-empty TX FIFO, the bit won't be written by bcm2835_spi_reset_hw()
      and it seems possible that transmission might subsequently break.  To be
      on the safe side, likewise amend bcm2835_spi_reset_hw() to always write
      the bit.
      
      Tested-by: default avatarNuno Sá <nuno.sa@analog.com>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Acked-by: default avatarStefan Wahren <wahrenst@gmx.net>
      Acked-by: default avatarMartin Sperl <kernel@martin.sperl.org>
      Link: https://lore.kernel.org/r/edb004dff4af6106f6bfcb89e1a96391e96eb857.1564825752.git.lukas@wunner.de
      
      
      Signed-off-by: Mark Brown's avatarMark Brown <broonie@kernel.org>
      4c524191
  4. 09 Sep, 2019 1 commit
  5. 05 Sep, 2019 2 commits
  6. 04 Sep, 2019 29 commits