tinysys part four: peripherals and memory arbitration

Hardware devices

There are several physical devices on the tinysys carrier board. They require either one-way or two-way communication, each with its own bus type.

The audio device uses an I2S bus to receive audio data from the FPGA fabric, whereas the SD Card uses SPI.

Above all of this lives a software layer that makes life easier for the OS. It all starts with memory mapped I/O regions.

Memory mapped I/O

Instead of exposing individual pins to our CPU / OS and having to precisely time signals ourselves, we can map a certain region of memory address space so that all input becomes memory reads, and all output becomes memory writes.

We can then map each peripheral to its own memory address range and access them as if they were uncached memory.

For tinysys the memory layout currently looks as follows:

Starting with the general purpose I/O pins, each device takes up 64Kbytes of memory space (0xFFFF) This much spacing is not ordinarily required for all devices, but for address decoding ease this layout is chosen.

Address decoding

The address space between 0x00000000 and 0x7FFFFFFF is cached memory, with most of it accessible and mapping onto the SDRAM. The address space beyond this, between 0x80000000 and 0x800FFFFF, is mapped to external peripherals and some of the CSRs of the two RISC-V cores. Anything beyond that, between 0x80100000 and 0xFFFFFFFF is considered wastelands and one should not wonder there.

First step in our decoding process is to determine if we’re addressing a cached device or not. To simplify matters, the control unit is tied to the data cache which houses a cached and an uncached memory access device. If the highest bit of our address is set, we’re in uncached memory and all access is routed to the uncached memory controller. Otherwise, we’ll hit the cache which will either return a value or defer to the SDRAM to fetch and populate itself before returning a value.
Second step is to check the 4 bit device code in bits [19:16], which tells us which device to pick. The device selector will map this to a bit mask and only route the AXI4 towards selected devices.

Arbitration

I chose to use an AXI4 bus in tinysys to tie everything together. To arbitrate the access to this bus between devices that need it requires a device called a bus arbitrator.

The bus arbitrator is a very simple device in tinysys, it merely does round-robin arbitration and will hand the memory access trying to be fair between the devices.

When the display scan-out is running (that is, if the VPU is enabled and there is video output) the biggest customer of the arbitrator is the VPU. It needs to populate a scanline cache without much delay; therefore, we can’t give the CPU all of our memory access time.

When we consider that the CPU is mostly running off caches (I$ and D$), it makes sense that the VPU not be stopped and gets an equal chance to read from the memory bus.

However, when we think about memory access, we should consider the read and write access requests separately. By allowing reads and writes to be asynchronous, we can guarantee that there is at least some overlap at the higher level, so things look like they’re not stalling most of the time.

Here’s a small peek into how a round-robin arbiter for read access might work, for two to one mapping (two customers and one device to share):

- For each arvalid request the line, populate one bit in a req bit array:
  rreq[0] = axi_s[0].arvalid;
  rreq[1] = axi_s[1].arvalid;
- For 'complete' notification, use the shared devices ack signals:
  rreqcomplete = axi_m.rvalid && axi_m.rlast;
- Copy the ready/data/status signals from axi_m (shared device) to the currently selected axi_s[]
- Copy the selected axi_s[] driving signals such as address/rready etc to axi_m
- Set aside a state that keeps track of which axi_s[] was the owner of the bus before
- For each new request, select the 'other, next' device from axi_s[] as the current accessor, and go to a 'granted' state
- Only exit the granted state if the 'complete' signal from before was high (i.e. transaction complete)

For the full source code see this file on the git repo for a 2->1 arbiter example. There is an 8->1 arbiter as well in this file, which uses essentially the same method but for a larger number of clients.

Peripherals

We mentioned above that the peripheral address space is uncached. That means if a value appears at the read port, and if we don’t read it in time, it’ll be lost to us. Of course, there is a way around this, we use a FIFO.

A first-in-first-out queue is fairly easy to implement in hardware, where a device on one clock domain can push some data for later retrieval by a device on another clock domain. This makes life easier when talking to devices such as the SD Card SPI interface, by taking care of the timing on the interface side. The user code on CPU end simply writes to a memory address from its point of view, but this address is intercepted and routed to a FIFO which stores the data at that time. Later on, when the device can read it and the timing is right, it’ll pop one value from the FIFO and process it either as a command or a data word.

Having a FIFO between modules is therefore one way to cope with asynchronous data flow, which looks like a memory access on the CPU side.

There are no streaming devices on tinysys, but for a second let’s assume we had a camera that has to deliver a large amount of data for us, for which a FIFO would be probably impractical. In such a scenario, we would push commands to the device still over a FIFO, however we’d then attach a DMA module to the device which streams out its large data directly onto the SDRAM. That of course means now this device also becomes a customer to our memory arbiter as described above, with probably some special high priority access rather than round-robin to guarantee bandwidth. Luckily, for now, we do not have to care about this on tinysys.

Peripheral interrupts

The CSR modules for each CPU for tinysys are equipped with a mechanism to let the CPU fetch unit know if an interrupt is pending, coming from one of the devices. One such device is the SD Card reader, and we specifically want to know when the user inserts or removes the card by probing one of its pins.

As described above, this is done by using a FIFO. There’s a little circuit on the FPGA which constantly monitors if the SD Card pin corresponding to card detect has changes state, and if so, pushes it to a FIFO. As long as there are any values in the FIFO, we hold an IRQ line high.

This causes the fetch unit to halt, jump to an ISR, upon which we can read some bits from a special CSR to detect which device triggered the interrupt. After that it’s the ISR’s responsibility to repeatedly read from the address mapped to the FIFO where we pushed this hardware state change, and process each and all state before it returns.

Once the FIFO is clear, ISR returns, and code execution resumes from where it left off. This way, we can handle an SD Card insertion, or a USB host interface interrupt with ease.

What’s next?

Next time let’s do a change of pace and look at the video scan-out logic, and this time read some System Verilog code for fun. Until then, you’re granted memory access.