RISC-V CPU & CUSTOM GPU ON AN FPGA PART 10 – MEMORY MAPPED DEVICES, CLOCK DOMAINS, DEVICE ROUTER AND THE UART
BEFORE WE BEGIN
In this part we’ll talk about crossing clock domains, our device router and the UART.
But before we go further, and since the files are getting too large to post here, we need to grab them from a git repository. Please head over to https://github.com/ecilasun/nekoichiarticle and use the following command to pull the files for all parts of the series:
git clone https://github.com/ecilasun/nekoichiarticle.git
after creating and changing to a folder of your preference with sufficient free disk space, since this is where we’ll be working on all parts of the series from now on.
NOTE: A small change done starting in this part to the project brings us a nicely arranged include file, namely cpuops.vh, which contains all the instruction groups and CPU states so we can refer to them from the decoder and the cpu itself without having to pollute our code with inline defines.
Device clock domains
As we evolve our RISC-V+GPU SoC (system-on-chip), we’ll need to add a variety of devices which don’t necessarily tick at the same rate as our CPU.
Initially, we have given our CPU a 100Mhz clock rate. But in this article, we’ll be adding an UART running at 115200bauds which is incredibly slow versus our CPU. Imagine other future devices, such as an SD card reader, or our GPU running each at very different speeds, and we have a little problem at our hands.
Let’s assume for a second that we wish to write a byte out to the UART Tx port. This operation, from the CPU’s perspective, should not stall execution unless it needs to. If the device we’re writing to is slow, we need to not only store this data somewhere until it’s consumed, but also make sure we don’t stomp on it before the target device reads it. Therefore, for our purposes, we need some form of a queue that can read and write at different clock speeds.
The easiest way is to use a FIFO (a first-in-first-out) queue. Internally, some devices carry a hardware FIFO, or some of them implement it using internal block RAM and a handful of logic units. But we don’t need any odd FIFO, we need one that has two ports, one to push data in, one to read from, which both run at different clock speeds. In our case, the write port needs to run at the 100Mhz CPU clock whereas the UART device reads using its own base clock running at 10Mhz.
For NekoIchi, so far we’ve been using Vivado’s built-in IPs, instead of writing them ourselves, to save time, so we’ll do the same with the UART FIFO. The IP variant we’re interested in is called ‘independent clock block RAM’ and two instances (one for Tx (transmit), one for Rx (receive) hardware) have already been added to the part10 project file.
The way it’s set up for Tx is as follows:
// Transmitter (CPU -> FIFO -> Tx)
wire [9:0] outfifodatacount;
wire [7:0] outfifoout;
wire uarttxbusy, outfifofull, outfifoempty, outfifovalid;
logic [7:0] datatotransmit = 8'h00;
logic [7:0] outfifoin; // This will create a latch since it keeps its value
logic transmitbyte = 1'b0;
logic txstate = 1'b0;
logic outuartfifowe = 1'b0;
logic outfifore = 1'b0;
async_transmitter UART_transmit(
.clk(uartbase),
.TxD_start(transmitbyte),
.TxD_data(datatotransmit),
.TxD(uart_rxd_out),
.TxD_busy(uarttxbusy) );
// Output FIFO
uartfifo UART_out_fifo(
// In
.full(outfifofull),
.din(outfifoin), // Data latched from CPU
.wr_en(outuartfifowe), // CPU controls write, high for one clock
// Out
.empty(outfifoempty), // Nothing to read
.dout(outfifoout), // To transmitter
.rd_en(outfifore), // Transmitter can send
.wr_clk(cpuclock), // CPU write clock
.rd_clk(uartbase), // Transmitter clock runs much slower
.valid(outfifovalid), // Read result valid
// Ctl
.rst(reset_p),
.rd_data_count(outfifodatacount) );
// Fifo output serializer
always @(posedge(uartbase)) begin
if (txstate == 1'b0) begin // IDLE_STATE
if (~uarttxbusy & (transmitbyte == 1'b0)) begin // Safe to attempt send, UART not busy or triggered
if (~outfifoempty) begin // Something in FIFO? Trigger read and go to transmit
outfifore <= 1'b1;
txstate <= 1'b1;
end else begin
outfifore <= 1'b0;
txstate <= 1'b0; // Stay in idle state
end
end else begin // Transmit hardware busy or we kicked a transmit (should end next clock)
outfifore <= 1'b0;
txstate <= 1'b0; // Stay in idle state
end
transmitbyte <= 1'b0;
end else begin // TRANSMIT_STATE
outfifore <= 1'b0; // Stop read request
if (outfifovalid) begin // Kick send and go to idle
datatotransmit <= outfifoout;
transmitbyte <= 1'b1;
txstate <= 1'b0;
end else begin
txstate <= 1'b1; // Stay in transmit state and wait for valid fifo data
end
end
end
The main idea behind this setup is that we have an always block that will tick at the UART base clock rate (10Mhz). This block will, if transmit state is idle, check to see if the UART device is busy and not transmitting. If the queue to transmit from is also not empty, it will start a queue read, and move into the not-idle state. In this state, it will wait for the FIFO data to be available, and if it is (the valid flag), it’ll set it as output for the transmit device and switch back to IDLE state.
When the always block loops back into IDLE, now that we’re transmitting, it will not allow for further reads from the queue. There is also a second mechanism around this queue to stall write attempts from the CPU side if the queue is full (it’s 1024 bytes long), which is a rare but not uncommon case. Increasing queue size as the device resources allows will let us queue up more bytes while the CPU is freed to do other tasks.
Memory mapped devices
A memory mapped device, as the name suggests, is any device apart from the main system RAM, that responds to reads and writes to a certain memory address. The address is reserved for this device only, and won’t have any effect on the actual memory storage.
Memory mapped devices are often used for convenience, since nothing in the instruction set or the CPU itself has to change, making design easier (as long as the CPU knows the device is busy or not in one way). To the CPU, accesses to a memory mapped device look like reading form or writing to memory. This allows us to reach to a variety of devices across many clock domains with a single bit of busy signal, letting the CPU access the bus for reads or writes depending on the device status.
Current planned memory mapping for NekoIchi is listed below, for current and future devices:
0x00000000 - 0x0003FFFF System RAM (main physical memory)
0x00040000 - 0x7FFFFFFF Reserved
0x8000001C Reserved
0x80000018 Reserved
0x80000014 Reserved
0x80000010 Reserved
0x8000000C UART write port
0x80000008 UART read port
0x80000004 UART incoming queue byte count
0x80000000 Reserved
The addresses marked as Reserved will be occupied with more devices as we progress with this series.
Device Router
NekoIchi has a strange memory bus, which is not a real bus, called the Device Router. It does very little to control data traffic, and also acts like the top layer for all devices (apart from the CPU) attached to it.
This was a deliberate design choice, for reasons that will be explained in later parts, but for now it suffices to say that keeping all child devices inside the device router allows for some shortcuts to be taken without spending too much effort on the bus busy signal logic. Seeing all devices at the same level lets us simply or the busy signals together instead of having to route more wires out and back into devices, and also lets us snoop states of other devices with the same ease. For now, since we only have one device, it should not be a problem to use either layout.
As you will see in the devicerouter.sv file, there are some commented out blocks of code. These are some memory mapped device detection flags, and the bus data output logic. The bus data output logic is using a very simple version at the moment versus the complex, commented out version that utilizes SDCard and GPU access flags.
The most interesting part here is the always_comb block. It is a combinatorial logic block, as the keyword suggests. Its sole job, just for now, is to ensure the correct byte selection/alignment for the UART writes. In the future, it will do similar work for the SPI interface (for SDCard access) and handle GPU and other device read/write access flags.
About resource utilization and development goals
At this point, we have a usable CPU with an UART for outside world access, which is where most tutorials stop and start explaining embedded software development practices.
For NekoIchi, we will do that briefly, but not for too long so we can focus on moving on to our other devices, most importanly the GPU. Also, while programming NekoIchi, we will try to steer far away from linker script and such magic for the ‘real’ ELF executables, but only do it where it matters, namely for the BIOS itself. When the articles start talking about software, we’ll be gradually building a library of useful routines for ourselves to aid in development.
One of the design goals of NekoIchi is to be efficient, therefore some sacrifices are going to be made during this series. So far we’re using up only around 2%-3% of the Arty A7-100’s device logic, but we are already consuming half the device block RAM for our system memory. In the future, adding a DDR3 will free up some of these resources, but it will add quite a bit of others, so the current resource usage is not a real indication of the final device’s utilization (and in parallel, power usage)
Another point here is that the CPU is currently set to run at 100Mhz, but we’ll have to reduce this speed later on due to some complexity in the design, but not to worry since our GPU will run at higher clock speed than our CPU to bring those fast triangles to screen.
Running the first NekoIchi version
We can now try and load our device onto the board and see what we get. We’ll follow the same method as before when we built our first blinky code.
First, plug in your Arty A7-100 board using a USB cable to your computer. Also make sure that you’ve installed the USB drivers as shown in part 1 of this series.
Then, open up the part10 project file if you don’t have it open already. Now, either click on Generate Bitstream, and select Yes to go through all dependent compile steps automatically, or one-by-one use the Run Synthesis, Run Implementation and finally Generate Bitstream.
While that’s cooking, we’ll need to install a terminal software that can use the USB/UART connection on our PC. I prefer to use PuTTY for this purpose. The settings you need are Serial, your serial port name where the Arty board is connected, (for example, /dev/ttyUSB5) and 115200 bauds as speed.
Click Open, which should display an empty terminal window, and drag it to an area of the screen where you can see the output.
(If you do not know which port your Arty board is connected to type ‘dmesg | grep ttyUSB’ and note the ttyUSB device name to use in the serial line entry above. For me this is ttyUSB5 as unplugging/replugging the board seems to slowly creep up the USB port index one by one. If this annoys you, check this link for a possible fixed mapping solution.)
Once these steps are complete, in Vivado, unfold the Open Hardware Manager link, and click on Open Target and select Auto Connect. This will show your current FPGA device on the Arty board (xc7a100t)
Right click on the xc7a100t and select Program Device, navigate to the part10 project folder, where the bitstream file lives, for instance …/part10/nekotutorial.runs/impl_1/nekotop.bit
Click Program, and keep an eye on the terminal window. If you see the following output, congratulations, you now have a RISC-V CPU attached to your serial port, communicating with your PC. Also I should note here that since the BIOS is from the future, it will show a mismatching feature set for our currently limited RISC-V device. For reference, what we have so far is a slightly OK subset of the rv32i variant of a RISC-V. The M and F extensions will come later, so will the GPU.
This concludes part 10. In the next part, we’ll look at riscvtool, RISC-V compiler toolchain, and compile/upload a few samples that we can interact with on our brand new computer.