RISC-V CPU & CUSTOM GPU ON AN FPGA PART 8 – ALU AND REGISTER FILE

Before we begin

In this part we’ll talk about the ALU and register files, and register file write delay and why we do it.

But before we go further, and since the files are getting too large to post here, we need to grab them from a git repository. Please head over to https://github.com/ecilasun/nekoichiarticle and use the following command to pull the files for all parts of the series:

git clone https://github.com/ecilasun/nekoichiarticle.git

after creating and changing to a folder of your preference with sufficient free disk space, since this is where we’ll be working on all parts of the series from now on.

NOTE: A small change done starting in this part to the project brings us a nicely arranged include file, namely cpuops.vh, which contains all the instruction groups and CPU states so we can refer to them from the decoder and the cpu itself without having to pollute our code with inline defines.

The Register File

As we have seen in part 3, the CPU requires proper reset/clock signals to start up and step properly, and a register file is the key to tracking the state of all programs and moving data around.

If you’ve synced to the latest github repo at the title of this part, you’ll have a folder named ‘part8’ with the Vivado project inside. Please go ahead and start it up to follow along.

The register file, registerfile.sv, has been added in this new project folder, alongside with two other files; ALU.sv and aluops.vh.

The register file has the same combinatorial + clocked design we were talking about in part 3, but now we get to see why this is so.

Let’s take a look at our first instructions in our BIOS.coe file, and see what it’s doing. It starts from the first yellow vertical line on the left and ends with the red vertical arrow on the right, taking 5 clocks (ordinarily, all other instructions take 4 clocks, the first instruction is an exception)

The area marked with the red box contain the states of our CPU during startup and execution of the first instruction. We initially go to RETIRE(8) state as you may recall, so that we can get the memory pointers set up from the current PC which should not be offset initially. This is the cpustate 8 depicted to the leftmost side of the red box.

After we set up the initial memory pointer for the first instruction in our RETIRE(9) state, we go to FETCH(1), which is used to wait for memory reads (due to 1 cycle latency of block RAM), then go to DECODE(2).

During DECODE(2), we feed the new instruction read into the decoder unit. The decoder unit is combinatorial, so it will pick up the changes in its input and generates the ALU operation and source/destination registers and any write flags required. The register indices feeds into the ALU module, which is also combinatorial, while also triggering reads from two source register from the register file, using the always statements at the end. All of these operations are clockless up to this point, and results will be visible inside the same clock that triggered them.

The register file’s outputs are then fed back into the ALU, and alongside with the alu op it knows what to do, such as add/subtract/shift etc. Still, we’re in combinatorial world, so the ALU output is available somewhere within the same clock that triggered all of these operations.

Delaying the register writes

The red arrows pointing down in the above image show where we let the ALU write its output back into the register file. The reason for this delayed write is so that we don’t destroy the register contents while reading from them during the ALU op input generation combinatorial step.

The register write enable state is initially set up by the decoder, but is delayed using a temporary register (registerwriteenable), which is set to the value of wren from decoder, and then immediately turned off the next clock to stop any further accidental writes. We always feed the register file’s write enable control with this temporary register so we can turn it off independent of the operation of the ALU.

As you might recall, results of clocked circuits will be visible on the next clock, therefore the RETIRE(8) stage is where we turns off register writes after they’ve been triggered in EXEC(4) state. The results are only available on or after the next FETCH(1) state, after the write has been disabled in RETIRE(8).

In short, if a register file does read/writes, we almost often want the reads to execute first and writes to happen after the read values are available, which the delayed write enable provides.

The RTL Diagram

We can show this as a diagram using the RTL(Register Transfer Level) ANALYSIS / Open Elaborated Design in our Flow Navigator panel in Vivado:

At the outer scope, our CPU is connected to the memory device via some form of data bus and read/write control, and drives one diagnosis LED, while being fer a clock and a reset line.

In a deeper scope, we can see that the CPU has a decoder (leftmost light blue box), receiving an instruction, and feering that into both the ALU and the register file (the rightmost blue boxes). Note that for now, no output from the ALU is used since we didn’t implement our code execution path yet, but the interconnect between these units do the core of the decoding/register file access/logic ops as they are.

The diagram comes in handy if you need to see if a device has actually been instantiated and/or if it actually has all the inputs and outputs connected, and serves as a diagnosis tool in our case.

The ALU

Our current ALU is pretty simple:

`timescale 1ns / 1ps

`include "aluops.vh"

module ALU(
	output logic [31:0] aluout,
	input wire [2:0] func3,
	input wire [31:0] val1,
	input wire [31:0] val2,
	input wire [4:0] aluop );

// Integer ALU
always_comb begin

	unique case (aluop)
		// Integer ops
		`ALU_ADD:  begin aluout = val1 + val2; end
		`ALU_SUB:  begin aluout = val1 + (~val2 + 32'd1); end
		`ALU_SLL:  begin aluout = val1 << val2[4:0]; end
		`ALU_SLT:  begin aluout = $signed(val1) < $signed(val2) ? 32'd1 : 32'd0; end
		`ALU_SLTU: begin aluout = val1 < val2 ? 32'd1 : 32'd0; end
		`ALU_XOR:  begin aluout = val1 ^ val2; end
		`ALU_SRL:  begin aluout = val1 >> val2[4:0]; end
		`ALU_SRA:  begin aluout = $signed(val1) >>> val2[4:0]; end
		`ALU_OR:   begin aluout = val1 | val2; end
		`ALU_AND:  begin aluout = val1 & val2; end
		default:  begin aluout = 1'b0; end
	endcase

end

endmodule

It is a single, combinatorial unit that will feed the aluout line with a result based on the aluop. Think of this as a giant multiplexer, as you can see from the diagram below. The ALU calculates everything at all times, but selects only the requested result as output. This way of thinking may be counter intuitive to software developers, but in hardware having excess units doing parallel operations is very common (at least in FPGA world) if one doesn’t really care about deep power analysis.

Assume for a second we really did shut off unused units that are being fed to the wide MUX unit on the right hand side. This would not only add extra wires and logic to the design, but also the resources freed from the idle units would not be re-usable in that clock cycle anyways. Therefore, instead of wasting logic, in this design I chose to keep the operations simultaneous, and use multiplexers to pick the desired output. If you truly wish to optimize these out, I suggest you wait until your design is complete; the synthesis tools contain optimizer tools for power consumption which might help better once you have more logic in your design.

The $signed() / $unsigned() syntax

You may have noticed that I’ve used a $signed() keyword in the ALU code. Ordinarily, a ‘logic [31:0] var;’ register is unsigned and won’t use any signed math. To force signed math / comparison and other algorithms, we need to force the register to use two’s complement math, which is denoted by wrapping an unsigned register as $signed(var).

The opposite is also possible. If you have declared your register as ‘logic signed [31:0] svar;’ and wish to do some unsigned math with it, it is possible to wrap it with a $unsigned(svar) and the value will be treated as unsigned.

Multiplication / Division

Currently, as we’re developing our ALU, we’re leaving out the integer multiplication/division units as they’re part of the M extension (please refer to the RISC-V ISA manual about extensions)

This is also because I don’t want to complicate the initial design with wait states and multi-cycle circuits, so for the time being we’ll keep the ALU unit do only the simple math that can be completed in one clock cycle.

This concludes part 8. In the next part we’ll extend our BIOS with more instructions, and implement some new instructions, and add a branch-ALU unit that can decide whether to take a branch or not. We will also start switching to a MMCM (Mixed Mode Clock Manager) generated clock instead of the FPGA clock pin, so our reset might work better at startup.