2021-05-272021-05-27

RISC-V CPU & custom GPU on an FPGA Part 3 – Getting started: Clocks/Reset and Register File

Now it’s time for us to start looking at ways that will allow us to implement our hardware in SystemVerilog. If you’re unfamiliar with hardware definition languages, I can suggest a quick detour and skim through the Wikipedia entry on the subject. In summary, this is the language we’ll be using to describe our hardware, mapping our control flow to the internals of our FPGA board.

Hardware and Control Flow

In hardware world, we do not have a sequential control flow unless we implement one manually, as there’s no central logic that comes out of the box to do that. This means every bit of hardware is switching on and off in parallel to each other based on some criteria, usually thousands (if not more) of them at a time. The repeated switching action of the circuit can be controlled in two ways: combinatorial and clocked.

Clocked circuits will take a clock signal, and based on when this clock ‘ticks’ and depending on which edge is chosen, will do a short operation, set some state for the next clock, and won’t get triggered until the next clock ‘edge’. If we think of the clock as a square wave signal (which at very high speeds, isn’t exactly that) then we can visualize the two possible clock edges as:

For very high speeds, the clock rising will take visible time, therefore the signal will start to look more like the following, with a noticeable rise and fall time:

(Clicking the above images will take you to the Clocks, Signals and Delays article on Alan Clements’ web site where you can read about this subject and more in great detail)

Unlike clocked circuits that depend on a clock to trigger their operation, the combinatorial circuits are driven by their inputs changing, and are evaluated across clock boundaries. Therefore reading a value at the output of such a circuit can be achieved on the same clock, which will come in handy in our register file and rasterizer implementation for the GPU later on.

Another thing to note about clocked regions is that the parallel assignments inside a clocked region will be visible on the next clock, and only their previous (pre-assignment) values are readable on the current clock.

In summary; clocked circuits can operate at quantized steps whereas combinatorial circuits are continuously changing, such as with an AND gate’s output in relation to its inputs.

This will start to matter when we start implementing our first piece of logic; the register file, as we’ll see below.

Clock and Reset

Often times in hardware, we wish to revert back the hardware state, or hold execution until clocks are stable enough, or some external device is initialized. This is usually accomplished by a reset line that will prevent execution of a circuit, if it cannot continue without other circuitry or clocks being ready.

The reset line is usually controlled by an external switch. It often has quite a bit of jitter and will toggle rapidly between zero and one while the button is being pressed if the vendor hasn’t put a Schmitt trigger on it, but in our design to prevent most of this, we’ll merge the reset line with the ‘clock locked’ signal that comes out of our clock module so we get some form of jitter projection. Not 100% ideal, but it works in practice, at least for our purposes.

For clocking our design, we’re using the Clock Wizard IP, so the internal details and how it works is not fully detailed here. We’ll get to it at a later point, but it suffices to say that the clock IP receives an external reference clock and will generate most of the clocks we will need in our design. It does so by using an internal device in the FPGA (an MMCM for the Xilinx part we’re using) that can do clock phase shifts, jitter correction, clock division and multiplication, and buffering operations according to our desired setup. In the NekoIchi design, all clocks are generated using a ‘custom’ input and are sourced from a Global Buffer, so that we can feed the same reference clock to many clock generators. NekoIchi uses many physical MMCM devices so that the final placement of hardware doesn’t suffer from long clock paths, since it can put the clock closer to the devices it’s driving. If none of this makes sense yet, don’t worry, explanations will come eventually.

Together, they will allow us to hold the rest of the machine until the reset key is let go and the clock has stabilized. Here’s how that piece of logic might look like:

wire sysclock60;
wire clockLocked;

SystemClockGen SysClockUnit(
	.clk_in1(CLK_I), // External 100Mhz base clock from FPGA pin E3
	.resetn(~RST_I), // Reset button from FPGA pin D9 (expected to be inverted)
	.sysclock60(sysclock60), // New output CPU clock at 60Mhz
	.locked(clocklocked) ); // Clock ready (when high, i.e. 1)

wire reset_p = RST_I | (~clockLocked); // Positive reset signal (generated)
wire reset_n = (~RST_I) & clockLocked; // Negative reset signal (generated)

You may have noticed that we’ve generated two reset signals: reset_p and reset_n. Most logic will accept a negative reset signal (i.e. 0 means reset) whereas other logic will accept a positive reset signal (1 means reset), and depending on hardware we’ll need to feed the proper logic level in the future.

Register File

As mentioned before, we’ll be needing a set of 32 values. In the software programming world, we’d be representing these values as:

uint32_t reg[32]; // Register array

which would make life easy when we try to access them with an index. To read and write these values, I could then simply do:

reg[SP] = 0x3FFF0; // Set stack pointer (old value==0x20000)
reg[T0] = reg[SP]; // Copy of modified stack pointer (0x3FFFF)

In hardware, there are two ways to run the above assignments. This is the ‘blocking’ way, where after SP is modified, we can assign the modified value to T0:

reg[SP] = 32'h3FFF0; // Set stack pointer (old value==32'h20000)
reg[T0] = reg[SP]; // Copy of modified stack pointer (32'h3FFFF)
// Afterwards: SP==32'h3FFF0, T0=32'h3FFF0

But instead, if we were to use non-blocking assignments (meaning, right side is evaluated first, left side is modified last, simultaneously), we’d be getting this result:

reg[SP] <= 32'h3FFF0; // Set stack pointer (old value==32'h20000)
reg[T0] <= reg[SP]; // Copy of previous stack pointer (32'h20000)
// Afterwards: SP==32'h3FFF0, T0=32'h20000

The parallel assignment (<=) is usually preferred with clocked circuits, and the blocking assignment is ordinarily part of combinatorial circuits. For our register file implementation, we’ll use both.

Now let’s see what a register file might look like as hardware implementation:

module registerfile(
	input wire reset,			// Internal state resets when high
	input wire clock,			// Writes are clocked, reads are not
	input wire [4:0] rs1,		// Source register 1
	input wire [4:0] rs2,		// Source register 2
	input wire [4:0] rd,		// Destination register
	input wire wren,			// Write enable bit for writing to register rd 
	input wire [31:0] datain,	// Data to write to register rd
	output wire [31:0] rval1,	// Register values for rs1 and rs2
	output wire [31:0] rval2 );

logic [31:0] registers[0:31]; 

// Writes are clocked, and since writes happen at the end of the clock
// the new values are available on the 'next' clock.
always @(posedge clock or posedge reset) begin
	if (reset) begin
		// Zero register, hardwired to zero
		registers[0] <= 32'h00000000;
		// Default hard-coded stack pointer in case code doesn't set it up
		registers[2] <= 32'h0003FFF0;
	end else begin
		// Do not write over zero register when write enable is on
		if (wren && rd != 5'd0)
			registers[rd] <= datain;
	end
end

// Outputs are continously assigned,
// therefore their values are available 'this' clock.
assign rval1 = registers[rs1];
assign rval2 = registers[rs2];

endmodule

Each ‘module’ we’ll encounter is a single unit of hardware, which can be instantiated for use later on, either from another module or from our ‘top’ module (which is the root of the hardware device) We can think of a module as a class which we can instantiate as many times as needed and each one will create a new hardware unit that we can use in parallel and independent of each other.

The above module is named ‘registerfile’, as that’s what it does. We’ll see that the names of all modules in NekoIchi are reflecting what the module does, though not following a single pattern.

In SystemVerilog, ‘logic’ is used to denote a register (storage) and the above syntax might need some explanation.

logic [highbit:lowbit] registers[lowindex:highindex];

Here the first array operator shows the bit length and bit ordering of one register entry, which is 32 bits wide (31..0), and the second array operator on the right shows the range of indices that an array of these groups of bits (therefore we have indices from 0 to 31, a total of 32)

This notation will be quite common thought the series, where we might encounter arrays of oddly sized bits and unusually wide bit patterns. This shows us that the underlying hardware doesn’t have a fixed type for a register, but allows us to group and divide bits into meaningful constructs as we see fit.

The above circuit operates in the following fashion:

Reset/clock lines control device’s operation as explained earlier
rs1 and rs2 are indices used to select our ‘source’ registers, to fit with the RV32I instruction format they’re 5 bits each (4..0), and the output from corresponding registers are written to rval1 and rval2 wires. These values are visible on current clock.
Likewise, we have a rd register which is the destination register where writes go to from datain line. The value of the destination register is visible on the next clock.

Here it’s important to say that a ‘wire’, unlike a ‘logic’, is not a storage unit, but in this case ties any external device’s storage or wires to this device. Doing it this way allows us to let the external device own the logic as they might want to redirect the output to somewhere else later.

Remember we said a clocked write is visible at the next clock? This device utilizes that by doing the following, so that reads are before writes:

The last two ‘assign’ statements are unclocked, and will let us see the register values as soon as we have a new pair of rs1/rs2 values set.
The ‘clocked’ part of our logic will see the current values on the RHS of the assignments, and the LHS values will be assigned in such a way that they’re available in the next clock cycle, therefore not shadowing the ‘current’ clock’s values of rs1/rs2

Due to how the fetch/decode/execute logic that we’ll be seeing later works, the above order will make a lot of sense.

This concludes part 3, next time we’ll be looking at setting up our board and run some blinky led on it to test!