RISC-V CPU & CUSTOM GPU ON AN FPGA PART 6 – BLOCK RAM, PROGRAM COUNTER AND INSTRUCTION FETCH

Before we begin

In this part we’ll take a look at RAM, program counter and instruction fetch concepts.

But before we go further, and since the files are getting too large to post here, we need to grab them from a git repository. Please head over to https://github.com/ecilasun/nekoichiarticle and use the following command to pull the files for all parts of the series:

git clone https://github.com/ecilasun/nekoichiarticle.git

after creating and changing to a folder of your preference with sufficient free disk space, since this is where we’ll be working on all parts of the series from now on.

Block RAM

In this part, we’ll take a little detour and add a small amount of memory to our system from the previous part.

The FPGA device has internal, dedicated memory that has very little latency and runs at quite high clock speeds, namely the Block RAM. This memory type can be arranged to be of pretty much any reasonable bit width and up to the limit of your device in length. There is also the option to have two ports for reading/writing data simultaneously, with two independent clocks or just use the one as an ordinary storage device.

We will be focusing on the single port, regular block memory for our first implementation, but note that this will change when we add our GPU to the design. The reason for a dual port architecture for central memory is to give both the CPU and the GPU simultaneous access to memory contents, such as when we’re reading instructions and at the same time copying a bitmap out to the display scan-out buffer. These concepts will make more sense in the GPU sections of this series.

Setting Up the Block RAM IP

To add a block RAM to our design, we can write some SystemVerilog code and expect the compile tools to infer an actual block memory device from it, or to avoid any conflicts due to possible coding errors, we can directly add a block RAM IP (Intellectual Property) to our design. The IP term is often used to show that there is some external (sometimes black-boxed) code out there which we can instantiate in our design, but not necessarily always see the source code of, depending on the design. The vendor supplied IPs often come with simulation models, which means we can expect the simulation to behave almost like the actual physical component in the FPGA.

For our initial design, we will go with a very small amount of block RAM, which will be about 4096 bytes long. Since we are building a system that uses 32bits, we will set the word size to 32, which means we have 1024 addresses (a 10 bit wide bus). This is so that we create the least amount of wiring required to make sure our design works, and the compile times are short enough for testing and simulation.

To add our memory IP to the design, click on the IP Catalog, type in ‘block’ in the search box, and double-click the Block Memory Generator in the list.

This will bring up the memory wizard.

On the first tab (Basic), make sure to select Native, Single Port RAM, and Byte Write Enable : 8 options, and use the name sysmem for the Component Name setting.

For the Port A Options, use 32 for Write Width, and 1024 as depth. This will yield a 4096 byte memory as mentioned in previous sections. Also do not forget to uncheck the Primitives Output Register option to make sure we get a 1 clock latency for reads instead of more. Other tabs do not need to change for now, so hit OK to close the wizard.

This will present us with the following dialog box.

For the time being, we will go with the default settings, and generate an ‘Out of Context’ synyhesized module. This means that the module will get a standalone compile and be cached on disk for rapid compiles, but might not benefit from global optimizations it would have received otherwise.

One word of note here.

The memory generated will have a single clock cycle of delay on read operations, as it’s not a ‘perfect’ memory. This delay means that when we set up the address to read from on one clock, the value is not visible until we travel to the next clock. This single clock of delay is what makes our current design require a wait state during loads. The most obvious one is the ‘FETCH’ state. After the ‘RETIRE’ state sets up the address the value is not immediately available on FETCH but on the following cycle, because the address setup in FETCH is itself not visible in the FETCH cycle. This chain of delayed propagation is sometimes annoying but most often quite useful in pipelining designs.

After a short time, you should

Instantiating the block RAM

After we’ve added the IP and a background task is kicked to sythesize it, we can now add an instance of this module to our design. Open up the nekotop.sv file and modify it to the following, which includes some new lines for our memory instance:

`timescale 1ns / 1ps

module nekotop(
	// Input clock
	input CLK_I,
	// Reset on lower panel, rightmost button
	input RST_I,
	// 4 monochrome LEDs
	output [3:0] led
);

wire [3:0] diagnosis;

// 10 bit byte-address wire between CPU and RAM
wire [9:0] memaddress;
// Data wires from/to CPU to/from RAM
wire [31:0] cpudataout;
wire [31:0] cpudatain;

sysmem mymemory(
	.addra(memaddress),	// 10 bit DWORD aligned address
	.clka(CLK_I),		// Clock, same as CPU clock
	.dina(cpudataout),	// Data in from CPU to RAM address
	.douta(cpudatain),	// Data out from RAM address to CPU
	.ena(1'b1),		// Reads are always enabled for now
	.wea(4'b0000) );	// Byte select mask for writes, no writes when 0000

riscvcpu mycpu(
	.clock(CLK_I),
	.reset(RST_I),
	.diagnosis(diagnosis),
	.memaddress(memaddress),
	.cpudataout(cpudataout),
	.cpudatain(cpudatain) );

assign led = diagnosis;

endmodule

The ‘mymemory’ instance is a copy of the block RAM IP which we created, connected to the CPU with 32 bit data wires for both input and output data, and a 10 bit address wire to cover the 0..1023 memory address range, which allows us to access data only in 32bit sizes, and at 32bit addresses.

Internally, memory addresses are actually full 12 bits. What happens internally is that all address calculations produce a byte or word or DWORD address, and we truncate and select only the top bits, discarding the lowest two. In turn, we need to create a wea mask when writing using the low 2 bit addresses and the type of the write.

For instance, to write a WORD 0xFFFE at 12 bit address 0x222, we need to generate the following:

  • dwordaddress = 0x88 (0x222 == 0010 0010 0010, remove bottom 2 bits: 0010 0010 00, which is 0x88)
  • wea = 1100 because low bits are 10 making it the ‘odd’ word address
  • cpudataout = 0xFFFE0000 since we need to shift the word data to the right position to match wea

It’s not that complicated once you get used to it, but the short rule here is that we should have a good understanding of memory alignment, be able to look at the lowest two bits to create word and byte wea masks and shift the byte or the word into the correct position in the cpudataout DWORD so it gets written correctly.

The operations for reading a byte or a word are almost the same but they reside inside the CPU module.

For now we’ll stick with DWORD access only, as that’s sufficient to implement our instruction fetcher.

Program Counter

This is probably the most important register that controls our program flow, since all instructions for execution come from this address. It’s also the one we modify to divert code to different branches, and the register we need to save/modify when we need to come back from a function call or an interrupt service routine.

For simplicity’s sake, we’ll start with a 32bit PC (program counter) to show us where we are, even though our memory address space is really short.

Ideally, each architecture has an entry point that the PC is initialized to, which could be anywhere in memory. For NekoIchi, and for reset simplicity, we’ll make a decision to place PC to point at address zero in memory at startup.

This actually makes a lot of sense for our design. Once we create a binary and place it as our initialization file into the block RAM, and point at address zero, our program should start spinning within that address space. Once we get to the later articles where we talk about the boot loader, this choice will allow us to load regular ELF binaries into addresses starting at 0x10000 which is the default physical address for the ELF files, while keeping the boot loader / BIOS in the 0x0 – 0xFFFF address space, which is plenty of room to work with.

So let’s go ahead and modify our CPU to contain a PC, and also let’s add some dummy wires and logic to hold our data in/out and the memory address. We’ll also add a nextPC to act as our next program counter that the control logic can fill in during execution to implement address increments or control flow.

As you can see below, we truncate the address to DWORD boundary as discussed above, and make sure to start from the ‘RETIRE’ state so that it can set up the program counter and help us boot from address 0x0

As you can also see, RETIRE state not only sets up the current instruction memaddress using PC, but it also sets up the nextPC to point at the next program location in our EXEC state. This is actually going to change later, where we can route nextPC to point at a branch target based on a logic decision, or to service an interrupt.

`timescale 1ns / 1ps

module riscvcpu(
	input clock,
	input reset,
	output logic [3:0] diagnosis = 4'b0000,
	logic [9:0] memaddress,
	output logic [31:0] cpudataout = 32'd0,
	wire [31:0] cpudatain  );

// Number of bits for the one-hot encoded CPU state
`define CPUSTAGECOUNT           4

// Bit indices for one-hot encoded CPU state
`define CPUFETCH		0
`define CPUDECODE		1
`define CPUEXEC			2
`define CPURETIREINSTRUCTION	3

// Start from RETIRE state so that we can
// set up instruction fetch address and read
// data which will be available on the next
// clock, in FETCH state.
logic [`CPUSTAGECOUNT-1:0] cpustate = 4'b1000;

logic [31:0] PC = 32'd0;
logic [31:0] nextPC = 32'd0;

always @(posedge clock) begin
	if (reset) begin
		//
	end else begin

		// Clear the state bits for next clock
		cpustate <= 4'b0000;

		// Selected state can now set the bit for the
		// next state for the next clock, which will
		// override the above zero-set.
		case (1'b1)
			cpustate[`CPUFETCH]: begin
				// Fetching from memory
				diagnosis[0] <= 1'b0;
				cpustate[`CPUDECODE] <= 1'b1;
			end
			cpustate[`CPUDECODE]: begin
				// cpudatain now contains our
				// first instruction to decode
				nextPC <= PC + 4;
				cpustate[`CPUEXEC] <= 1'b1;
			end
			cpustate[`CPUEXEC]: begin
				// TODO:
				cpustate[`CPURETIREINSTRUCTION] <= 1'b1;
			end
			cpustate[`CPURETIREINSTRUCTION]: begin
				// Set new PC
				PC <= nextPC;
				// Truncated
				memaddress <= nextPC[11:2];
				diagnosis[0] <= 1'b1;
				cpustate[`CPUFETCH] <= 1'b1;
			end
		endcase
	end
end

endmodule

Simulating our no-op computer

If we were to now head over to the Simulation section in our Flow Navigator and click Run Simulation/Run Behavioral Simulation, an if we were to drag mymemory and mycpu onto the simulation output, force RST_I to zero constant and force CLK_I to 10ns period clock, we should see this when we run the simulation:

The PC is now incrementing in steps of 4, and you can see it getting set up to do so in cpustate 2 (hint: parallel assignment, value changes on next clock, remember?)

This neatly puts us into a loop:

  • FETCH is where instruction read delay clock is happening
  • DECODE is where we’re supposed to split our instruction into small bites and decide what to do
  • EXEC is where we modify our machine state
  • RETIRE is where we set up the next program counter, instruction read address and loop back to FETCH

However if you’ve noticed from the above, we have absolutely nothing in our block RAM to work with, so we don’t know if it’s reading anything. To fix this, close the Simulation, head over to the Sources panel and double click the mymemory IP, and go to the Other Options tab.

Here, if we click the Edit button, we’ll be presented with a choice to create a memory initialization file. Go ahead and select Yes, and navigate to the sources_1/new folder and give it the name ‘BIIOS.coe’ and save.

For now, keep it set to a initialization radix of 16 (hexadecimal), type in these four hex values and Save and Close it. Then close the memory wizard using OK, which will ask you to re-generate the IP, which you should confirm by using Generate.

00000093 00008113 00008193 00008213

This will take a minute to re-generate the memory IP using the above four 32bit values as initialization. If we were to re-run the simulation we should now see this:

Now we should be able to notice a few things here

  • RETIRE(8) is the first state being run after reset which is used to load the fist instruction
  • After that, at every FETCH(1) state we see the new memory address
  • On each DECODE(2) state we have a new instruction word to work with
  • The word sticks around during EXEC(4) state as well as the RETIRE(8) state after first instruction
  • nextPC is set up in DECODE stage but visible on EXEC and afterwards

This is the very basic mechanism of our CPU’s instruction flow, which will of course get augmented with many more features and steps.

This concludes part 6. In the next part, we can start looking at instruction decoding and try to figure out what those 4 DWORDs are meant to do.