2021-05-292021-05-31

RISC-V CPU & CUSTOM GPU ON AN FPGA PART 7 – DECODING RISC-V INSTRUCTIONS

Before we begin

In this part we’ll take a brief look at instruction decoding for the base instruction set, namely rv32i.

But before we go further, and since the files are getting too large to post here, we need to grab them from a git repository. Please head over to https://github.com/ecilasun/nekoichiarticle and use the following command to pull the files for all parts of the series:

git clone https://github.com/ecilasun/nekoichiarticle.git

after creating and changing to a folder of your preference with sufficient free disk space, since this is where we’ll be working on all parts of the series from now on.

NOTE: A small change done starting in this part to the project brings us a nicely arranged include file, namely cpuops.vh, which contains all the instruction groups and CPU states so we can refer to them from the decoder and the cpu itself without having to pollute our code with inline defines.

Instructions

At this point, we’ll have to start referring to our RISC-V instruction set manual, which can be found here as we saw on the first article.

If we refer to our manual, in Chapter 19 we will find the Instruction Set Listings, which shows the binary pattern of every RISC-V instruction we’d care to decode. Scanning through this, we can see that the four DWORDs we have used in our BIOS file in part 6 decode to the following instructions:

00000093          	li	ra,0
00008113          	mv	sp,ra
00008193          	mv	gp,ra
00008213          	mv	tp,ra

Therefore, this is a series of instructions that loads the return address register with a zero immediate, then copies that register to the stack pointer, global pointer, and thread pointer registers. Effectively, zeroing out four registers as it goe along. This is part of the boot up sequence where we initialize all 32 integer and float registers before going ahead with further processing, to save hardware space so that we don’t need to do this in the FPGA device itself.

Further dissecting these instructions, we see a common pattern, where ‘source register’, ‘destination register’ and the opcode fields overlap between instructions. This is deliberate, and makes our decoded quite simple since we can now simply

But as you recall, some instructions do not exist in this list, as we’ll see with the LI (load immediate) instruction, which is none other than the ADDI (Add immediate) instruction with rd (destination register) set to 1, which is the ra (return address) register:

Another interesting thing in the manual is that most instruction share the same opcode in the same group, for instance the ALU instructions. They are most often distinguished by looking at the func3 and func7 fields, which contain spare bits that can show us which sub-instruction from this group is being used. For instance, SLTI instruction is in the same group as ADDI so they have the same base opcode, but their func3 field is quite different; 000 for ADDI and 010 for SLTI.

Decoding

Decoding is a convenience operation to set up some flags, figure out source and target registers and the type of operation to follow. It involves parsing through the bits of the instruction word to form immediate values, and grabbing the indices of registers to work on.

If you open the project for part7, in the decoder.sv file you’ll see that we’re reading bit ranges from the incoming instruction, which correspond to the fields in our instruction set manual, as well as forming immediate values for each instruction group in case they use one:

	opcode = instruction[6:0];
	rs1 = instruction[19:15];
	rs2 = instruction[24:20];
	rs3 = instruction[31:27]; // Used by fused float ops
	rd = instruction[11:7];
	func3 = instruction[14:12];
	func7 = instruction[31:25];
	selectimmedasrval2 = opcode==`OPCODE_OP_IMM ? 1'b1 : 1'b0;
//...
	`OPCODE_LOAD: begin
		immed = {{20{instruction[31]}},instruction[31:20]};
	end

The decoder itself as can be seen from the decoder.sv file, a combinatorial logic circuit. This means, from our point of view, almost as soon as the instruction is set, the decoded output is ready. This is handy, since if we can use this behavior to drive, say, our register file, we can have the output values ready by the next clock.

Simulating the new code

If you start the simulation as we saw in earlier parts of the series, you will notice that now we’re getting a whole lot more output being produced for the CPU. Notice how all the four instructions decode to the same ’13h’ value (opcode), which is the ADDI instruction (even though they look different from the disassembler’s point of view)

We also get something interesting going on: notice the value selectimmedasrval2. For some instructions, we do not use the source register two, but a constant value. This constant is generated in the decoder itself, and a bit flag is set when the immediate should be used so that the CPU (or rather mostly, ALU) can do the right thing with this value.

Also worth mentioning here is that once the program counter reaches the end of our tiny 4K memory, it will wrap around and keep repeating the same instructions, over and over again. This is because we have no proper flow control, but only a very simple decoder that does not know about the program state properly.

CPU-decoder interface

We need examine the main CPU state machine after the changes to see how exactly we’re driving the decoder and where the decoded instruction ends.

Please recall that we had a one-hot state machine built in the previous article, with the states FETCH / DECODE / EXEC / RETIRE. It still looks the same, though we have some changes, for example we have the CPU opcodes in a define file (cpuops.vh) and we are using an instance of the ‘decoder’, implemented in our decoder.sv file.

The decoder accepts a ‘variable’ (that is, a 32 bit logic group) named ‘instruction’ as input. In FETCH state, we are still reading memory, and in DECODE state we have our memory output. We simply assign this to the input of our decoder circuit in the DECODE state, and since decoder is a combinatorial unit, its output is available pretty quickly, somewhere within the same clock. We do not really require this behavior right now since we don’t have an ALU yet, but that will come in handy once we do. After all this decoding, we arrive to the EXEC state where we have all the bits and pieces required to decide on register file activity and program flow.

`timescale 1ns / 1ps

`include "cpuops.vh"

module riscvcpu(
	input clock,
	input reset,
	output logic [3:0] diagnosis = 4'b0000,
	logic [9:0] memaddress = 10'd0,
	output logic [31:0] cpudataout = 32'd0,
	wire [31:0] cpudatain  );

// Start from RETIRE state so that we can
// set up instruction fetch address and read
// data which will be available on the next
// clock, in FETCH state.
logic [`CPUSTAGECOUNT-1:0] cpustate = `CPUSTAGEMASK_RETIREINSTRUCTION;

logic [31:0] PC = 32'd0;
logic [31:0] nextPC = 32'd0;
logic [31:0] instruction = 32'd0; // Illegal instruction

// Instruction decoder and related wires
wire [6:0] opcode;
wire [2:0] func3;
wire [6:0] func7;
wire [4:0] rs1;
wire [4:0] rs2;
wire [4:0] rs3;
wire [4:0] rd;
wire [31:0] immed;
wire selectimmedasrval2;
decoder mydecoder(
	.instruction(instruction),
	.opcode(opcode),
	.func3(func3),
	.func7(func7),
	.rs1(rs1),
	.rs2(rs2),
	.rd(rd),
	.immed(immed),
	.selectimmedasrval2(selectimmedasrval2) );

always @(posedge clock) begin
	if (reset) begin
		//
	end else begin

		// Clear the state bits for next clock
		cpustate <= `CPUSTAGEMASK_NONE;

		// Selected state can now set the bit for the
		// next state for the next clock, which will
		// override the above zero-set.
		case (1'b1)
			cpustate[`CPUFETCH]: begin
				// Fetching from memory
				diagnosis[0] <= 1'b0;
				cpustate[`CPUDECODE] <= 1'b1;
			end
			cpustate[`CPUDECODE]: begin
				// cpudatain now contains our
				// first instruction to decode
				// Set it as decoder input
				instruction <= cpudatain;
				nextPC <= PC + 4;
				cpustate[`CPUEXEC] <= 1'b1;
			end
			cpustate[`CPUEXEC]: begin
				// At this stage decoder output is ready
				cpustate[`CPURETIREINSTRUCTION] <= 1'b1;
			end
			cpustate[`CPURETIREINSTRUCTION]: begin
				// Set new PC
				PC <= nextPC;
				// Truncated
				memaddress <= nextPC[11:2];
				diagnosis[0] <= 1'b1;
				cpustate[`CPUFETCH] <= 1'b1;
			end
		endcase
	end
end

endmodule

Some words about the code

Please note that this entire series aims to be readable, while being practically synthesizeable, so some code might not exactly look up to standard. Please feel free to modify, change, rip and re-do everything here at your own preference, but the idea is that once you’re simulating and/or debugging, a readable code base is quite a valuable tool.

This concludes part 7. In part 8 we will integrate an ALU and the register file so we can run a few simple instructions.