RISC-V CPU & CUSTOM GPU ON AN FPGA PART 7 – DECODING RISC-V INSTRUCTIONS
Before we begin
In this part we’ll take a brief look at instruction decoding for the base instruction set, namely rv32i.
But before we go further, and since the files are getting too large to post here, we need to grab them from a git repository. Please head over to https://github.com/ecilasun/nekoichiarticle and use the following command to pull the files for all parts of the series:
git clone https://github.com/ecilasun/nekoichiarticle.git
after creating and changing to a folder of your preference with sufficient free disk space, since this is where we’ll be working on all parts of the series from now on.
NOTE: A small change done starting in this part to the project brings us a nicely arranged include file, namely cpuops.vh, which contains all the instruction groups and CPU states so we can refer to them from the decoder and the cpu itself without having to pollute our code with inline defines.
Instructions
At this point, we’ll have to start referring to our RISC-V instruction set manual, which can be found here as we saw on the first article.
If we refer to our manual, in Chapter 19 we will find the Instruction Set Listings, which shows the binary pattern of every RISC-V instruction we’d care to decode. Scanning through this, we can see that the four DWORDs we have used in our BIOS file in part 6 decode to the following instructions:
00000093 li ra,0
00008113 mv sp,ra
00008193 mv gp,ra
00008213 mv tp,ra
Therefore, this is a series of instructions that loads the return address register with a zero immediate, then copies that register to the stack pointer, global pointer, and thread pointer registers. Effectively, zeroing out four registers as it goe along. This is part of the boot up sequence where we initialize all 32 integer and float registers before going ahead with further processing, to save hardware space so that we don’t need to do this in the FPGA device itself.
Further dissecting these instructions, we see a common pattern, where ‘source register’, ‘destination register’ and the opcode fields overlap between instructions. This is deliberate, and makes our decoded quite simple since we can now simply
But as you recall, some instructions do not exist in this list, as we’ll see with the LI (load immediate) instruction, which is none other than the ADDI (Add immediate) instruction with rd (destination register) set to 1, which is the ra (return address) register:
Another interesting thing in the manual is that most instruction share the same opcode in the same group, for instance the ALU instructions. They are most often distinguished by looking at the func3 and func7 fields, which contain spare bits that can show us which sub-instruction from this group is being used. For instance, SLTI instruction is in the same group as ADDI so they have the same base opcode, but their func3 field is quite different; 000 for ADDI and 010 for SLTI.
Decoding
Decoding is a convenience operation to set up some flags, figure out source and target registers and the type of operation to follow. It involves parsing through the bits of the instruction word to form immediate values, and grabbing the indices of registers to work on.
If you open the project for part7, in the decoder.sv file you’ll see that we’re reading bit ranges from the incoming instruction, which correspond to the fields in our instruction set manual, as well as forming immediate values for each instruction group in case they use one:
opcode = instruction[6:0];
rs1 = instruction[19:15];
rs2 = instruction[24:20];
rs3 = instruction[31:27]; // Used by fused float ops
rd = instruction[11:7];
func3 = instruction[14:12];
func7 = instruction[31:25];
selectimmedasrval2 = opcode==`OPCODE_OP_IMM ? 1'b1 : 1'b0;
//...
`OPCODE_LOAD: begin
immed = {{20{instruction[31]}},instruction[31:20]};
end
The decoder itself as can be seen from the decoder.sv file, a combinatorial logic circuit. This means, from our point of view, almost as soon as the instruction is set, the decoded output is ready. This is handy, since if we can use this behavior to drive, say, our register file, we can have the output values ready by the next clock.
Simulating the new code
If you start the simulation as we saw in earlier parts of the series, you will notice that now we’re getting a whole lot more output being produced for the CPU. Notice how all the four instructions decode to the same ’13h’ value (opcode), which is the ADDI instruction (even though they look different from the disassembler’s point of view)
We also get something interesting going on: notice the value selectimmedasrval2. For some instructions, we do not use the source register two, but a constant value. This constant is generated in the decoder itself, and a bit flag is set when the immediate should be used so that the CPU (or rather mostly, ALU) can do the right thing with this value.
Also worth mentioning here is that once the program counter reaches the end of our tiny 4K memory, it will wrap around and keep repeating the same instructions, over and over again. This is because we have no proper flow control, but only a very simple decoder that does not know about the program state properly.
CPU-decoder interface
We need examine the main CPU state machine after the changes to see how exactly we’re driving the decoder and where the decoded instruction ends.
Please recall that we had a one-hot state machine built in the previous article, with the states FETCH / DECODE / EXEC / RETIRE. It still looks the same, though we have some changes, for example we have the CPU opcodes in a define file (cpuops.vh) and we are using an instance of the ‘decoder’, implemented in our decoder.sv file.
The decoder accepts a ‘variable’ (that is, a 32 bit logic group) named ‘instruction’ as input. In FETCH state, we are still reading memory, and in DECODE state we have our memory output. We simply assign this to the input of our decoder circuit in the DECODE state, and since decoder is a combinatorial unit, its output is available pretty quickly, somewhere within the same clock. We do not really require this behavior right now since we don’t have an ALU yet, but that will come in handy once we do. After all this decoding, we arrive to the EXEC state where we have all the bits and pieces required to decide on register file activity and program flow.
`timescale 1ns / 1ps
`include "cpuops.vh"
module riscvcpu(
input clock,
input reset,
output logic [3:0] diagnosis = 4'b0000,
logic [9:0] memaddress = 10'd0,
output logic [31:0] cpudataout = 32'd0,
wire [31:0] cpudatain );
// Start from RETIRE state so that we can
// set up instruction fetch address and read
// data which will be available on the next
// clock, in FETCH state.
logic [`CPUSTAGECOUNT-1:0] cpustate = `CPUSTAGEMASK_RETIREINSTRUCTION;
logic [31:0] PC = 32'd0;
logic [31:0] nextPC = 32'd0;
logic [31:0] instruction = 32'd0; // Illegal instruction
// Instruction decoder and related wires
wire [6:0] opcode;
wire [2:0] func3;
wire [6:0] func7;
wire [4:0] rs1;
wire [4:0] rs2;
wire [4:0] rs3;
wire [4:0] rd;
wire [31:0] immed;
wire selectimmedasrval2;
decoder mydecoder(
.instruction(instruction),
.opcode(opcode),
.func3(func3),
.func7(func7),
.rs1(rs1),
.rs2(rs2),
.rd(rd),
.immed(immed),
.selectimmedasrval2(selectimmedasrval2) );
always @(posedge clock) begin
if (reset) begin
//
end else begin
// Clear the state bits for next clock
cpustate <= `CPUSTAGEMASK_NONE;
// Selected state can now set the bit for the
// next state for the next clock, which will
// override the above zero-set.
case (1'b1)
cpustate[`CPUFETCH]: begin
// Fetching from memory
diagnosis[0] <= 1'b0;
cpustate[`CPUDECODE] <= 1'b1;
end
cpustate[`CPUDECODE]: begin
// cpudatain now contains our
// first instruction to decode
// Set it as decoder input
instruction <= cpudatain;
nextPC <= PC + 4;
cpustate[`CPUEXEC] <= 1'b1;
end
cpustate[`CPUEXEC]: begin
// At this stage decoder output is ready
cpustate[`CPURETIREINSTRUCTION] <= 1'b1;
end
cpustate[`CPURETIREINSTRUCTION]: begin
// Set new PC
PC <= nextPC;
// Truncated
memaddress <= nextPC[11:2];
diagnosis[0] <= 1'b1;
cpustate[`CPUFETCH] <= 1'b1;
end
endcase
end
end
endmodule
Some words about the code
Please note that this entire series aims to be readable, while being practically synthesizeable, so some code might not exactly look up to standard. Please feel free to modify, change, rip and re-do everything here at your own preference, but the idea is that once you’re simulating and/or debugging, a readable code base is quite a valuable tool.
This concludes part 7. In part 8 we will integrate an ALU and the register file so we can run a few simple instructions.