RISC-V CPU & CUSTOM GPU ON AN FPGA PART 13 – INTEGER MULTIPLY/DIVIDE AND FLOATING POINT MATH, CSRs & HARDWARE INTERRUPTS

BEFORE WE BEGIN

In this part we’ll talk about integer and floating point math extensions, M and F. In addition, we’ll mention the CSR register file and the interrupt mechanism.

But before we go further, and since the files are getting too large to post here, we need to grab them from a git repository. Please head over to https://github.com/ecilasun/nekoichiarticle and use the following command to pull the files for all parts of the series:

git clone https://github.com/ecilasun/nekoichiarticle.git

after creating and changing to a folder of your preference with sufficient free disk space, since this is where we’ll be working on all parts of the series from now on.

A couple words about this series

At this point, the project files in part13 folder of NekoIchi repo implements a 100Mhz CPU and a 85Mhz GPU. This article quickly goes over the additional units added to help software development (such as the floating point math, integer math and interrupt support via machine control registers)

The series up to this point is meant to be a read-along for the project files, but after this article the text will take precedence and more software related parts will begin.

Integer Multiplication using DSPs

So far we’ve been running all math code in software emulation mode using existing add/sub/shift and other operations. This is obviously quite slow compared to hardware, therefore in this part we’ll start adding the necessary units to handle math for us, starting with integer multiplication and division.

For multiplication, to benefit from the internal DSP in the FPGA fabric, we’ll be using the ‘multiplier’ IP that comes with the Vivado software. However, we are going to do something different than usual with it to save space for our multiplication operation variants, namely the MUL / MULH / MULHSU / MULHU.

These operations work in combinations of signed and unsigned values as input. We have access to a variety of signed/unsigned input in our multiplier IP, however that would create too many DSPs with many inputs and outputs that we’d need to deal with.

Instead, we take a shortcut, and use a 33 bit input and a 66 bit output, and synthetically make up a new number by either sign extending or zero extending to build a signed/unsigned number out of the 32bit inputs:

`F3_MUL, `F3_MULH: begin
	A <= {multiplicand[31], multiplicand};
	B <= {multiplier[31], multiplier};
end
`F3_MULHSU: begin
	A <= {multiplicand[31], multiplicand};
	B <= {1'b0, multiplier};
end
`F3_MULHU: begin
	A <= {1'b0, multiplicand};
	B <= {1'b0, multiplier};
end

The concatenation syntax using {X,Y} will merge the bits into a single 33bit number, either taking the top bit of the input if it’s supposed to be signed, or placing a zero there if it’s supposed to be unsigned. The result of the extended multiplication then ends up in our product register after the recommended delay cycles for the multiplier unit (which in this case is 7 cycles) Thus, we have a multiplication result in approximately 3+7+1 (including fetch, retire and others) cycles.

Integer division

For division, the FPGAs don’t provide any hardware block, therefore we’lll have to do it manually. In the file dividers.sv, you’ll find a 32 clock signed and unsigned integer divider pair. The difference between the two is the sign processing and sign tracking. Both modules also produce a remainder at the same time as division, which means we can share it between the DIV and REM instructions and pick the correct data output accordingly.

There are ways to get the division to go faster than 32 clocks (even down to 1 clock cycle) but that requires quite a large number of logic gates and wastes a large portion of the FPGA fabric, so we’ll steer away from it and choose a multi-clock approach.

Multi-cycle operations on the CPU

As with multiplication, the operation is started once the ‘start’ line is high, for one clock cycle, after which the division operation commences. At the end, we get a zero for ‘busy’ signal and can stop.

The CPU kicks the integer math operations in its EXEC state. Once the decoded instruction is known to be of the _OP type, and its subtype is a multiply, divide or a remainder operation, we start these operations in their respective modules.

Since NekoIchi is not a pipelined CPU (at least not yet), we have to start these operations and go to a wait state (stall) until they’re complete. This adds 1 clock no-op EXEC, several clocks for the operation itself, and another clock for the ‘done’ state to transition towards the RETIRE state. This means for example integer divide and remainder operations take slightly longer than 32 clocks to complete, which is rather slow. In pipelined CPU scenarios, some part of this slow operation might be hidden depending on how many pipeline stages exist on the CPU, but we’ll keep things simple for the sake of exercise.

Floating point math

The RISC-V F (float) extension has a floating point control register as part of the CSR registers. This controls rounding modes and hold some exception flags coming from the floating point unit.

NekoIchi uses existing IPs for floating point math to save FPGA space. That means we have no control over the rounding modes, and some operations do not exist (which we work around by changing the input slightly)

For the current version, NekoIchi uses 13 different floating point math IPs. These include comparators, integer/float conversion units, dividers, multipliers, square root and fused multiply/add/sub units. They are driven the same way the integer math modules are, by using a ‘valid’ input to act as a start signal during the EXEC state, and return the operation result in their individual registers alongside with a result valid flag.

# IP instance for the multiply-subtract fused operation
# fmsubvalid starts the operation during EXEC
# fmsubresultvalid turns high when result is available

wire fmsubvalid = isexecuting & (opcode==`OPCODE_FLOAT_MSUB);

logic [31:0] fmsubresult;
logic fmsubresultvalid;
fp_msub floatfmsub(
	.s_axis_a_tdata(frval1), // A
	.s_axis_a_tvalid(fmsubvalid),
	.s_axis_b_tdata(frval2), // *B
	.s_axis_b_tvalid(fmsubvalid),
	.s_axis_c_tdata(frval3), // -C
	.s_axis_c_tvalid(fmsubvalid),
	.aclk(clock),
	.m_axis_result_tdata(fmsubresult),
	.m_axis_result_tvalid(fmsubresultvalid) );

While the floating point operations are pending, the CPU sits in the FSTALL state and monitors a wide OR of several flags:

fmulresultvalid | fdivresultvalid | fi2fresultvalid | ff2iresultvalid | faddresultvalid | fsubresultvalid | fsqrtresultvalid | feqresultvalid | fltresultvalid | fleresultvalid

Then, if any bit is one, will assign the correct operation value to the floating point register, and go to RETIRE state.

Floating point register file

This unit is very similar to the integer register file, and has the exact same code except it does allow for writes into register number zero. As you might see from the floatregs.sv file, it also has three source registers, the third one being used by the fused math operations of the form A*B+C

The current architecture of NekoIchi includes only 32bit floating point math, therefore the register file entries are also 32bits in size each.

Control Status Registers

The RISC-V ISA tells us that we have space for 4096 special registers, namely the CSRs. The related instructions that access the CSR are listed under the Zicsr section of the RISC-V Spec.

NekoIchi is a very simple architecture, therefore we’re interested in only two things from this list: counters and machine interrupt registers.

C00: CYCLE (CPU Clock, low bits)
C01: TIME (Wall Clock, low bits)
C02: RETI (Retired Instrictions, low bits)
C80: CYCLE (high bits)
C81: TIME (high bits)
C82: RETI (high bits)
300: MSTATUS (Machine Status)
304: MIE (Machine Interrupt Enable)
305: MTVEC (Machine Interrupt Service Vector)
341: MEPC (Machine Interrupt Service Routine Return Address)
342: MCAUSE (Machine Interrupt Cause)
344: MIP (Machine Interrupt Pending)
800: TIMECMP (Custom CSR, Timer Interrupt Compare Value, low bits)
801: TIMECMP (high bits)

This makes a total of 14 registers, 2 of which are custom CSRs. They hold the timer trigger value for timer interrupts. There is a slight deviation from the way tjhese last two registers are described in RISC-V manual, where they’re memory mapped. To avoid access delays causing missed interrupts, NekoIchi places them into two custom CSRs that can be accessed directly without stalls.

The way we can access the CSR registers is via the Zicsr instructions. We can read and swap their contents with a register, and set or clear some bits if we have enough privileges to do so. Since NekoIchi always works in machine mode, it bypasses the privileges and can always write to these registers except the first 6, as they’re counters and will get stomped over by the hardware anyways.

Interrupts

In machine mode, NekoIchi implements three distinct interrupt types:

  • Machine External Interrupt
  • Timer Interrupt
  • Debugger (soft) interrupts

Machine external interrupts, as the name suggest, come from an external source, which is the UART port in our case. There is a very small circuit in the device router module which monitors the incoming data FIFO, and will keep an IRQ line high as long as there’s data in the FIFO. Since, when we service this interrupt, we turn off interrupt handling, it won’t cause any re-entrancy issues.

Timer interrupts are software controlled, and are triggered by first loading the two custom registers, 0x800 and 0x801, with an 64 bit ‘future’ time value, which is when the interrupt should be triggered. Once this occurs, our interrupt service routine will be called, where we’ll have to re-set the future time again if we wish to fire further timer interrupts. This mechanism is used by NekoIchi’s mini OS in the ROM_expterimental module to implement time-sliced multithreading.

Debugger interrupts are only triggered by the EBREAK instruction. NekoIchi currently listens to the machine external interrupt, reads any debugger commands and will backup some instructions and replace them with an EBREAK to trigger debugger breakpoints. In turn, when the breakpoint is hit, the handler can talk back to the debugger to manage the debugging session.

However, to get any of these working, we will have to use a CSR to enable machine interrupts; the MIE register. It has the following layout:

       11                   7                    3
[....][MEIE][_][SEIE][UEIE][MTIE][_][STIE][UTIE][MSIE][_][SSIE][USIE]

The only bits we’re interested in for NekoIchi are the 11/7/3 set (MEIE / MTIE / MSIE). They control, in order, the external, timer, and software interrupts. The rest of the bits are for software and user level privileges to the same type of interrupts, which we do not use in the current architecture, and are ignored.

For example, let’s assume we wish to service all types of interrupts. Here’s how we might set up things initially:

// Choose a time in the future
uint64_t future = now + DEFAULT_TIMESLICE;
asm volatile("csrrw zero, 0x801, %0" :: "r" ((future&0xFFFFFFFF00000000)>>32));
asm volatile("csrrw zero, 0x800, %0" :: "r" (uint32_t(future&0x00000000FFFFFFFF)));

// Set the interrupt handler vector
asm volatile("csrrw zero, mtvec, %0" :: "r" (interrupt_handler));

// Enable machine interrupts
int mstatus = (1 << 3); /// Bit 3 of MSTATUS register
asm volatile("csrrw zero, mstatus,%0" :: "r" (mstatus));

// Enable machine timer interrupts, machine external interrupts and debug interrupt
int msie = (1 << 7) | (1 << 11) | (1 << 3);
asm volatile("csrrw zero, mie,%0" :: "r" (msie));

void __attribute__((interrupt("machine"))) interrupt_handler()
{
   // Get cause of the interrupt
   register uint32_t causedword;
   asm volatile("csrr %0, mcause" : "=r"(causedword));

   if (causedword==7) // Timer
      timer_interrupt();
   else if (causedword==3) // Breakpoint
      breakpoint_interrupt();
   else if (causedword==11) // External
      external_interrupt();
   // This function returns with an MRET to the MEPC, resuming code as usual
}

It’s quite straightforward as we can see. However in the real world. one might want to make sure the registers saved on the stack by the function prologue is in a specific order so that we may implement multithreading easily. Currently NekoIchi uses __attribute__((naked)) instead of interrupt(“machine”) to do its own register management, which is a subject for future parts of this series.

Next

This concludes part 13. In part 14, we will go into further details about the system.