RISC-V CPU & CUSTOM GPU ON AN FPGA PART 12 – VIDEO OUTPUT AND THE GPU

BEFORE WE BEGIN

In this part we’ll talk about video output and the GPU, and intermediate devices to get them working.

But before we go further, and since the files are getting too large to post here, we need to grab them from a git repository. Please head over to https://github.com/ecilasun/nekoichiarticle and use the following command to pull the files for all parts of the series:

git clone https://github.com/ecilasun/nekoichiarticle.git

after creating and changing to a folder of your preference with sufficient free disk space, since this is where we’ll be working on all parts of the series from now on.

NOTE: A small change done starting in this part to the project brings us a nicely arranged include file, namely cpuops.vh, which contains all the instruction groups and CPU states so we can refer to them from the decoder and the cpu itself without having to pollute our code with inline defines.

Adding a second port to our system memory

To allow for GPU read and writes to the CPU memory, we need to add a second port to our system RAM device. To achieve this, we’ll change the type of it to a ‘True Dual Port RAM’.

A true dual port RAM allows for access to the internal storage from two different clock domains, with their independent address, data and read/write signals. In effect, this allows the CPU to keep reading instructions and writing data to the system memory as if nothing else is there, while the GPU is also free to do the same. We use this mechanism to allow the GPU to read bitmap data out from the system memory and write it to the current frame buffer, and also let the GPU write data back to the system memory to let the CPU know that it’s past a certain point in its command queue. The latter behavior can be used to let the CPU do other work while the GPU is busy to overlap their execution in a more efficient way, or to simply provide flicker free graphics.

Video frame buffers and the video scan output circuit

Before we can build a GPU, we need a mechanism that allows us to drive some external video memory. There will be two copies of these units, one open for writes by the GPU, one being scanned out to the DVI port for us to see.

The layout of our video memory will be small and simple. It will be a 256×192 pixel 2D array, with each pixel consisting of an 8 bit value. This 8bit value will include a 3-3-2 bit layout that stores a narrow band of RGB values. This arrangement allows us to use about 256 distinct colors (8x8x4), quantized into the 3-3-2 bit color space. We could have used a paletted approach to give us more precise control over the output values, and we will probably allow for that in the future by extending our GPU. For now, we’ll be working with this single color space.

The memory layout of our 256×192 buffes is also pretty straightforward: 256 bytes for each row, all adjacent to eachother in address space. To address a single pixel with a known x/y coordinate pair, all we need to do is find this value:

# In C terms, this would be
relativeaddress = x + (y<<8);

# Which in Verilog is concatenation of two values, therefore involves no math
logic [7:0] x;
logic [7:0] y;
relativeaddress = {y, x);

However this kind of layout, even though addressing is pretty simple, writing these pixels one at a time to clear the screen might become somewhat tedious.

If we look at the vertical pixel count (192), we can see that it’s a multiple of 16 pixels. This is quite convenient, as that’s exactly a slice of 256×16 (4096) bytes of memory. Since we’re building hardware, we can choose to divide our screen into 12 such 4K blocks, and add a ‘force enable’ line to each.

By adding 12x4K blocks, and allowing for writes to them simultaneously using this force enable line (which we’ll call lane mask in our GPU code), we can loop for only 4K pixels, yet simultaneously write to each slice, thus clearing the VRAM at x12 faster speed. This approach is currently only used by the screen clear command in NekoIchi, but will help handle parallel rasterization in the future as well.

As mentioned earlier, there are two sets of these 12 slices, for a total of 24 slices. The first 12 are denoted ‘page 0’ and the rest make up ‘page 1’ of our video memory circuitry. The GPU controls the line that selects either of these pages, named ‘videopage’ in the circuit.

Briefly, the videopage selection works as follows:

  • The GPU selects a page using videopage
  • This will be picked up by two identical video memory circuits, one of which will turn off writes to itself, while the other enables writes
  • The circuit that disabled the writes will now become the video scan out page

In effect this allows us to double-buffer our writes so that the video scan out does not show individual write operations since it’s displaying the other page at the time. By including a vertical synchronization wait event together with this mechanism, we can provide flicker-free graphics to the user.

Scanline caching

The source files for the frame buffers live in the videocontroller.sv file, where you’ll also find the mechanism for scanline caching.

What the scanline cache does is crucial for correct video display. First, let’s take a second to look at what our video output looks like.

The white 512×384 region inside the black 640×480 region is our scaled up frame buffer. We read each pixel, and scan it out twice, for two scan rows inside this region. The black border around this region shows no image, since we don’t have sufficient memory for video output here.

If you’re using the 1BitSquared’s DVI PMOD, you will notice that this region does not appear in the final output image, and all we get is a scaled up 512×384 region on the display. This is due to the fact that we enable the DVI_DE (data enable) bit only inside the white region, causing the final hardware to scale the incoming image. If you’re still able to see this border, that means your display device does not zoom in to the data enable region, which is equally fine, and you’ll be presented with a slightly smaller image centered on the screen with a black border around it.

Back to the matter at hand. As you can see form the above image, there’s a red region marked ‘Fill Cache’ which is responsible for populating the scanline cache using the the current scanline’s data from memory. The reason it does this is quite simple actually. Since the pixel counter is tied to a clocked circuit, we can only see the current pixel’s value on the next clock, which is too late to output that pixel’s color, plus we get one clock more latency from the block memory for reads to be visible. By caching things ahead of time into an array of 64x32bit registers (256 pixels worth), we can guarantee that there is no noticable read latency, and we can pull the current pixels’ data not being tied to a clock using a combinatorial circuit from the registers. This way we get not half or one-pixel-off artifacts on our video output.

GPU command queue and CPU/GPU synchronization

As we saw above, the GPU is supposed to control which video page is open for writes and which one is the currently displayed one. But how does the GPU know how to choose them? Simply, it doesn’t. We’re supposed to drive the GPU from our CPU by some means to get it to do the right thing.

For this, we will use a clock domain crossing FIFO as we did with the UART circuit. It will be an independent clock FIFO, one port written at cpu clock speeds, whereas the other port read at gpu clock speeds. The GPU, when in IDLE state, will check to see if there are any more pending commands in the FIFO, and go to process them, while the CPU writes more. If we don’t synchronize the CPU to the GPU busy state, we’ll quickly fill the FIFO and stall our CPU, therefore we use a mechanism as depicted below to balance things:

But how does this work?

Assume for a second that we have two values, ‘gpustate’ and a counter variable ‘cnt’. They are both initialized to the same value at the start of the frame. If they are equal, we increment the ‘cnt’ value, push a list of draw commands to the GPU queue, insert a vsync wait command for the GPU, then ask the GPU to write this new incremented cnt value over the address where gpu state lives, but also reset the gpustate to zero.

While the GPU is busy, these counters will be different from eachother, and at one point after the vsync wait the GPU will overwrite the ‘gpustate’ value with the new counter, allowing us to send another frame since they’re now equal once more. Let’s see it in code to make it a bit more clear. This is a core loop from any application running on NekoIchi that wants to implement GPU/CPU synchronization and flicker free drawing:

#include "gpu.h"

int main(int argc, char ** argv)
{
   volatile unsigned int gpustate = 0x00000000;
   unsigned int cnt = 0x00000000;
   // Set initial video page
   uint32_t page = 0;
   GPUSetRegister(6, page);
   GPUSetVideoPage(6);

   while(1)
   {
      if (gpustate == cnt) // GPU work complete, push more
      {
         ++cnt;
         // Draw the current frame by sending commands to GPU FIFO
         DrawSomething();
         // Add a stall command to GPU until vsync is reached
         GPUWaitForVsync();
         // Swap to new video page after vsync
         page = (page+1)%2;
         GPUSetRegister(1, page);
         GPUSetVideoPage(1);
         // GPU status address in G1
         uint32_t gpustateDWORDaligned = uint32_t(&gpustate);
         GPUSetRegister(1, gpustateDWORDaligned);
         // Write 'end of processing' from GPU so that CPU can resume its work
         GPUSetRegister(2, cnt);
         GPUWriteSystemMemory(2, 1);
         // Reset gpustate; GPU will overwrite it
         gpustate = 0;
      }
   }
   return 0;
}

Connecting the DVI PMOD

For NekoIchi default setup, we need to use the 1BitSquared 12bit DVI PMOD. Please make sure you have this attached to the leftmost two PMOD ports (A and B). You can then either attach a small LCD panel, or use an HDMI cable to drive an external display such as a monitor or a capture device.

The rest of the configuration is already done in the code, therefore the pins of PMOD ports A and B area already wired to the device.

Here’s a version where I use a small LCD panel directly attached to the Arty board using a u-shaped HDMI bridge, which cuts the need for all the extra HDMI cables:

HEADS UP: If you have ordered the DVI PMOD, it will more than likely arrive with its connectors unsoldered, so if that’s the case be prepared for some very light soldering work. I used a hot air gun and some liquid paste to put it together in a few minutes, but your milage may vary. If you’re not into soldering work, I’ll provide a VGA PMOD version in later articles to make life easier, so you might want to skip ahead to other sections before the GPU. Also note that the video output is entirely optional, and if there’s no module connected, simply nothing will happen, even though the GPU will still do the work, but your only output will be possible either by UART or any other device you might devise on your own.

Testing the GPU

First, we’ll need to attach the 1BitSquared’s 12bit DVI PMOD to the two leftmost ports on the Arty A7 board. These pins have already been marked in the project files for this part, so we don’t need to do anything more for this part.

If you have the files for part12 from the git repo, and the riscvtool, we can test the GPU using one of our test samples. However, since we don’t have any math code yet, we’ll have to edit the build.sh script and change the following line:

riscv64-unknown-elf-g++ -o gpupipetest.elf test/gpupipetest.cpp test/utils.cpp test/console.cpp -fno-builtin -mcmodel=medany -std=c++11 -Wall -Ofast -march=rv32imf -mabi=ilp32f -ffunction-sections -fdata-sections -Wl,-gc-sections -fPIC -lgcc -lm

to this, in order to turn off integer and float math and use software version instead:

riscv64-unknown-elf-g++ -o gpupipetest.elf test/gpupipetest.cpp test/utils.cpp test/console.cpp -fno-builtin -mcmodel=medany -std=c++11 -Wall -Ofast -march=rv32i -mabi=ilp32 -ffunction-sections -fdata-sections -Wl,-gc-sections -fPIC -lgcc -lm

and then run this command to build it:

build.sh

For this part, we’ll need to build a new .bit file to program our device, as in the previous part. Please refer to the previous article or simply open part12 project file, then use Generate Bitstream shortcut in the Flow Navigator. Then open the Device Manager/Open Target/Auto Connect when it builds the bitstream, and select your xc7a100t device, right click and choose Program Device, or right click the s25fl128 device and choose Program Configuration Memory Device. Respectively, either choose the .bit file or the .bin file to match your choice, which will program your device (don’t forget to reset the device if you’ve programmed the configuration memory device)

Once the device is alive (which you can tell by a green tinted background in the video output), we can run our GPU test on it. However, notice that this one will run very slow since we don’t have (and have turned off) any hardware math operations, which the particles in this demo use. But for a first test, it will be sufficient:

# Replace the /dev/ttyUSB1 part with your USB port connected to the Arty board
./build/release/riscvtool gpupipetest.elf -sendelf 0x10000 /dev/ttyUSB1

If evertying went well, you’ll be presented with (a very slow version of) the GPU pipeline test, showing small rotating triangles, a large triangle pair at a fixed angle, and some text on screen. The one that shows ‘time’ will probably show some unusual, and fixed number for you, since we do not have any clock circuit in our device yet.

Another example you can run (again, incredibly slowly for now) is the mandelbrot sample. Simply replace the rv32imf/ilp32f pair with rv32i/ilp32 and you’ll have a software emulated floating point library, drawing a mandelbrot set quite inefficiently:

DMA

The GPU has a mechanism to move bytes from a SYSRAM address onto the VRAM, using the second port of the system memory module.

The source and destination addresses are set up in GPU registers, and a DMA operation is queued up. The GPU will then, upon receiving this request, go to a DMA loop, moving 32bits at a time. Optionally, it can generate a write mask automatically using the bit pattern of this 32bit work, effectively skipping writing any time it sees a zero. This is especially useful for sprites with transparent bits, and font rendering onto different backgrounds.

Apart form the auto-mask generation, there are features planned to do auto-row skipping to avoid sending individual packages for each scanline of the DMA operation. This will effectively allow the GPU to DMA a rectangular region without any assistance, as long as it knows the memory stride between source rows and the stride between target rows.

Another use of the DMA is to implement small frame buffers that one can blit onto the actual video output buffer, in case the CPU wants to draw some pre-generated bitmaps or do some software rasterization work. Using the auto-bit-masking, it’s also quite easy to overlay any UI elements generated on the CPU side from an offscreen buffer onto the VRAM.

Fast clear

As mentioned in the x12 sliced nature of the VRAM, the GPU will handle fast VRAM clears using the lane mask. If there is a clear command in the queue, the GPU, after seeing it, will enable all lanes for forced write, set the clear color, and loop for 0x1000 (4096) times to clear the VRAM quite quickly. It will then turn off the forced write mask for all lanes and resume normal operation.

Direct system memory write from GPU

As mentioned in the synchronization section, the GPU is capable of writing to system memory, using any DWORD aligned pointer, as soon as it processes the write command. The only thing that the GPU can’t do (currently) is to write to the memory mapped devices, as this would cause quite a bit of havoc in the CPU execution path.

The way the GPU will do this in the future is to use more queues to stash future work in, once we come to the point where we’re implementing more advanced rasterization methods and per-pixel programmability.

Triangle rasterization

The rasterizer in NekoIchi’s GPU is currently quite primitive by comparison. It implements a very simple algorithm:

  • Store 16bit signed vertex data for the current triangle in the GPU registers, alongside with a single color
  • Using combinatorial circuitry, select min/max bounds, creating an AABB
  • Make sure the bounds are clipped to screen bounds, and X coordinates are multiples of 4 pixels
  • Set up the ‘scan’ cursor to the first 4×1 pixel block in the rectangle covered by the triangle
  • This generates a 4×1 pixel write mask for VRAM
  • Write the single color, replicated to 32bits, using the write mask, and move the tile cursor
  • Once we reach the end of current AABB row, step to the next row and go to the leftmost position
  • Stop once we reach the last tile of the last row

Here we can see the mask generation module, along with how it’s driven for one of the 4×1 pixel corners:

module LineRasterMask(
	input wire reset,
	input wire signed [15:0] pX,
	input wire signed [15:0] pY,
	input wire signed [15:0] x0,
	input wire signed [15:0] y0,
	input wire signed [15:0] x1,
	input wire signed [15:0] y1,
	output wire outmask );
logic signed [31:0] lineedge;
// Edge equation components
wire signed [15:0] A = (pY-y0);
wire signed [15:0] B = (pX-x0);
wire signed [15:0] dy = (y1-y0);
wire signed [15:0] dx = (x0-x1);
always_comb begin
	if (reset) begin
		// No reset logic required yet
	end else begin
		// Calculate the edge equation
		lineedge = A*dx + B*dy;
	end
end
// We only care about the sign bit, rest is not required (yet)
assign outmask = lineedge[31];
endmodule

// We can now use it to generate a 4x1 mask for one of the vertices
// tileX0 and tileY0 are the coordinates of the current 4x1 tile being
// scanned by the sweep logic. As soon as this input changes, we
// get a new mask due to the combinatorial nature of this circuit,
// ready to use in the nearest available clock.
LineRasterMask m0(reset, tileX0,         tileY0, x0,y0, x1,y1, edgemask[0]);
LineRasterMask m1(reset, tileX0+16'sd1,  tileY0, x0,y0, x1,y1, edgemask[1]);
LineRasterMask m2(reset, tileX0+16'sd2,  tileY0, x0,y0, x1,y1, edgemask[2]);
LineRasterMask m3(reset, tileX0+16'sd3,  tileY0, x0,y0, x1,y1, edgemask[3]);
//... 4x1 mask generators for the other two edges goes here

// We can then combine 3 of these edge masks together to create our write mask
// If and only if all edges have negative sign on a pixel of the 4x1 tile
// we create a 'true' to mark a write enable. Other pixels that yield zero
// are outside of the primitive, or the primitive is backfacing, and are discarded
assign tilecoverage = edgemask[3:0] & edgemask[7:4] & edgemask[11:8];

As you can see, it currently doesn’t do rectangular macro/micro tiles, neither does it run in parallel to the 12 VRAM slices currently, and no barycentric coordinates are generated. In addition we don’t have a bi-directional sweep rasterizer which would be even more ideal since it would skip empty spaces efficiently. This is all for convenience and to keep the GPU simple while we focus on the overall design, and will be impoved upon gradually as we progress.

Next

This concludes part12. We will dive a little bit more into the system in the next parts.