RISC-V CPU & CUSTOM GPU ON AN FPGA PART 16 – DDR3 ACCESS AND THE CACHE

BEFORE WE BEGIN

In this part we’ll start talk about cache access for our DDR3 module from previous part.

But before we go further, and since the files are getting too large to post here, we need to grab them from a git repository. Please head over to https://github.com/ecilasun/nekoichiarticle and use the following command to pull the files for all parts of the series:

git clone https://github.com/ecilasun/nekoichiarticle.git

after creating and changing to a folder of your preference with sufficient free disk space, since this is where we’ll be working on all parts of the series from now on.

Vivado DDR3 IP and access speed

The built in IP that comes out of the box in Vivado 2020 series is not really made for accessing 32 bit data at a time, but its performance improves when it applies batch operations instead. These batch operations are called a ‘burst’ and as the name suggests, apply a series of reads or writes in a row, back to back, without any intervention.

In our case, each burst is 128 bits wide. This means we can read or write a group of 4 DWORDs with one read or write command. This is also usually much faster than reading individual DWORDs as each read or write would have extra clock cycles around them, for start/stop states and more.

The reason we have 128bits in a row is due to two things. First, the DDR3 device on Arty A7 board is a 16bit part, which means it has addressing aligned to 16bit words and can’t operate on a smaller unit. The second reason is due to the wiring of the UI layer that interfaces us to the physical layer; this one has a fixed burst length of 8 words. Together, they make a total of 128bit read or writes possible.

However a read or write this wide by itself is pretty useless without using a second mechanism around it, which depends on wide operations. The reason we lose speed in this plain mode, without that second mechanism, is that we waste all other 96 bits when applying our memory operations, which we have to re-read or re-write again and again four times in a row for a simple linear memory operation such as writing zeros to a region of memory.

Simple direct mapping cache

To add to the above burst read mechanics, we can use several types of additional units, which cache the incoming and outgoing data in a faster type of memory; the block ram.

The simplest one we can use is called a Direct Mapped Cache, which will work as follows:

  • First, a cache line size and a cache line count is decided
  • Then, memory is assumed to be divided into units of cache line size times cache line count slices
  • An access to each of these ‘slices’ will be mapped onto our cache region
  • There will be competing access for same line on two different memory slices
  • We resolve the conflicts using one very simple eviction rule

To show this on a simple diagram:

The right hand side of the figure is our memory, and the left hand side is our cache. The cache has four slices (0-3) that we’ll index using an OFFSET that represent the 4 continuous DWORDs that are brought in from the memory. The TAG will let us select which slice will be associated to the cache on any given LINE. Each LINE contains 4 DWORDs.

This mechanism allows us to associate the 128bit data at a given memory slice’s line to a line on the cache, and we use the offset to select which range of these 128bits to select later on using our offset. The association can overlap for two slices of memory though as you can see above: the blue and yellow lines on the right both can use the yellow line on the cache, in which case one of them must be evicted to make room for the other.

The cache not only contains data though; we need some means to associate what line from memory is in the cache by storing its TAG in a second array, which we’ll call the tag memory. Furthermore, to see if we’ve ever written to any cache line (so that we can write it back to main memory), we need a one bit array of ‘dirty’ values.

All of the TAG, LINE and OFFSET values come from our memory address as shown below:

// The division of address into cache, device and byte index data is as follows
// Most significant bit is on the left
// device  tag              line     offset  byteindex
// 0000    0000000000000000 00000000 00      00
// 'device' and 'byteindex' are unused for the cache

// Memory address to tag/line/offset:

wire [15:0] ctag = busaddress[27:12]; // 65536 cache pages (slices)
wire [7:0] cline = busaddress[11:4]; // 256 cache lines
wire [1:0] coffset = busaddress[3:2]; // 4 DWORDS

// Tag/line/offset to 128bit aligned memory address:

memaddress = {1'b0, ctag, cline, 3'b000};

Using the above mappings, we can now go from a given 32bit memory address on the bus to a cache cell and its associated 128bit memory address on the DDR3.

Cache operation

But how does our cache work, once we found the TAG/LINE/OFFSET for a memory request?

Let’s assume we have a small state machine that addresses the read/write requests coming in from our CPU or another device. The first action we need to take in this case is to first see if this request lives within a cache cell before we try to read or write anything.

We have two choices in this case; a memory request for a read or write either lives in the cache or not. To check for this, we need to calculate the TAG/LINE/OFFSET value from incoming address as shown above, then use the LINE to index our cache and see if the TAG there matches the TAG:

cachehit = (cachetags[cline] == ctag) ? 1'b1 : 1'b0

The next step depends on the result of the above calculation. Assuming we already had the line we wish to read (or write) in cache, we can directly return the value at given cache offset or modify it. However, if we do modify the cache for any reason at all, we need to mark the modified line as dirty by storing a 1 in the dirty array.

Now the fun part. We might have missed the cache according to the above calculation (that is, we don’t have the current memory address TAG match the one stored in our tag memory) In this case, we need to follow a set of rules to ensure healthy operation of our system so it doesn’t accidentally lose any data. We can simplify this to a very short list:

  • If cache page is ‘dirty’, first evict its values to main memory by writing the whole 128bit line to address {tag,line} and clear the dirty bit
  • If cache page is ‘dirty’ or not dirty, but always after cache writes, read the new contents from {tag,line}, and update tag memory with the new tag

And that’s all there is to making a very simple cache work. The shortest path is when we hit the cache in which case the results are available the next clock, and if don’t hit the cache we take a small detour and the value is available several (maybe a dozen) cycles later. This is the reason why a cache miss is a very expensive thing and randomly trashing memory is not a great idea on any device.

Modifying DDR3 interface

Our DDR3 interface from part 15 was doing a very naiive thing and applying individual read/write operations for every 32bit value, always writing to memory. The new interface uses the cache described above to handle memory requests faster, but I forgot to mention a trick it does for updating parts of memory.

Assume for a second we wish to write a single byte to our memory. We ordinarily would do this by providing a write mask (a bitmask showing which bytes out of a DWORD end up in memory). However, this mask is too short for our current purposes. We first expand it to a full 32bit mask, so that we can use it to mask out our incoming DWORD and leave only the parts to write to memory:

wire [31:0] cwidemask = {{8{buswe[3]}}, {8{buswe[2]}}, {8{buswe[1]}}, {8{buswe[0]}}};

This wide mask by itself is not sufficient, we also need its complement to mask out areas that are being written to in our cache line:

wire [31:0] cwidemaskn = {{8{~buswe[3]}}, {8{~buswe[2]}}, {8{~buswe[1]}}, {8{~buswe[0]}}};

Now, any time we wish to use a write mask on a cache line, we can simply do the following, which will first select the correct DWORD to modify out of the four stored, then combine the existing value there by parts of the incoming data:

case (coffset)
  2'b00: cache[cline][31:0] <= (cwidemaskn&cache[cline][31:0]) | (cwidemask&busdatain);
  2'b01: cache[cline][63:32] <= (cwidemaskn&cache[cline][63:32]) | (cwidemask&busdatain);
  2'b10: cache[cline][95:64] <= (cwidemaskn&cache[cline][95:64]) | (cwidemask&busdatain);
  2'b11: cache[cline][127:96] <= (cwidemaskn&cache[cline][127:96]) | (cwidemask&busdatain);
endcase

Performance improvement

We can now look at some side-by-side figures to see how much of a performance gain this brings us:

Before adding our cache: 1.442Mbps write speed
After adding our cache: 9.909 Mbps

This is a great deal better, with a 6.32 times speed improvement.

But surely we can do better than this. The NekoIchi architecture uses only one cache for both code and data currently, to simplify the code. We could start by splitting our cache into two: one for code access, one for data access, so that they don’t interfere with eachother. However, this is a bit tricky, since if you read code as bytes and modify them, the operation has to be reflected to the code cache, possibly stalling the CPU. We’ll avoid doing this for now.

Another way to optimize this further is to adjust our TAG/LINE/OFFSET counts to host perhaps more than 4 DWORDs per line, but then we need to start checking the performance benefits/losses of such a change (due to the fixed 8x16bit burst read of the UI interface, we might need to switch to the AXI4 bus version)

Also as a small victory run, I’ve ported over DOOM 1.10 to NekoIchi and ran it to see the cache in action. You can check a video of it here:

Next

Now that we’ve closer to an actual, proper system with some IO, screen and a means to access the entire available memory, we can start thinking about some more things to add. I’ll leave this out for now since I’m not decided as of yet. Until next time.