tinysys part three: OS / hardware interop and the kernel
Previously on tinysys
Last time I wrote some overviews about tinysys to slowly introduce the system, but I haven’t fully covered what it is that I am aiming to accomplish with a custom OS and a custom system design.
In this part I will start doing that, and go bit by bit until we get a full overview of what and why everything is the way it is.
My design approach here was that the hardware was complementary to the software that I wanted to write, or rather how I wanted to write it. In other words, I haven’t gone and designed some hardware with a CPU and try to write an OS for it.
The hardware layout
If one must, tinysys can be squeezed into a rough block diagram as follows:
All of the units on the left-hand side have SDRAM access (cached or direct read/write) whereas the right-hand side devices are only accessible via the D$ of the two CPUs (the D$ is smart; it will know that it has to do uncached access to these devices and will directly get an arbiter line to talk to them)
If you paid attention to some oddity in the above diagram, yest the two CSRs (control and status register files) are outside the CPU fabric and are mapped onto the uncached memory device address space. This allows for a CPU to go and read/write directly onto the other CPUs CSR file, which is quite handy. If you recall from the previous post, this mapping is used to prepare the CPUs for reboot whereupon they’ll find their entry point addresses written to these CSR.
Another oddity here is that there’s a device called ‘MAIL’ here, which is actually not a mailbox as one might guess. It used to be one, but I found that I don’t need further interrupt logic here so it’s essentially an uncached memory area which all CPUs can read/write to, and this is where all the task context data (including saved task registers) reside. This allows for the D$ on both CPUs to stay intact during task switching.
If you think this is somewhat of a curious setup, you should first look at the CPU before you decide on that:
This is the tinysys CPU, which is semi-pipelined; only the FETCH/DECODE/INTERRUPT unit is fully parallel with the EXECUTE/MEMORY unit, and MEMORY is somewhat overlapping most of the execution. This design allows for fast changes without having to pipe an entire set of signals for a small change through the entire system, possibly over 5 or more stages. This is one tradeoff for being easy to change the design.
Another oddity here is the INTERRUPT ROUTINE ROM. Normally, you’d get an interrupt branch to your MTVEC, upon which you’d do all the fun stuff of handling your interrupt by looking at some mcause bits and other information from the hardware. Since I have chosen to handle interrupts in the fetch unit itself, what happens instead is that the ‘fixed’ code that populates the mcause bits and preps to branch into the mtvec are all handled by a routine chosen by the fetch unit at runtime. That also lets me push a bunch of instructions to the instruction fifo without having to hit I$, as a direct copy from a small ROM. As a result, interrupt entry and exit operations are quite fast on tinysys.
For one instruction, the execution logic of the CPU is as follows:
- initialize the CPU at a fixed reset vector (PC = 0x0FFE0000)
- fetch one instruction and decode it
- put all decoded bits into the instruction fifo, unless that instruction is illegal, is an ecall or ebreak, mret or wfi, or is a I$ fence operation, or an IRQ (those are handled in the fetch unit)
- while pushing those bits into the fifo, also decide on the next PC (JAL instruction is handled immediately since we know it’ll be a jump to PC+immed)
- if this was a branch instruction, stall waiting for the new PC
- if this was an interrupt (or ebreak/ecall), push the entire set of interrupt entry routine instructions (selected based on hardware/software/timer interrupt type)
- for WFI, we actually wait here (since if there was an interrupt we’d be inserting the handler prologue here anyways)
- if we encounter an mret, then push the entire set of interrupt exit routine instructions (selected using flags saved at interrupt entry time)
Its counterpart the execution unit has quite few things to do:
- pull one instruction from the instruction fifo (which comes with decoded immediates, and the PC from which the instruction was fetched)
- handle it looking at the recoded bits in fifo
- if this is a load or store, do the initial operation, however, do not wait for the completion of reads or writes (NOTE: since we’re using zfinx extension, FPU shares registers and therefore load and store instructions with the rest of the CPU, thus we only have to track one set of load/store instructions)
- pending loads and stores are turned ‘off’ by D$ ready bits and might overlap some amount of other code execution
- for now, execute unit does not track which register was the target of load/alu before it decides to wait to read it, it’ll always stall on use of a register when encountered (this’ll be improved later on)
As you can see, there isn’t much pipelining going on, but here’s a neat thing: since the fetch and execute parts are talking via a fifo, it is possible in the future to make the execute unit use a faster clock speed if one were to take care of the D$ access (which would then also need a fifo) This would help with long operations such as a floating point division where our wait times will diminish somewhat.
One thing that’s currently missing, but in the works, is atomic load / store instructions (load reserved / store conditional)
The OS
Perhaps it is not that obvious just yet, but up to this point every bit of device memory layout and wiring choice is geared towards fast interrupt handling, direct access to devices, uncached read/writes to shared memory locations, and the user of zfinx extension, all help our OS execution and task handling performance.
Since the OS is nothing more than a glorified interrupt handler and task context switcher, it is very important for tinysys to detect an interrupt and handle it as quickly as possible. Currently the ISR that handles interrupts will service all OS calls (for example file open or print) and the quicker we can return to caller the better for the currently executing program.
Next time
Next time we will dive a bit more into the peripherals and take a look at memory access arbitration between them. Until then, stay in machine mode.