Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

How should we time our single-cycle datapath? How should we set the clock frequency? In this section, we develop an approximation of instruction timing using the five steps to a RISC-V instruction.

2Timing Diagram for add

First, let’s consider the delays in our beloved add instruction. Review the add datapath in Figure 1.

Figure 1:The add datapath, updated from an earlier section’s simple add-only datapath. Use the menu bar to trace through the animation or access the original Google slides.

Figure 2 shows the waveforms for executing an add x1 x2 x3 instruction at address 0x100, followed by add x6 x7 x9 at address 0x104.

Timing diagram for add. Only relevant signal waveforms are shown.

Figure 2:Timing diagram for add. Only relevant signal waveforms are shown.

3Critical path delay by instruction

Different instructions use different components of the datapath. We now update our definition of critical path to consider the path between clocked element inputs and outputs that matter for the given instruction. For example, accessing DMEM does not matter for an add, whereas setting up the RegFile data to write back does not matter for sw.

Table 1:Timing descriptions of components.

DelayDescription
tclk-to-qt_{\texttt{clk-to-q}}clk-to-q delay to transfer register input value to the output.
tsetupt_{\texttt{setup}}Setup time to hold the register input stable before the rising clock edge.
tmuxt_{\texttt{mux}}Propagation delay through a mux; assume the same delay for all muxes.
taddt_{\texttt{add}}Propagation delay through the simple adder that increments PC to the next instruction.
tRegFilet_{\texttt{RegFile}}Delay to read a register value from RegFile.
tIMEMt_{\texttt{IMEM}}Delay to read the instruction from IMEM.
tDMEMt_{\texttt{DMEM}}Delay to read a word from DMEM.
tALUt_{\texttt{ALU}}Propagation delay through the ALU.
tImmt_{\texttt{Imm}}Propagation delay through the immediate generator.
tBrCompt_{\texttt{BrComp}}Propagation delay through the branch comparator.

Figure 3:The beq datapath, updated from an earlier section’s simpler datapath. Use the menu bar to trace through the animation or access the original Google slides.

Figure 4:The lw datapath, updated from an earlier section’s simpler datapath. Use the menu bar to trace through the animation or access the original Google slides.

4The single-cycle datapath clock is slow

To determine the clock frequency for the single-cycle datapath, we compute delays of each instruction’s critical path, then set the clock period as the worst-case delay incurred over all instructions.

To put some numbers to our earlier analysis, we will simplify our time estimates with Table 2, which assumes that the timing of each of the five steps to a RISC-V instruction are dominated by the major functional hardware units.

Table 2:Assume each of the five steps is dominated by a major hardware unit. Multiplexors, control unit, PC accesses, immediate generation, and branch prediction incur minimal delay.

StepOperation timeMajor hardware unit
Instruction Fetch (IF)200 psRead an instruction word from IMEM.
Instruction Decode (ID)100 psRead register values from the RegFile.
Execute (EX)200 psPerform arithmetic/logical operations in the ALU.
Memory Access (MEM)200 psRead or write data from DMEM.
Write Back (WB)100 psWrite back to the RegFile. For single-cycle, we assume this is the delay of the WBSel mux and setup time.

We can then produce the simplified timing diagram in Figure 5 for an instruction that uses all phases—like our lw instruction from earlier. We can additionally construct Table 3, which shows the time required for various instruction formats.

Approximate timing diagram for the five steps to a RISC-V instruction in the single-cycle-datapath.

Figure 5:Approximate timing diagram for the five steps to a RISC-V instruction in the single-cycle-datapath.

Table 3:(P&H Figure 4.28). Total time for each instruction calculated from the simplified time for each phase.

InstructionIF (200ps)ID (100ps)EX (200ps)MEM (200ps)WB (100ps)Total
addXXXX600ps
beqXXX500ps
jalXXX500ps
lwXXXXX800ps
swXXXX700ps

While Table 3 above shows the shortest time to complete each instruction, we note that the single-cycle datapath, like all synchronous digital systems, shares a single clock.

We further note that each instruction’s critical path often involves accessing major hardware units in sequence. In other words, for most of each clock period, much of our hardware is idle and not computing additional data!

We address these performance issues and more in our pipelined datapath design up next. Stay tuned!

Footnotes
  1. These processes take a comparable amount of time, though which is longer depends on the specific technology. In Figure 2, the adder happens to complete faster than the IMEM memory fetch.

  2. Note that the waveform represent bundles of wires with a hexadecimal value (contrast this with the clock’s binary high-low signal). The PC output pc bundle of wires update at the same time, because flip-flops are wired in parallel. By contrast, the pc+4 output does not stabilize simultaneously. Because the adder cascades single-bit adders in series, the least significant bits stabilize sooner than the more significant bits. In timing diagrams, we always show the transition to the correct value. For pc+4, this occurs after the propagation delay of the most significant bit.

  3. In Figure 2, the control logic decoding of the instruction happens to complete faster than the RegFile register read. We will assume this precedence in later analysis.

  4. There are two multiplexers controlled with ASel and BSel, respectively. Both propagation delays occur concurrently, so we only count for one mux’s propagation delay.