Back Next
Pipelining
   Wikipedia topics :
     Instruction pipeline
     classic RISC pipeline  

* Early processor design often required a read to fetch the instruction opcode
  followed by additional reads to fetch operands after the decoding process 
  started.

* Because memory access is often slower than activity in CPU, performing single
  read of complete instruction can save clock time.

* RISC architectures use single read to obtain whole instruction, 
   ie. instructions equal to the size of the data bus.
  However, to access full address range, RISC architecture may resort to 
   indirect addressing where a register holds the target memory address.
  This register may require multiple instructions to set.

* Modern CPUs (32, 64, 128-bit data bus) are capable of fetching whole or 
   even multiple instructions in single read even when variable in length.

* CISC systems often support complex indirect addressing modes and combine
  data fetch with more complex activity (multiply a register with the 
  value found in memory ) which can cause delays in the fetch - excute cycle.
  
  RISC separages the fetching/storing of data in/out of the cpu from the 
  more complex tasks such as adding or mulitiplication, allowing these
  activities only between registers in the CPU.

* Another feature of RISC is to create a set of simpler shorter time 
  instructions and let the user use programming to customize more complex
  activities only as needed.

Non-pipelined - 1 MHz clock
FIDIEIWB
T1I1


T2
I1

T3

I1
T4


I1
Instruction type 1
1 clock per step in FE cycle * 4 steps = 4 clocks/ins.
1,000,000 cycles per sec./4 cycles per instruction = 250,000 ins/sec.

Instruction type 2
1 clock/FI, 1 clock/DI, 4 clock/EI, 2 clock/WB = 8 clocks/ins.
1,000,000 cycles per sec./8 cycles per instruction = 125,000 ins/sec.

Instruction type 3
1 clock/FI, 1 clock/DI, 3 clock/EI, 5 clock/WB = 10 clocks/ins.
100,000 ins/sec

Pipelined - 1 MHz clock - example assumes single instruction type for simplicity.


FIDIEIWB
T1I1


T2I2I1

T3I3I2I1
T4I4I3I2I1
Instruction type 1
Since all steps are the same, longest step 1 cycle/sec. 1 instruction/clock
1,000,000 cycles per sec. / 1 cycle per instruction = 1,000,000 ins/sec.

Instruction type 2
Longest step = 4 clocks for EI step
1,000,000 cycles per sec. / 4 cycles per instruction = 250,000 ins/sec.

Instruction type 3
Longest step = 5 clocks for WB step
1,000,000 cycles per sec. / 5 cycles per instruction = 200,000 ins/sec.

Super-scalar pipeline (execution step) - 1 MHz clock - single instruction type for simplicity.

FIDIEIWB

T1
I1



T2
I2I1


T3
I3I2I1


T4
I4I3I1
I2


T5
I5I4I3
I2
I1

T6
I6I5I4
I3
I2

T7
I7I6I5
I4
I3
Instruction type 1
Since all steps are the same, super-scalar has no overall effect.
1 instruction/clock
1,000,000 cycles per sec. / 1 cycle per instruction = 1,000,000 ins/sec.

Instruction type 2
Longest step = 4 clocks for EI step
But this is averaged between two execution steps :
1,000,000 cycles/sec. / ( 4 cycles per instruction / 2 instructions at a time ). = 500,000 ins/sec.

Instruction type 3
Longest step = 5 clocks for EI step
1,000,000 cycles/sec. / 5 cycles per instruction. = 200,000 ins/sec.
Because the write-back step is NOT parallelled, super-scalar has no effect on execution throughput.

Check out :
https://arstechnica.com/old/content/2004/09/pipelining-1.ars
and
https://arstechnica.com/old/content/2004/09/pipelining-2.ars