Back Next

Pipelining
   Wikipedia topics :
     Instruction pipeline
     classic RISC pipeline  

* Early processor design often required a read to fetch the instruction opcode
  followed by additional reads to fetch operands after the decoding process 
  started.

* Because memory access is often slower than activity in CPU, performing single
  read of complete instruction can save clock time.

* RISC architectures use single read to obtain whole instruction, 
   ie. instructions equal to the size of the data bus.
  However, to access full address range, RISC architecture may resort to 
   indirect addressing where a register holds the target memory address.
  This register may require multiple instructions to set.

* Modern CPUs (32, 64, 128-bit data bus) are capable of fetching whole or 
   even multiple instructions in single read even when variable in length.

* CISC systems often support complex indirect addressing modes and combine
  data fetch with more complex activity (multiply a register with the 
  value found in memory ) which can cause delays in the fetch - excute cycle.
  
  RISC separages the fetching/storing of data in/out of the cpu from the 
  more complex tasks such as adding or mulitiplication, allowing these
  activities only between registers in the CPU.

* Another feature of RISC is to create a set of simpler shorter time 
  instructions and let the user use programming to customize more complex
  activities only as needed.

Non-pipelined - 1 MHz clock

	FI	DI	EI	WB
T1	I1
T2		I1
T3			I1
T4				I1

Instruction type 1
1 clock per step in FE cycle * 4 steps = 4 clocks/ins.
1,000,000 cycles per sec./4 cycles per instruction = 250,000 ins/sec.

Instruction type 2
1 clock/FI, 1 clock/DI, 4 clock/EI, 2 clock/WB = 8 clocks/ins.
1,000,000 cycles per sec./8 cycles per instruction = 125,000 ins/sec.

Instruction type 3
1 clock/FI, 1 clock/DI, 3 clock/EI, 5 clock/WB = 10 clocks/ins.
100,000 ins/sec

Pipelined - 1 MHz clock - example assumes single instruction type for simplicity.

FI DI EI WB

T1 I1

T2 I2 I1

T3 I3 I2 I1

T4 I4 I3 I2 I1

Instruction type 1
Since all steps are the same, longest step 1 cycle/sec. 1 instruction/clock
1,000,000 cycles per sec. / 1 cycle per instruction = 1,000,000 ins/sec.
Instruction type 2
Longest step = 4 clocks for EI step
1,000,000 cycles per sec. / 4 cycles per instruction = 250,000 ins/sec.
Instruction type 3
Longest step = 5 clocks for WB step
1,000,000 cycles per sec. / 5 cycles per instruction = 200,000 ins/sec.

Super-scalar pipeline (execution step) - 1 MHz clock - single instruction type for simplicity.

FI DI EI WB

T1
I1

T2
I2 I1

T3
I3 I2 I1

T4
I4 I3 I1
I2

T5
I5 I4 I3
I2 I1

T6
I6 I5 I4
I3 I2

T7
I7 I6 I5
I4 I3

Instruction type 1
Since all steps are the same, super-scalar has no overall effect.
1 instruction/clock
1,000,000 cycles per sec. / 1 cycle per instruction = 1,000,000 ins/sec.
Instruction type 2
Longest step = 4 clocks for EI step
But this is averaged between two execution steps :
1,000,000 cycles/sec. / ( 4 cycles per instruction / 2 instructions at a time ). = 500,000 ins/sec.
Instruction type 3
Longest step = 5 clocks for EI step
1,000,000 cycles/sec. / 5 cycles per instruction. = 200,000 ins/sec.
Because the write-back step is NOT parallelled, super-scalar has no effect on execution throughput.

Check out :
https://arstechnica.com/old/content/2004/09/pipelining-1.ars
and
https://arstechnica.com/old/content/2004/09/pipelining-2.ars