Pipe-lining

Clock speeds and MIPs

CPU uses a clock to switch between stable and transition states.

Depending on complexity of instruction
  Number of steps and clock cycles to finish instruction can vary greatly.

Semi-Simplified fetch/execute steps for 8-bit system bus supporting
multi-byte instructions (CISC). Total time (clock cycles) can vary widely.

  Fetch instruction op-code (fi)
  
  Decode instruction op-code (di)
  
  Fetch Data (fd)

  Execute instruction (ei)

  Write-back - write result into user accessible register.(wb)


Non-pipe-lined - 1 MHz clock

	FI	DI	FD	EI	WB
T1	I1
T2		I1
T3			I1
T4				I1
T5					I1

Instruction type 1
1 clock per step in FE cycle * 5 steps = 5 clocks/ins.
1,000,000 cycles per sec./5 cycles per instruction = 200,000 ins/sec.

Instruction type 2
FI = 1 clock, DI = 1 clock, FD = 2 clock, EI = 4 clock, WB = 2 clock = 10 clocks/ins.
1,000,000 cycles per sec./10 cycles per instruction = 100,000 ins/sec.

Instruction type 3
FI = 1 clock, DI = 1 clock, FD = 5 clock, EI = 1 clock, WB = 0 clock = 8 clocks/ins.
125,000 ins/sec

Pipe-lined - 1 MHz clock - example assumes single instruction type for simplicity.

FI DI FD EI WB

T1 I1

T2 I2 I1

T3 I3 I2 I1

T4 I4 I3 I2 I1

T5 I5 I4 I3 I2 I1

Instruction type 1
Since all steps are the same, longest step 1 cycle/sec.
1 instruction can be completed every clock cycle with full pipeline.
1,000,000 cycles per sec. / 1 cycle per instruction = 1,000,000 ins/sec.
Instruction type 2
Longest step = 4 clocks for EI step
1 instruction can be completed every 4 clock cycles with full pipeline.
1,000,000 cycles per sec. / 4 cycles per instruction = 250,000 ins/sec.
Instruction type 3
Longest step = 5 clocks for FD step
1 instruction can be completed every 5 clock cycles with full pipeline.
1,000,000 cycles per sec. / 5 cycles per instruction = 200,000 ins/sec.

Super-scalar pipeline (execution step) - 1 MHz clock - single instruction type for simplicity.

FI DI FD EI WB

T1
I1

--

T2 I2 I1
--

T3 I3 I2 I1 --

T4 I4 I3 I2 I1
--

T5 I5 I4 I3 I1
--
I2

T6 I6 I5 I4 I3
--
I2 I1

T7 I7 I6 I5 I3
--
I4 I2

Instruction type 1
Since all steps are the same, super-scalar has no overall effect.
1 instruction can be completed every clock cycle with full pipeline.
1,000,000 cycles per sec. / 1 cycle per instruction = 1,000,000 ins/sec.
Instruction type 2
Longest step = 4 clocks for EI step
But this is averaged between two execution steps :
2 instruction can be completed every 4 clock cycles with full pipeline
or 1 instruction every 2 clock cycles (averaged)
1,000,000 CPUs. / ( 4 cycles per instruction / 2 instructions at a time ). = 500,000 ins/sec.
Instruction type 3
Longest step = 5 clocks for FD step
1,000,000 CPUs. / 5 cycles per instruction. = 200,000 ins/sec.
Because the operand fetch unit is NOT paralleled, it will still
require 5 clocks to complete an instruction
super-scalar not applied to the operand fetch stage
so has no effect.

The slowest (longest) step in the sequence limits the throughput. This bottleneck may appear in different parts of the pipeline depending on the instruction.

In perfect world, each step of the fetch/execute cycle works with no competition with other steps for common resources.

In reality, execution time of individual instructions often slower because of resource competition and the overhead of actual staging.

But overall performance (average execution time) improves greatly.