Techniques to improve on simple fetch/execute cycle.

  Simple pipe-lining
    Multiple instructions in CPU

    Different steps of the F/E cycle being processed on different instructions
      at same time.

    Pipe-lining can create conflicts.
      Different steps wanting same resources
       * fetch operand while writing results.
       * attempting to conditionally branch before result finalized.

    May offer 20-60% speed improvement.

    Datapath description
    Improved Datapath

  Super-scalar
    Duplication or variation of complex task circuitry, most commonly the ALU. 

    Arithmetic Logic Units.
      Copies often not symmetrical. One ALU performs addition better and the 
        other favors multiplication/division.

    Floating point units. May technically qualify as secondary processor.

    Less competition for same resources.

  Vector processing
    A vector processor will apply a single instruction to multiple data units.
      Units are a small set of identical processing circuits.
      Customized version of super-scalar.

    Useful in numeric task, not so much in word processing.

    Intel supports 4 integer SIMD instructions. - SSE circa 2000
      small number of useful instructions. 
      fetch will read 4 32-bit values from a starting place in memory (array).
      single arithmetic instruction applied to all 4 values. 
    - instructions may not be supported on older CPUs.
    - requires compiler to recognize target CPUs ability.
    + Use of SDRAM, caching, and 64-bit data bus means values can be read
      in very quickly.

  Hyper-threading
    CPU core simulates 2 CPUs. 
    Not quite true parallel processing.

    Takes advantage of super-scalar features.
    Duplicates architectural state 
    Control registers : status, interrupt mask, memory management.
    Uses duplicate general purpose registers.

    While having a single instruction decoder/execution engine, single cache
      mechanism, single MAR/MDR interfaces, etc.

    Allows two separate processes or threads to co-exist in CPU core.
    If one thread stalled waiting for something like I/O
      other thread allowed full access to resources.

    Requires an OS that is multi-CPU enabled.
    Invisible to program - but programs with parallel potential may benefit.

    Up to 30% performance improvement but very application dependent.  

    # Primarily Intel, AMD has Clustered Multi Threading.
    # With the abundance of multi-core CPUs, these are somewhat out of favor.
    
  Multi-core

    Most of CPU's core circuits duplicated on same silicon chip.
      2, 4, 6 core (8 now available)

    Each core has its own level 1 harvard-style caches.

    May share single level 2 cache circuits.

    May share on or off chip level 3 cache.

    Share single set of address, data,
      and control lines connecting CPU to system buses.
      Data lines now 64 bits (8 bytes) wide.

    Requires OS be aware of multiprocessing abilities.

    Multi-core requires different coding for a single application to take
      advantage of multiple cores.

      Super-scalar attempts to execute different parts of a single program
      in parallel, whereas multi-core can run different programs.

  Multiple CPUs or Multiprocessing
    Separate CPUs that share bus and memory.

    Symmetrical
      - many identical CPUs.

      Often simple.  
        SIMD, single instruction, multiple data (vector).
        MIMD, multiple instructions, multiple data (mesh).
          Modern 'cloud' computing may be a kind of MIMD.

      # parallel processing - different CPUs all working on same task 
         (more common).

      # multiprocessing - different CPUs acting on different tasks.
         
    Asymmetrical
      - different CPUs handle different system tasks.

      Memory management unit, Math Co-processor

      Video processor, Sound processor