Software tweaks :

  Compile time:
    Branch delay slot
      Insert NOP (no operations) to provide some additional time for 
        previous instruction to complete.

      Out or order execution. Inserting an unrelated instruction after 
        branch test. Similar to NOP but gets something done.

    Compiler-optimization

      Compiler generates out of order instruction sequence when a set of 
      instructions in a particular sequence are NOT dependent on each other, 
      the compiler can reorder.

      
      Initial re-ordered
      
      
        Add A to B

        Store B

        Set C to 7

        Set D to 0

      
      
        Add A to B

        Set C to 7

        Set D to 0

        Store B

      
      
 

      - Out of order compiling time expensive, 
        so usually reserved for final compile of program.

    Branch Prediction
      Rewrite test to favor linear (sequential) branch.
      Code recompiled so more often executed code immediately follows
        branch test (branch not taken).

        May require additional information or compiler simulation.

        'Static' - compiler assumes branches back to lower address or
          previous inline code more likely taken.

    Substitute instructions with more efficient instructions.
      e.g.  Multiply by 2 replaced with shift-left.

    - Optimization takes time. If code is in development stage, may not
      be tolerable.

    - Because branching often conditional on data,
      compiler best guess may or may not be.


  Run-time :

    Branch Predictor
      As more executions occur, note the preferred behaviour of branch.

      Requires a counter of 1 or more bits.
        1-bit notes last evaluation of branch.

        2-bit notes last 4 evaluations. 

        More bits, better observation but more complex logic.

      Two level predictor
        Counter replaced with or kept in a small table that records the history 
          of branch taken/not taken.

        Table allows predictor circuit to see patterns such as branch taken
          every third time.

        * One of the most popular branch handing techniques in current use.

        Example :
          2-bit by 4 entry table.
          First time branch encountered, 1st entry adjusted.
          2nd time, 2nd entry adjusted.
          3nd time, 3nd entry adjusted.
          4nd time, 4nd entry adjusted.
          
          We now can recognize patterns in the branch behaviour.

          Longer the table, the better the prediction.

          Additonal field required to handle different/nested branches 
            in program.
    
          Branch history table starts to perform as a cache with its
            associated overhead and logic.

    Branch target prediction table.
      Branch addresses are often relative to PC 
        or may be specified in a pointer (indirect addressing).

      Precalculating the target address and storing it in a small cache table
        can improve performance.

      Branch predicition table - if branch is taken.
      Branch target predicition table - where to branch to.
   
      Both of these are often implemented in the same logic module.


      Predict which branch more likely taken based on direction of branch.
        Branches back (PC - X) are more commonly at the bottom of a loop and
          more likely to be taken.

        Branches forward (PC + X) are more commonly breaking out of a loop and
          less likely to be taken. 


    Use conditional instruction (predicate)

      Useful for short branch conditions.

        sub r3,r2,r1 # subtract r1 from r2  r3=r2-r1
        brp debit    # branch to code marked by debit if positive
        xor r3,r3    # other wise zero out r3
  debit sub r2,r2,1  # subract 1 from r2

        sub  r3,r2,r1 # subtract r1 from r2  r3=r2-r1
        xorn r3,r3    # zero out r3 if negative flag set
  debit sub  r2,r2,1  # subract 1 from r2

      Same test condition can be checked by multiple instructions
        as long as they don't modify condition.

      Works well when 'branch' is fairly short.

      - requires additional bits in instruction opcode. bigger code.
      - requires additional logic.
      - may require additional execution time
        extra clock cycle or slightly slower clock.

      # Intel uses movcc and fcmov to conditionally move data in a 
      #   registers based on 8 (3-bit) recognized conditons. 
      #
      # ARM uses a 4-bit conditional test (16 possible condtions) on
      #   a number its opcodes.


  Out of order execution
    Cache next instruction if data dependent.

      Out of order execution - requires advanced hardware logic.
      More common in CISC CPUs - Intel P6.


  Speculative execution

"Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.

From the wikipedia page.

    The branch prediction covered above is a form of "predicitve execution"
      If 'prediction' wrong, pipeline has to be flushed and reinitalized
      with code from correct branch.

    Eager execution assumes availability of circuitry to be able to run
      both branches as separate sub-tasks and discard results of branch
      not taken without affecting the overall state of the pipeline.

      Super-scalar ALUs may allow for some cases of eager execution.

      IBM 360 supported.
 
      Systems using CUDA - library that allows a CPU to offload computation
        on to NVidia graphics co-processors.

        Requires compile time library support.

        Mainly used to process large arrays of data in parallel (?).

      Resource needs grow exponentially as number of unresolved branches
        increases.


    Spectre
    Meltdown (Intel CPUs)

    On most CPUs built in the last 20 yrs. using branch prediction, 
      an abandoned branch is not purged after abandonment.

    Unlike main memory where memory for one program is protected from access
      by another program, the branch prediction area inside the CPU 
      is not protected.

    A clever program can mine information from this area about what is in 
      the CPU cache or even in the CPU itself.

Lectures

Initial	re-ordered
Add A to B Store B Set C to 7 Set D to 0	Add A to B Set C to 7 Set D to 0 Store B