Cache memory

Because the CPU is usually much faster than memory for economic reasons, the CPU often spends much of its time waiting for the data it requested. Improvements in CPU technology, including pipe-lining and super-scalar design, have increased this problem. Improvements in memory have been concentrated in size not speed.

Memory suffers from a trait called latency. This occurs in three parts. It takes time for the control and address buses to select a particular unit of memory, it takes time to deliver a copy of that data to the CPU, and because of the nature of economical memory, it takes time for memory to recover after it has been accessed.

A solution to this bottleneck is to design a small amount of memory that is fast and easily accessible to the CPU. This memory, called cache memory is memory usually located in the CPU or immediately adjacent to it connected by high speed connections. The memory is quick and expensive.

Since programs show a behavior called locality, it does not have to be a large amount of memory thus keeping the cost down.

Locality is exhibited when a table of data is accessed. Usually a command will proceed through the table in sequential order and that table will also be stored sequentially in memory with all of the bytes being local to each other.

It is also often seen in the branching or jumping of execution code. While jumps to other functions may be a large distance away, loops within a program or simple decision branches tend to be very local to each other.

Because of this locality, it is possible to copy, with very efficient commands, the blocks of memory containing current code and data from the main memory to memory the cache and execute the program from the cache.

There several ways to cache a block of code/data. Each offers advantages and costs.

The simplest is to mix the code and data in a single cache. This offers the simplest and smallest (cheap) configuration. But it comes at the cost of increasing the chance that the information you need is not available in the cache.

An improvement at the cost of increasing the cache size and adding some control circuits is to split the cache into a code section and a separate data section, called spit or Harvard architecture. The code and data do not compete for the same slots in the cache. Additionally, on most systems, code is never modified and never has to be written back out to main memory.

Cache is implemented in two or three levels. Level1 is memory directly on the CPU. Level 2 is memory separate from the CPU but directly connected via a specially designed bus set that run near to CPU speeds. Level 3 (static) cache may run at bus speeds but has no recovery latency.

In a cached system, the memory being cached tends to be duplicated at all levels.

Caching depends on the principle of locality. Locality exists in two forms, temporal and spacial. Memory accesses tend to be sequential or with in a near range (spacial) and these accesses tend to stay in this area for a while. Think of a for loop, initializing an array of numbers.

for i = 1 to 1000
do
  set array[i] to 10

done

Both the for, do, set, and done commands and the arrangement of the array elements are spatially near and the loop sequence is going to run for a thousand times( temporal).

Setting up a cache architecture

To create a cache architecture, first the main memory is broken up into blocks the size of the cache.

Next the cache memory is divided up into fixed size blocks called a cache line. This line contains 3 pieces of information.

1^st is a valid flag to indicate that that particular line of the cache is being used. Next comes the tag which will be the address of one of the main memory blocks from which the memory is being cached. This is being treated somewhat like the segment register.

Finally, there is memory is between 4 and 64 bytes long.

If 32 bytes/line * 2K lines = 64K memory.

Main memory is divided up into 64K blocks and these blocks are divided up into 32 byte units.

Specific units within each and every 64K block of memory is assigned to a particular line of the cache and only that line.

0-15 line 0

16-31 line 1

...

65536-65551 line 0

65552-65567 line 1

...

When an address in main memory is accessed, it is broken down into a set of addresses.

1^st, the address of 64K block of memory being accessed is found. Then the 32 byte line within that block is found.

The cache mechanism then checks the valid flag for the particular line to see if it is set. if it is not, then access to real main memory is performed and moved into the cache at the appropriate line.

Calculate the 64K block and line to access

If the cache has an entry at the appropriate line then
  Compare block address of memory to the tag entry for the line desired. 

  If they match then 
    The bytes wanted are accessed.  

  Else
    Write out current contents of line to main memory
    Read in line from desired main memory.

  Endif

Else 
  Read in line from desired main memory.

Endif

Obviously, there are issues concerning the data that is in the cache at the current line if that is not the data needed.

Because the calculations are reasonably simple and can be done by hardware circuits in parallel, even the additional step of finding tag and line is quick.

Note that the data stored in a line is contiguous but the data stored in any two adjacent lines may be from different blocks in main memory.

As seen in the Intel assembler architecture, the data and code often start at the beginning of segments. The use of the split architecture protects against competition for the same lines in a cache.

But is also common for different program modules to be aligned at the beginning of a memory block, especially if it is a large program.

A solution to this is to again split the cache into two parts called set associative.

Set associative caches tend to consist of 2 or 4 identical caches of the above nature. The difference is that if a line in the first cache is in use, then same line in the next is accessed. Only if a particular line in all of the caches is in use does the system swap out one of the lines in use with the needed data.

There are several possible decision mechanisms for deciding which copy of the line to sacrifice.

FIFO - first in first out - assumes the oldest line will probably not be needed.

LRU - least recently used - the line accessed the longest time ago.

LFU - least frequently used - the line used the fewest times.

Caches with a larger set of lines can be created but the cost of checking lines becomes more complex and expensive. It has been found that 2 way or 4 way offer the best of both worlds.

Another issue with caching (for the data cache) is the problem of writing back a line because the data has changed. Reading and writing between memory and cache (or CPU) is expensive (that's why we have the cache), but the integrity of the data is at risk if it is only in the cache. There are two solutions to this.

Write through (wt) - changes to data in the cache are immediately written to main memory. This is a simple design and guarantees the data is in main memory but causes additional memory traffic.

Write deferred (wd) - write back when data is normally swapped out.

What happens in these two schemes if the memory being written is not already in the cache.

In wt, the change in memory will passed directly to main memory.

In wd, the memory to be written will be loaded into the cache with the assumption that additional writes will be occurring.