CSCI 340 Assignment 5

Purpose

This assignment includes a review of C++ template classes, the STL list class, the concept of function objects, and the hash table data structure.

Assignment

Implement a hash table as a C++ template class, using the division method for the hash function and separate chaining to resolve collisions. Then write a program that uses your hash table class to implement a spell-checking dictionary.

Hash Table Implementation

You will need to write the header files for two classes: The StringHashKey class is a function object that converts a C++ string into an integer key, and the HashTable class implements a hash table using the division method and separate chaining. The data for the hash table will be stored in a STL vector of STL lists.

`class StringHashKey`

The StringHashKey class is used to create a function object that converts a string key to an integer key. This class has no data members and a single public method, the overloaded operator (). The method will take a reference to a constant string as its single parameter and return an integer. The method should convert the string parameter into an integer. There are a number of possible ways to accomplish this; the logic for one technique is outlined in the pseudocode below:

set an integer sum to 0
for each character in the string
   multiply the sum times 8, then add the value of the current string character
   if the result is greater than 2**27, calculate the remainder mod 2**27

`class HashTable`

The HashTable class implements a hash table. This will be a template class that takes two parameters. The first parameter is the data type of the items to be stored in the hash table. The second parameter is a function object to convert an item of the type being stored into an integer key that can be hashed. The definition for the HashTable class should be placed in the HashTable.h header file. Because this is a template class, the code for the methods of the class should also be placed in the header file.

A HashTable contains four data members. The first is an integer that contains the number of pointers to linked lists in the hash table; these pointers represent the buckets of the hash table. The second is an integer that contains the number of data items currently stored in the hash table. The third is a function object of the type of the second template parameter which will convert a data item to an integer hash key. The fourth data member is the vector of linked-list buckets, which should be declared as follows:

   vector <list<T> > buckets;

where T is the first template class parameter. Note the space between the two > symbols - it is required. Failure to include it will cause a syntax error.

`HashTable` constructor

The constructor for the class should take a single parameter, an integer specifying the number of buckets. The constructor should set the number of buckets to this parameter, set the size to 0, and create the vector of buckets. You may need to use the member initialization syntax for the latter.

Even though this class involves dynamic memory, you do not need to write a copy constructor, destructor, or overloaded assignment operator for it because all of the dynamically-allocated storage for the hash table is managed by the embedded STL container classes, which already have the appropriate methods implemented for them.

`HashTable clear()` method

This method returns nothing and takes no parameters. It clears the hash table, releasing all of the dynamic memory associated with it. The method should reset the size to 0 and call the clear() method for the vector of buckets.

`HashTable size()` method

This method takes no parameters. It returns the number of data items currently stored in the hash table.

`HashTable bucketsUsed()` method

This method takes no parameters. It returns the number of buckets currently in use in the hash table, i.e., the number of non-empty lists.

`HashTable maxLength()` method

This method takes no parameters. It returns the maximum length of any list currently in the hash table.

`HashTable empty()` method

This method takes no parameters. It returns true if the hash table contains no data items; otherwise it returns false.

`HashTable insert()` method

This method returns a Boolean value and takes a single parameter, a reference to a constant item of the first template parameter type (the item to insert). The method should "call" the hash key function object, passing it the item to insert. The integer returned by the function object should be divided by the number of buckets; the remainder is the hash value, which can be used as an index to a particular bucket in the vector of buckets. You should perform a linear search of the linked list associated with that bucket. If the item to be inserted is already present in the linked list, return false. Otherwise, insert the item into the list for that bucket, increment the size of the hash table, and return true.

`HashTable find()` method

This method returns a Boolean value and takes a single parameter, a reference to a constant item of the first template parameter type (the item to find). The method does not alter the hash table data. The method should "call" the hash key function object, passing it the item to insert. The integer returned by the function object should be divided by the number of buckets; the remainder is the hash value, which can be used as an index to a particular bucket in the vector of buckets. You should perform a linear search of the linked list associated with that bucket. If the item to be found is present in that bucket, return true; otherwise return false.

Program

In the second part of this assignment, you will write a main program that uses your hash table class. The main program will simulate the logic of the Linux utility ispell.

The program will take three command line parameters: the number of buckets in the hash table, the name of the dictionary file, and a file of words to spell check.

There are three possible outputs for each word:

"OK" means that the word is in the dictionary.
"Not found" means that the word is not in the dictionary and neither are any of its near misses.
Otherwise, print a list of near misses in alphabetical order. Near misses are words in the dictionary that are similar to the input. The ispell program recognizes several types of near-misses: single-letter omissions, single-letter additions, transpositions of two neighboring letters, substitution of one letter for another, and two words run together. Your program should do the same. The easiest way to sort the possible corrections is to store them in a list of strings. Then you can use the sort member function of list. (The algorithm described below is a simplified form of the spelling algorithm used by most word processors today and first described in Kernighan's COLING '90 paper.)

We will provide you with a file of about 25,000 words to use as a dictionary. You should test with a much smaller dictionary first. Read the dictionary file and add the words to a hash table of the specified number of buckets. Convert each word to lower case first.

After the dictionary is loaded, print the following information about it, one per line, properly aligned:

Number of entries
Number of buckets in the hash table
Number of buckets in use
Average length of list (round to two decimal places)
Maximum length of list
Time needed to load the dictionary

Then read the file specified in the second argument. Convert each word to lower case, then check to see whether it is in the dictionary. If so, it should print "ok". If not, it should look for all possible near misses. If the program finds any near misses in the dictionary, it should print them in alphabetical order, four across. If not, it should print "not found". (Make sure you print the original word first!)

To search for near misses, you need to generate every possible near miss and look for it in the dictionary. Specifically:

Construct every string that can be made by deleting one letter from the word. (n possibilities, where n is the length of the word)
Construct every string that can be made by inserting any letter of the alphabet at any position in the string. (26*(n+1) possibilities)
Construct every string that can be made by swapping two neighboring characters in the string. (n-1 possibilities)
Construct every string that can be made by replacing each letter in the word with some letter of the alphabet. (26*n possibilities, including n times for replacing the letter by itself, which may be easier than trying to bypass this case)
Construct every pair of strings that can be made by inserting a space into the word. (n-1 pairs of words; you have to check separately in the dictionary for each word in the pair. The pair of words is a near miss only if both words are in the dictionary)

At the end of your program, print the time needed to spell check the file.

Programming Notes

The declaration of an iterator for a container class that stores a generic template parameter data type requires the addition of the keyword typename, e.g.:
```
typename list<T>::iterator iter;
```
In your main program, the data type of your hash table dictionary will be HashTable<string,StringHashKey>. Since this is rather lengthy, you may want to use the typedef instruction to create a shorter alias for this data type to save on typing, e.g.:
```
typedef HashTable<string,StringHashKey> NewName;
```
Use STL algorithms wherever possible. For example, you can use the transform function to convert the input word to lower case. You can use for_each or another algorithm to calculate the hash key. You can use the STL find algorithm to search a linked list. You will have to refer to it as stl::find to avoid ambiguity with the find method of the hash table.
This web site explains how to obtain CPU time: CPU time calculation. CLOCKS_PER_SEC is a constant defined in the #include. It is 1,000,000 on turing, which means that the system measures times to the nearest millionth of a second. That means that a total time less than one millionth of a second will appear to be 0.

In a real-world example, you want to keep the average length of an overflow chain less than 1, i.e., you want plenty of extra space in your hash table. We are trying this hash table with some unrealistically small sizes (e.g., 97) so you can see the huge time difference between a too-small hash table and a more realistic one.
Here is an example of using a class variable in a template that will be filled in with the name of a function object at compile time:
```
template <class T, class HF>
class HashTable
   {
    ...
   };


template <class T, class HF>
HashTable<T, HF>::HashTable(int nb) : buckets(nb)
  {
    ...
  }
```
You can define variables of type T or HF:
```
T my_vbl;
HF my_hashfn;
```
The program that declares an instance of the class needs to provide actual values for the class variables, e.g.:
```
HashTable<string, StringHash> dictionary(bucket_count);
```
The variables T and HF in the class definition are filled in with your choices at compile time, just as an STL class like list might be instantiated as list<int> when you write your code. Remember that if a class is defined with a template, all of its members must use the same template. For an example of the syntax, see the following example from 241: template example
The ":" in the example above is an example of member initialization syntax. You could initialize all your variables with this syntax if you prefer, e.g.: buckets(nb), bucketsUsedCount(0),... It's actually more efficient to do it that way. For a nested STL type like vector, it's required.
The purpose of declaring a function object and overloading the "()" operator is to let you use the normal calling sequence for a function whose name is a variable. For example, using the example above you could write my_hashfn(input).

Standards

Use good object-oriented practices. Methods that do not alter the class data should be declared const. Objects should normally be passed by reference to save on overhead; if a method does not alter the object, pass a reference to const data rather than passing the object by value.
Remember that any auxiliary variables you define, such as iterators, should be local to the function they are used in. They should not be class members or global variables. Similarly, variables should not be defined inside a loop.
You must have a makefile with a rule for compiling an object file for each source code file. A linking rule should link all of your object files together to create one executable. The name of your final executable should be hw5.exe. You should also provide a "clean" rule.
Your program must follow all the documentation standards listed on the course web site.
Don't forget to get rid of all compiler warnings when the -Wall compilation option is used.
As always, programs that do not compile on turing automatically receive 0 points.

Extensions

You can improve the performance of your program by using binary search instead of linear search. Another option would be to use maps to store the contents of the buckets. The STL makes it possible to try these changes with very little programming.

Submission

A dictionary is available at dict.txt.
Here is a sample input words.txt to spell check. Here are three outputs showing the effect of the hash table size: hw5-out97.txt hw5-out997.txt hw5-out9973.txt. (Note that the format is not exactly what you have been requested to produce.)

The word "assert" occurs three times in the output because "asssert" can be corrected to "assert" three ways depending on which 's' is deleted. The duplicates have been left in so that you can see whether your program is working correctly. In a real-world application you would delete the duplicates because they are not meaningful to the user.

The output also shows some other spell-check issues that you would have to deal with to get a professional-quality program. The letters of the alphabet might occur in a document, and Lund is a city in Sweden, but that doesn't mean that "b lund" is a very likely correction for "blund." You could add rules to restrict the possible pairs of words, or you could do as most spell checkers do today, and sort the corrections by probability of occurrence.

Note that you need to hash the input, not do a linear (or binary) search through the dictionary, to get correct timings.

Execute your program as ./hw5.exe buckets dictionary.txt words.txt.
Run your program 3 times, using 97, 997 and 9973 buckets, respectively. Make sure your output is the same except for the times. Compare the times.
Submit your project using mailprog.340. Submit the 3 files you have written plus your makefile. There are only three (hw5.cpp, HashTable.h, and StringHash.h) because code for template classes and functions needs to be in the .h file.

CSCI 340 Assignment 5

Purpose

Assignment

Hash Table Implementation

class StringHashKey

class HashTable

HashTable constructor

HashTable clear() method

HashTable size() method

HashTable bucketsUsed() method

HashTable maxLength() method

HashTable empty() method

HashTable insert() method

HashTable find() method