This assignment includes a review of C++ template classes, the STL
list
class, the concept of function objects, and the hash
table data structure.
Implement a hash table as a C++ template class, using the division method for the hash function and separate chaining to resolve collisions. Then write a program that uses your hash table class to implement a spell-checking dictionary.
You will need to write the header files for two classes:
The StringHashKey
class is a function object that
converts a C++ string into an integer key, and
the HashTable
class implements a hash table using the
division method and separate chaining. The data for the hash table
will be stored in a STL vector
of
STL list
s.
class StringHashKey
The StringHashKey
class is used to create a function
object that converts a string key to an integer key. This class has no
data members and a single public method, the overloaded
operator ()
. The method will take a reference to a
constant string as its single parameter and return an integer. The
method should convert the string parameter into an
integer. There are a number of possible ways to accomplish this; the
logic for one technique is outlined in the pseudocode below:
set an integer sum to 0 for each character in the string multiply the sum times 8, then add the value of the current string character if the result is greater than 2**27, calculate the remainder mod 2**27
class HashTable
The HashTable
class implements a hash table. This will
be a template class that takes two parameters. The first parameter is
the data type of the items to be stored in the hash table. The second
parameter is a function object to convert an item of the type being
stored into an integer key that can be hashed. The definition for
the HashTable
class should be placed in the
HashTable.h
header file. Because this is a template class,
the code for the methods of the class should also be placed in the
header file.
A HashTable
contains four data members. The first is an
integer that contains the number of pointers to linked lists in the hash
table; these pointers represent the buckets of the hash table. The second
is an integer that contains the number of data items currently stored in
the hash table. The third is a function object of the type of the second
template parameter which will convert a data item to an integer hash key.
The fourth data member is the vector
of linked-list buckets,
which should be declared as follows:
vector <list<T> > buckets;
where T
is the first template class parameter. Note
the space between the two > symbols - it
is required. Failure to include it will cause a
syntax error.
HashTable
constructorThe constructor for the class should take a single parameter, an
integer specifying the number of buckets. The constructor should set
the number of buckets to this parameter, set the size to 0,
and create the vector
of buckets. You may need to use the
member initialization syntax for the latter.
Even though this class involves dynamic memory, you do not need to write a copy constructor, destructor, or overloaded assignment operator for it because all of the dynamically-allocated storage for the hash table is managed by the embedded STL container classes, which already have the appropriate methods implemented for them.
HashTable clear()
methodThis method returns nothing and takes no parameters. It clears the
hash table, releasing all of the dynamic memory associated with
it. The method should reset the size to 0 and
call the clear()
method for the vector
of
buckets.
HashTable size()
methodThis method takes no parameters. It returns the number of data items currently stored in the hash table.
HashTable bucketsUsed()
methodThis method takes no parameters. It returns the number of buckets currently in use in the hash table, i.e., the number of non-empty lists.
HashTable maxLength()
methodThis method takes no parameters. It returns the maximum length of any list currently in the hash table.
HashTable empty()
methodThis method takes no parameters. It returns true
if the hash table
contains no data items; otherwise it returns false
.
HashTable insert()
methodThis method returns a Boolean value and takes a single parameter, a
reference to a constant item of the first template parameter type (the
item to insert). The method should "call" the hash key function
object, passing it the item to insert. The integer returned by the
function object should be divided by the number of buckets; the
remainder is the hash value, which can be used as an index to a
particular bucket in the vector
of buckets. You should
perform a linear search of the linked list associated with that
bucket. If the item to be inserted is already present in the linked
list, return
false
. Otherwise, insert the item into the list for that bucket,
increment the size of the hash table, and return true
.
HashTable find()
methodThis method returns a Boolean value and takes a single parameter, a
reference to a constant item of the first template parameter type (the
item to find). The method does not alter the hash table data. The
method should "call" the hash key function object, passing it the item
to insert. The integer returned by the function object should be
divided by the number of buckets; the remainder is the hash value,
which can be used as an index to a particular bucket in the
vector
of buckets. You should perform a linear search of
the linked list associated with that bucket. If the item to be found
is present in that bucket, return true
; otherwise return
false
.
In the second part of this assignment, you will write a main program that uses your hash table class. The main program will simulate the logic of the Linux utility ispell.
The program will take three command line parameters: the number of buckets in the hash table, the name of the dictionary file, and a file of words to spell check.
There are three possible outputs for each word:
list
of string
s. Then you can use the
sort
member function of list
.
(The
algorithm described below is a simplified form of the spelling
algorithm used by most word processors today and first described
in Kernighan's
COLING '90 paper.)We will provide you with a file of about 25,000 words to use as a dictionary. You should test with a much smaller dictionary first. Read the dictionary file and add the words to a hash table of the specified number of buckets. Convert each word to lower case first.
After the dictionary is loaded, print the following information about it, one per line, properly aligned:
To search for near misses, you need to generate every possible near miss and look for it in the dictionary. Specifically:
n
possibilities, where n
is the length of the word)26*(n+1)
possibilities)n-1
possibilities)26*n
possibilities, including
n
times for replacing the letter by itself, which may be
easier than trying to bypass this case)n-1
pairs of words; you
have to check separately in the dictionary for each word in the
pair. The pair of words is a near miss only if both words are in the
dictionary)At the end of your program, print the time needed to spell check the file.
The declaration of an iterator for a container class that
stores a generic template parameter data type requires the addition of
the keyword typename
, e.g.:
typename list<T>::iterator iter;
In your main program, the data type of your hash table dictionary will be
HashTable<string,StringHashKey>
. Since this is rather lengthy,
you may want to use the typedef
instruction to create a shorter
alias for this data type to save on typing, e.g.:
typedef HashTable<string,StringHashKey> NewName;
Use STL algorithms wherever possible. For example, you can use
the transform
function to convert the input word to lower
case. You can use for_each
or another algorithm to
calculate the hash key. You can use the STL find
algorithm to search a linked list. You will have to refer to it
as stl::find
to avoid ambiguity with
the find
method of the hash table.
In a real-world example, you want to keep the average length of an overflow chain less than 1, i.e., you want plenty of extra space in your hash table. We are trying this hash table with some unrealistically small sizes (e.g., 97) so you can see the huge time difference between a too-small hash table and a more realistic one.
Here is an example of using a class variable in a template that will be filled in with the name of a function object at compile time:
template <class T, class HF> class HashTable { ... }; template <class T, class HF> HashTable<T, HF>::HashTable(int nb) : buckets(nb) { ... }
You can define variables of type T
or HF
:
T my_vbl; HF my_hashfn;
The program that declares an instance of the class needs to provide actual values for the class variables, e.g.:
HashTable<string, StringHash> dictionary(bucket_count);
The variables T
and HF
in the class
definition are filled in with your choices at compile time, just as an
STL class like list
might be instantiated
as list<int>
when you write your code. Remember that if a
class is defined with a template, all of its members must use the same
template. For an example of the syntax, see the following example from
241:
template example
The ":" in the example above is an example of member initialization
syntax. You could initialize all your variables with this syntax if
you prefer, e.g.: buckets(nb), bucketsUsedCount(0),
...
It's actually more efficient to do it that way. For a nested STL
type like vector
, it's required.
The purpose of declaring a function object and overloading the "()"
operator is to let you use the normal calling sequence for a function
whose name is a variable. For example, using the example above you could
write my_hashfn(input)
.
const
. Objects should
normally be passed by reference to save on overhead; if a method does
not alter the object, pass a reference to
const
data rather than passing the object by value.You must have a makefile with a rule for compiling an object
file for each source code file. A linking rule should link all of
your object files together to create one executable. The name of your
final executable should be hw5.exe
. You should also
provide a "clean" rule.
Don't forget to get rid of all compiler warnings when the
-Wall
compilation option is used.
As always, programs that do not compile on turing automatically receive 0 points.
You can improve the performance of your program by using binary
search instead of linear search. Another option would be to
use map
s to store the contents of the buckets. The STL
makes it possible to try these changes with very little programming.
A dictionary is available at dict.txt.
Here is a sample input words.txt to spell check. Here are three outputs showing the effect of the hash table size: hw5-out97.txt hw5-out997.txt hw5-out9973.txt. (Note that the format is not exactly what you have been requested to produce.)
The word "assert" occurs three times in the output because "asssert" can be corrected to "assert" three ways depending on which 's' is deleted. The duplicates have been left in so that you can see whether your program is working correctly. In a real-world application you would delete the duplicates because they are not meaningful to the user.
The output also shows some other spell-check issues that you would have to deal with to get a professional-quality program. The letters of the alphabet might occur in a document, and Lund is a city in Sweden, but that doesn't mean that "b lund" is a very likely correction for "blund." You could add rules to restrict the possible pairs of words, or you could do as most spell checkers do today, and sort the corrections by probability of occurrence.
Note that you need to hash the input, not do a linear (or binary) search through the dictionary, to get correct timings.
Execute your program as ./hw5.exe
buckets dictionary.txt words.txt
.
Submit your project using mailprog.340. Submit the 3 files you
have written plus your makefile. There are only three
(hw5.cpp
, HashTable.h
,
and StringHash.h
) because code for template classes and
functions needs to be in the .h
file.