Module 8: Hash Tables
Lab
Abstract: We have seen hashing as a technique for obtaining
expected constant-time operations on finding or inserting elements in a set.
Hashing is also used to compute digital signatures for confirming the authenticity of messages.
In this assignment, you will apply a hashing scheme in pursuit of finding
occurences of one string in another,
using the Rabin–Karp algorithm.
That algorithm is discussed in Section 32.2 of our text, on page 990.
Introduction
Suppose we have two strings of interest:
- target
- is a relatively long string of length n.
- query
- is a relatively short string of length m.
We are interested in finding all occurrences of query in target.
For example, consider the query aaba and
the target aabaacaadaabaabaa.
The last character of the query occurs within the
target as shown below:
a a b a a c a a d a a b a a b a a
| | |
a a b a | |
a a b a |
a a b a
Notice that the second and third matches overlap. As discused in class,
the most straightforward algorithm is as follows:
for each i in the interval [0, n)
match = true
for each j in the interval [0, m)
if target[i+j] ≠ query[j]
match = false
end if
end for
if match
// definite match of query starting at index i of target
end if
end for
The worst-case complexity of this approach is
Θ(mn).
Rabin–Karp Algorithm
Instead of trying to match the query at each location in the target, the
Rabin–Karp Algorithm computes a rolling hash of a window
of m characters of the target. Near the beginning of the target, where
there may not be m characters in the window, the value 0 is
used for those
entries outside of the target.
We will use in particular the following hash function for a window
of size m in a character array, with the first character of
the window at index start:
private static int hashFor(char[] chars, int start, int m) {
int ans = 0;
for (int i=start; i < start+m; ++i) {
int val = 0;
//
// Within the bounds of the chars array?
//
if (i ≥ 0 && i < chars.length) {}
val = chars[i]; // then use the character value there
}
ans = mod( mod(31*ans, 511) + mod(val, 511), 511 );
}
return ans;
}
The above function is similar to how Java computes .hashCode() for
a String, except that we confine the results to being strictly less
than 511 using mod (the % operator). The following
identities related to mod should be useful in this assignment:
- (a + b) mod p = ( (a mod p) + (b mod p) ) mod p
- (a × b) mod p = ( (a mod p) × (b mod p) ) mod p
You should investigate how large the intermediate results of addition
and multiplication can get before the final mod p is applied.
Applied to our running example, the rolling hash values at the end of each window within our target string are as follows:
a | | a | | b | | a | | a | | c | | a | | a | | d | | a | | a | | b | | a | | a | | b | | a | | a |
| 97 | | 38 | | 254 | | 306 | | 214 | | 430 | | 337 | | 153 | | 73 | | 368 | | 92 | | 224 | | 306 | | 214 | | 429 | | 306 | | 214
|
-
Note that at the end of each aaba substring, the hash value
306 is produced.
- Thus, each time the rolling hash produces 306, we
probably have a match of aaba in the target.
- No other window produces the 306 hash value,
so we were lucky in this example: the hash value 306
produced matches only for our query.
The Rabin–Karp (RK) algorithm is essentially as follows:
- Compute the RK hash for the query string of interest.
In our running
example, that hash value is 306, which can be computed by
invoking
hashFor("aaba".toCharArray(), 0, 4)
- For each index i of the target string,
compute the rolling hash of the target that begins at index
i-m+1 and has length m.
-
In the example, the value for the first a is computed at indices
-3, -2, -1, and 0, for a window of length 4 that ends at the first index
of the target. The numerical values for characters outside the window are 0,
and the numerical value for a is its ASCII encoding of 97.
- Thus, the window for a (the first character in our
target string) has hash value 97.
- Moving one character to the right encounters another a. Its
hash value is at indices -2, -1, 0, and 1, which produces 31×97+97,
which is 3104, but taken mod 511 is 38.
- Thus the window for a a (the first two characters in
our target string) has value 38.
-
However, if we compute the rolling hash by calling the hashFor method above,
the complexity is still
Θ(mn).
- Your task is to compute that rolling hash incrementally,
as described below.
Your assignment
- Update your repository as usual.
- In the rabinkarp package of the labs source
folder, find and open the RK class.
- Implement the methods with FIXME as described in the comments.
- You MUST write out your pseudocode for each FIXME before you
write your Java code. No credit will be given for Java code without pseudocode!
- The constructor accepts the value of m, which is the length
of the query string in this assignment.
- The method int nextCh(char c) accepts a character supplied
from outside, and returns the rolling KR hash of it and the previous
m-1 characters.
You must complete the constructor and the nextCh method,
subject to the constraints stated below.
In terms of our running example, the following calls after instantiation of
a KR(4) object would return values as follows:
- nextCh('a') → 97
- nextCh('a') → 38
- nextCh('b') → 254
- nextCh('a') → 306
- nextCh('a') → 214
- nextCh('c') → 430
To receive full credit for this lab, your solution must:
- Produce the correct hash results using an incremental approach,
taking constant (Θ(1)) time. This means your approach must
take less than Ω(m) time.
- Suppose h is the hash value of a window of m characters.
- Suppose character c is about to leave that window, and
character d is the next to be incorporated into the window.
- We can compute the new value of h using:
h = (h×31 – 31m×c + d) mod 511
- Ensure that the above computation does not cause overflow, taking care
to keep the numbers mod 511 wherever possible.
- Use space proportional only to m, which is the size of
the sliding window.
- Implement a circular buffer as described in class, of size m,
to keep track of the
last m characters seen.
- Pass the unit tests.
- Check the Grading Rubric for this lab.
Having trouble?
Post problems on piazza, but do not
post code publicly there!
How to submit your work
- The unit tests are random. Run them several times so that you know you code is functional.
- Make sure all of your work is committed and pushed in your repository.
If you get a rejected message when pushing, pull and then try pushing again.
- If you do not push your code, we cannot grade it; your lab will be considered late!
- Your work will be graded as of the due date for this assignment.
Submitting your work (read carefully)
- You must commit all of your work to your repository and push it. It's best to do this
from the top-most level of your repository, which bears your wustl key.
Last modified 09:46:24 CDT 22 August 2016