CSE 247: Hash Tables

Module 8: Hash Tables

Lab

Abstract: We have seen hashing as a technique for obtaining expected constant-time operations on finding or inserting elements in a set. Hashing is also used to compute digital signatures for confirming the authenticity of messages.

In this assignment, you will apply a hashing scheme in pursuit of finding occurences of one string in another, using the Rabin–Karp algorithm.

That algorithm is discussed in Section 32.2 of our text, on page 990.

Introduction

Suppose we have two strings of interest:

target: is a relatively long string of length n.
query: is a relatively short string of length m.

We are interested in finding all occurrences of query in target. For example, consider the query aaba and the target aabaacaadaabaabaa. The last character of the query occurs within the target as shown below:

a a b a a c a a d a a b a a b a a
       |                 |     |
a a b a                  |     |
                  a a b a      |
                        a a b a

Notice that the second and third matches overlap. As discused in class, the most straightforward algorithm is as follows:

   for each i in the interval [0, n)
      match = true
      for each j in the interval [0, m)
          if target[i+j] ≠ query[j]
             match = false
          end if
      end for
      if match
         // definite match of query starting at index i of target
       end if
   end for

The worst-case complexity of this approach is Θ(mn).

Rabin–Karp Algorithm

Instead of trying to match the query at each location in the target, the Rabin–Karp Algorithm computes a rolling hash of a window of m characters of the target. Near the beginning of the target, where there may not be m characters in the window, the value 0 is used for those entries outside of the target.

We will use in particular the following hash function for a window of size m in a character array, with the first character of the window at index start:

private static int hashFor(char[] chars, int start, int m) {
   int ans = 0;
   for (int i=start; i < start+m; ++i) {
      int val = 0;
      //
      // Within the bounds of the chars array?
      //
      if (i ≥ 0 && i < chars.length) {}
         val = chars[i];                        // then use the character value there
      }
      ans = mod( mod(31*ans, 511) + mod(val, 511), 511 );
   }
   return ans;
}

The above function is similar to how Java computes .hashCode() for a String, except that we confine the results to being strictly less than 511 using mod (the % operator). The following identities related to mod should be useful in this assignment:

(a + b) mod p = ( (a mod p) + (b mod p) ) mod p
(a × b) mod p = ( (a mod p) × (b mod p) ) mod p

You should investigate how large the intermediate results of addition and multiplication can get before the final mod p is applied.

Applied to our running example, the rolling hash values at the end of each window within our target string are as follows:

a a b a a c a a d a a b a a b a a
97 38 254 306 214 430 337 153 73 368 92 224 306 214 429 306 214

Note that at the end of each aaba substring, the hash value 306 is produced.
Thus, each time the rolling hash produces 306, we probably have a match of aaba in the target.
No other window produces the 306 hash value, so we were lucky in this example: the hash value 306 produced matches only for our query.

The Rabin–Karp (RK) algorithm is essentially as follows:

Compute the RK hash for the query string of interest.
In our running example, that hash value is 306, which can be computed by invoking hashFor("aaba".toCharArray(), 0, 4)
For each index i of the target string, compute the rolling hash of the target that begins at index i-m+1 and has length m.
- In the example, the value for the first a is computed at indices -3, -2, -1, and 0, for a window of length 4 that ends at the first index of the target. The numerical values for characters outside the window are 0, and the numerical value for a is its ASCII encoding of 97.
- Thus, the window for a (the first character in our target string) has hash value 97.
- Moving one character to the right encounters another a. Its hash value is at indices -2, -1, 0, and 1, which produces 31×97+97, which is 3104, but taken mod 511 is 38.
- Thus the window for a a (the first two characters in our target string) has value 38.
However, if we compute the rolling hash by calling the hashFor method above, the complexity is still Θ(mn).
Your task is to compute that rolling hash incrementally, as described below.

Your assignment

Update your repository as usual.
In the rabinkarp package of the labs source folder, find and open the RK class.
Implement the methods with FIXME as described in the comments.
You MUST write out your pseudocode for each FIXME before you write your Java code. No credit will be given for Java code without pseudocode!
The constructor accepts the value of m, which is the length of the query string in this assignment.
The method int nextCh(char c) accepts a character supplied from outside, and returns the rolling KR hash of it and the previous m-1 characters.

You must complete the constructor and the nextCh method, subject to the constraints stated below.

In terms of our running example, the following calls after instantiation of a KR(4) object would return values as follows:

nextCh('a') → 97
nextCh('a') → 38
nextCh('b') → 254
nextCh('a') → 306
nextCh('a') → 214
nextCh('c') → 430

To receive full credit for this lab, your solution must:

Produce the correct hash results using an incremental approach, taking constant (Θ(1)) time. This means your approach must take less than Ω(m) time.
- Suppose h is the hash value of a window of m characters.
- Suppose character c is about to leave that window, and character d is the next to be incorporated into the window.
- We can compute the new value of h using: h = (h×31 – 31^m×c + d) mod 511
Ensure that the above computation does not cause overflow, taking care to keep the numbers mod 511 wherever possible.
Use space proportional only to m, which is the size of the sliding window.
Implement a circular buffer as described in class, of size m, to keep track of the last m characters seen.
Pass the unit tests.
Check the Grading Rubric for this lab.

Having trouble?

Post problems on piazza, but do not post code publicly there!

How to submit your work

The unit tests are random. Run them several times so that you know you code is functional.
Make sure all of your work is committed and pushed in your repository.
If you get a rejected message when pushing, pull and then try pushing again.
If you do not push your code, we cannot grade it; your lab will be considered late!
Your work will be graded as of the due date for this assignment.

Submitting your work (read carefully)

You must commit all of your work to your repository and push it. It's best to do this from the top-most level of your repository, which bears your wustl key.

Last modified 09:46:24 CDT 22 August 2016

`a`		`a`		`b`		`a`		`a`		`c`		`a`		`a`		`d`		`a`		`a`		`b`		`a`		`a`		`b`		`a`		`a`
	97		38		254		306		214		430		337		153		73		368		92		224		306		214		429		306		214

`a`		`a`		`b`		`a`		`a`		`c`		`a`		`a`		`d`		`a`		`a`		`b`		`a`		`a`		`b`		`a`		`a`
	97		38		254		306		214		430		337		153		73		368		92		224		306		214		429		306		214

`a`		`a`		`b`		`a`		`a`		`c`		`a`		`a`		`d`		`a`		`a`		`b`		`a`		`a`		`b`		`a`		`a`
	97		38		254		306		214		430		337		153		73		368		92		224		306		214		429		306		214