CS61B(19): Hashing

This is the lecture note of CS61B - Lecture 19.

What we have done by far?

Today we’ll see the answer to both of the questions above is yes.

Data Indexed Arrays

Let's start from a strange approach.

And what's nice about this idea is that the implementation is very simple.

However, this approach has an issue that it extremely wastes memory. Furthermore, since we want a data indexed set that can store arbitrary types ideally, so we also need some way to generalize beyond integers.

However, this approach has two problems,

  • Collisions: other words start with 'C'.
    • set.contains('church') --> true ❌
  • Can't store other string which doesn't start with EN letters, such as "98yawef".

Let's try to tackle these problems.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/** Converts i-th character of string to a letter number. */
public static int letterNum(String s, int i) {
int ithChar = s.charAt(i);
if ((ithChar < 'a') || (ithChar > 'z')) {
throw new IllegalArgumentException();
}
return ithChar - 'a' + 1;
}

public static int englishToInt(String s) {
int intRep = 0;
for (int i = 0; i < s.length(); i += 1) {
intRep = intRep * 27;
intRep = intRep + letterNum(s, i);
}
return intRep;
}

Data Indexed String Set

The approach above can only use lowercase English characters, which is too restrictive. What if we want to store strings like “2pac” or “eGg!”? To understand what value we need to use for our base, let’s discuss briefly discuss the ASCII standard.

Hash Codes

In Java, the largest possible integer is 2,147,483,647.

  • If you go over this limit, you overflow, starting back over at the smallest integer, which is -2,147,483,648.
  • In other words, the next number after 2,147,483,647 is -2,147,483,648.

Because Java has a maximum integer, we will run into overflow even for short strings. And overflow can result in collisions, causing incorrect answers.

The official term for the number we’re computing is hash code.

  • A hash code “projects a value from a set with many (or even an infinite number of) members to a value from a set with a fixed number of (fewer) members.”
  • Our target set is the set of Java integers, which is of size 4294967296.

However, Pigeonhole principle tells us that if there are more than 4294967296 possible items, multiple items will share the same hash code. So, Collisions are inevitable!

Hash Tables: Handling Collisions

Hash Table Performance

The answer is, do not use a fixed number of buckets. Use an increasing number of buckets.

Even distribution of item is critical for good hash table performance. So we still need to discuss how to ensure even distribution (see lecture slide).

Hash Tables in Java

Python dictionaries are just hash tables in disguise.

In Java, implemented as java.util.HashMap and java.util.HashSet.

  • compute an object’s hash code: .hashCode()
    • if an object's hash code is negative, use Math.floorMod method.