Code Answer: Hashtable/Dictionary collisions

Using the standard English letters and underscore only, how many characters can be used at a maximum without causing a potential collision in a hashtable/dictionary.

So strings like:

blur
Blur
b
Blur_The_Shades_Slightly_With_A_Tint_Of_Blue

...

From stackoverflow

There's no guarantee that you won't get a collision between single letters.

You probably won't, but the algorithm used in string.GetHashCode isn't specified, and could change. (In particular it changed between .NET 1.1 and .NET 2.0, which burned people who assumed it wouldn't change.)

Note that hash code collisions won't stop well-designed hashtables from working - you should still be able to get the right values out, it'll just potentially need to check more than one key using equality if they've got the same hash code.

Any dictionary which relies on hash codes being unique is missing important information about hash codes, IMO :) (Unless it's operating under very specific conditions where it absolutely knows they'll be unique, i.e. it's using a perfect hash function.)

Joshua : FYI, the known unique is called perfect hashing and is very rarely seen these days except with MD5 and SHA1 sums.

Jon Skeet : It can only possibly work when the hash code space is larger than the potential key space, for starters :)

Jon Skeet : (Will edit answer to reference the wikipedia article on perfect hashes.)

Joan Venge : Thanks Jon. I see what you mean. I know I don't need perfect uniqueness, just thought if I had a limited number of characters, maybe that would guarantee uniqueness in the hashtable, so to avoid additional lookup. Also can I see string.GetHashCode in the reflector?

ShuggyCoUk : incidentally MD5 or SHA are NOT perfect hashes. They are just very convoluted and computationally expensive excellent hashes. You can do just as well with much cheaper ones for non crypto uses.

ShuggyCoUk : @Joan you can see the implementation of GetHashcode in reflector, it's ok-ish there are better ones available though I would question whether you really need to bother since they aren't *that* much better until you get to *really* big (millions+) dictionaries.

Joan Venge : Thanks, ShuggyCoUk. I see. Mine was just out of curiosity. :)
If your key is a string (e.g., a Dictionary) then it's GetHashCode() will be used. That's a 32bit integer. Hashtable defaults to a 1 key to value load factor and increases the number of buckets to maintain that load factor. So if you do see collisions they should tend to occur around reallocation boundaries (and decrease shortly after reallocation).
A hash algorithm isn't supposed to guarantee uniqueness. Given that there are far more potential strings (26^n for n length, even ignoring special chars, spaces, capitalization, non-english chars, etc.) than there are places in your hashtable, there's no way such a guarantee could be fulfilled. It's only supposed to guarantee a good distribution.
Given a perfect hashing function (which you're not typically going to have, as others have mentioned), you can find the maximum possible number of characters that guarantees no two strings will produce a collision, as follows:

No. of unique hash codes avilable = 2 ^ 32 = 4294967296 (assuming an 32-bit integer is used for hash codes) Size of character set = 2 * 26 + 1 = 53 (26 lower as upper case letters in the Latin alphabet, plus underscore)

Then you must consider that a string of length l (or less) has a total of 54 ^ l representations. Note that the base is 54 rather than 53 because the string can terminate after any character, adding an extra possibility per char - not that it greatly effects the result.

Taking the no. of unique hash codes as your maximum number of string representations, you get the following simple equation:

54 ^ l = 2 ^ 32

And solving it:
```
log2 (54 ^ l) = 32
l * log2 54 = 32
l = 32 / log2 54 = 5.56
```
(Where log2 is the logarithm function of base 2.)

Since string lengths clearly can't be fractional, you take the integral part to give a maximum length of just 5. Very short indeed, but observe that this restriction would prevent even the remotest chance of a collision given a perfect hash function.

This is largely theoretical however, as I've mentioned, and I'm not sure of how much use it might be in the design consideration of anything. Saying that, hopefully it should help you understand the matter from a theoretical viewpoint, on top of which you can add the practical considersations (e.g. non-perfect hash functions, non-uniformity of distribution).

ShuggyCoUk : +1 I liked your maths so much I did my own. Mind me integrating your maths into my answer to include the practical limitations on the perfect hashing?

Noldorin : @ShuggyCoUk: Yeah, no problem... I'd be interested to see what you come up with. :)

Joan Venge : Yeah this is a good one.

ShuggyCoUk : integrated and attributed in comments, now community wiki feel free to edit if you wish. thanks
Universal Hashing

To calculate the probability of collisions with S strings of length L with W bits per character to a hash of length H bits assuming an optimal universal hash (1) you could calculate the collision probability based on a hash table of size (number of buckets) 'N`.

First things first we can assume a ideal hashtable implementation (2) that splits the H bits in the hash perfectly into the available buckets N(3). This means H becomes meaningless except as a limit for N. W and 'L' are simply the basis for an upper bound for S. For simpler maths assume that strings length < L are simply padded to L with a special null character. If we were interested we are interested in the worst case this is 54^L (26*2+'_'+ null), plainly this is a ludicrous number, the actual number of entries is more useful than the character set and the length so we will simply work as if S was a variable in it's own right.

We are left trying to put S items into N buckets. This then becomes a very well known problem, the birthday paradox

Solving this for various probabilities and number of buckets is instructive but assuming we have 1 billion buckets (so about 4GB of memory in a 32 bit system) then we would need only 37K entries before we hit a 50% chance of their being at least one collision. Given that trying to avoid any collisions in a hashtable becomes plainly absurd.

All this does not mean that we should not care about the behaviour of our hash functions. Clearly these numbers are assuming ideal implementations, they are an upper bound on how good we can get. A poor hash function can give far worse collisions in some areas, waste some of the possible 'space' by never or rarely using it all of which can cause hashes to be less than optimal and even degrade to a performance that looks like a list but with much worse constant factors.

The .NET framework's implementation of the string's hash function is not great (in that it could be better) but is probably acceptable for the vast majority of users and is reasonably efficient to calculate.

An Alternative Approach: Perfect Hashing

If you wish you can generate what are known as perfect hashes this requires full knowledge of the input values in advance however so is not often useful. In a simliar vein to the above maths we can show that even perfect hashing has it's limits:

Recall the limit of of 54 ^ L strings of length L. However we only have H bits (we shall assume 32) which is about 4 billion different numbers. So if you can have truly any string and any number of them then you have to satisfy:
```
54 ^ L <= 2 ^ 32
```
And solving it:
```
log2 (54 ^ L) <= 32
L * log2 54 <= 32
L <= 32 / log2 54 <= 5.56
```
Since string lengths clearly can't be fractional, you are left with a maximum length of just 5. Very short indeed.

If you know that you will only ever have a set of strings well below 4 Billion in size then perfect hashing would let you handle any value of L, but restricting the set of values can be very hard in practice and you must know them all in advance or degrade to what amounts to a database of string -> hash and add to it as new strings are encountered.
1. For this exercise the universal hash is optimal as we wish to reduce the probability of any collision i.e. for any input the probability of it having output x from a set of possibilities R is 1/R.
2. Note that doing an optimal job on the hashing (and the internal bucketing) is quite hard but that you should expect the built in types to be reasonable if not always ideal.
3. In this example I have avoided the question of closed and open addressing. This does have some bearing on the probabilities involved but not significantly
ShuggyCoUk : this answer now reflects Noldorin's efforts on the perfect hashing so is now community wiki

Code Answer

Wednesday, April 20, 2011

Hashtable/Dictionary collisions

Universal Hashing

An Alternative Approach: Perfect Hashing

0 comments:

Post a Comment

Blog Archive