Planetarion Forums

22 Nov 2002, 00:17

I need 2 fairly quick and simple, and easy to write in java, hashing algorithms, to hash a string into an int, and im too lazy to sift through hundreds of google pages.

any help you could give would be appreciated

MT · 22 Nov 2002, 02:01

could i suggest a book on data algorithms?

from my current favourite book ..

this is for a CBHT, choose m to be a prime s.t. you would expect a load factor of about 0.6, c[i] being the ith char of the word

hash(word) = sigma(c[i]*i,0,length(word)) % m

.. its ahrd to do sigma signs, so that means the sum of (c[i]*i) from i = 0 to i < length(word)

or in java, as I'm a softy,

Code:

static int hash(String s) throws Exception {
  int hash = 0;
  if (s==null) throw new IllegalArgumentException();
  for (int i = 0;i < s.length();i++)
    hash+=(i*s.charAt(i));
  return hash % PRIME;
}

22 Nov 2002, 16:49

ah magic, i assume i would substitute prime with my table size, as its for a hashing table.

cheers

Structural Integrity · 22 Nov 2002, 17:06

What is hashing actually for? I never really understood (I even had problems understanding the concept of hash-tables in mIRC, but that's besides the point).

Can you give examples in which situation one would like to use hashing, why and what you accomplish with it?

Nodrog · 22 Nov 2002, 20:47

Quote:

Originally posted by Structural Integrity
What is hashing actually for? I never really understood (I even had problems understanding the concept of hash-tables in mIRC, but that's besides the point).

Can you give examples in which situation one would like to use hashing, why and what you accomplish with it?

A hashing algorithm 'converts' an obect into an integer, for storing in a hash table.

A hash table is a way to store data (like an array/list/tree/whatever), but a lot more efficient, as it has constant time on all operations. Most other data structures have certain operations for which they are slow (eg, finding an object in an unsorted array takes linear time, insertations into a binary tree take log(n) time, searching a binary tree takes log(n) time, etc etc etc).

A hash table however takes constant time for all operations. It is normally implemented using a large array (for example, you could use a hash table of size 1000 to store strings). Every string has an integer representation, which is calculated using a hashing algortithm (so the string "what" might be 997, while the string "lol" might be 445). This number is the strings 'key' in the array. The string "what" would be stored at the 997th position in the array (array[997]), while the string "lol" would be stored in array[445]. When you wanted to look up a string in the array, you would first find the strings key, and then look at that position in the array.

This is a horrible explanation, and it gets more complex because different strings may have the same key, and so forth. Do a search on google for a better one.

22 Nov 2002, 20:50

they are used in places when you want a quick lookup table.
You can perform a hashing algorithm on a string, to get an index for it instead of scanning an array for it which aparently is much quicker.

eg, you could have an array of 1000 elements long, and you want to find a certain string which is held in one of these elements, it could be in the last one, so instead of scanning through the array to get it, you just perform a quick hash on it, and jump to its place.

Ofc there are problems such as one element already being at that place etc, but there are ways around this and it still proves much quicker than just a regular search algorithm.

the only practicle aplication i can think of where this is used is a compilers table of reserved words.

Hope this helps

22 Nov 2002, 20:52

beat me to it

meglamaniac · 22 Nov 2002, 21:32

SI - a very basic example of using hash tables is in the numerous mIRC lookup scripts (ie. !seen <nick>).

I made one myself about a year ago which was above average in that it would search for other matches to the requested nick's address.
The first time round it used a text file.
This was a very bad idea since every time someone spoke, joined, left, etc the text file had to be searched and updated, and then people looking up nicks caused it to be searched as well.
Bear in mind this was on a P200, and I quite often went in creators hour then - the machine would practically fall over (it did, on more than one occasion) simply because constantly searching and writing in a 3Mb text file is too much.
I then re-wrote the storage system (kept the "front end" of the script the same) to use a hash table. There was very little difference in how the information was stored (something along the lines of nick:address:$ctime:<various parameters depending on what it was the user last did>. The hash table was dumped to a file every 5 minutes (in case the machine crashed) and was dumped on exit. It was reloaded from file when mIRC opened.
The search time went from something like 15 seconds to complete a lookup with 1000 nicks in the text file to about 2 seconds to complete a lookup with 1000 nicks in the hash table.

This is probably not a great example, but its much easier to store data like this where you don't know the length of each field, never mind each line, or how many entries you have like this - and it's fast.
I know mIRC hash tables aren't exactly 'real' either, but there you go.

22 Nov 2002, 21:39

/me puts on the 'to-play-with' list and goes in search of code examples

22 Nov 2002, 21:49

i used a hashmap once to represent a (very small) db-relation-table, because i needed that relation quite often (the <option>-s in a select-field) and didnt want to access the db all the time.
(i didnt write anything on my own though, i used java.util.HashMap )

MT · 22 Nov 2002, 23:25

Two different things ..

hashcodes are useful in two ways - generating indexs to hash tables, and equality tests on certain objects. If its prohibitivly expensive to directly compare two objects, you can first thin the field by seeing if their hashcodes are the same. If they are, then do the expensive operation, otherwise dont.

hashtables are a time efficient way of retrieving data, comparing objects (the ubiquitous contains() method) and storing data. The constraints on a hash table are that for maximum efficiency, the load factor (the number of elements contained in the table/the number of buckets in the table) must be kept below 1, optimally (to allow for growth) below 0.6. If the load factor becomes high (say 1000), then essentially you degrade to O(n) time performance.

Btw nod, I believe you meant O(n) time for linked list traversal.

meglamaniac · 22 Nov 2002, 23:38

My bad then (tut - what a bad example...).

Erm - one more thing.
If we now have hashing algorithms and hash tables, the what on earth do you call the type of hash used to secure passwods and other sensitive 1 way data (MD5 and others of thier ilk)?

MT · 22 Nov 2002, 23:58

thats still a hash (a hash is just a routine that will convert one set of information into another set of information, and reliabley return the same info time after time (or similar)), hashtables and hashcodes are just yet another abstract data type that your lecturers will start to bang on about in a bit if you are unlucky.

meglamaniac · 23 Nov 2002, 00:01

They're allready banging on about set theory and bolean logic fs. It's not particularly hard, it's just that after an hour of either of them your brain has turned into cream cheese purely because it wanted to do something more exciting.

The irony is that either is not really necessary at this stage. Take Aberystwyth with it's excellent CS reputation - my mate is in his second year there and they had no maths modules at all in the first year.
The only reason we're doing all this maths at Loughborough is because some of the countries top Logic bods live in the CS department and like thier egos boosting.

*sigh*

JetLinus · 23 Nov 2002, 02:28

Quote:

Originally posted by meglamaniac
My bad then (tut - what a bad example...).

Erm - one more thing.
If we now have hashing algorithms and hash tables, the what on earth do you call the type of hash used to secure passwods and other sensitive 1 way data (MD5 and others of thier ilk)?

Ok, maybe it's just like this:
As someone said above, "what" might give you 997, while "lol" gives you 445.

Ok, now I got my password and the server has it, and I want to convince the server I got it right.
Let my password be "lol" for this example.
I compile a hash, the server does, and we both check... If the hashes are the same, chances are very high we got the same original string (ie. password).
The advantage of this method is, that we only had to transfer the hash-code. And apparently it's hard to guess the password just from its hash (one-way-function).
Well basically I think the server and client both add some random part to the pw and then apply MD5 to it (which should return pretty unique hashes), so that its no value if you get hold of the hashkey, as for the next login procedure you will need a new one...

Sorry if I didn't make myself clear...

queball · 23 Nov 2002, 09:18

Quote:

Originally posted by MT
abstract data type

Do you just put in the word "abstract" because it makes the phrase complete?

Structural Integrity · 23 Nov 2002, 12:12

So, when I understand correctly hash tables are primailly used to check the presence of, for example, a string. When you want to get the x'th entry it becomes less efficient than a simple array because you can have void entries in your hash-array.

When a duplicate hash index occurs it puts it on the same place in the array but with a sub index. (my little theory)
That explains why mIRC manual mentiones that a hash table becomes more efficient with more hash-array-fields (or something like that) because more strings can be assigned to an unique index that way and are not assigned a sub index.

\o/

Nodrog · 23 Nov 2002, 15:07

Quote:

Originally posted by Structural Integrity
So, when I understand correctly hash tables are primailly used to check the presence of, for example, a string. When you want to get the x'th entry it becomes less efficient than a simple array because you can have void entries in your hash-array./

Yes, however a decent hash table implementation will dynamically resize the array if it gets too full. This is where the "load factor" comes into play - if a hash table has a load factor of 0.5, it will resize itself when it is 50% full.

Quote:

Originally posted by Structural Integrity
When a duplicate hash index occurs it puts it on the same place in the array but with a sub index. (my little theory)

Depends on the implelmentation. You can indeed make linked lists or whatever at indexes with multiple objects (this is called 'chaining'), but there's also other ways to do it. Probing is one of the major ones, and this is where all the shenanigans with prime numbers comes into play.

http://ciips.ee.uwa.edu.au/~morris/Y...sh_tables.html