

A common task for any authentication system is to store and retrieve passwords. Doing this securely is key to building a system that is not only stable, but relatively safe in the event that it ever becomes unstable and allows potential attackers to view stored account information. Passwords should never (or rarely) be stored as plaintext: this is where one-way cryptographic hashing can save the day—or at least save plenty of difficult work.
Early in my career, one of the seemingly commonly used techniques that often confused me was the art of password hashing.
In order to authenticate a user, most systems do not need to store the user's actual password. Instead, they employ a technique called hashing in which a complex and irreversible operation transforms a user's password into a hash of finite length. Practically speaking, most modern hashing wrappers use the MD5 or SHA-1 algorithms, but other methods of differing complexity are also available.
The idea of hashing is a bit hard to grasp at first. The point is to feed sensitive data into one end of an operation, and the other end delivers a greatly less sensitive blob of data, in an irreversible manner. I say less sensitive and not insensitive for reasons that we will discuss later in this article.
One analogy that helped me understand this concept has to do with colours. Every grade school student knows that mixing blue and yellow paint makes green paint. Different shades and amounts of blue and yellow form a different shade of green. Once the blue and yellow are mixed together to form green, it's impossible—without an elaborate and expensive process that's out of reach for most people—to separate them back into their exact original colours. Given an exact amount and shade of blue, however, a user could supply an exact amount and shade of yellow to form exactly the same green. This is similar to hashing in that the operation has one known and one unknown, and that the operation is not (easily) reversible.
Hashed strings are valuable for a few reasons. Primarily, these include the ability to store password information in an area that is mostly secure (again, see below), and the hashing algorithms produce output that is of a specific length—a 4-character password and a 400 character password will hash to the same number of output bits using MD5, for example.
The irreversible nature of hashing a string lends itself well to storing passwords in a database. Consider the following:
As you can see, anyone with read access to this particular user table can now impersonate me by using my password, as found in plaintext in this fictitious database.
To make this a bit more safe, we can hash password on insert with a query that looks something like this:
This results in:
If you didn't know the original password, it would be very difficult to convert guess that the data passed to MD5() to produce 824a67f29e97b8798a9df7f00189f3e1 was in fact qwert123.
As far as cryptographers are concerned, the only way to determine the input of a secure hashing algorithm is to actually test all possible inputs until a match is found. This means that given the output 824a67f29e97b8798a9df7f00189f3e1, we'd have to start guessing by actually computing all possible values:
Obviously, this large number of computations is expensive—computing resources are not cheap, and calculating exhaustively like this takes a long time—so your password is safe... or is it?
Practically speaking, no, your password is not safe. Many years ago, this may have been the case, but just as computing resources are not cheap on an individual basis, the same collective chunk of CPU-hours, disk space and memory pool is becoming much more easily attainable by individuals and small groups. We'll get into the mechanics and math of this in the next part of this series, but before leaving you hoping that no one bothers to calculate your data, we need to touch on Rainbow tables.
As with most bits of computation-intensive data, there's not much sense in performing the same expensive calculations more than once. In a modern application of “don't reinvent the wheel,” wouldn't it make sense to simply calculate MD5 hashes once, and then simply store the results so they can be easily referenced when needed?
Consider this:
“That sort of thing is science fiction!” you must be thinking. Truthfully, the first time I saw this type of data, I thought the same. And back then (almost 5 years ago, now, I think), these data pools were small and mostly insignificant. However, just today, in researching for this article, I stumbled upon one interface to a rainbow table repository, and out of curiosity, entered 824a67f29e97b8798a9df7f00189f3e1, fully expecting it to return “not found.” Turns out I was wrong. The service quickly and easily told me qwert123. This is a bit scary.
So, how does a diligent coder go about protecting his data from this sort of thing? The not-so-secret sauce is something known as “salting” your data before hashing. In the same way that md5('qwert123') returns 824a67f29e97b8798a9df7f00189f3e1, md5('qwert123' . 'mySecretSaltGoesH3re') gives a completely different result: f4b5ee03796df1379fe14766d4cc6821.
There are two main types of salting: public-dynamic, and private-static. In the private scenario, one hash is applied, unilaterally to all passwords in a given set. The salt is kept private, and it is “known” but all library code. As long as the salt is not known, it's impossible to create a rainbow table that pre-calculates all possible passwords in this set. Each piece of code that verifies or stores a hash must apply this specific hash:
However, even if the salt does, at some point, leak out, it's far less dangerous than using straight hashes of the password data. This is the basis for the other type of salting that I mentioned: public.
In public-dynamic salting, each hash is assigned its own salt when the hash is stored. Many systems do this by default when using the crypt() function/system call. Let's say we want to create a two-character salt (we'll use XX to illustrate) for each password we're storing; we'd do something like this:
The inserted value, in this case, is XX7d3c60249fc71ac62b9cf8286fef5549. Validating this hash is a bit more complicated. You'd first have to retrieve the user record based on the email and then you'd have to determine the salt with string manipulation functions, then you'd have to query again for the hash, to match. Of course, the salt could be stored in another column, but I showed this method because this is how many systems handle salting.
Hopefully you already knew about hashing data, but now you have a better understanding why you go through the hassle. Salting is an additional hassle, but it's worth it—and if you don't believe me, try running your own hashes on some of the public rainbow table interfaces.
Sean Coates wears many hats around php|architect, and is currently in charge of software development and system administration. He was formerly the Editor-in-Chief of php|architect Magazine, and is the co-host of php|architect's PHP Podcast.
Two more thought on this subject :
- I think it is very preferable to do compute in php, and not in the SQL. If you have you SQL logs enabled (think mysql binary logs, which are enabled by defaut), then your passwords are long lived as clear text on your hard disk ! Plus, if you have a non crypted connection between your PHP and SQL server, passwords transit unencrypted on the wire ! So I always prefer to do this in php : 'XX' . md5('qwert123' . 'XX') and insert the result in the query.
- your should mention http://www.openwall.com/phpass/ which goes even further on the topic
It's 2008! md5 and sha1 are "broken" . SHA256/SHA512 is where it's at.
At the current exponential rate of growth (estimated at 9x per year) of that MD5 database, all 8-character alphanumeric sequences could be computed in about 3.5 years. The database size would be at least 2.8 trillion entries. The rate of growth could increase as fast parallel computing becomes even cheaper. All the more reason to always salt your hashes (and preferably with a longer hash).
Thanks for explaining this in 'plain English' and not in geek. It makes it easier for a self thought hack like myself to understand.
@Tetraboy points out that MD5 and SHA-1 have been shown to be crackable in recent years, so SHA-256/SHA-512 is preferable. That's correct. In fact, NIST considers SHA-256 to be the preferred secure hashing algorithm, and mandates its use in US Federal agencies by 2010. @Okin7 says that he prefers to do the hashing in PHP code and then post the hash string to the database, instead of sending an SQL expression containing the plaintext password. That's a good idea. However PHP doesn't seem to support a builtin function for SHA-256 or other SHA-2 hash algorithms yet. FWIW, Python does support the SHA-2 family of hash algorithms in its hashlib module!
@billkarwin PHP actually supports a lot more hashing algos than people think md4,md5,sha1,sha256,sha384,sha512,ripemd128,ripemd160,whirlpool,tiger128,3,tiger160,3,tiger192,3,tiger128,4,tiger160,4,tiger192,4,snefru,gost,adler32,crc32,crc32b,haval128,3,haval160,3,haval192,3,haval224,3,haval256,3,haval128,4,haval160,4,haval192,4,haval224,4,haval256,4,haval128,5,haval160,5,haval192,5,haval224,5,haval256,5 Just use the hash() function.
Re: hash() function. Well, all righty then! Thanks! It'd be nice if the docs for md5() and sha1() mentioned this in their See Also section.
@Bill I've updated the XML sources, next rebuild and push to mirrors it should appear: http://docs.php.net/md5

