Why I Hate MD5 or: How I Learned to Start Worrying and Hate the Misconceptions

January 24, 2008

by — Posted in Security, Technology

Most people in the technical field that have delved into cryptographic, public keys, or file authenticity has at one time or another at least heard of MD5. MD5 is a fingerprint that is unique to a file. It is that signature line that is included at most reputable Linux download site to make sure the file you downloaded is not corrupt. It is the thing that is the current bane of my existence.

I’m attempting to archive files and want the MD5 to be unique on each file. So I’m attempting to make a digest of the files I’m creating. Wikipedia describes MD5 digesting like this:

MD5 digests have been widely used in the software world to provide some assurance that a transferred file has arrived intact. For example, file servers often provide a pre-computed MD5 checksum for the files, so that a user can compare the checksum of the downloaded file to it. Unix-based operating systems include MD5 sum utilities in their distribution packages, whereas Windows users use third-party applications.

Now I’m aware that there are ways to spoof an MD5 hash, so in effect you could modify a file and maintain the same hash. This however isn’t my concern, at the moment for my needs I am generating and MD5 from a group of files and some of these are zero byte files. What I was not aware but I’m sure is common knowledge that zero byte files will always produce the result of d41d8cd98f00b204e9800998ecf8427e as its hash output.  I’ll be the first to admit that I’m an idiot for not knowing or thinking about this.

Essentially when the MD5 is calculated it looks only at the information of the file and not the file per say.   A friend I was emailing  described it like this:

 Think about it this way. You have 2 boxes, one is red and one is blue
and you also have a clipboard. The clipboard says on it contents, 1 red
box, 1 blue box, both of size 0 (this is the filesystem metadata). Your
md5 program asks for the location of the box from the clipboard, walks
over to the box and opens it up (not caring about color, shape, size,
etc). Grabs the contents (in this case nothing) and runs nothing through
it’s magical hashing process.

Now I am aware that getting a checksum of a file or packet looked at the packet in general and would usually be limited since most checksums included in OS’s are usually 8 bit’s I believe.   I thought this was superseded later by MD5 checksums and their ability to be 32-bit.   I thought MD5 would at least look at the file name since a file named test1 and a file named test2 would still be unique files even if they were zero bytes in size.   What I thought paraphrasing my friends scenario is that MD5 would look at the box, the size, and the contents as a complete package – taking the file name into account for it being unique.   Sadly I thought wrong.

MD5 has its flaws and in recent years has been broken and can be spoofed.   Hopefully the next iteration will add a little more authenticity to a file being unique instead of just the contents inside of it.    Bob may be male but that doesn’t make him any less unique than Larry who is also male.   If MD5 was looking at images it would count them as being equal in this regard.

MD5 used to be something that I understood better than I actually did.   It’s like the guy you knew at college but ran into years later.    Originally you would have been fine if he had dated your sister.  However now that you know him better and find out the things he has done wrong you feel  that you don’t know him as well  as you thought you did.   This makes you trust him less.   Unfortunately one of the most commonly used purposes of MD5 is to promote trust.

I recant what I said originally in the title, I don’t hate MD5 – but I don’t trust it the way I used to.

7 thoughts on “Why I Hate MD5 or: How I Learned to Start Worrying and Hate the Misconceptions

  1. there is a difference between md5 and the common practice to use only the contents of a file as input. i guess this is because the tool md5sum takes a filename as argument, which is like a shortcut of “cat file | md5sum”. you could echo the filename and its contents and pipe it into md5sum, you would have it.

    so the better title would be “why i hate the practice of only using file-contents as input for md5 hashes and not taking the filename itself into account”

  2. Well there was alot of people ignorant like me – so i thought the misconception title was best. But yes i hate the fact it ignores the file name

  3. It takes, say, 2 minutes to write a program which actually uses file metadata (name, creation and modification date, or even inode data [position on the disk]) for generating hashes, be it md5 (which in fact harder to spoof than you suggest) or some other algos (sha2 comes to mind). Takes half an hour if you want it a bit faster. :-)

    But your problem is that you fail to know the goal of the method you use: it's for detecting changes in the files (or, actually, the falsification of the data) and not to have a hash unique to the (actually any random) file. Downloaders aren't intersted whether the file is called “kernel-latest.tar.bz2” or “linux-2.6.31rc2.tar.bz2” as long as it's the same.

    Actual security tools (like tripwire, integrit, etc) use file metadata hashing as well, so they detect not just data or filename change, but moving the file or having it changed by any unknown means (which changes, say, inode numbers).

    Use tools what they're for. Don't try to screw in a screw with a sledgehammer. ;-)

  4. I was just pointing out the fallacy where some people take with MD5 – I'm well aware these days not to trust it too much . It has it's flaws and I know it is difficult to spoof – but MD5 collisions caused the new SSL vulnerability issue because people put trust in it and didn't think md5 collisions would be an issue at. It was considered good enough – if you find something you consider a problem raise awareness – in security “good enough” is never good enough – the bad guys always work past it.

  5. btw the script I ended up writing with md5 did use some file metadata so yes you are write and it does take less then 2 minutes.

Leave a Reply