Skip to main content.

Sat, 09 Aug 2008

Finding duplicate files

Every once in a while, I know one file is duplicated in many places. This happens, for example, when I have imported photos from my camera into a photo management program and also stored a copy of them somewhere else. Sometimes I have downloaded files twice from the web.

Detecting duplicate files is not hard - you just compare the file contents. The problem is that with large files, and a large number of files, it can take a long time if you compare every file to every other file.

Because I needed to do this for a few gigabytes of photos, and everything I found I either didn't trust or ran too slowly, I wrote my own. Once you detect duplicate files, you generally want to either delete all but one, or to "merge them" via hardlinks so that all the files exist, but they share storage space on disk.

Summary: I had a fairly good approach, but everyone should use rdfind instead of my code.

My approach

You can check out (using Subversion or a web browser) my code at http://svn.asheesh.org/svn/public/code/merge_dups/ .

This approach has to stat() every file at least once, but many files don't have to be read at all. For my photos, this was a huge time-saver.

(Why delete the one with a longer filename? Usually that's the one in some obscure directory named "camera-backup" or "recovered-from-some-dying-computer".)

I trust my code. Plus, it is verbose, printing out what it is doing and why. And the entire program with comments, status message print-outs, and vertical spacing easily fits on my screen.

Other implementations

Today, I decided to go through Freshmeat to see if I could retire my code and just rely on someone else's. So I checked out the reasonable contenders from this search.

find_duplicates by Fredrik Hubinette

It uses the first few kilobytes of the string as a hash, which is probably more efficient that reading the whole thing. It is safe and reads the whole files before marking them as duplicates.

dmerge.cpp by Jonathan H Lundquist

I stopped caring when I realized it calls external programs. I doubt it does it in a correct/secure way, so forget it.

duff by Camilla Berglund

This looks really good, but it doesn't actually do the merging. It relies on a shell script to do the merging, and I don't trust the correctness of the shell script's handling of filenames (due to the whitespace-separated output format of duff itself).

Note to Camilla: If you provided a -z option (like find -print0) to duff, and made sure the shell script respected it, then it would be practically perfect.

fslint 2.14

It was so litlte fun to use I don't even want to talk about it. The benchmarks on the rdfind web page confirm this with data.

rdfind by Paul Sundvall (WINNER!)

Finding software like this is why I look for software not written by me.

Other tools I didn't fully review

Conclusion

rdfind looks great. Every once in a while, two hours are better spent doing research rather than re-inventing the wheel. This is one of those times where I was more useful to my life as a secretary rather than by trying to be a programmer.

[] permanent link and comments

Comment form

  • The following HTML is supported: <a href>, <em>, <i>, <b>, <blockquote>, <br/>, <p>, <abbr>, <acronym>, <big>, <cite>, <code>, <dfn>, <kbd>, <pre>, <small> <strong>, <sub>, <sup>, <tt>, <var>
  • I do not display your email address. It is for my personal use only.

Name: 
Your email address: 
Your website: 
 
Comment: