/ opensource/ dupefinder

opensource icon DupeFinder

This page is about a duplicate file finding (and deleting) program and the algorithm it uses.

contents

introduction

I wrote this application to demonstrate the use of a new data type I discovered (the 'condensed ternary tree').

It's also a neat duplicate file finder.

The algorithm used could easily be genericised into a map implementation or container for streams (strings), that should be faster and smaller in memory than either a hash or ternary tree.

This program extends a clever idea for finding duplicate files by Antonio Bellezza .

There are a large number of programs already available to find duplicate files (a search returns half a million results), perhaps because such a program is deceptively easy for a beginner to write.

Unfortunately they almost all use the same (wrong) algorithm.
This proves the old aphorism that "for any sufficiently complex problem there is a solution that is simple, intuitive, obvious and wrong".

The way these programs typically work is to calculate the checksums on all files, then look for identical checksums. This causes two problems:
  1. it means that all of every file has to be read, making the programs extremely slow
  2. it is surprisingly easy to get two different files with the same checksum (the so-called birthday paradox)
    e.g. in a system with a 1,000,000 files (a typical number for a desktop system) there is less than a 1 in 1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 (1 followed by 51 zeros) chance that there won't be at least 2 dissimilar files with the same 32-bit checksum.
This explains why so many of the programs I tried flagged non-identical files as duplicates and some would have automatically deleted them given the right options.

A better algorithm is used in the program DupSeek.

This finds duplicate files by sorting all the files into sets of files that are equal up to a certain length, and subdividing the sets by increasing the length being compared. When any set contains only one file, it must be unique. This is much faster than using checksums as most files can be found to be unique after only a few bytes are read. (Using the dupefinder debug option shows almost all files differ in the first 8 bytes)

I have modified this algorithm in DupeFinder to take better advantage of disk caching and memory localisation.

Because of the unique algorithm it uses it should be faster than any other duplicate file finder (although I haven't tested this)

The algorithm used is shown under algorithm below - this is probably only of interest to programmers, others can skip to installation.

algorithm

A ternary tree (aka ternary search tree) is a compressive form of storing any set of strings that have common prefixes. It is commonly used for word lists, and in fact I use one in my dictionary utility

It is usually faster to read than a hash (i.e. to determine if a word is in the list) when held in memory, and is almost always faster than a hash when paged or read from disk

From the point of view of access times & space ternary trees have the problem that many strings have substrings that are unique, particularly suffixes.
For these substrings each char takes at least 3 pointers to store:
e.g. the tree below contains only the words 'associate' and 'aspect':


          0   0   0   0
         /   /   /   /
        p - e - c - t - 0
       / \   \   \   \
      /   0   0   0   0
     /   .   .
a - s - s - o -...
 \   \   \   \   ...
  0   0   0   0  ...
where 0 = a null pointer

This has a lot of null pointers (although it is still much better than a trie) and reading a word beyond its unique prefix amounts to following a linked list with a node for each character.

The idea behind a condensed ternary tree is to store any contiguous substrings that are in all elements below a particular node, in that node; as illustrated below:

         0
        /
    pect - 0
   /    \    0
  /      0  /
as - sociate - 0  ...
  \         \
   0         0


where 'as', 'pect' & 'sociate' are each in a single node.
This uses much less space for a list containing many unique or long strings, and is faster to access / search because each unique fragment can be read & compared at once.

The main complication is that of splitting the node when a new string is added e.g. after adding 'assume' to the above tree it should look like:

             0
            /
        pect - 0
       /    \
      /      0
     /        0
    /        /
   /   ociate - 0
  /   /      \ 
as - s - 0    0
  \   \ 
   0   \     0
        \   /
         ume - 0
            \
             0

I have an implementation of a 'condensed ternary tree' in the 'DupeFinder' program, which uses it to store (links to) the contents of files.

DupeFinder applies this data structure to the problem of finding identical files while reading the minimum possible from each file.

A clever algorithm for finding duplicate files is used in the program Dupseek.
As described above, this finds duplicate files by sorting all the files into sets of files that are equal up to a certain point. each set is then successively subdivided by reading the next byte. When any set contains only one file, it must be unique. This is much faster than using checksums as most files can be found to be unique after only a few bytes are read. The dupefinder debug option shows almost all files differ in the first 8 bytes, and most in the first 4 bytes.

Unfortunately this algorithm is still not ideal; this is because at each stage it reads a single byte from all the files in each set in order to split it into more sets.
In other words, it reads a single byte from many files then later goes back to many of them to read the next byte.
This means the program is constantly jumping between files, the disk cache is almost never hit and the disk head is moving between sectors.

Reading a thousand bytes from one file is much faster than reading one byte from a thousand files because the disk head does not have to seek, and it takes advantage of caching better.

So I have devised an alternative algorithm that still reads the minimum necessary from each file, but takes advantage of disk caching better.

The solution is to store a list of sets of files, each set contains files identical to the others in the set.
As each file is added it is sorted into the right set, reading only as much of the file as needed to find the correct set.

A tree is used to store the sets, the position in the tree is determined by the contents of the file, so each branch in the tree contains a unique file or a set of files that are all identical up to a certain point.

As new files are added to the tree, it is only necessary to read the files at each node it is compared against. The same file (being inserted) will be read until a byte is found that is unique, or the end of the file is reached. This results in a lot less jumping around between files, and a faster algorithm.

The obvious form of tree to use is a ternary tree.
This means that, when searching for a file in the tree, after having read a certain length of the file a position in the tree is reached such that to compare it to the files in the next node only the next byte of the file need be read. This is analogous to storing a string in a ternary tree of strings, where once a node is reached only the next character of the string need be read.

For a binary tree or similar, the files would need to be re-read up to the point of difference on each compare.

For a normal ternary tree at least as many nodes would be needed as the number of bytes in the longest file, so it is much better to use a condensed ternary tree.

In this implementation, I have not stored the contents of the files in the tree (if I did this, it would be a compressed copy of all the files). Instead each node contains a pointer to a position in a single file and a length that identifies the sub-string associated with the node.

performance

As noted, in practice most files differ in the first 4 bytes. But the average length of a file is 2K. This suggests that DupeFinder should be about 2000 / 4 or 500 times faster than checksum based programs that read all of every file. This of course ignores disk caching that would reduce this difference.
I haven't done any benchmarks, but it does seem much faster than other programs I have tried.

I intend to do a comparison of duplicate file finders some time (similar to my review of directory synchronisers) and I will include benchmarks in that.

A parameter allows you to set the program's internal buffer size, which may be worth tweaking (it should generally be below 128 bytes).

installation

To run DupeFinder:
It can then be run from the command line with a command like java -ea dupes.DupeFinder ...( see usage)

known bugs

This program is still work in progress, so ...
  1. DupeFinder will quite happily follow symlinks (which it probably optionally shouldn't), this means if you have link to a directory above itself, the program will enter an infinite loop. This only applies to OS's which have links (i.e. not windows)
  2. It will simply search each directory passed on the command line in turn and treat any two identical files it finds as duplicates. This means if the same directory is read twice, all files in it will be marked as duplicates of themselves, and potentially deleted. E.g. "DupeFinder a a, DupeFinder a/b/c a/b" will delete all files in a/b/c.

TODOs

usage

DupeFinder [ --hideFirst(-f) | --showUniques(-u) | --debugLevel(-d) <int> | --bufSize(-b) <int> | --deleteDupes(-D) | --linkDupes(-l) | --userString(-s) <string> | --help(-h|-?) | --noPrint(-n) | --extension(-e) | --test(t) | --name(m) | --verbose(v) | --] directory [directory]

All boolean options default to off

hideFirst don't show the first file in the list of dupes (only relevant when printing output)
showUniques show unique files as well as dupes (only relevant when printing)
debugLevel amount of debugging info to show default = 0 (none)
bufSize internal file buffer size (default 16)
deleteDupes delete any duplicate files found (but not first file found)
linkDupes delete any duplicate files found (but not first file found) and replace them with links to the first file (the link will have the name of the deleted file)
userString the user string supplied is invoked in the shell, with any '%1' in the string replaced with the first file, and '%2' replaced with the duplicate file. example DupeFile -s "rm \"%2\"; ln -s \"%1\" \"%2\"" ~
help show this message
noPrint don't print the duplicate files to stdout
extension only treat files as dupes if they have the same extension
test run unit test
name only treat files as dupes if they have the same name (including extension)
verbose verbose output
-- treat the next string as a directory even if it begins with '-'

licence

All sources are released under the General Public License.


Share this post with others:
diggdigg this article
del.icio.ussave in del.icio.us
Slashdotsave in Slashdot
slashdotsubmit to slashdot
redditpost to reddit
YahooMyWebpost to yahoo
Click an icon to share this post on a link forum site so other readers can find it
netscape Fark Simpy Furl connotea blogmarks co.mments Ma.gnolia Spurl blinkbits blinklist feedmelinks LinkaGoGo NewsVine Netvouz scuttle Shadows smarking rawsugar TailRank