It's expensive and time consuming to back up your files when you have many Giga-bytes of data. When you have many users, it's not uncommon to have many copies of the same file on a file server. When you pay for storage and it's backup, duplicated files isn't something that you want to encourage.
fdf is a small Perl script that came from a request from a Windows admin at work to track down duplicated files on their Windows fileservers. The initial version was little more than a crude hack, when a friend said he would like a copy, I thought it best to write it as a proper application
The script first catalogues every file in the requested directories, assuming they meet the request criteria. This is relatively quick and easy, even for a large file system of many gigabytes of data.
Calculating a checksum puts both the file system and the CPU under a considerable load. Only file sizes that are not unique are considered. Most files of the same file size are probably not duplicates, so to reduce the number of files to checksum, the program takes a small sample of each file to exclude any files that are unique at the sample point.
Once this list is complete, it then goes over the list calculating a FIPS SHA-1 digest of every file. This may be very time consuming.
Finally it outputs a simple report with the hash key, the file size in bytes and the full path to the file. It is up to the administrator to decide what to do with the duplicate, this script doesn't attempt to delete the duplicate and create a link to the original by default. Experimental support for automatic deletion and hardlinking is now available.
$ fdf /tmp/ 45def52a1cb24fc02bdcb88c94a46723181528f0 2048 /tmp/copy1 45def52a1cb24fc02bdcb88c94a46723181528f0 2048 /tmp/copy2 45def52a1cb24fc02bdcb88c94a46723181528f0 2048 /tmp/copy3
$ fdf -h This is Find Duplicate Files version 0.4 Usage: fdf [ -l <bytes> ] [ -u <bytes> ] [ -v ] [ -o <output_log> ] [ -g 'GLOB' ] [ -x ] [ -X ] [paths]... If no options are passed, fdf will scan the current working directory for files of all sizes, results outputted to *STDOUT. Options: -l Lower limit of files size to scan, in bytes -u Upper limit of files size to scan, in bytes -g Pass a shell style GLOB to limit the search -v Verbose mode (sent to *STDERR) -s If directories to be scanned are mounted via Samba from CIFS/SMB -o Output file to log results to -x Delete duplicates files and create hard links -X Force deletion (Think very hard before using this...) -h This usage note
$ fdf -l 1024 -u 8192 -v -g 'copy*' -o fdf.log /tmp /var/tmp Find Duplicate Files v0.4 (verbose mode) Search GLOB: copy* Minimum file size: 1024 Maximum file size: 8192 Output Log File: fdf.log Sample Window : 8192 Finding all files in: /tmp Finding all files in: /var/tmp This may take a while... Pass 1 complete. Possibility of 10240 bytes of duplication. Now calculating checksums. This may take a little while longer... Pass 2 complete. 4096 bytes of duplicates found. Results logged to fdf.log
Files and SHA1 sums:
Beta/development snapshot. May not even work!
Note: I'm working through a number of submitted patches as the moment and in the process of moving house, so I am aware of bugs and possible improvments, it's just not practical to do anything about it at the moment. April 2010.
Digest::SHA(in the 5.10 core)
File::Temp(version 0.20 or later)