It's expensive and time consuming to back up your files when you have many Giga-bytes of data. When you have many users, it's not uncommon to have many copies of the same file on a file server. When you pay for storage and it's backup, duplicated files isn't something that you want to encourage.
fdf
fdf
is a small Perl script that came from a request from a Windows admin at work to track down duplicated files on their Windows fileservers. The initial version was little more than a crude hack, when a friend said he would like a copy, I thought it best to write it as a proper application
The script first catalogues every file in the requested directories, assuming they meet the request criteria. This is relatively quick and easy, even for a large file system of many gigabytes of data.
Calculating a checksum puts both the file system and the CPU under a considerable load. Only file sizes that are not unique are considered. Most files of the same file size are probably not duplicates, so to reduce the number of files to checksum, the program takes a small sample of each file to exclude any files that are unique at the sample point.
Once this list is complete, it then goes over the list calculating a FIPS SHA-1 digest of every file. This may be very time consuming.
Finally it outputs a simple report with the hash key, the file size in bytes and the full path to the file. It is up to the administrator to decide what to do with the duplicate, this script doesn't attempt to delete the duplicate and create a link to the original by default. Experimental support for automatic deletion and hardlinking is now available.
$ fdf /tmp/ 45def52a1cb24fc02bdcb88c94a46723181528f0 2048 /tmp/copy1 45def52a1cb24fc02bdcb88c94a46723181528f0 2048 /tmp/copy2 45def52a1cb24fc02bdcb88c94a46723181528f0 2048 /tmp/copy3
$ fdf -h This is Find Duplicate Files version 0.4 Usage: fdf [ -l <bytes> ] [ -u <bytes> ] [ -v ] [ -o <output_log> ] [ -g 'GLOB' ] [ -x ] [ -X ] [paths]... If no options are passed, fdf will scan the current working directory for files of all sizes, results outputted to *STDOUT. Options: -l Lower limit of files size to scan, in bytes -u Upper limit of files size to scan, in bytes -g Pass a shell style GLOB to limit the search -v Verbose mode (sent to *STDERR) -s If directories to be scanned are mounted via Samba from CIFS/SMB -o Output file to log results to -x Delete duplicates files and create hard links -X Force deletion (Think very hard before using this...) -h This usage note
$ fdf -l 1024 -u 8192 -v -g 'copy*' -o fdf.log /tmp /var/tmp Find Duplicate Files v0.4 (verbose mode) Search GLOB: copy* Minimum file size: 1024 Maximum file size: 8192 Output Log File: fdf.log Sample Window : 8192 Finding all files in: /tmp Finding all files in: /var/tmp This may take a while... Pass 1 complete. Possibility of 10240 bytes of duplication. Now calculating checksums. This may take a little while longer... Pass 2 complete. 4096 bytes of duplicates found. Results logged to fdf.log
Files and SHA1 sums:
58802270037699f93ec7e48478f355ecfe88851a
58802270037699f93ec7e48478f355ecfe88851a
cea3a501776f1c6e143b7d27c98b45453d9a06fe
3199f4663e273f7da542b544e2e9e929d7d4d371
Beta/development snapshot. May not even work!
923cd7c1b7a272ffd5cbeccb82f3dd7103765ca2
Note: I'm working through a number of submitted patches as the moment and in the process of moving house, so I am aware of bugs and possible improvments, it's just not practical to do anything about it at the moment. April 2010.
Text::Glob
Digest::SHA
(in the 5.10 core)File::Temp
(version 0.20 or later)