Publications

Find Duplicate Files

Introduction

It's expensive and time consuming to back up your files when you have many Giga-bytes of data. When you have many users, it's not uncommon to have many copies of the same file on a file server. When you pay for storage and it's backup, duplicated files isn't something that you want to encourage.

fdf

fdf is a small Perl script that came from a request from a Windows admin at work to track down duplicated files on their Windows fileservers. The initial version was little more than a crude hack, when a friend said he would like a copy, I thought it best to write it as a proper application

Theory

The script first catalogues every file in the requested directories, assuming they meet the request criteria. This is relatively quick and easy, even for a large file system of many gigabytes of data.

Calculating a checksum puts both the file system and the CPU under a considerable load. Only file sizes that are not unique are considered. Most files of the same file size are probably not duplicates, so to reduce the number of files to checksum, the program takes a small sample of each file to exclude any files that are unique at the sample point.

Once this list is complete, it then goes over the list calculating a FIPS SHA-1 digest of every file. This may be very time consuming.

Finally it outputs a simple report with the hash key, the file size in bytes and the full path to the file. It is up to the administrator to decide what to do with the duplicate, this script doesn't attempt to delete the duplicate and create a link to the original by default. Experimental support for automatic deletion and hardlinking is now available.

"Screen Shots"

Simple Case

$ fdf /tmp/
45def52a1cb24fc02bdcb88c94a46723181528f0        2048    /tmp/copy1
45def52a1cb24fc02bdcb88c94a46723181528f0        2048    /tmp/copy2
45def52a1cb24fc02bdcb88c94a46723181528f0        2048    /tmp/copy3

Options

$ fdf -h
This is Find Duplicate Files version 0.4

Usage:
    fdf [ -l <bytes> ] [ -u <bytes> ] [ -v ] [ -o <output_log> ]
        [ -g 'GLOB' ] [ -x ] [ -X ] [paths]...

If no options are passed, fdf will scan the current working
directory for files of all sizes, results outputted to
*STDOUT.

Options:
    -l  Lower limit of files size to scan, in bytes
    -u  Upper limit of files size to scan, in bytes
    -g  Pass a shell style GLOB to limit the search
    -v  Verbose mode (sent to *STDERR)
    -s  If directories to be scanned are mounted via Samba from CIFS/SMB
    -o  Output file to log results to
    -x  Delete duplicates files and create hard links
    -X  Force deletion (Think very hard before using this...)
    -h  This usage note

Complex Example

$ fdf -l 1024 -u 8192 -v -g 'copy*' -o fdf.log /tmp /var/tmp
Find Duplicate Files v0.4 (verbose mode)
         Search GLOB: copy*
   Minimum file size: 1024
   Maximum file size: 8192
     Output Log File: fdf.log
     Sample Window  : 8192
Finding all files in: /tmp
Finding all files in: /var/tmp

This may take a while...

Pass 1 complete. Possibility of 10240 bytes of duplication.
Now calculating checksums. This may take a little while longer...

Pass 2 complete. 4096 bytes of duplicates found.
Results logged to fdf.log

Downloads

Files and SHA1 sums:

Beta/development snapshot. May not even work!

Note: I'm working through a number of submitted patches as the moment and in the process of moving house, so I am aware of bugs and possible improvments, it's just not practical to do anything about it at the moment. April 2010.

Requirements

  • Perl 5.8 (or later, may work on earlier versions but not tested)
  • Text::Glob
  • Digest::SHA (in the 5.10 core)
  • File::Temp (version 0.20 or later)