dupemerge

Last updated February 13 2023, Version 1.104

Privacy Statement

The privacy statement can be found here

Quick Start

Download
Documentation
Frequently asked questions (FAQ)
History
Donations

Introduction

Most hard disks contain quite a lot of completely identical files, which consume a lot of disk space. This waste of space can be drastically reduced by using the NTFS file system hardlink functionality to link the identical files ("dupes") together.
Dupemerge searches for identical files on a logical drive and creates hardlinks among those files, thus saving lots of hard disk space.

Installation

Dupemerge.exe is a command line utility which runs from a command prompt window or in a batch or cmd script. It needs no formal "setup". Copy dupemerge.exe to some directory referenced by your PATH environment variable. The %systemroot% (that is, c:\winnt or c:\windows) is a good place.

To remove it, delete the dupemerge.exe file from wherever you copied it.

Using dupemerge

dupemerge.exe can be controlled by a few command line arguments, and its highlights are as follows:

Specify path	More than one path can be specified to search for dupes. dupemerge c:\data c:\test\42 The above command causes dupemerge to search c:\data and c:\test\42 and below for dupes. Dupes might be spread across given subdirectory trees: e.g. If the files c:\data\a.txt c:\data\dd.txt and c:\test\42\new.bat are dupes, they get hardlinked together.

--include Include via wildcards	In certain situations only a few files, e.g.: .pdb, below a path should be checked for dupes. To accomplish this, dupemerge can be run with filters specified, to only match certain files. dupemerge --include .dbg --wildcard a.pdb c:\data In the above example dupemerge only searches for files which match the expressions specified with --include. The --include option can be used more than once. This option supports taking its arguments from file. dupemerge --include @List.txt c:\data The above example references to a file List.txt, which on every line contains a matching pattern. e.g. .dbg *.sbr

--includedir Include directories via wildcards	To selectivley run Dupemerge on certain directories, the --includedir option can be used with wildcards. dupemerge --includedir test c:\data Basically any arbitrary wildcard expressions can be used, because the wildcard expressions are translated into a regular expression. This means that e.g src\\sub?older* is also a valid wildcard expression for --includedir. The wildcard expression specified under --includedir is applied to the whole path, which means that e.g. dupemerge --includedir "fotos\\temp" --copy c:\data will include all directories containing 'fotos\temp' and their subdirectories. The above example will e.g. include 'fotos\tempur\myfotos', 'fotos\temp\myfotos' or 'fotos\tempomat\myfotos'. Please note that '\' has to be escaped via '\\' This option supports taking its arguments from file. dupemerge --includedir @List.txt c:\data The above example references to a file List.txt, which on every line contains a matching pattern. e.g. fotos\\temp aDir

--exclude Exclude via wildcards	In certain situations all but a few files, e.g.: .pdb, below a path should be checked for dupes. To accomplish this, dupemerge can be run with filters specified, to exclude certain files. dupemerge --exclude .dbg --exclude a.pdb c:\data In the above example dupemerge only searches for all files, but not the ones given via the --exclude. The --exclude option can be used more than once. This option supports taking its arguments from file. dupemerge --exclude @List.txt c:\data The above example references to a file List.txt, which on every line contains a matching pattern. e.g. .pdb .sbr Myfile.*

--excludedir Exclude directories via wildcards	In certain situations not all directories below a path should be checked for dupes. To accomplish this, dupemerge can be run to exclude certain directories. dupemerge --excludedir DontWantIt --excludedir DisLike c:\data In the above example dupemerge searches all files except the ones excluded via --excludedir. The --excludedir option can be used more than once in one invocation. Basically any arbitrary wildcard expressions can be used, because the wildcard expressions are translated into a regular expression. This means that e.g file.ext. is also a valid wildcard expression for --excludedir. dupemerge --excludedir test c:\data The wildcard expression specified under --excludedir is applied to the whole path, which means that e.g. dupemerge --excludedir "fotos\\temp" c:\data will exclude all directories containing 'fotos\temp' and their subdirectories. The above example will e.g. exclude 'fotos\tempur\myfotos', 'fotos\temp\myfotos' or 'fotos\tempomat\myfotos'. Please note sure that '\' has to be escaped via '\\' This option supports taking its arguments from file. dupemerge --excludedir @List.txt c:\data The above example references to a file List.txt, which on every line contains a matching pattern. e.g. fotos\\temp aDir

--regexp Use regular expressions	In certain situations some kinds of files, e.g. all files containing only letters, should be checked for dupes. To accomplish this, dupemerge can be run with regular expression filters specified, to only match certain files. dupemerge --regexp "[a-z]" c:\data In the above example dupemerge only searches for files which match the regular expressions specified with --regexp. The --regexp option can be used more than once. This option supports taking its arguments from file. dupemerge --regexp @List.txt c:\data The above example references to a file List.txt, which on every line contains a matching pattern. e.g. [a-z] [0-9]*

--list List only	To find out which files are dupes, but to not hardlink those files, dupemerge can be run in list mode dupemerge --list c:\data c:\test\42 An extensive report is generated showing which files are dupes in c:\data and below and c:\test\42 and below.

--minsize --maxsize Size dependent check	The size of the files to be compared can be controlled by two switches dupemerge --minsize 3000 --maxsize 500000 c:\data In the above example dupemerge searches for files bigger than 3000 bytes and smaller than 500000 bytes below c:\data.

--sort Sort Order	The output shows the order of found dupegroups either random or by cardinality or by filesize. This is controlled by the --sort switch, which has the filesize or the cardinality modifier. The default behaviour is to show dupegroups random. dupemerge --sort cardinality c:\data In the above example dupemerge searches for files below c:\data and prints the output so that the dupegroup with most identical files is printed first, and the dupegroup with fewest identical files is printed last. dupemerge --sort filesize c:\data In the above example dupemerge searches for files below c:\data and prints the output so that the dupegroup which contains the largest files is printed first, and the dupegroup with smallest files is printed last.
--supportfs	There are a lot of filesystems out by third party vendors nowadays which support hardlinks. In order to provide the dupemerge.exe functionality on that drives, the supported filesystems can be given: dupemerge --supportfs btrfs x:\location_to_be_deduped Configuring your favourite filesystem to be recognized by dupemerge.exe is on your own risk. Basically dupemerge.exe would do all operations to the configured filesystems, which it does to NTFS. So make sure your filesystem supports the same primitives as NTFS does, otherwise certain operations will fail.

Output

Dupemerge.exe returns its status at the end of its operation:

-f c:\backup\test\deleteme.dat
!*h c:\backup\test\cannothardlink.dat
!\f (0x00000005) c:\data\failed\AccessDenied.txt
!/f (0x00000005) c:\data\failed\MappingFailed.txt

Basically DupeMerge protocols each action it did, and prefixes two characters to each item it processed for each line of the output. The first column of the output contains the Operation, which was performed, and the second column specfies the Type of item, which was processed.

Operation Description

* Hardlink a file

- Remove an item from the target that is not present in the source. Used during Smart Mirror

? Enumerate an item.

~ Item has been excluded by command line arguments.

\ Opening a file.

/ Map file into adress space.

= Move/Rename a file.

! An error happened.

' Informational Message

Item Description

f A File is processed.

h A Hardlink is processed.

s A Symbolic link file or Symbolic Link Directory.

j A Junction is processed.

d A Directory is processed.

Sample Description

~d d:\source\mydir The directory d:\source\mydir has been excluded intentionally by either e.g. --exclude.

~f d:\source\aFile The file d:\source\aFile has been excluded intentionally by e.g. --exclude.

!\f (0x00000005) d:\src\deny The read access to the file 'deny' has been denied. This means the file is not part of the deduping process.

!/f (0x00000005) d:\src\deny Could not map the file 'deny' into the adress space to calculate a checksum. This means the file is not part of the deduping process.

!-f (0x00000005) d:\src\deny Failed to delete a file because the access has been denied.

!=f (0x00000005) d:\src\Dupe.txt Failed to rename file before hardlinking.

!*h (0x00000476) d:\s\Gt1023.txt Dupemerged reached the OS limit of 1024 hardlinks per file.

!*h (0x000005b4) d:\s\changed.txt The timestamp of the file has changed since it was enumerated.

!*h (0x00000585) d:\ The NTFS implentation of this drive is broken. It returns the same file-index for files with different file size.
This is only a warning and DupeMerge continues but ignores the retrieved file-indices. It calculates all dupe-info on its own, which takes a bit longer.

Backgrounders

Dupemerge creates a cryptological hashsum for each file found below the given paths and compares those hashes to each other to find the dupes. There is no file date comparison involved in detecting dupes, only the size and content of the files.

To speed up comparison, only files with the same size get compared to each other. Furthermore the hashsums for equal sized files get calculated incrementally, which means that during the first pass only the first 4 kilobyte are hashed and compared, and during the next rounds more and more data are hashed and compared.

Due to long run time on large disks, a file which has already been hashsummed might change before all dupes to that file are found. To prevent false hardlink creation due to intermediate changes, dupemerge saves the file write time of a file when it hashsums the file and checks back if this time changed when it tries to hardlink dupes.

Multiple Runs
If dupemerge is run once, hardlinks among identical files are created. To save time during a second run on the same locations, dupemerge checks if a file is already a hardlink, and tries to find the other hardlinks by comparing the unique NTFS file-id. This saves a lot of time, because checksums for large files need not be created twice.

Transaction based Hardlinking
Before DupeMerge hardlinks file together it renames the file to a temporary name, then creates the hardlink, and afterwards deletes the temporary file. All that is done to be able to roll-back the operation if e.g the hardlinking failes.

TimeStamp Handling
A tupel of hardlinks for one file has always the one timestamp. This is by design of NTFS. But things are a bit confusing sometimes, because after hardlinking the same timestamp is only shown, after the hardlink was once e.g. opened/accessed. So it may happen, that immediatley after dupemerge one observes different timestamps within a tupel of hardlinks, but after such a hardlink has been opened for e.g. read, the timestamp changes to the timestamp of the whole tupel. That's also by design of NTFS.

Dupemerge has a dupe-find algorithm which is tuned to perform especially well on large server disks, where it has been tested in depth to guarantee data integrity.

Limitations

The dupemerge.exe can only be used with NT4/W2K/WXP/W2K3/Windows7/W2K8/W10/W11
Dupes can only be merged within a NTFS volumes, under NT4/W2K/WXP/W2K3/Windows7/W2K8/W10/W11
Dupes can not be merged across NTFS volumes
Dupes can only be merged on *fixed* NTFS volumes
Dupes can only be merged on local NTFS volumes
There is a NTFS limit of having not more than 1023 hardlinks to one file. Dupemerge knows about this and won't create more than 1023 hardlinks for one file

Frequently Asked questions

Hello, this may seem a basic question, but how do I know how much space dupemerge has saved by using hard links? If I have two identical directories A & B and run dupemerge.exe C:\A C:\B, I'd imagine that the resulting size of the two directories would be halved. However windows explorer still thinks the size on disk of the two directories combined is double rather than half. Can you not see the saved space via explorer?

A: You can't see the saved space via Explorer, because Explorer simply adds the size of files found below a given location. Explorer does not currently care when two files are hardlinked. It simply reports the size of each file's data, and then totals the sizes even though the total may include duplications. To see the saved space open a command prompt and type 'dir', run dupemerge, and once again run 'dir'. Or via Explorer: Open the drive and take a look at the drive properties, before and after running dupemerge.

One can use the Hard Link Shell Extension or the ln.exe --list command or the ln.exe --truesize command to see how many filenames are hard-linked to any file. That can provide another way to learn how much space is saved by using hard links.

History

August 14 2021	Version 1.104 released. Added the --suportfs option Recompiled with signed binary and new version number to make it to chocolatey [Internal] but important change from VS2005 (sic) to VS2017. Basically everything compiled smoothly except for the heap, thus... [Internal] but important change from VS2005 (sic) to VS2017. Basically everything compiled smoothly except for the heap, thus... [Internal] Removed the Rockall fast heap. This was neccessary, but also a big performance gain. Memory allocation is 2 times faster, and memory deletion is 10 times faster. Memory allocation is crucial for the core of Dupemerge. [Internal] Dropped Itanium configuration, since VS2017 does not support it anymore, and I am sure there is no Itanium hardware out in the wild anymore.
June 15 2017	Version 1.080 released. Added the excludedirectories option. Added the includedirectories option. include/exclude/includedir/excludedir options can take its arguments from file.
October 18 2014	Version 1.07.001 released. Certain Unicode command line arguments could drive the command line parser crazy.
August 23th 2013	Version 1.07 released. Dead Junctions to a different drive could lead to not detecting hardlinks during all operations. Very Nasty, but no dataloss caused.
August 20th 2013	Version 1.06 released. Introduced a sanity check for broken NTFS implementations. When hardlinking a tupel of equal files, the tupels common date is the date of the oldest file of the tupel. The statistics is printed after the detailed log and not before.
August 7th 2013	Version 1.05 released. Fixed a crash during cleanup at the very end when all was correctly done.
April 5th 2013	Version 1.04 released. Fixed a crash when files were larger than 16gb.
October 28th 2012	Version 1.03 released. The number of new dupegroups was reported non deterministic when huge amounts of files were scanned, but always calculated correct. The --output option didn't redirect output properly. Improved summary statistics. Error message is printed out if the 1024 hardlink limit per file is exceeded.
October 1st 2012	Version 1.0 released. In rare situations not all dupes of a group were merged into one group but in e.g two groups. Fast file enumeration is now the default. Little tweaks here and there, but in general a long journey has ended and this version qualifies for a 1.0 release
September 16th 2012	Version 0.9998 released. Fixed a bug which caused the message 'Could not map view of file' to show up on certain files, causing the files to be not part of the deduping process Fixed a problem, where dupemerge did not find all dupes. Time during operation is printed in readable hh:mm:ss.mss Dupemerge now uses the fast file enumerator which is also used in LSE and ln.exe. This speeds up file enumeration by the factor 10+ Fixed a problem where files larger than 4gb were never detected as dupes. Files larger than 4gb sometimes might have caused the 'Could map view of file' error message In general filesize is not a limiting factor anymore DupeMerge now also handles files with ReadOnly attribute set Added the --output option Added a error reporting capabilty. Added the --exclude option Fixed the statistic output so that only new dupegroups are printed out.
February 25th 2012	Version 0.9994 released. Dupemerge does not climb down junctions or symbolic link directories and ignores symbolic link files.
December 17th 2010	Version 0.9993 released. Dupemerge does not merge more than 1022 equal files, because there is a NTFS limit of having not more than 1023 hardlinks to one file. This is a temporary solution. Let's see what can be done to improve situation here. The regular expression machine used in dupemerge has been changed to tre-0.8.0, which means regexp patterns are no longer case sensitive, and the regexp machine works for 100%.
June 29th 2008	Version 0.999 released. Dupemerge preserves the timestamps of original files when it merges the files via hardlinks. Itanium binaries are available.
October 21st 2007	Version 0.998 released. Dupemerge respects the NTFS limit of having not more than 1023 hardlinks on one file Binaries are now available for 32bit and 64bit, because the compiler for this tool changed to VS2005
January 25th 2007	Version 0.997 released. There is an issue with dupemerge when it climbs down junctions/symbolic links. Until a proper fix to the main-algorithm is out, dupemerge is does not run down junctions/symbolic links, and it keeps files below junctions/symbolic links completely untouched.
November 3rd 2006	Version 0.995 released. Improved the recursive runner performance, which yields faster scanning time. Fixed a super rare merging bug, if dupemerge was run a second time on the same directory and file sizes and numbers were in a super rare constellation. Migrated to the hardlink baseservices components, which also drive LinkShellExt and ln.
March 18th 2006	Version 0.991 released. Fixed a critcal merging bug, which occur ed in very rare situations, when dupemerge was run a second time on the same directory. Did tests on large amounts of data, and checked the output via a second program, to prove integrity. Added calculation on savings resulting in running dupemerge, even with -l switch.
February 17th 2006	Version 0.985 released. Added a check if given paths are on NTFS drives.
Novemver 5th 2005	Version 0.98 released. Fixed the problems with bogus progress output.
January 10th 2005	Version 0.95 released

Status

The 1.100 version is the base for ongoing development, which will be the support for multi-core machines, so that has calculation is distributed onto all available cores.

Acknowledgements

I wish to thank those who have contributed significantly to the development of dupemerge.

Open Issues

With Dupemerge 1.100 all known open issues have been worked off.

License

This program is provided as is. Please see license.txt
dupemerge.exe uses tre the as regular expression machine. See the BSD style tre license.
dupemerge.exe uses ultragetopt for command line parsing. See the ultragetopt license.

Contact / Donations

Bug reports or feature requests send to Hermann Schinagl..

Dupemerge.exe is and will be freeware, but if Dupemerge.exe was really helpful for you and saved lots of your time please think of donations either via PayPal

or by sending me a gift certificate from

.

or by donating bitcoins:

bitcoinlogo
bc1q4hvevwrmnwt7jg8vws0v8xajywhffl4gwca5av bitcoinlseqr

Download

All Windows 32bit	Simply download and extract dupemerge.zip (174kB) to your favourite tools directory. All neccesary runtime dlls are already installed on your system, but if not, grab them from here.

All Windows 64bit	Simply download and extract dupemerge64.zip (219kB) to your favourite tools directory. All neccesary runtime dlls are already installed on your system, but if not, grab them from here.
Windows Itanium	The Itanium version is not supported anymore, but the last VS2005 based version 1.0.8.0 is kept for legacy. Please make sure that the necessary runtime .dlls are installed on your system. This prerequisites package can be downloaded from Microsoft: vcredist_IA64.exe for VS2005 SP1, version 6195/June 2011 (6.3 Mb) Afterwards install the dupemergeItanium.zip (347kB)