|
|
Last updated December 17th 2010, Version 0.9993
|
|
|
Introduction
|
Most harddisks contain quite a lot of completely identical files,
which consume a lot of disk space. This waste of space can be drastically
reduced by using the NTFS filesystem hardlinks functionality, if identical
files aka dupes are hardlinked together.
Dupemerge searches for identical files on a logical drive and creates hardlink among those
file thus saving lots of harddisk space.
|
|
Installation
|
Dupemerge.exe is a command line utility, so copy dupemerge.exe
to some directory referenced by your PATH environment variable.
%systemroot% is a good place. e.g.: c:\winnt
|
|
|
Using
dupemerge
|
dupemerge.exe can be controlled
by a few command line arguments, and its highlights are as follows:
|
|
|
|
Specify path
|
More than one path can be specified to search for dupes.
dupemerge
c:\data c:\test\42
The above command causes dupemerge to search below
c:\data and c:\test\42 for dupes. Dupes might be
spread across given subdirectory trees: e.g. If the files c:\data\a.txt
c:\data\dd.txt and c:\test\42\new.bat are dupes, they get hardlinked
together.
|
|
|
Use wildcards
|
In certain situations only a few files, e.g.: *.pdb,
below a path should be checked for dupes. To accomplish this dupemerge can be run
with filters specified, to only match certain files.
dupemerge --wildcard *.dbg --wildcard a*.pdb
c:\data
In the above example dupemerge only searches for files, which
match the expressions sepcified with --wildcard. The --wildcard option can be used
more than once.
|
|
|
Use regular expressions
|
In certain situations some kinds of files, e.g.:
alle file containing only letters should be checked for dupes. To accomplish this
dupemerge can be run with regular expression filters specified, to only match certain
files.
dupemerge --regexp "[a-z]*" c:\data
In the above example dupemerge only searches for files, which
match the regular expressions sepcified with --regexp. The --regexp option can be
used more than once.
|
|
|
List only
|
To find out which files are dupes, but to not hardlink
those file, dupemerge can be run in list mode
dupemerge --list c:\data c:\test\42
An extensive report is generated which files are dupes below
c:\data and c:\test\42
|
|
|
Size dependent
check
|
The size of the files, which are compared, can be controlled by two
switches
dupemerge
--minsize 3000 --maxsize 500000 c:\data
In the above example dupemerge searches
for files bigger than 3000 bytes and smaller than 500000 bytes below
c:\data.
|
|
|
Sort output
|
The output shows the order of found dupegroupes either random or by cardinality or by filesize. This
is controlled by the size switch, which has the filesize or the cardinality
modifier. The default behaviour is to show dupegroups random.
dupemerge
--sort cardinality c:\data
In the above example dupemerge searches
for files below c:\data and prints the output so that the dupegroup with most
identical files is printed first, and the dupegroup with least members is printed
last.
dupemerge
--sort filesize c:\data
In the above example dupemerge searches
for files below c:\data and prints the output so that the dupegroup, which contains the largest
files, is printed first, and the dupegroup with smallest files is printed
last.
|
|
|
|
|
Backgrounders
|
Dupemerge creates a cryptological hashsum for each file found below
the given pathes
and compares those hashes to each other find the dupes. There is no file date comparison
involved in detecting dupes, which might cause troubles.
To speed up comparison only files with same size get compared
to each other. Furthermore the hashsums for equal sized files
get calculated incrementally, which means, that during the first
pass only the first 4 kilobyte are hashed and compared and
during the next rounds more and more data are hashed and compared.
Due to long runtime on large disks some files, which have already been
hashsumed, might change before all dupes to that file are found.
To prevent false hardlink creation due to intermediate changes,
dupemerge saves the file write time of a file when it hashsums
the file and checks back if this time changed when it tries to
hardlink dupes.
If dupemerge is run once, hardlinks among indentical files
are created. To save time during a second run on the same
locations, dupemerge checks if a file is already a hardlink, and
tries to find the other hardlinks by comparing the unique
NTFS file-id. This saves a lot of time, because especially
checksums for large files need not to to be created twice.
Dupemerge has a dupe find algorithm, which is extremly tuned
to especially perform well on large server disks, where it has been
tested in depth to guarantee data integrity.
|
|
|
Limitations
|
- The dupemerge.exe can only be used with NT4/W2K/WXP/W2K3
- Dupes can only be merged within a NTFS volumes, under NT4/W2K/WXP/W2K3
- Dupes can not be merged across NTFS volumes
- Dupes can only be merged on *fixed* NTFS volumes
- Dupes can only be merged on local NTFS volumes
- There is a NTFS limit of having not more than 1023 hardlinks to one file. Dupemerge knows about this and denies creating more than 1023 hardlinks for one file
|
|
|
Frequently Asked questions
|
Hello, this may seem a basic question, but how do I know how much space dupemerge has saved by using hard links?
If I have two identical directories A & B and run dupemerge.exe C:\A C:\B, Id imagine that the resulting size of the two directories would be halved.
However windows explorer still thinks the size on disk of the two directories combined is double rather than half.
Can you not see the saved space via explorer?
A: You can't see the saved space via explorer, because
explorer simply adds the size of files found
below a given location, and because hardlinks are very
transparent, explorer does not know, that a summed up file is a hardlink,
so it thinks it is a file.
To see the saved space open a command prompt and
type 'dir', run dupemerge, and once again run 'dir'.
Or via Explorer: Open the drive and take a look at the
drive properties, before and after running dupemerge.
|
|
|
History
|
|
December 17th 2010
|
Version 0.9993 released.
- Dupemerge does not merge more than 1022 equal files, because there is a
NTFS limit of having not more than 1023 hardlinks to one file. This is a temporary
solution. Lets see what can be done to improve situation here.
- The regular expression machine used in dupemerge has been changed to tre-0.8.0,
which means regexp pattern are not case sensitive any more, and the regexp machine
works for 100%.
|
|
June 29th 2008
|
Version 0.999 released.
Dupemerge preserves the timestamps of original files when it merges the files via hardlinks.
Itanium binaries are available.
|
|
October 21st 2007
|
Version 0.998 released.
Dupemerge respects the NTFS limit of having not more than 1023 hardlinks on one file
Binaries are now available for 32bit and 64bit, because the compiler for this tool changed to VS2005
|
|
January 25th 2007
|
Version 0.997
released.
There is an issue with dupemerge when it climbs down junctions. Until a proper
fix to the main-algorithm is out, dupemerge is limited not to run down junctions, and
it keeps files below junctions completely untouched.
When dupemerge was called many times with the same path, e.g. dupemerge.exe c:\1 c:\1
the files in that directory got accidently deleted.
|
|
November 3rd 2006
|
Version 0.995
released.
Improved the recursive runner performance, which yields faster scanning time.
Fixed a super rare merging bug, if dupemerge was run a second time on the same directory
and file sizes and numbers were in a super rare constellation.
Migrated to the hardlink baseservices components, which also drive
LinkShellExt and
ln.
|
|
March 18th
2006
|
Version 0.991 released.
Fixed a critcal merging bug, which occured in very rare situations, when
dupemerge was run a second time on the same directory. Did tests on large
amounts of data, and checked the output via a second programm, to prove integrity.
Added calculation on savings resulting in running dupemerge, even with -l switch.
|
|
February 17th
2006
|
Version 0.985
released. Added a check if given paths are on NTFS drives.
|
|
Novemver 5th
2005
|
Version 0.98
released. Fixed the problems with bogous progress output.
|
|
January 10th
2005
|
Version 0.95
released
|
|
|
| Status |
The 0.9991 version
is stable enough to satisfy most needs. A bugfixing release
is scheduled for February 2011, which should contain a fix for the junction
problem and a lot more.
|
|
|
Acknowledgements
|
I wish to thank those
who have contributed significantly to the development of
dupemerge.
|
|
|
Open Issues
|
- There is an issue with junctions: If a file is found twice or more via a junction, it accidently gets deleted.
- The number of dupegroups gets counted slightly incorrect.
- There is a problem with cyrillic characters in Pathnames, which
causes dupemerge to output wrong dupegroups, but behind the scenes
dupegroups get created correclty.
|
|
|
License
|
This program is provided as is. Please see license.txt
ln.exe uses tre the as regular expression machine. See the BSD
style tre license.
|
|
|
Contact / Donations
|
Bug reports, or feature requests send to
Hermann Schinagl..
Dupemerge.exe is and will be freeware, but if Dupemerge.exe was really
helpful for you and saved lots of your time please
think of donations either via PayPal
or by sending me a gift certificate from
.
|
|
|
|
Download
|
|
|