Duplicate files and strategies for keeping drive space free

**french frank** · 17-03-16, 00:10

Originally posted by MrGongGong View Post

It seems to work by file size alone

One folder I just checked had 8 copies of the same pdf

How does this happen?

**Dave2002** · 17-03-16, 00:17

Originally posted by MrGongGong View Post

It seems to work by file size alone
I have had a couple of "false positives" where there were two different files with identical sizes which was a bit of a surprise as I imagined that this would have been mathematically impossible But there again my maths is at about the same level as George Osborne so I wouldn't trust it at all.

Doing matching by file size is not a particularly good idea - though it is a starting point. I'm not quite sure what data is readily available in Mac OS - but there is probably some descriptive data stored with each file - maybe including one of more hash values (e.g md5) which would be easy to use in comparisons.

Conceptually all that is required is to examine matching characteristics for each possible group of replicates. File size might do to start, but then if appropriate hash values differ the files would be considered different. If groups of files are still not differentiated by several different hash values, then they might then be considered as identical - though there could be a very low probability that they're not. To be absolutely sure, all the groups of candidate files would actually have to be bytewise compared to test for exact equality, though the likelihood that this would be burdensome is low. A worst case (or at least a very bad case) would be if a large drive had a large number of identical files - e.g a 1 Tbyte drive containing 1000 x 1 Mbyte files, as this would force every byte on the drive to be read multiple times. Comparing 2 files would require 2 x 1 million (or 2^20 if one wants to be fussy and use computer style numbers) comparisons to test for inequality (which would fail, if all the files are the same), and then this would have to be repeated for all the other possible pairs. I think it would come out at about the equivalent of reading the whole drive 2000 times - which would be a very long time indeed. Normally one would not expect a large number of exact replica large files, and the algorithm would go much quicker. Normally file equality tests would fail very quickly for files which actually are different if they have identical hash values. If hash values are not readily available i.e pre computed, they would have to be calculated for all the files on the drive, which would probably require one complete scan of the drive at least.

Duplicate files and strategies for keeping drive space free

Comment

Comment