If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.
Duplicate files and strategies for keeping drive space free
One folder I just checked had 8 copies of the same pdf How does this happen?
It isn't given us to know those rare moments when people are wide open and the lightest touch can wither or heal. A moment too late and we can never reach them any more in this world.
It seems to work by file size alone
I have had a couple of "false positives" where there were two different files with identical sizes which was a bit of a surprise as I imagined that this would have been mathematically impossible But there again my maths is at about the same level as George Osborne so I wouldn't trust it at all.
Doing matching by file size is not a particularly good idea - though it is a starting point. I'm not quite sure what data is readily available in Mac OS - but there is probably some descriptive data stored with each file - maybe including one of more hash values (e.g md5) which would be easy to use in comparisons.
Conceptually all that is required is to examine matching characteristics for each possible group of replicates. File size might do to start, but then if appropriate hash values differ the files would be considered different. If groups of files are still not differentiated by several different hash values, then they might then be considered as identical - though there could be a very low probability that they're not. To be absolutely sure, all the groups of candidate files would actually have to be bytewise compared to test for exact equality, though the likelihood that this would be burdensome is low. A worst case (or at least a very bad case) would be if a large drive had a large number of identical files - e.g a 1 Tbyte drive containing 1000 x 1 Mbyte files, as this would force every byte on the drive to be read multiple times. Comparing 2 files would require 2 x 1 million (or 2^20 if one wants to be fussy and use computer style numbers) comparisons to test for inequality (which would fail, if all the files are the same), and then this would have to be repeated for all the other possible pairs. I think it would come out at about the equivalent of reading the whole drive 2000 times - which would be a very long time indeed. Normally one would not expect a large number of exact replica large files, and the algorithm would go much quicker. Normally file equality tests would fail very quickly for files which actually are different if they have identical hash values. If hash values are not readily available i.e pre computed, they would have to be calculated for all the files on the drive, which would probably require one complete scan of the drive at least.
Comment