Thursday, October 14, 2010

removing duplicate files

The challenge: Remove from one directory any file that exists in another directory, however the name of the files can not be used.

my solution:

first, make a list of the hashes of the files in each directory
find . -type f -print0 | xargs -0 shasum -a 256 > ~/tmp/files.lst
edit the file lists if necessary.

take a set intersection of the hashes
cut -f 1 -d ' ' list.txt | sort > hash.txt
comm -12 hash1.txt hash2.txt > clean.txt
the result is a list of just hashes in common between the lists.

list all the files with the selected hashes, using a perl script
#!/usr/bin/perl -w
use strict;

if(@ARGV != 2) {
    die "use: select hashes list\n";
}

my %hashes;
local *IN;
open(IN, "<", $ARGV[0]) or die "open: $!";
while(<IN>) {
    chomp;
    $hashes{$_} = 1;
}
close(IN);

open(IN, "<", $ARGV[1]) or die "open: $!";
while(<IN>) {
    my ($hash) = m,^(\S+), or next;
    $hashes{$hash} or next;
    print $_;
}
close(IN);

invoked like
perl select.pl clean.txt list1.txt | cut -c 67- > remove.txt
which results in a list of file names in the first list for files that also exist in the second list.

finally, remove the files
while read x ; do rm -- "$x"; done < ~/tmp/remove.txt

No comments:

Post a Comment