Sarah Happy: removing duplicate files

The challenge: Remove from one directory any file that exists in another directory, however the name of the files can not be used.

my solution:

first, make a list of the hashes of the files in each directory

find . -type f -print0 | xargs -0 shasum -a 256 > ~/tmp/files.lst

edit the file lists if necessary.

take a set intersection of the hashes

cut -f 1 -d ' ' list.txt | sort > hash.txt
comm -12 hash1.txt hash2.txt > clean.txt

the result is a list of just hashes in common between the lists.

list all the files with the selected hashes, using a perl script

#!/usr/bin/perl -w
use strict;

if(@ARGV != 2) {
    die "use: select hashes list\n";
}

my %hashes;
local *IN;
open(IN, "<", $ARGV[0]) or die "open: $!";
while(<IN>) {
    chomp;
    $hashes{$_} = 1;
}
close(IN);

open(IN, "<", $ARGV[1]) or die "open: $!";
while(<IN>) {
    my ($hash) = m,^(\S+), or next;
    $hashes{$hash} or next;
    print $_;
}
close(IN);

invoked like

perl select.pl clean.txt list1.txt | cut -c 67- > remove.txt

which results in a list of file names in the first list for files that also exist in the second list.

finally, remove the files

while read x ; do rm -- "$x"; done < ~/tmp/remove.txt

Sarah Happy

Thursday, October 14, 2010

removing duplicate files

No comments:

Post a Comment