Sunday, April 18, 2010

large set calculations on the command line

I was working on Happy Archive a little bit, burning some backup disks.

To figure out what to put on a disk, I need to figure out which keys are not stored in a particular volume set, and not stored on any volume set.

There are two volume sets, offsite and onsite.

I have a list of keys in the store (store.lst), and a table of what keys from the store exist on what volumes (volume.grid: volume-set, volume, file-name, key).

First I make a list of what keys are in each volume set already:
grep $'^onsite\t' volume.grid | cut -d $'\t' -f 4 | sort > on-onsite.lst
grep $'^offsite\t' volume.grid | cut -d $'\t' -f 4 | sort > on-offsite.lst
Then I take those sets away from the store list (outstanding = store - exist).
sort on-onsite.lst | comm -13 - store.lst > outstanding-onsite.lst
sort on-offsite.lst | comm -13 - store.lst > outstanding-offsite.lst
Now I take the intersection of those sets to find the keys not on either disk set, and the difference to find the keys that are only on one of the disk sets.
comm -12 outstanding-onsite.lst outstanding-offsite.lst > outstanding-common.lst
comm -23 outstanding-onsite.lst outstanding-offsite.lst > outstanding-just-onsite.lst
comm -13 outstanding-onsite.lst outstanding-offsite.lst > outstanding-just-offsite.lst
And finally make the list in the order of preference for burning, giving priority to keys that are not on the backup disks yet.
cat outstanding-common.lst outstanding-just-onsite.lst > outstanding-onsite.lst
cat outstanding-common.lst outstanding-just-offsite.lst > outstanding-offsite.lst

I am going to be embedding this into a java program, but in a pinch this data processing is not hard to do.

There were 141k keys in the store list, and 80k keys in the resulting outstanding lists, and it did not take long to run these commands.

No comments:

Post a Comment