I remember this being slow even when using a relational database, taking several minutes when the list and index were both in the database. If the list was not in the database it was painfully slow.
I have the indexes, sorted by key, one file per volume (768 files).
I have a list of keys to find in a file (24309 entries).
I try to do a binary search of each index file for each key and it takes about 27 seconds per file to do the search, and I have not wanted to wait to see how long the entire search takes (a quick estimate is 6 hours). I would expect that if the indexes were merged into one sorted file then it would still take 27 seconds to search that and get the whole answer I am looking for, but that is extra management to deal with. I am already caching reads in the binary search, they only improved the performance slightly when they were added.
To do a straight linear scan of all the files and pick out the lines I am looking for takes 177 seconds (almost 3 minutes).
Just reading all of the files with no processing
cat */* > /dev/nulltakes 63 seconds.
The size of the data set on disk is almost 3G.
excited:~/archive.d/index$ du -h 1.5G ./offsite 1.4G ./onsite 2.9G .
The total number of lines of text in the data set is 26283702.
I am not even sure if it is possible to do faster than the linear scan for performance without merging all of the files together, which I would rather not do to save the size of the volume set and name fields.
Post a Comment