Monday, April 26, 2010

searching for many lines in a text file

I spent the day working on my archive system project trying to improve the speed of searching the indexes for batches of keys.

I remember this being slow even when using a relational database, taking several minutes when the list and index were both in the database. If the list was not in the database it was painfully slow.

I have the indexes, sorted by key, one file per volume (768 files).

I have a list of keys to find in a file (24309 entries).

I try to do a binary search of each index file for each key and it takes about 27 seconds per file to do the search, and I have not wanted to wait to see how long the entire search takes (a quick estimate is 6 hours). I would expect that if the indexes were merged into one sorted file then it would still take 27 seconds to search that and get the whole answer I am looking for, but that is extra management to deal with. I am already caching reads in the binary search, they only improved the performance slightly when they were added.

To do a straight linear scan of all the files and pick out the lines I am looking for takes 177 seconds (almost 3 minutes).

Just reading all of the files with no processing
cat */* > /dev/null
takes 63 seconds.

The size of the data set on disk is almost 3G.
excited:~/archive.d/index$ du -h
1.5G ./offsite
1.4G ./onsite
2.9G .

The total number of lines of text in the data set is 26283702.

I am not even sure if it is possible to do faster than the linear scan for performance without merging all of the files together, which I would rather not do to save the size of the volume set and name fields.

No comments:

Post a Comment