Saturday, April 24, 2010

binary search of a sorted text file

This is my second time implementing a binary search of a sorted text file, the previous time was many years ago using PERL, this time it is in Java.

Here is the code to do a binary search of a sorted text file.

/**
 * Find the position of the start of the first line in the file that is
 * greater than or equal to the target line, using a binary search.
 * 
 * @param file
 *            the file to search.
 * @param target
 *            the target to find.
 * @return the position of the first line that is greater than or equal to
 *         the target line.
 * @throws IOException
 */
public static long search(RandomAccessFile file, String target)
        throws IOException {
    /*
     * because we read the second line after each seek there is no way the
     * binary search will find the first line, so check it first.
     */
    file.seek(0);
    String line = file.readLine();
    if (line == null || line.compareTo(target) >= 0) {
        /*
         * the start is greater than or equal to the target, so it is what
         * we are looking for.
         */
        return 0;
    }

    /*
     * set up the binary search.
     */
    long beg = 0;
    long end = file.length();
    while (beg <= end) {
        /*
         * find the mid point.
         */
        long mid = beg + (end - beg) / 2;
        file.seek(mid);
        file.readLine();
        line = file.readLine();

        if (line == null || line.compareTo(target) >= 0) {
            /*
             * what we found is greater than or equal to the target, so look
             * before it.
             */
            end = mid - 1;
        } else {
            /*
             * otherwise, look after it.
             */
            beg = mid + 1;
        }
    }

    /*
     * The search falls through when the range is narrowed to nothing.
     */
    file.seek(beg);
    file.readLine();
    return file.getFilePointer();
}

I tested this on a 33MB text file, it took about 15ms to find a line where grep took 600ms. I will be trying to make this faster using caching, and the result will likely be posted later.

8 comments:

  1. I can't see how this works properly - the binary search is looking at byte offsets, which means you could start searching int the middle of a line. You need to also find the beginning of the line before calling readLine() also so you are comparing the entire line.

    ReplyDelete
    Replies
    1. You may notice that during the search there are two readLine() calls, one to advance to the beginning of a line, the other to read it. It does loose some efficiency points for reading the same line over and over again when the search is getting tight. What is returns is the offset after skipping the partial line it finished on.

      Delete
  2. Here is a twist. What if the file contains duplicate lines of the search string?

    ReplyDelete
    Replies
    1. You should end up pointing at the first one in the sequence of duplicated lines.

      Delete
  3. thanks for this. fyi, the double readLine is complex and dangerous, and not working everytime (I had issues where I got back the line right after the one I was looking for) so you better go back to the previous line break after a seek with something like this :
    while (mid >= 0) {
    raf.seek(mid--);
    if (mid <= 0 || (char) raf.readByte() == '\n') {
    break;
    }
    }

    ReplyDelete
    Replies
    1. If you found a case where this fails I would love to see it. At one point I had considered seeking backwards to find the start of the line, but it would have involved a lot of backwards reading which can be much more expensive than forward reading, so I avoided it. I did the search from the perspective of the first full line after the pointer, which would never find the first line, so I had to treat that one specially.

      Delete
  4. I come up with similar solution. However I try to find a way to perform something similar in binary file, but it not seem possible at least for the moment

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete