Saturday, April 24, 2010

binary search of a sorted text file

This is my second time implementing a binary search of a sorted text file, the previous time was many years ago using PERL, this time it is in Java.

Here is the code to do a binary search of a sorted text file.

/**
 * Find the position of the start of the first line in the file that is
 * greater than or equal to the target line, using a binary search.
 * 
 * @param file
 *            the file to search.
 * @param target
 *            the target to find.
 * @return the position of the first line that is greater than or equal to
 *         the target line.
 * @throws IOException
 */
public static long search(RandomAccessFile file, String target)
        throws IOException {
    /*
     * because we read the second line after each seek there is no way the
     * binary search will find the first line, so check it first.
     */
    file.seek(0);
    String line = file.readLine();
    if (line == null || line.compareTo(target) >= 0) {
        /*
         * the start is greater than or equal to the target, so it is what
         * we are looking for.
         */
        return 0;
    }

    /*
     * set up the binary search.
     */
    long beg = 0;
    long end = file.length();
    while (beg <= end) {
        /*
         * find the mid point.
         */
        long mid = beg + (end - beg) / 2;
        file.seek(mid);
        file.readLine();
        line = file.readLine();

        if (line == null || line.compareTo(target) >= 0) {
            /*
             * what we found is greater than or equal to the target, so look
             * before it.
             */
            end = mid - 1;
        } else {
            /*
             * otherwise, look after it.
             */
            beg = mid + 1;
        }
    }

    /*
     * The search falls through when the range is narrowed to nothing.
     */
    file.seek(beg);
    file.readLine();
    return file.getFilePointer();
}

I tested this on a 33MB text file, it took about 15ms to find a line where grep took 600ms. I will be trying to make this faster using caching, and the result will likely be posted later.

16 comments:

  1. I can't see how this works properly - the binary search is looking at byte offsets, which means you could start searching int the middle of a line. You need to also find the beginning of the line before calling readLine() also so you are comparing the entire line.

    ReplyDelete
    Replies
    1. You may notice that during the search there are two readLine() calls, one to advance to the beginning of a line, the other to read it. It does loose some efficiency points for reading the same line over and over again when the search is getting tight. What is returns is the offset after skipping the partial line it finished on.

      Delete
  2. Here is a twist. What if the file contains duplicate lines of the search string?

    ReplyDelete
    Replies
    1. You should end up pointing at the first one in the sequence of duplicated lines.

      Delete
  3. thanks for this. fyi, the double readLine is complex and dangerous, and not working everytime (I had issues where I got back the line right after the one I was looking for) so you better go back to the previous line break after a seek with something like this :
    while (mid >= 0) {
    raf.seek(mid--);
    if (mid <= 0 || (char) raf.readByte() == '\n') {
    break;
    }
    }

    ReplyDelete
    Replies
    1. If you found a case where this fails I would love to see it. At one point I had considered seeking backwards to find the start of the line, but it would have involved a lot of backwards reading which can be much more expensive than forward reading, so I avoided it. I did the search from the perspective of the first full line after the pointer, which would never find the first line, so I had to treat that one specially.

      Delete
  4. I come up with similar solution. However I try to find a way to perform something similar in binary file, but it not seem possible at least for the moment

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. Altogether, 96% and 78% of the patients were in New York Heart Association functional class I-II in the Marfan and non-Marfan groups respectively. None of the Marfan patients needed reoperation on the aortic valve. Dr. Eyal Nachum is a senior cardiologist working in Heart Transplantation Unit, Sheba Medical Center, Ramat Gan, Israel.

    ReplyDelete
  7. Good one my dear I wish to have more of your articles and also become my entrepreneurship mentor, thank you.binary options demo

    ReplyDelete
  8. Wow those look great. I need to bring some cakes for our last day in the office so might have to give that recipe a try.binaryoptionsdemo

    ReplyDelete
  9. At the point when you've developed the capital and pay of your Forex frameworks activity, and have gotten together significant exchanging experience, you may choose to evaluate exchanging Forex for yourself. Aldi

    ReplyDelete
  10. You have a good point here!I totally agree with what you have said!!Thanks for sharing your views...hope more people will read this article!!! iq option entrar

    ReplyDelete
  11. Yes i am totally agreed with this article and i just want say that this article is very nice and very informative article.I will make sure to be reading your blog more. You made a good point but I can't help but wonder, what about the other side? !!!!!!Thanks robo trader

    ReplyDelete
  12. Momentary exchanging can be additionally characterized in to day exchanging, swing exchanging and position exchanging. forex course etc

    ReplyDelete