Tuesday, January 29, 2013

java regular expression on byte array

Ever wanted to use a regular expresson on a byte array in Java? It turns out that regular expressions are eight bit safe in Java, and bytes can safely map into the lower half of the character type. With a simple adapter it becomes a trivial task. Demonstration:
package org.yi.happy.binary_regex;

import static org.junit.Assert.assertEquals;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.junit.Test;

public class BinaryRegexTest {
    /**
     * Find line endings in a byte array using a regular expression.
     */
    @Test
    public void testExpression() {
        byte[] data = new byte[] { 'a', '\r', '\r', 'c' };
        Pattern p = Pattern.compile("\r\n?|\n\r?");
        Matcher m = p.matcher(new ByteCharSequence(data));

        assertEquals(true, m.find(0));
        assertEquals(1, m.start());
        assertEquals(2, m.end());

        assertEquals(true, m.find(2));
        assertEquals(2, m.start());
        assertEquals(3, m.end());

        assertEquals(false, m.find(3));
    }

    /**
     * Find null bytes in a byte array using a regular expression.
     */
    @Test
    public void testNull() {
        byte[] data = new byte[] { 'a', 0, 'b', 0 };

        Pattern p = Pattern.compile("\0");
        Matcher m = p.matcher(new ByteCharSequence(data));

        assertEquals(true, m.find(0));
        assertEquals(1, m.start());
        assertEquals(2, m.end());

        assertEquals(true, m.find(2));
        assertEquals(3, m.start());
        assertEquals(4, m.end());

        assertEquals(false, m.find(4));
    }
}
And the adapter is as one might expect,
package org.yi.happy.binary_regex;

public class ByteCharSequence implements CharSequence {

    private final byte[] data;
    private final int length;
    private final int offset;

    public ByteCharSequence(byte[] data) {
        this(data, 0, data.length);
    }

    public ByteCharSequence(byte[] data, int offset, int length) {
        this.data = data;
        this.offset = offset;
        this.length = length;
    }

    @Override
    public int length() {
        return this.length;
    }

    @Override
    public char charAt(int index) {
        return (char) (data[offset + index] & 0xff);
    }

    @Override
    public CharSequence subSequence(int start, int end) {
        return new ByteCharSequence(data, offset + start, end - start);
    }

}

8 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. This is brilliant! I can now do binary pattern matching using built-in java libs. The only thing I might suggest is to emphasise that in the expression \XYZ is in Oct (not hex, not dec).

    Mate, thanks for sharing this!

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. This comment has been removed by a blog administrator.

    ReplyDelete
  7. All are saying the same thing repeatedly, but in your blog I had a chance to get some useful and unique information, I love your writing style very much, I would like to suggest your blog in my dude circle, so keep on updates.
    microsoft azure training in bangalore
    rpa training in bangalore
    best rpa training in bangalore
    rpa online training

    ReplyDelete
  8. Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
    AWS Training in pune
    AWS Online Training

    ReplyDelete