Sarah Happy: May 2010

Friday, May 21, 2010

index bloom filter

I experimented with a bloom filter to reduce the number of index files I need to search when looking for the volumes that contain a particular block.

I found that the filter was quite effective at reducing the number of indexes to search for single blocks, but when I went to search for several thousand blocks I would get at least one false positive in each index. Because I was dealing with probabilities this stands to reason.

parallel index searching

I tried searching the archive system index files in parallel (two at a time) and the wall time went down a little bit. This is a sign that the problem is likely IO bound, since there are no locks involved.

real    1m39.428s
user    2m23.523s
sys     0m7.205s

Tuesday, May 18, 2010

Commons-CLI repeated options

I was curious about how repeated options are handled in commons-cli... Here is a test with the conclusion of that curiosity.

    /**
     * An option that may be specified more than once.
     * 
     * @throws ParseException
     */
    @Test
    public void testMultipleInstance() throws ParseException {
        Options o = new Options().addOption("a", "apple", true, "Apples");

        String[] args = { "-a", "one", "-a", "two" };
        CommandLine c = new GnuParser().parse(o, args);

        assertArrayEquals(new String[] { "one", "two" }, c
                .getOptionValues("apple"));

        assertEquals("one", c.getOptionValue("apple"));
    }

Commons-CLI test

I started using commons-cli yesterday, and was curious about the case where an option is given, but not defined. In these situations I end up either writing tests to check the condition, or small example programs to exercise the condition. In this case I wrote a JUnit test.

It turns out that options that are not defined raise an error condition.

package org.yi.happy.archive;

import static org.junit.Assert.assertArrayEquals;

import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.GnuParser;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
import org.apache.commons.cli.UnrecognizedOptionException;
import org.junit.Test;

/**
 * Experimental tests for the commons-cli library.
 */
public class CommandLineTest {
    /**
     * Show what happens when an invalid option is given.
     * 
     * @throws ParseException
     */
    @Test(expected = UnrecognizedOptionException.class)
    public void testBadOption() throws ParseException {
        Options o = new Options().addOption("a", "apple", true, "Apples");

        String[] arguments = { "--orange", "5" };
        new GnuParser().parse(o, arguments);
    }

    /**
     * End the argument list to catch an option with dashes.
     * 
     * @throws ParseException
     */
    @Test
    public void testArguments() throws ParseException {
        Options o = new Options().addOption("a", "apple", true, "Apples");

        String[] arguments = { "--", "--orange", "5" };
        CommandLine c = new GnuParser().parse(o, arguments);

        assertArrayEquals(new String[] { "--orange", "5" }, c.getArgs());
    }

}

Tuesday, May 11, 2010

compressed index searching

I am continuing to try to make searching my disk indexes faster, this time I tried compressing the index files to reduce the IO required to get them into memory, however the run time went up.

First, the uncompressed index has a size of 3017824 K, and the times of two consecutive scans are

real    2m35.547s
user    2m18.172s
sys     0m6.146s

real    2m36.304s
user    2m17.152s
sys     0m6.242s

Next, the compressed index has a size of 1164228 K, and the times of two consecutive scans are

real    3m20.342s
user    3m4.939s
sys     0m8.805s

real    3m20.269s
user    3m3.866s
sys     0m8.699s

Since the system time went up I wonder if some extra buffering would work, also since this machine has two cores I wonder if I could do the scan in parallel.

Uncompressed with buffering makes it slower:

real    2m42.024s
user    2m15.048s
sys     0m6.214s

Compressed with buffering is slower too:

real    3m34.062s
user    3m3.281s
sys     0m5.488s

I ran each timing multiple times, and the times were mostly the same for each run.

So far straight reading of uncompressed data is fastest when scanning the whole data set.

At a later time I will try and make a parallel version.

Saturday, May 8, 2010

Sustainable Test-Driven Development

I just watched "InfoQ: Sustainable Test-Driven Development" and saw some things I really should do in my own tests.

One thing I should do is make builders for when I am putting structures together, and the idea of having methods that are hiding the details of what is going on so that the tests essentially declare the intent or feature.

Friday, May 7, 2010

tee and digest output stream

I was working on storing files in Happy Archive, the way it is constructed currently I store files by writing the contents to an output stream that makes chunks, encodes, encrypts, and stores the chunks.

I also wanted a hash of the whole data stream to add to the file meta-data.

I considered making a decorator output stream that would calculate the hash of all the data that passed through it, but then also decided that it would be simpler to build a sink that does the hashing, and a tee to send the data to the two sinks (hashing, and encoding). Another argument for not using a filter is that the hashing does not change the data that is passing through.

The tee looks like this:

package org.yi.happy.archive;

import java.io.IOException;
import java.io.OutputStream;

/**
 * An output stream that writes to two output streams.
 */
public class TeeOutputStream extends OutputStream {
    private final OutputStream out1;
    private final OutputStream out2;

    /**
     * create an output stream that writes to two output streams.
     * 
     * @param out1
     *            the first stream to write to.
     * @param out2
     *            the second stream to write to.
     */
    public TeeOutputStream(OutputStream out1, OutputStream out2) {
        try {
            this.out1 = out1;
        } finally {
            this.out2 = out2;
        }
    }

    @Override
    public void write(int b) throws IOException {
        try {
            out1.write(b);
        } finally {
            out2.write(b);
        }
    }

    @Override
    public void write(byte[] b) throws IOException {
        try {
            out1.write(b);
        } finally {
            out2.write(b);
        }
    }

    @Override
    public void write(byte[] b, int off, int len) throws IOException {
        try {
            out1.write(b, off, len);
        } finally {
            out2.write(b, off, len);
        }
    }

    @Override
    public void flush() throws IOException {
        try {
            out1.flush();
        } finally {
            out2.flush();
        }
    }

    @Override
    public void close() throws IOException {
        try {
            out1.close();
        } finally {
            out2.close();
        }
    }
}

Also the hashing output stream looks like this:

package org.yi.happy.archive;

import java.io.IOException;
import java.io.OutputStream;
import java.security.MessageDigest;

import org.yi.happy.archive.key.HashValue;

/**
 * An output stream that calculates the digest of whatever is written to it.
 */
public class DigestOutputStream extends OutputStream {
    private MessageDigest md;
    private HashValue hash;
    private long size;

    /**
     * set up an output stream that calculates the digest of whatever is written
     * to it.
     * 
     * @param md
     *            the {@link MessageDigest} to use. The {@link MessageDigest} is
     *            assumed to be freshly created and not shared.
     */
    public DigestOutputStream(MessageDigest md) {
        this.md = md;
        this.hash = null;
        this.size = 0;
    }

    @Override
    public void write(int b) throws IOException {
        if (md == null) {
            throw new ClosedException();
        }

        md.update((byte) b);
        size += 1;
    }

    @Override
    public void write(byte[] b) throws IOException {
        if (md == null) {
            throw new ClosedException();
        }

        md.update(b);
        size += b.length;
    }

    @Override
    public void write(byte[] b, int off, int len) throws IOException {
        if (md == null) {
            throw new ClosedException();
        }

        md.update(b, off, len);
        size += len;
    }

    @Override
    public void close() throws IOException {
        if (md == null) {
            return;
        }

        hash = new HashValue(md.digest());
        md = null;
    }

    /**
     * Get the final digest value.
     * 
     * @return the final digest value.
     * @throws IllegalStateException
     *             if {@link #close()} has not been called.
     */
    public HashValue getHash() throws IllegalStateException {
        if (hash == null) {
            throw new IllegalStateException();
        }
        return hash;
    }

    /**
     * @return the number of bytes written to this stream.
     */
    public long getSize() {
        return size;
    }
}

HashValue is a value object that represents strings of bytes that are hashes. ClosedException is an IOException.

I also noticed that there is a DigestOutputStream in the Java standard library which is an output stream filter.

Wednesday, May 5, 2010

pancake syrup

To go with the pancakes that I make, I make syrup too.

Take a cup (250 ml) of brown sugar, add two cups of water, and boil the mix down until it just starts to foam, or thickens slightly to not pour like water off the spoon.

Remove from heat and add some vanilla for flavor and serve.

If it is too thick, boil it again and add some water.

If it is too runny, boil it again.

Tuesday, May 4, 2010

Mac OSX Quicktime Codecs

For playing video on various platforms I often use VLC, it worked well until I ran into a file using an audio codec that was not included in it. Since I installed it on this Mac from a binary package I do not know how to add more codecs to it.

In response I tried the other players that were installed here, DivX Player and Quicktime Player, neither of them could handle the audio codec either, but gave a useful error message which I searched for and found the Perian codec pack for Mac OSX.

It worked perfectly, all the media that I have is now recognized by the player that came with the computer.