I experimented with a bloom filter to reduce the number of index files I need to search when looking for the volumes that contain a particular block.
I found that the filter was quite effective at reducing the number of indexes to search for single blocks, but when I went to search for several thousand blocks I would get at least one false positive in each index. Because I was dealing with probabilities this stands to reason.
Friday, May 21, 2010
parallel index searching
I tried searching the archive system index files in parallel (two at a time) and the wall time went down a little bit. This is a sign that the problem is likely IO bound, since there are no locks involved.
real 1m39.428s user 2m23.523s sys 0m7.205s
Tuesday, May 18, 2010
Commons-CLI repeated options
I was curious about how repeated options are handled in commons-cli... Here is a test with the conclusion of that curiosity.
/** * An option that may be specified more than once. * * @throws ParseException */ @Test public void testMultipleInstance() throws ParseException { Options o = new Options().addOption("a", "apple", true, "Apples"); String[] args = { "-a", "one", "-a", "two" }; CommandLine c = new GnuParser().parse(o, args); assertArrayEquals(new String[] { "one", "two" }, c .getOptionValues("apple")); assertEquals("one", c.getOptionValue("apple")); }
Commons-CLI test
I started using commons-cli yesterday, and was curious about the case where an option is given, but not defined. In these situations I end up either writing tests to check the condition, or small example programs to exercise the condition. In this case I wrote a JUnit test.
It turns out that options that are not defined raise an error condition.
It turns out that options that are not defined raise an error condition.
package org.yi.happy.archive; import static org.junit.Assert.assertArrayEquals; import org.apache.commons.cli.CommandLine; import org.apache.commons.cli.GnuParser; import org.apache.commons.cli.Options; import org.apache.commons.cli.ParseException; import org.apache.commons.cli.UnrecognizedOptionException; import org.junit.Test; /** * Experimental tests for the commons-cli library. */ public class CommandLineTest { /** * Show what happens when an invalid option is given. * * @throws ParseException */ @Test(expected = UnrecognizedOptionException.class) public void testBadOption() throws ParseException { Options o = new Options().addOption("a", "apple", true, "Apples"); String[] arguments = { "--orange", "5" }; new GnuParser().parse(o, arguments); } /** * End the argument list to catch an option with dashes. * * @throws ParseException */ @Test public void testArguments() throws ParseException { Options o = new Options().addOption("a", "apple", true, "Apples"); String[] arguments = { "--", "--orange", "5" }; CommandLine c = new GnuParser().parse(o, arguments); assertArrayEquals(new String[] { "--orange", "5" }, c.getArgs()); } }
Tuesday, May 11, 2010
compressed index searching
I am continuing to try to make searching my disk indexes faster, this time I tried compressing the index files to reduce the IO required to get them into memory, however the run time went up.
First, the uncompressed index has a size of 3017824 K, and the times of two consecutive scans are
Next, the compressed index has a size of 1164228 K, and the times of two consecutive scans are
Since the system time went up I wonder if some extra buffering would work, also since this machine has two cores I wonder if I could do the scan in parallel.
Uncompressed with buffering makes it slower:
Compressed with buffering is slower too:
I ran each timing multiple times, and the times were mostly the same for each run.
So far straight reading of uncompressed data is fastest when scanning the whole data set.
At a later time I will try and make a parallel version.
First, the uncompressed index has a size of 3017824 K, and the times of two consecutive scans are
real 2m35.547s user 2m18.172s sys 0m6.146s real 2m36.304s user 2m17.152s sys 0m6.242s
Next, the compressed index has a size of 1164228 K, and the times of two consecutive scans are
real 3m20.342s user 3m4.939s sys 0m8.805s real 3m20.269s user 3m3.866s sys 0m8.699s
Since the system time went up I wonder if some extra buffering would work, also since this machine has two cores I wonder if I could do the scan in parallel.
Uncompressed with buffering makes it slower:
real 2m42.024s user 2m15.048s sys 0m6.214s
Compressed with buffering is slower too:
real 3m34.062s user 3m3.281s sys 0m5.488s
I ran each timing multiple times, and the times were mostly the same for each run.
So far straight reading of uncompressed data is fastest when scanning the whole data set.
At a later time I will try and make a parallel version.
Saturday, May 8, 2010
Sustainable Test-Driven Development
I just watched "InfoQ: Sustainable Test-Driven Development" and saw some things I really should do in my own tests.
One thing I should do is make builders for when I am putting structures together, and the idea of having methods that are hiding the details of what is going on so that the tests essentially declare the intent or feature.
One thing I should do is make builders for when I am putting structures together, and the idea of having methods that are hiding the details of what is going on so that the tests essentially declare the intent or feature.
Friday, May 7, 2010
tee and digest output stream
I was working on storing files in Happy Archive, the way it is constructed currently I store files by writing the contents to an output stream that makes chunks, encodes, encrypts, and stores the chunks.
I also wanted a hash of the whole data stream to add to the file meta-data.
I considered making a decorator output stream that would calculate the hash of all the data that passed through it, but then also decided that it would be simpler to build a sink that does the hashing, and a tee to send the data to the two sinks (hashing, and encoding). Another argument for not using a filter is that the hashing does not change the data that is passing through.
The tee looks like this:
Also the hashing output stream looks like this:
HashValue is a value object that represents strings of bytes that are hashes. ClosedException is an IOException.
I also noticed that there is a DigestOutputStream in the Java standard library which is an output stream filter.
I also wanted a hash of the whole data stream to add to the file meta-data.
I considered making a decorator output stream that would calculate the hash of all the data that passed through it, but then also decided that it would be simpler to build a sink that does the hashing, and a tee to send the data to the two sinks (hashing, and encoding). Another argument for not using a filter is that the hashing does not change the data that is passing through.
The tee looks like this:
package org.yi.happy.archive; import java.io.IOException; import java.io.OutputStream; /** * An output stream that writes to two output streams. */ public class TeeOutputStream extends OutputStream { private final OutputStream out1; private final OutputStream out2; /** * create an output stream that writes to two output streams. * * @param out1 * the first stream to write to. * @param out2 * the second stream to write to. */ public TeeOutputStream(OutputStream out1, OutputStream out2) { try { this.out1 = out1; } finally { this.out2 = out2; } } @Override public void write(int b) throws IOException { try { out1.write(b); } finally { out2.write(b); } } @Override public void write(byte[] b) throws IOException { try { out1.write(b); } finally { out2.write(b); } } @Override public void write(byte[] b, int off, int len) throws IOException { try { out1.write(b, off, len); } finally { out2.write(b, off, len); } } @Override public void flush() throws IOException { try { out1.flush(); } finally { out2.flush(); } } @Override public void close() throws IOException { try { out1.close(); } finally { out2.close(); } } }
Also the hashing output stream looks like this:
package org.yi.happy.archive; import java.io.IOException; import java.io.OutputStream; import java.security.MessageDigest; import org.yi.happy.archive.key.HashValue; /** * An output stream that calculates the digest of whatever is written to it. */ public class DigestOutputStream extends OutputStream { private MessageDigest md; private HashValue hash; private long size; /** * set up an output stream that calculates the digest of whatever is written * to it. * * @param md * the {@link MessageDigest} to use. The {@link MessageDigest} is * assumed to be freshly created and not shared. */ public DigestOutputStream(MessageDigest md) { this.md = md; this.hash = null; this.size = 0; } @Override public void write(int b) throws IOException { if (md == null) { throw new ClosedException(); } md.update((byte) b); size += 1; } @Override public void write(byte[] b) throws IOException { if (md == null) { throw new ClosedException(); } md.update(b); size += b.length; } @Override public void write(byte[] b, int off, int len) throws IOException { if (md == null) { throw new ClosedException(); } md.update(b, off, len); size += len; } @Override public void close() throws IOException { if (md == null) { return; } hash = new HashValue(md.digest()); md = null; } /** * Get the final digest value. * * @return the final digest value. * @throws IllegalStateException * if {@link #close()} has not been called. */ public HashValue getHash() throws IllegalStateException { if (hash == null) { throw new IllegalStateException(); } return hash; } /** * @return the number of bytes written to this stream. */ public long getSize() { return size; } }
HashValue is a value object that represents strings of bytes that are hashes. ClosedException is an IOException.
I also noticed that there is a DigestOutputStream in the Java standard library which is an output stream filter.
Wednesday, May 5, 2010
pancake syrup
To go with the pancakes that I make, I make syrup too.
Take a cup (250 ml) of brown sugar, add two cups of water, and boil the mix down until it just starts to foam, or thickens slightly to not pour like water off the spoon.
Remove from heat and add some vanilla for flavor and serve.
If it is too thick, boil it again and add some water.
If it is too runny, boil it again.
Take a cup (250 ml) of brown sugar, add two cups of water, and boil the mix down until it just starts to foam, or thickens slightly to not pour like water off the spoon.
Remove from heat and add some vanilla for flavor and serve.
If it is too thick, boil it again and add some water.
If it is too runny, boil it again.
Tuesday, May 4, 2010
Mac OSX Quicktime Codecs
For playing video on various platforms I often use VLC, it worked well until I ran into a file using an audio codec that was not included in it. Since I installed it on this Mac from a binary package I do not know how to add more codecs to it.
In response I tried the other players that were installed here, DivX Player and Quicktime Player, neither of them could handle the audio codec either, but gave a useful error message which I searched for and found the Perian codec pack for Mac OSX.
It worked perfectly, all the media that I have is now recognized by the player that came with the computer.
In response I tried the other players that were installed here, DivX Player and Quicktime Player, neither of them could handle the audio codec either, but gave a useful error message which I searched for and found the Perian codec pack for Mac OSX.
It worked perfectly, all the media that I have is now recognized by the player that came with the computer.
Subscribe to:
Posts (Atom)