Tuesday, April 27, 2010

ubuntu custom kernel build script

As I customized my kernel on a ubuntu laptop, I found myself looking up and running the same commands each time I made a change, so I wrote myself a little script to do the work.

cd /usr/src/linux
make-kpkg clean
 --append-to-version=-custom kernel_image kernel_headers modules_image

I put the script in my path and called it build-kernel, so I could use it with
sudo build-kernel

This made things easy.

Monday, April 26, 2010

searching for many lines in a text file

I spent the day working on my archive system project trying to improve the speed of searching the indexes for batches of keys.

I remember this being slow even when using a relational database, taking several minutes when the list and index were both in the database. If the list was not in the database it was painfully slow.

I have the indexes, sorted by key, one file per volume (768 files).

I have a list of keys to find in a file (24309 entries).

I try to do a binary search of each index file for each key and it takes about 27 seconds per file to do the search, and I have not wanted to wait to see how long the entire search takes (a quick estimate is 6 hours). I would expect that if the indexes were merged into one sorted file then it would still take 27 seconds to search that and get the whole answer I am looking for, but that is extra management to deal with. I am already caching reads in the binary search, they only improved the performance slightly when they were added.

To do a straight linear scan of all the files and pick out the lines I am looking for takes 177 seconds (almost 3 minutes).

Just reading all of the files with no processing
cat */* > /dev/null
takes 63 seconds.

The size of the data set on disk is almost 3G.
excited:~/archive.d/index$ du -h
1.5G ./offsite
1.4G ./onsite
2.9G .

The total number of lines of text in the data set is 26283702.

I am not even sure if it is possible to do faster than the linear scan for performance without merging all of the files together, which I would rather not do to save the size of the volume set and name fields.

reformat iso date

I was given a tiny challenge, write a function that changes "yyyy-mm-dd HH:MM" into "mm/dd/yyyy".

First I wrote a test:
package org.yi.happy;

import static org.junit.Assert.assertEquals;

import org.junit.Test;

public class DateReformatTest {
    public void test1() {
        String have = DateReformat.isoToAccess("2010-02-01 14:40");

        assertEquals("02/01/2010", have);


Then I wrote the code, and used a regular expression to do the work:
package org.yi.happy;

public class DateReformat {

    public static String isoToAccess(String value) {
        return value.replaceAll(
                "^(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) \\d\\d:\\d\\d$",


All it does is capture some groups of digits, then re-arranges them. If it does not match it just returns the original string.

Saturday, April 24, 2010

binary search of a sorted text file

This is my second time implementing a binary search of a sorted text file, the previous time was many years ago using PERL, this time it is in Java.

Here is the code to do a binary search of a sorted text file.

 * Find the position of the start of the first line in the file that is
 * greater than or equal to the target line, using a binary search.
 * @param file
 *            the file to search.
 * @param target
 *            the target to find.
 * @return the position of the first line that is greater than or equal to
 *         the target line.
 * @throws IOException
public static long search(RandomAccessFile file, String target)
        throws IOException {
     * because we read the second line after each seek there is no way the
     * binary search will find the first line, so check it first.
    String line = file.readLine();
    if (line == null || line.compareTo(target) >= 0) {
         * the start is greater than or equal to the target, so it is what
         * we are looking for.
        return 0;

     * set up the binary search.
    long beg = 0;
    long end = file.length();
    while (beg <= end) {
         * find the mid point.
        long mid = beg + (end - beg) / 2;
        line = file.readLine();

        if (line == null || line.compareTo(target) >= 0) {
             * what we found is greater than or equal to the target, so look
             * before it.
            end = mid - 1;
        } else {
             * otherwise, look after it.
            beg = mid + 1;

     * The search falls through when the range is narrowed to nothing.
    return file.getFilePointer();

I tested this on a 33MB text file, it took about 15ms to find a line where grep took 600ms. I will be trying to make this faster using caching, and the result will likely be posted later.

Friday, April 23, 2010

reordering columns in a text grid

I just wanted to quickly move a column around in a grid file (tab separated text data file), and had to look up a few commands to do it.

cut - I use this program quite often to select columns from a grid.

paste - join two files line by line with corresponding lines separated with a tab.

paste <(cut -d $'\t' -f 3 $file) <(cut -d $'\t' -f 1,2,4,5,6 $file) | sort > $out

Basically the two cuts run in parallel, and are piped into paste which reads each a line at a time and outputs the pasted lines, the rest of the command line just sorts the result and saves it to a file.

I am doing this to try and do a binary search on the text file.

Monday, April 19, 2010

finding and visualizing disk usage

I was working with someone to clean up their hard disk because it was getting full, and it was not very apparent where the space was going, so I introduced them to WinDirStat, and it showed the sore spots for usage.

If you are on Linux, the equivalent that came first is KDirStat.

I like it when cleaning up, but usually I end up using
du | sort -n
and working my way from the bottom of the list.

Sunday, April 18, 2010

large set calculations on the command line

I was working on Happy Archive a little bit, burning some backup disks.

To figure out what to put on a disk, I need to figure out which keys are not stored in a particular volume set, and not stored on any volume set.

There are two volume sets, offsite and onsite.

I have a list of keys in the store (store.lst), and a table of what keys from the store exist on what volumes (volume.grid: volume-set, volume, file-name, key).

First I make a list of what keys are in each volume set already:
grep $'^onsite\t' volume.grid | cut -d $'\t' -f 4 | sort > on-onsite.lst
grep $'^offsite\t' volume.grid | cut -d $'\t' -f 4 | sort > on-offsite.lst
Then I take those sets away from the store list (outstanding = store - exist).
sort on-onsite.lst | comm -13 - store.lst > outstanding-onsite.lst
sort on-offsite.lst | comm -13 - store.lst > outstanding-offsite.lst
Now I take the intersection of those sets to find the keys not on either disk set, and the difference to find the keys that are only on one of the disk sets.
comm -12 outstanding-onsite.lst outstanding-offsite.lst > outstanding-common.lst
comm -23 outstanding-onsite.lst outstanding-offsite.lst > outstanding-just-onsite.lst
comm -13 outstanding-onsite.lst outstanding-offsite.lst > outstanding-just-offsite.lst
And finally make the list in the order of preference for burning, giving priority to keys that are not on the backup disks yet.
cat outstanding-common.lst outstanding-just-onsite.lst > outstanding-onsite.lst
cat outstanding-common.lst outstanding-just-offsite.lst > outstanding-offsite.lst

I am going to be embedding this into a java program, but in a pinch this data processing is not hard to do.

There were 141k keys in the store list, and 80k keys in the resulting outstanding lists, and it did not take long to run these commands.

Wednesday, April 14, 2010

streaming radio capture

I was offered the challenge of capturing an mp3 stream of live radio for later listening, so I threw together a short shell script for that purpose.

out=stream-`date +%Y-%m-%d-%H%M`.mp3

curl -s -G "$url" > "$out" &

function cleanup() {
  kill $curl_pid

trap cleanup EXIT

sleep $time

# cleanup gets implicitly called here

url is the raw stream address, found in the mp3 playlist, the example stream is cbc radio halifax.

time is the length to record in seconds.

output is what to name the output file (stream-yyyy-mm-dd-HHMM.mp3).

Tuesday, April 13, 2010

dumping a table from a sqlite database

While trying to extract the text messages from an iPhone backup, I found a need to extract a table to a text file to make it easier to work with. This extraction ended up being the final result, since it was good enough and could be imported into a spreadsheet for easy viewing.

The table extraction just does a grid escape of the results of a select statement.
#!/usr/bin/env python
import sys, sqlite3

if len(sys.argv) != 3:
  print "expected database file name and table name"
file = sys.argv[1]
table = sys.argv[2]

def grid_escape(x):
  if x == None:
    return "\\N"
  x = unicode(x)
  return x.replace("\\", "\\\\").replace("\t", "\\t").replace("\n","\\n")

conn = sqlite3.connect(file)
c = conn.cursor()

c.execute('select * from ' + table)
print "\t".join([col[0] for col in c.description]).encode('utf-8')
for row in c:
  print "\t".join(map(grid_escape,row)).encode('utf-8')

the grid escape function demonstrates how simple it is to encode that format, which is part of why I like it, it is trivial to produce in many languages.

Monday, April 12, 2010

setting gnome options

While searching for how to disable the face browser (list of available users on the login screen) in gnome 2.22, I found the command line method for updating system wide gnome settings.

First to disable the face browser and shutdown buttons on the gnome login screen
sudo gconftool-2 --direct \
 --config-source xml:readwrite:/etc/gconf/gconf.xml.defaults \
 --type bool --set /apps/gdm/simple-greeter/disable_user_list true
sudo gconftool-2 --direct \
 --config-source xml:readwrite:/etc/gconf/gconf.xml.defaults \
 --type bool --set /apps/gdm/simple-greeter/disable_restart_buttons true

And while initially installing the server I wanted to change the power settings for the laptop lid, ideally from the command line, I see that I can do these with these commands,
sudo gconftool-2 --direct \
 --config-source xml:readwrite:/etc/gconf/gconf.xml.defaults \
 --type string --set /apps/gnome-power-manager/buttons/lid_ac blank
sudo gconftool-2 --direct \
 --config-source xml:readwrite:/etc/gconf/gconf.xml.defaults \
 --type string --set /apps/gnome-power-manager/buttons/lid_battery blank
to make the closing of the lid just blank the screen instead of going to sleep. I actually set these through the power configuration applet when I did them the first time.

Sunday, April 11, 2010

samba listening to ipv4 and ipv6

I was having a problem with windows boxes connecting to the server that is being built. After some investigating I figured out that samba was not listening on ipv4, just ipv6.

The solution that worked for me I found on "Flameeyes's Weblog: Tip of the day: if Samba 3.4 fails to work...".

I had to add listen directives to smb.conf.
interfaces = eth0
bind interfaces only = yes

Now the file server is operational for all the computers on the network.

simple samba configuration

I am building a new home server from a laptop with a smashed screen, and I just configured the file server part of it.

After stripping out the comments and most of the configuration directives from the stock smb.conf file, I was left with a nice short file that works. The objective was to have the home directory available for users that are configured, and a public directory for anyone, also there should be no permissions restrictions on the public directory, people should be able to modify each others files.

The resulting /etc/samba/smb.conf file is
dns proxy = no
map to guest = Bad User
guest account = guest
unix extensions = no

server string = Home server
log file = /var/log/samba/log.%m
max log size = 1000
syslog = 0

browsable = no
read only = no
valid users = %S

read only = no
guest ok = yes
force user = guest
path = /export/public

I had also created a guest account for this setup.
sudo useradd guest -m -p \*

And of course the public directory to share
sudo mkdir /export
sudo mkdir /export/public
sudo chown guest /export/public

The server is behind a home router, which acts as a firewall, I would not put a configuration like this directly attached to the internet.

Saturday, April 10, 2010

dumping the list of tables in a sqlite database

I was poking through an iPhone backup for a friend, trying to recover the text message log after he accidentally deleted a bunch of messages. I found that the text messages were stored in a SQLite database. It also turned out that the backup was before the time that the messages were deleted, so most of the deleted messages were intact.

I decided to try and dump the whole database to the simple grid format discussed previously. First I needed a list of tables to dump, and I decided to make this an exercise in learning some Python database API.

Here is a Python program to dump the list of tables in a SQLite database.

import sys, sqlite3

if len(sys.argv) != 2:
  print "expected database file name"
file = sys.argv[1]

conn = sqlite3.connect(file)
c = conn.cursor()

c.execute('select name from sqlite_master where type=\'table\'')
for table in c:
  print table[0]


Wednesday, April 7, 2010

VNC Login on Ubuntu 9.10

I am setting up a new home server to replace the existing one, in the hope of reducing power consumption. The new server is a laptop with a broken screen.

I just configured a VNC Login screen, so here are the notes that I made. (The previous post used an earlier version of Ubuntu.)

The first requirement for the VNC login screen is a XDMCP server, such as GDM. The new GDM configuration applet does not provide much, but after a quick search I found a guide on Enabling XDMCP on Karmic Koala.

I created the file /etc/gdm/custom.conf containing
and restarted gdm
sudo restart gdm
which enabled XDMCP.

I also installed a VNC server
sudo apt-get install vnc4server
which will provide the X server for VNC logins.

Finally, an inetd needed to be installed
sudo apt-get install xinetd
so that the vnc service can be configured.

To configure the VNC service I created the configuration file /etc/xinetd.d/vnc
service vnc
    type = UNLISTED
    port = 5900
    only_from = localhost
    disable = no
    socket_type = stream
    protocol = tcp
    wait = no
    user = nobody
    server = /usr/local/bin/xvnc-inetd
and the script /usr/local/bin/xvnc-inetd
exec /usr/bin/Xvnc \
 -inetd \
 -query localhost \
 -once \
 -SecurityTypes=None \
 -pn \
 -extension XFIXES
which was named in the configuration. The script is just to keep the line lengths down, making things look cleaner.

Finally I restarted xinetd.
sudo /etc/init.d/xinetd restart

And it works.

The problems I have with the setup at the moment is that the login screen lists the users instead of asking for the name to be typed, and allows shutdown of the system from the login screen.

VNC Login on Ubuntu

I have a laptop running Ubuntu Desktop 9.04, which has been sitting on a shelf getting occasional remote use. I am also in the process of building a replacement home server for the house from a broken laptop, in a hope to replace the current home server with one that uses less power. So, I went about figuring out how to get a gdm login screen over VNC, for remote graphical login.

I found that I needed to use the vnc4server package instead of the other options for vnc servers on the laptop so that the keyboard would translate properly into the remote session.

sudo apt-get install vnc4server

I already knew that I needed to turn on XDMCP, but figuring out how was a little bit of a challenge since each display manager has the option in a different place, and it also changes with versions of the display managers. I eventually found a graphical tool to turn it on under "System -> Administration -> Login Window", in there I just set "Remote -> Style -> Same as Local" and XDMCP was turned on.

Now I needed to have an inetd to launch the vncserver from, on incoming connections.

sudo apt-get install xinetd

and add a vnc service name to /etc/services (if it is not already listed in that file)

vnc 5600/tcp

and add the vnc service to xinetd (/etc/xinetd.d/vnc)

service vnc
        only_from = localhost
        disable = no
        id = vnc
        socket_type = stream
        protocol = tcp
        wait = no
        user = nobody
        server = /usr/bin/Xvnc4
        server_args = -inetd -query localhost -once -SecurityTypes=None -extension XFIXES -pn -fp /usr/share/fonts/X11/misc/,/usr/share/fonts/X11/75dpi/,/usr/share/fonts/X11/100dpi/
        log_on_failure += USERID

The long line starting with server_args is server_args = -inetd -query localhost -once -SecurityTypes=None -extension XFIXES -pn -fp /usr/share/fonts/X11/misc/,/usr/share/fonts/X11/75dpi/,/usr/share/fonts/X11/100dpi/, which I could put into a script somewhere to make the line shorter, if I feel the need later.

connections are restricted to the local network only, after restarting xinetd it finally worked.

I will be setting this up again on the server that I am building, and if something is missing I will update it then.

Tuesday, April 6, 2010

ISO9660 Size Estimating

When I was coming up with the format for the backup disks I decided on a flat directory of numbered files because it was simple and usually worked well. Also it made each block addressable by the underlying operating system, externalizing some complexity, and allowed for media defects to have the least amount of impact, allowing for many partially burned disks to still be useful.

The first thought I had for filling the disks was to estimate the image size based on just the total size of the files being put on the disk, this worked well except where there were large numbers of small files (on the order of 100k files).

total = \sum{\lceil\frac{size}{2048}\rceil}2048

For a while I used this estimator with a pad factor based on the number of files, which worked fairly well but sometimes needed adjustment.

Last June I did some research and found the ISO9660 Filesystem specifications, and wrote up a better estimator for my specific case.
  • All files are 8.3 format.
  • There are no extended formats
  • All the files are in the root directory
The resulting estimator for image size in 2 KiB blocks is
174 + \lfloor\frac{count}{42}\rfloor + \sum{\lceil\frac{size}{2048}\rceil}
where the 174 is the disk overhead including padding, super blocks, and path tables; count is the number of file names on the disk, and size is the size of each file. This can be calculated incrementally with just two variables, a count and the sum of sizes so far.

This estimator came out to be right on every time.

Some day I may even use the ECMA-119 Standard (which got approved as the ISO9660 standard) to directly write a disk image instead of populating a directory and using mkisofs to make the image, but I have more important things to do before that.

Here is that same estimator as a PERL code example, calculating 1000 instances of a 2048 byte file.
#!/usr/bin/perl -w
use strict;
use POSIX;

sub sum {
    my $out = 0;
    for(@_) {
        $out += $_;
    return $out;

my @sizes = ( 2048 ) x 1000;
my $file_count = @sizes;

my $data_size = sum(map { ceil($_ / 2048) } @sizes);
my $dir_size = floor( $file_count / 42 ) + 1;
my $overhead = 173;

my $size = $overhead + $dir_size + $data_size;

$\ = "\n";
print $size;

Monday, April 5, 2010

splitting a PDF file into one page chunks

Yesterday I found how to split and merge PDF files using the iText Java library.

Today I will post the split program, which I simplified and cleaned up from an example.

package org.yi.happy.pdf;

import java.io.FileOutputStream;

import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;

public class PdfSplitMain {
    public static void main(String[] args) throws Exception {
        if (args.length < 1) {
            System.out.println("use: infile [outbase]");

        String inFile = args[0];

        String outBase;
        if (args.length < 2) {
            if (inFile.endsWith(".pdf")) {
                outBase = inFile.substring(0, inFile.length() - 4) + "-";
            } else {
                outBase = inFile;
        } else {
            outBase = args[1];

        PdfReader reader = new PdfReader(inFile);
        try {
            for (int i = 1; i <= reader.getNumberOfPages(); i++) {
                String outFile = outBase + String.format("%04d", i) + ".pdf";

                Document document = new Document();
                FileOutputStream output = new FileOutputStream(outFile);
                try {
                    PdfCopy writer = new PdfCopy(document, output);
                    PdfImportedPage page = writer.getImportedPage(reader, i);
                } finally {
        } finally {

Basically the input file is opened, and each page from it is written to a separate output file.

After figuring out these programs, with the intention of being able to remix PDF files along page boundaries, I learned that the Preview application of the MAC that I am using can edit PDF files very easily, so I will probably be using that instead most of the time.

Sunday, April 4, 2010

joining pdf files

I want to be able to split and join PDF files to allow me to remix virtual paper like I do physical paper, at least on page boundaries. Most of the pages I deal with come from scans, so are actually images of pages.

After doing a brief search I came across an iText tutorial on merging PDF files. I then took the example code and vastly simplified it for my own purposes.

This code requires the iText jar file and Java5.

package org.yi.happy.pdf;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.Arrays;

import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;

public class PdfJoinMain {
    public static void main(String[] args) throws Exception {
        if (args.length < 2) {
            System.out.println("use: output input...");

        String[] pdfs = Arrays.copyOfRange(args, 1, args.length);
        OutputStream output = new FileOutputStream(args[0]);
        try {
            Document document = new Document();
            PdfCopy writer = new PdfCopy(document, output);
            for (String pdf : pdfs) {
                FileInputStream input = new FileInputStream(pdf);
                try {
                    PdfReader reader = new PdfReader(input);
                    for (int p = 1; p <= reader.getNumberOfPages(); p++) {
                        PdfImportedPage page = writer
                                .getImportedPage(reader, p);
                } finally {
        } finally {

The code opens the output PDF file, then for each input PDF file, copies each page from the input file into the output file, and finally closes the output file. There are try ... finally constructs to ensure that the files get closed. When the program is invoked with no arguments it prints help and exits.

I feel that this program may actually be counter-intuitive since the target comes first on the command line, instead of last; take copy and move (cp and mv) for instance, the target comes last on the command line.

The splitting code is similar, and will be posted tomorrow.

Saturday, April 3, 2010

grid file format

I do data analysis on the command line at times. One of the tools that gets much use is cut, which extracts delimited columns from lines. Another that gets well used is PERL, to make simple summary or translation filters.

When doing this kind of data analysis I like to use tab and newline delimited text, which I like to call grid. I find CSV to be harder to work with because it does not have delimiters that can not also show up in the values.

grid := line*
line := field* ( cr | lf | cr lf )
field := ( [ not cr lf tab "\" ] | "\r" | "\n" | "\t" | "\\" )* | "\N"

This is actually an 8-bit clean format, any binary value can be put in the field as long as the escape character (backslash), newline characters, and tab character are escaped. Finally this format can also support database NULL values, with "\N". I typically use UTF-8 encoded text in this format.

I originally saw this format in the MySQL manual, it is the import data format.

The main reason I use this format is that it is extremely easy to parse and write, and it also works with many of the stock text processing tools that come with the shell.

Friday, April 2, 2010

reversing log

I was tutoring, and one particular type of problem came up as difficult, solving for a variable in a log.

2 = \log x
To solve this, first exponent both sides on 10, which will reverse the log. The hard part seems to be here.
10^2 = 10^{\log x}
and simplify (10^{\log x} = x).
100 = x

To make is harder, multiply the x by a constant inside the log.
2 = \log 2x
put both sides as exponents of 10
10^2 = 10^{\log 2x}
100 = 2x
divide both sides by 2
\frac{100}{2} = \frac{2x}{2}
50 = x

and this time add something to the x inside the log
2 = \log\left( 2x + 1 \right)
put both as exponents of 10
10^2 = 10^{\log\left( 2x + 1 \right)}
100 = 2x + 1
subtract 1 from both sides
100 - 1 = 2x + 1 - 1
99 = 2x
divide both sides by two
\frac{99}{2} = \frac{2x}{2}
\frac{99}{2} = x

what can I do to make it harder? Two layers of log? two layers of logs of different bases? that should be plenty hard.

2 = \ln\log x
The \ln x is the outside log, so it needs to be dealt with first, make both sides exponents of e
e^2 = e^{\ln\log x}
e^2 = \log x
make both sides exponents of 10
10^{e^2} = 10^{\log x}
10^{e^2} = x

And now for one that looks a bit scary, the log is in an exponent.
2 = e^{\log x}
First there is the e^x to deal with, so take the natural logarithm of both sides.
\ln 2 = \ln e^{\log x}
\ln 2 = \log x
and put both sides on an exponent of 10, as in the previous solutions
10^{\ln 2} = 10^{\log x}
10^{\ln 2} = x

It is all a matter of removing one layer at a time to get to the thing being solved for.

all good?

Thursday, April 1, 2010

revealing the algebraic falacy, 2 != 1

Previously I used only basic algebraic manipulation I showed that 2 = 1, which is also false, now I am going to show where it went wrong, by doing the same steps as before, using the same forms as before.

a = b

Substitute b for a in the entire rest of the article, since they are the same.

b = b

we can do whatever we want to an equation as long as we do the same thing to both sides, so first add b

b + b = b + b

partially simplify

2b = b + b

subtract 2b

2b - 2b = b + b - 2b

partially group like terms

2b - 2b = b + (1-2)b


2b - 2b = b + (-1)b

adding a negative number is the same as subtracting

2b - 2b = b - b

factor out (b - b)

2(b - b) = 1(b - b)

divide by (b - b), which is also known as zero.

\frac{2(b - b)}{b - b} = \frac{1(b - b)}{b - b}

re-factor to isolate the one, now we have zero over zero, anything can happen

2\left(\frac{b - b}{b-b}\right) = 1\left(\frac{b - b}{b-b}\right)

reduce the one, and we are just not allowed to get here

2(1) = 1(1)


2 = 1

Division by zero is not allowed, and zero over zero is just undefined.