## Tuesday, April 27, 2010

### ubuntu custom kernel build script

As I customized my kernel on a ubuntu laptop, I found myself looking up and running the same commands each time I made a change, so I wrote myself a little script to do the work.

#!/bin/bash
cd /usr/src/linux
make-kpkg clean
INSTALL_MOD_STRIP=1 CONCURRENCY_LEVEL=3 make-kpkg --initrd \


I put the script in my path and called it build-kernel, so I could use it with
sudo build-kernel


## Monday, April 26, 2010

### searching for many lines in a text file

I spent the day working on my archive system project trying to improve the speed of searching the indexes for batches of keys.

I remember this being slow even when using a relational database, taking several minutes when the list and index were both in the database. If the list was not in the database it was painfully slow.

I have the indexes, sorted by key, one file per volume (768 files).

I have a list of keys to find in a file (24309 entries).

I try to do a binary search of each index file for each key and it takes about 27 seconds per file to do the search, and I have not wanted to wait to see how long the entire search takes (a quick estimate is 6 hours). I would expect that if the indexes were merged into one sorted file then it would still take 27 seconds to search that and get the whole answer I am looking for, but that is extra management to deal with. I am already caching reads in the binary search, they only improved the performance slightly when they were added.

To do a straight linear scan of all the files and pick out the lines I am looking for takes 177 seconds (almost 3 minutes).

Just reading all of the files with no processing
cat */* > /dev/null

takes 63 seconds.

The size of the data set on disk is almost 3G.
"$2/$3/$1"); } }  All it does is capture some groups of digits, then re-arranges them. If it does not match it just returns the original string. ## Saturday, April 24, 2010 ### binary search of a sorted text file This is my second time implementing a binary search of a sorted text file, the previous time was many years ago using PERL, this time it is in Java. Here is the code to do a binary search of a sorted text file. /** * Find the position of the start of the first line in the file that is * greater than or equal to the target line, using a binary search. * * @param file * the file to search. * @param target * the target to find. * @return the position of the first line that is greater than or equal to * the target line. * @throws IOException */ public static long search(RandomAccessFile file, String target) throws IOException { /* * because we read the second line after each seek there is no way the * binary search will find the first line, so check it first. */ file.seek(0); String line = file.readLine(); if (line == null || line.compareTo(target) >= 0) { /* * the start is greater than or equal to the target, so it is what * we are looking for. */ return 0; } /* * set up the binary search. */ long beg = 0; long end = file.length(); while (beg <= end) { /* * find the mid point. */ long mid = beg + (end - beg) / 2; file.seek(mid); file.readLine(); line = file.readLine(); if (line == null || line.compareTo(target) >= 0) { /* * what we found is greater than or equal to the target, so look * before it. */ end = mid - 1; } else { /* * otherwise, look after it. */ beg = mid + 1; } } /* * The search falls through when the range is narrowed to nothing. */ file.seek(beg); file.readLine(); return file.getFilePointer(); }  I tested this on a 33MB text file, it took about 15ms to find a line where grep took 600ms. I will be trying to make this faster using caching, and the result will likely be posted later. ## Friday, April 23, 2010 ### reordering columns in a text grid I just wanted to quickly move a column around in a grid file (tab separated text data file), and had to look up a few commands to do it. cut - I use this program quite often to select columns from a grid. paste - join two files line by line with corresponding lines separated with a tab. file=2007-01-23-1520 out=index.txt paste <(cut -d$'\t' -f 3 $file) <(cut -d$'\t' -f 1,2,4,5,6 $file) | sort >$out


Basically the two cuts run in parallel, and are piped into paste which reads each a line at a time and outputs the pasted lines, the rest of the command line just sorts the result and saves it to a file.

I am doing this to try and do a binary search on the text file.

## Monday, April 19, 2010

### finding and visualizing disk usage

I was working with someone to clean up their hard disk because it was getting full, and it was not very apparent where the space was going, so I introduced them to WinDirStat, and it showed the sore spots for usage.

If you are on Linux, the equivalent that came first is KDirStat.

I like it when cleaning up, but usually I end up using
du | sort -n

and working my way from the bottom of the list.

## Sunday, April 18, 2010

### large set calculations on the command line

I was working on Happy Archive a little bit, burning some backup disks.

To figure out what to put on a disk, I need to figure out which keys are not stored in a particular volume set, and not stored on any volume set.

There are two volume sets, offsite and onsite.

I have a list of keys in the store (store.lst), and a table of what keys from the store exist on what volumes (volume.grid: volume-set, volume, file-name, key).

First I make a list of what keys are in each volume set already:
grep $'^onsite\t' volume.grid | cut -d$'\t' -f 4 | sort > on-onsite.lst
grep $'^offsite\t' volume.grid | cut -d$'\t' -f 4 | sort > on-offsite.lst

Then I take those sets away from the store list (outstanding = store - exist).
sort on-onsite.lst | comm -13 - store.lst > outstanding-onsite.lst
sort on-offsite.lst | comm -13 - store.lst > outstanding-offsite.lst

Now I take the intersection of those sets to find the keys not on either disk set, and the difference to find the keys that are only on one of the disk sets.
comm -12 outstanding-onsite.lst outstanding-offsite.lst > outstanding-common.lst
comm -23 outstanding-onsite.lst outstanding-offsite.lst > outstanding-just-onsite.lst
comm -13 outstanding-onsite.lst outstanding-offsite.lst > outstanding-just-offsite.lst

And finally make the list in the order of preference for burning, giving priority to keys that are not on the backup disks yet.
cat outstanding-common.lst outstanding-just-onsite.lst > outstanding-onsite.lst
cat outstanding-common.lst outstanding-just-offsite.lst > outstanding-offsite.lst


I am going to be embedding this into a java program, but in a pinch this data processing is not hard to do.

There were 141k keys in the store list, and 80k keys in the resulting outstanding lists, and it did not take long to run these commands.

## Wednesday, April 14, 2010

I was offered the challenge of capturing an mp3 stream of live radio for later listening, so I threw together a short shell script for that purpose.

#!/bin/sh
url=http://out1.cbc.icy.abacast.com:80/cbc-cbcr1hfx-96
time=60
out=stream-date +%Y-%m-%d-%H%M.mp3

curl -s -G "$url" > "$out" &
curl_pid=$! function cleanup() { kill$curl_pid
}

trap cleanup EXIT

sleep $time # cleanup gets implicitly called here  url is the raw stream address, found in the mp3 playlist, the example stream is cbc radio halifax. time is the length to record in seconds. output is what to name the output file (stream-yyyy-mm-dd-HHMM.mp3). ## Tuesday, April 13, 2010 ### dumping a table from a sqlite database While trying to extract the text messages from an iPhone backup, I found a need to extract a table to a text file to make it easier to work with. This extraction ended up being the final result, since it was good enough and could be imported into a spreadsheet for easy viewing. The table extraction just does a grid escape of the results of a select statement. #!/usr/bin/env python import sys, sqlite3 if len(sys.argv) != 3: print "expected database file name and table name" sys.exit() file = sys.argv[1] table = sys.argv[2] def grid_escape(x): if x == None: return "\\N" x = unicode(x) return x.replace("\\", "\\\\").replace("\t", "\\t").replace("\n","\\n") conn = sqlite3.connect(file) c = conn.cursor() c.execute('select * from ' + table) print "\t".join([col[0] for col in c.description]).encode('utf-8') for row in c: print "\t".join(map(grid_escape,row)).encode('utf-8') conn.close()  the grid escape function demonstrates how simple it is to encode that format, which is part of why I like it, it is trivial to produce in many languages. ## Monday, April 12, 2010 ### setting gnome options While searching for how to disable the face browser (list of available users on the login screen) in gnome 2.22, I found the command line method for updating system wide gnome settings. First to disable the face browser and shutdown buttons on the gnome login screen sudo gconftool-2 --direct \ --config-source xml:readwrite:/etc/gconf/gconf.xml.defaults \ --type bool --set /apps/gdm/simple-greeter/disable_user_list true sudo gconftool-2 --direct \ --config-source xml:readwrite:/etc/gconf/gconf.xml.defaults \ --type bool --set /apps/gdm/simple-greeter/disable_restart_buttons true  And while initially installing the server I wanted to change the power settings for the laptop lid, ideally from the command line, I see that I can do these with these commands, sudo gconftool-2 --direct \ --config-source xml:readwrite:/etc/gconf/gconf.xml.defaults \ --type string --set /apps/gnome-power-manager/buttons/lid_ac blank sudo gconftool-2 --direct \ --config-source xml:readwrite:/etc/gconf/gconf.xml.defaults \ --type string --set /apps/gnome-power-manager/buttons/lid_battery blank  to make the closing of the lid just blank the screen instead of going to sleep. I actually set these through the power configuration applet when I did them the first time. ## Sunday, April 11, 2010 ### samba listening to ipv4 and ipv6 I was having a problem with windows boxes connecting to the server that is being built. After some investigating I figured out that samba was not listening on ipv4, just ipv6. The solution that worked for me I found on "Flameeyes's Weblog: Tip of the day: if Samba 3.4 fails to work...". I had to add listen directives to smb.conf. [global] interfaces = 127.0.0.1 eth0 bind interfaces only = yes  Now the file server is operational for all the computers on the network. ### simple samba configuration I am building a new home server from a laptop with a smashed screen, and I just configured the file server part of it. After stripping out the comments and most of the configuration directives from the stock smb.conf file, I was left with a nice short file that works. The objective was to have the home directory available for users that are configured, and a public directory for anyone, also there should be no permissions restrictions on the public directory, people should be able to modify each others files. The resulting /etc/samba/smb.conf file is [global] dns proxy = no map to guest = Bad User guest account = guest unix extensions = no server string = Home server log file = /var/log/samba/log.%m max log size = 1000 syslog = 0 [homes] browsable = no read only = no valid users = %S [public] read only = no guest ok = yes force user = guest path = /export/public  I had also created a guest account for this setup. sudo useradd guest -m -p \*  And of course the public directory to share sudo mkdir /export sudo mkdir /export/public sudo chown guest /export/public  The server is behind a home router, which acts as a firewall, I would not put a configuration like this directly attached to the internet. ## Saturday, April 10, 2010 ### dumping the list of tables in a sqlite database I was poking through an iPhone backup for a friend, trying to recover the text message log after he accidentally deleted a bunch of messages. I found that the text messages were stored in a SQLite database. It also turned out that the backup was before the time that the messages were deleted, so most of the deleted messages were intact. I decided to try and dump the whole database to the simple grid format discussed previously. First I needed a list of tables to dump, and I decided to make this an exercise in learning some Python database API. Here is a Python program to dump the list of tables in a SQLite database. #!/usr/bin/python import sys, sqlite3 if len(sys.argv) != 2: print "expected database file name" sys.exit() file = sys.argv[1] conn = sqlite3.connect(file) c = conn.cursor() c.execute('select name from sqlite_master where type=\'table\'') for table in c: print table[0] conn.close()  ## Wednesday, April 7, 2010 ### VNC Login on Ubuntu 9.10 I am setting up a new home server to replace the existing one, in the hope of reducing power consumption. The new server is a laptop with a broken screen. I just configured a VNC Login screen, so here are the notes that I made. (The previous post used an earlier version of Ubuntu.) The first requirement for the VNC login screen is a XDMCP server, such as GDM. The new GDM configuration applet does not provide much, but after a quick search I found a guide on Enabling XDMCP on Karmic Koala. I created the file /etc/gdm/custom.conf containing [xdmcp] Enable=true  and restarted gdm sudo restart gdm  which enabled XDMCP. I also installed a VNC server sudo apt-get install vnc4server  which will provide the X server for VNC logins. Finally, an inetd needed to be installed sudo apt-get install xinetd  so that the vnc service can be configured. To configure the VNC service I created the configuration file /etc/xinetd.d/vnc service vnc { type = UNLISTED port = 5900 only_from = localhost 192.168.0.0/24 disable = no socket_type = stream protocol = tcp wait = no user = nobody server = /usr/local/bin/xvnc-inetd }  and the script /usr/local/bin/xvnc-inetd #!/bin/sh exec /usr/bin/Xvnc \ -inetd \ -query localhost \ -once \ -SecurityTypes=None \ -pn \ -extension XFIXES  which was named in the configuration. The script is just to keep the line lengths down, making things look cleaner. Finally I restarted xinetd. sudo /etc/init.d/xinetd restart  And it works. The problems I have with the setup at the moment is that the login screen lists the users instead of asking for the name to be typed, and allows shutdown of the system from the login screen. ### VNC Login on Ubuntu I have a laptop running Ubuntu Desktop 9.04, which has been sitting on a shelf getting occasional remote use. I am also in the process of building a replacement home server for the house from a broken laptop, in a hope to replace the current home server with one that uses less power. So, I went about figuring out how to get a gdm login screen over VNC, for remote graphical login. I found that I needed to use the vnc4server package instead of the other options for vnc servers on the laptop so that the keyboard would translate properly into the remote session. sudo apt-get install vnc4server  I already knew that I needed to turn on XDMCP, but figuring out how was a little bit of a challenge since each display manager has the option in a different place, and it also changes with versions of the display managers. I eventually found a graphical tool to turn it on under "System -> Administration -> Login Window", in there I just set "Remote -> Style -> Same as Local" and XDMCP was turned on. Now I needed to have an inetd to launch the vncserver from, on incoming connections. sudo apt-get install xinetd  and add a vnc service name to /etc/services (if it is not already listed in that file) vnc 5600/tcp  and add the vnc service to xinetd (/etc/xinetd.d/vnc) service vnc { only_from = localhost 192.168.0.0/24 disable = no id = vnc socket_type = stream protocol = tcp wait = no user = nobody server = /usr/bin/Xvnc4 server_args = -inetd -query localhost -once -SecurityTypes=None -extension XFIXES -pn -fp /usr/share/fonts/X11/misc/,/usr/share/fonts/X11/75dpi/,/usr/share/fonts/X11/100dpi/ log_on_failure += USERID }  The long line starting with server_args is server_args = -inetd -query localhost -once -SecurityTypes=None -extension XFIXES -pn -fp /usr/share/fonts/X11/misc/,/usr/share/fonts/X11/75dpi/,/usr/share/fonts/X11/100dpi/, which I could put into a script somewhere to make the line shorter, if I feel the need later. connections are restricted to the local network only, after restarting xinetd it finally worked. I will be setting this up again on the server that I am building, and if something is missing I will update it then. ## Tuesday, April 6, 2010 ### ISO9660 Size Estimating When I was coming up with the format for the backup disks I decided on a flat directory of numbered files because it was simple and usually worked well. Also it made each block addressable by the underlying operating system, externalizing some complexity, and allowed for media defects to have the least amount of impact, allowing for many partially burned disks to still be useful. The first thought I had for filling the disks was to estimate the image size based on just the total size of the files being put on the disk, this worked well except where there were large numbers of small files (on the order of 100k files). total = \sum{\lceil\frac{size}{2048}\rceil}2048  For a while I used this estimator with a pad factor based on the number of files, which worked fairly well but sometimes needed adjustment. Last June I did some research and found the ISO9660 Filesystem specifications, and wrote up a better estimator for my specific case. • All files are 8.3 format. • There are no extended formats • All the files are in the root directory The resulting estimator for image size in 2 KiB blocks is 174 + \lfloor\frac{count}{42}\rfloor + \sum{\lceil\frac{size}{2048}\rceil}  where the 174 is the disk overhead including padding, super blocks, and path tables; count is the number of file names on the disk, and size is the size of each file. This can be calculated incrementally with just two variables, a count and the sum of sizes so far. This estimator came out to be right on every time. Some day I may even use the ECMA-119 Standard (which got approved as the ISO9660 standard) to directly write a disk image instead of populating a directory and using mkisofs to make the image, but I have more important things to do before that. Here is that same estimator as a PERL code example, calculating 1000 instances of a 2048 byte file. #!/usr/bin/perl -w use strict; use POSIX; sub sum { my$out = 0;
for(@_) {
$out +=$_;
}
return $out; } my @sizes = ( 2048 ) x 1000; my$file_count = @sizes;

my $data_size = sum(map { ceil($_ / 2048) } @sizes);
my $dir_size = floor($file_count / 42 ) + 1;
my $overhead = 173; my$size = $overhead +$dir_size + $data_size;$\ = "\n";
print \$size;


## Monday, April 5, 2010

### splitting a PDF file into one page chunks

Yesterday I found how to split and merge PDF files using the iText Java library.

Today I will post the split program, which I simplified and cleaned up from an example.

package org.yi.happy.pdf;

import java.io.FileOutputStream;

import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfImportedPage;

public class PdfSplitMain {
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.out.println("use: infile [outbase]");
return;
}

String inFile = args[0];

String outBase;
if (args.length < 2) {
if (inFile.endsWith(".pdf")) {
outBase = inFile.substring(0, inFile.length() - 4) + "-";
} else {
outBase = inFile;
}
} else {
outBase = args[1];
}

try {
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
String outFile = outBase + String.format("%04d", i) + ".pdf";

Document document = new Document();
FileOutputStream output = new FileOutputStream(outFile);
try {
PdfCopy writer = new PdfCopy(document, output);
document.open();
document.close();
writer.close();
} finally {
output.close();
}
}
} finally {
}
}
}


Basically the input file is opened, and each page from it is written to a separate output file.

After figuring out these programs, with the intention of being able to remix PDF files along page boundaries, I learned that the Preview application of the MAC that I am using can edit PDF files very easily, so I will probably be using that instead most of the time.

## Sunday, April 4, 2010

### joining pdf files

I want to be able to split and join PDF files to allow me to remix virtual paper like I do physical paper, at least on page boundaries. Most of the pages I deal with come from scans, so are actually images of pages.

After doing a brief search I came across an iText tutorial on merging PDF files. I then took the example code and vastly simplified it for my own purposes.

This code requires the iText jar file and Java5.

package org.yi.happy.pdf;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.Arrays;

import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfImportedPage;

public class PdfJoinMain {
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.out.println("use: output input...");
return;
}

String[] pdfs = Arrays.copyOfRange(args, 1, args.length);
OutputStream output = new FileOutputStream(args[0]);
try {
Document document = new Document();
PdfCopy writer = new PdfCopy(document, output);
document.open();
for (String pdf : pdfs) {
FileInputStream input = new FileInputStream(pdf);
try {
for (int p = 1; p <= reader.getNumberOfPages(); p++) {
PdfImportedPage page = writer
}
} finally {
input.close();
}
}
document.close();
writer.close();
} finally {
output.close();
}
}
}


The code opens the output PDF file, then for each input PDF file, copies each page from the input file into the output file, and finally closes the output file. There are try ... finally constructs to ensure that the files get closed. When the program is invoked with no arguments it prints help and exits.

I feel that this program may actually be counter-intuitive since the target comes first on the command line, instead of last; take copy and move (cp and mv) for instance, the target comes last on the command line.

The splitting code is similar, and will be posted tomorrow.

## Saturday, April 3, 2010

### grid file format

I do data analysis on the command line at times. One of the tools that gets much use is cut, which extracts delimited columns from lines. Another that gets well used is PERL, to make simple summary or translation filters.

When doing this kind of data analysis I like to use tab and newline delimited text, which I like to call grid. I find CSV to be harder to work with because it does not have delimiters that can not also show up in the values.

grid := line*
line := field* ( cr | lf | cr lf )
field := ( [ not cr lf tab "\" ] | "\r" | "\n" | "\t" | "\\" )* | "\N"


This is actually an 8-bit clean format, any binary value can be put in the field as long as the escape character (backslash), newline characters, and tab character are escaped. Finally this format can also support database NULL values, with "\N". I typically use UTF-8 encoded text in this format.

I originally saw this format in the MySQL manual, it is the import data format.

The main reason I use this format is that it is extremely easy to parse and write, and it also works with many of the stock text processing tools that come with the shell.

## Friday, April 2, 2010

### reversing log

I was tutoring, and one particular type of problem came up as difficult, solving for a variable in a log.

2 = \log x

To solve this, first exponent both sides on 10, which will reverse the log. The hard part seems to be here.
10^2 = 10^{\log x}

and simplify (10^{\log x} = x).
100 = x


To make is harder, multiply the x by a constant inside the log.
2 = \log 2x

put both sides as exponents of 10
10^2 = 10^{\log 2x}

simplify
100 = 2x

divide both sides by 2
\frac{100}{2} = \frac{2x}{2}

simplify
50 = x


and this time add something to the x inside the log
2 = \log\left( 2x + 1 \right)

put both as exponents of 10
10^2 = 10^{\log\left( 2x + 1 \right)}

simplify
100 = 2x + 1

subtract 1 from both sides
100 - 1 = 2x + 1 - 1

simplify
99 = 2x

divide both sides by two
\frac{99}{2} = \frac{2x}{2}

simplify
\frac{99}{2} = x


what can I do to make it harder? Two layers of log? two layers of logs of different bases? that should be plenty hard.

2 = \ln\log x

The \ln x is the outside log, so it needs to be dealt with first, make both sides exponents of e
e^2 = e^{\ln\log x}

simplify
e^2 = \log x

make both sides exponents of 10
10^{e^2} = 10^{\log x}

simplify
10^{e^2} = x


And now for one that looks a bit scary, the log is in an exponent.
2 = e^{\log x}

First there is the e^x to deal with, so take the natural logarithm of both sides.
\ln 2 = \ln e^{\log x}

simplify
\ln 2 = \log x

and put both sides on an exponent of 10, as in the previous solutions
10^{\ln 2} = 10^{\log x}

simplify
10^{\ln 2} = x


It is all a matter of removing one layer at a time to get to the thing being solved for.

all good?

## Thursday, April 1, 2010

### revealing the algebraic falacy, 2 != 1

Previously I used only basic algebraic manipulation I showed that 2 = 1, which is also false, now I am going to show where it went wrong, by doing the same steps as before, using the same forms as before.

a = b


Substitute b for a in the entire rest of the article, since they are the same.

b = b


we can do whatever we want to an equation as long as we do the same thing to both sides, so first add b

b + b = b + b


partially simplify

2b = b + b


subtract 2b

2b - 2b = b + b - 2b


partially group like terms

2b - 2b = b + (1-2)b


simplify

2b - 2b = b + (-1)b


adding a negative number is the same as subtracting

2b - 2b = b - b


factor out (b - b)

2(b - b) = 1(b - b)


divide by (b - b), which is also known as zero.

\frac{2(b - b)}{b - b} = \frac{1(b - b)}{b - b}


re-factor to isolate the one, now we have zero over zero, anything can happen

2\left(\frac{b - b}{b-b}\right) = 1\left(\frac{b - b}{b-b}\right)


reduce the one, and we are just not allowed to get here

2(1) = 1(1)


simplify

2 = 1


Division by zero is not allowed, and zero over zero is just undefined.