Saturday, April 3, 2010

grid file format

I do data analysis on the command line at times. One of the tools that gets much use is cut, which extracts delimited columns from lines. Another that gets well used is PERL, to make simple summary or translation filters.

When doing this kind of data analysis I like to use tab and newline delimited text, which I like to call grid. I find CSV to be harder to work with because it does not have delimiters that can not also show up in the values.

grid := line*
line := field* ( cr | lf | cr lf )
field := ( [ not cr lf tab "\" ] | "\r" | "\n" | "\t" | "\\" )* | "\N"

This is actually an 8-bit clean format, any binary value can be put in the field as long as the escape character (backslash), newline characters, and tab character are escaped. Finally this format can also support database NULL values, with "\N". I typically use UTF-8 encoded text in this format.

I originally saw this format in the MySQL manual, it is the import data format.

The main reason I use this format is that it is extremely easy to parse and write, and it also works with many of the stock text processing tools that come with the shell.

No comments:

Post a Comment