off by one for 2012


Mon Feb 27 13:17:31 PST Storing Research Data -- SQLite vs. The Other Guys :

I generally have a real aversion to using a full-on SQL implementation like MySQL or Postgres when it's not really necessary... and I typically think it's unnecessary until something forces me to change my mind. When I started writing small projects in Perl, I used to use Berkeley DB key-value pairs — lately I've used Python Pickles for similar purposes. It's simple and quick, which is nice, but it basically forces you to roll your own code.

Lately, everyone's been getting on the SQLite bandwagon, and it's pretty awesome. I've moved to using SQLite as my first choice when storing anything beyond flat text data. It has nice portability characteristics (unlike your homemade solution), simple backup and export formats. And, being able to make queries on the data is great for me, since these days most of my data are experimental results, etc.

But another good reason for using an external, portable data sink became obvious when I started to visualize and analyze my data using R. R has the RSQLite package, which made importing data into R for plotting and analysis a breeze. (Or as much of a breeze as anything is in R.) The thing is, roll-your-own formats may be perfectly "good enough" for isolated projects, but the minute you start wanting to use a different tool to view the data, having it already be in a format that is easily accessible is a major win. And if you're like me, you won't always know that in two months you're going to want to use the data in some completely different way. So I feel like I got that capability for "free" just because I decided to store my data in a more structured and portable way.

But SQLite is targeted at a specific niche — projects that would benefit from SQL behaviors but don't need all the robustness and consistency guarantees of "enterprise" databases. While you get good performance (because SQLite works inside your code, rather than through RPCs), if you start wishing for read/write concurrency (say, importing new data and plotting other unrelated data at the same time) you may find yourself frustrated with SQLite's limitations. That's what happened to me. As I started to generate large numbers of plots, I wanted to be able to run multiple scripts (some of which generate plots, some of which update other tables) at the same time — SQLite can balk at this.

So, I switched to a more traditional SQL database, which itself was relatively painless because of the underlying SQL standard. (R has libraries for it as well — another good reason not to roll your own even if you don't need all that SQL provides.) And in turn, this highlighted another unexpected value for the more "enterprise" systems: caching. Re-running long queries just to tweak a plot is considerably quicker with the big server as opposed to SQLite.

None of this is new information — in fact, it's a pretty textbook tour of the hows and whys of data storage. But for me, it inspired a change of heart, thanks to the convenience factor. In the future, I'll probably start with a Big Dog SQL server for research projects (running on my personal laptop) because it avoids the papercuts encountered when my projects get to big for SQLite. But I'll stick to SQLite for simpler things, to avoid the dependencies created by the more enterprise approach.

Sat Mar 10 08:31:42 PST full moon names:

Man, why use lame ego-stroking Roman names like these when we could be using names like these?

Tue Oct 09 08:53:46 PST Doing testing and need to decrypt captured SSL traffic?:

now you can!. (via clickolinko)

Tue Oct 09 13:10:09 PST there IS a method to the madness...:

The vim git syntax was driving me crazy. Turns out I'm not the only one, but there is a reason for the madness.

Thu Oct 11 09:42:30 PST Principle of Least Responsibility:

A lot of the code I write these days is for running experiments (or analyzing data from experiments). And I'm always trying to automate my testing more, in order to keep the system from being idle when there are more cases to test. So, given some experimental program E, and a set of inputs to test {1,2,3}, it's tempting to give E the power (through the magic of computer programming) to take in a list of tasks to work on -- so you can tell it "do {1,2,3}" instead of telling it "do 1", then telling it "do 2", and then "do 3".

But there's a problem here, especially if your tests take a long time to complete. If your code crashes (or a test fails) in the middle of "do {1,2,3}" it may not be obvious where in the test sequence the code failed. Or, even if it is obvious, it may not be easy to pick up the tasks from where things crashed.

Instead, it's more robust to write a simple wrapper W that, when given the inputs {1,2,3}, picks them off one at a time, calling E with the individual tasks. It may take a little extra work to do things this way, but it will pay dividends in the long run.

For example, another benefit is intelligent management of repetitions of experiments. Suppose you have tasks {1,2,3}, but you want to run each of them 3 times. A naive approach will run {1,1,1,2,2,2,3,3,3} but then you have potential ordering bias. And, if (say) tests {1,1,1} succeed, but the 2nd test 2 fails, what happens to your results?

Using a wrapper W, you can generate the list of experiments, randomize them, and write them to a file. W can track the success or failure of each test, keeping your experiments running but also separating the wheat from the chaff. This approach can also give you a good way to track progress through a set of experiments.

My "task log" usually looks like a shuffled file with one line for each task, or a directory of files with each file representing one task; this also allows you to modify the task set on the fly -- adding tests to the end of the task log. W doesn't remove a task until it's been completed successfully, but it moves on to subsequent tasks so it doesn't get stuck in a pathological case. (You could have it put failed tests in a separate list.) Anyway, all this allows for better recovery and progress tracking.

Anyway, I know there are lots of names for this general principle of abstraction or compartmentalization, but I'm calling it the "Principle of Least Responsibility" -- with the first working definition being "give a piece of code the least amount of responsibility that is practical". I say "practical", not "possible" because taking it to the extreme could be burdensome.

This is pretty similar to the UNIX philosophy of "Do one thing and do it well." In my case, it specifically means "don't make the code hold the state of a sequence of experiments because all it needs to be responsible for is the single experiment it is currently performing!"

In the moment, it may seem like a great idea to let your code do all the heavy lifting for you, but a little abstraction/functional decomposition can save you a lot of grief in the long run.

Thu Oct 11 10:26:21 PST one less paper cut:

As a vi user who codes in a GNOME environment with sloppy keyboard habits (you'd think there wouldn't be that many of us, but... surprise!), I hate the GNOME default of having F1 bound to the help menu. I'm constantly hitting F1 when reaching for ESC, so I regularly get annoying help windows popping up.

For a long time, I assumed that it was the top-level GNOME help document -- you know, like "how to use GNOME." I'm still not sure if this was ever the case. But today, I noticed that it's actually showing me the help file for GNOME Terminal. I scoffed that I would actually care to read the help file for my terminal emulator.

The irony is that if I had *read* that annoying help file, I might have actually found the answer to my problem. You can disable it (and set a lot of other nice shortcuts) under Edit | Keyboard Shortcuts.

Fri Oct 12 07:09:54 PST groan:

Chemists at UCLA experiment with the reagents of the University of California. *badaboom*

Sat Oct 13 08:33:39 PST space shuttle in los angeles:

pretty incredible.

Tue Oct 16 12:05:03 PST Sometimes I can be a little too sensitive...:

Case-sensitivity is a given in most *NIX and programming environments. But in your editors and tools, case sensitivity is a double-edged sword. Sometimes it's nice to have case-insensitive search so you can find every matching sequence, but just as often I end up getting "too many hits". And doing find and replace insensitively can be really annoying. If only there was a smarter way to manage case!

There is. vim has a neat feature called "smartcase" that searches insensitively if you use all lower-case, but searches case-sensitively if you include any capital letters in your search. So,


... would find "foo", "Foo", "fOO", etc., but


... would only match on "Foo".

To enable smartcase, you also need to enable case-insensitivity. You can put this in your .vimrc:

set ignorecase " Do case insensitive matching

set smartcase " I'm sensitive, but not too sensitive.

However, if you want to do this on a case by case basis, you can do this in vim at runtime with :set ignorecase and :set smartcase. (:set nosmartcase and :set noignorecase will undo the option.) Happy hacking!

Wed Dec 12 09:32:06 PST a moment of silence:

Linux no longer supports 386.

Tue Dec 18 17:06:46 PST hey, I should do^H^H^H^H see if someone has done that:

So you're diving through folders, looking for some files or data. Of course you're using Nautilus since you're running GNU/Linux and Gnome*. You get to your file, "ah ha, you found it" only to realize that you really wish you had a command line to slice and dice the file. What you'd love is to be able to just "open a terminal here". I just had that experience, and like usual (thanks to the long tail), someone else has already done this. If you're using Ubuntu (or possibly Debian) you can simply 'apt-get install nautilus-open-terminal'. Enjoy!

* No offense implied to anyone else, I'm just being silly.

off by one for 2012



Unless otherwise noted, all content licensed by Peter A. H. Peterson
under a Creative Commons License.