Russell's Blog

New. Improved. Stays crunchy in milk.

Fun with de Bruijn graphs

Posted by Russell on October 29, 2010 at 4:34 a.m.
One of the projects I'm working on right now involves searching a better approaches to assembling short read data metagenomic data. Many of the popular short read assembly algorithms rely on a mathematical object called a de Bruijn graph. I wanted to play around with these things without having to rummage around in the guts of a real assembler. Real assemblers have to be designed with speed and memory conservation in mind -- or, at least they ought to be. So, I decided to write my own. My implementation is written in pure Python, so it's probably not going to win any points for speed (I may add some optimization later). However, it is pretty useful if all you want to tinker around with de Bruijn graphs.

Anyway, here is the de Bruijn graph for the sequence gggctagcgtttaagttcga projected into 4-mer space :

This is the de Bruijn graph in 32-mer space for a longer sequence (it happens to be a 16S rRNA sequence for a newly discovered, soon-to-be-announced species of Archaea).

It looks like a big scribble because it's folded up to fit into the viewing box. Topologically, it's actually just two long strands; one for the forward sequence, and one for its reverse compliment. There are only four termini, and if you follow them around the scribble, you won't find any branching.

raygun : a simple NCBI BLAST wrapper for Python

Posted by Russell on October 25, 2010 at 11:42 a.m.
Things have been a little quiet on for the last couple of weeks, but a lot of frantic activity has been going on behind the peaceful lack blog updates. When I returned from Kamchatka, Jonathan had a little present for me -- he took the DNA from the 2005 Uzon field season for Arkashin and Zavarzin hotsprings, and ran a whole lane of paired-end sequencing on one of our Illumina machines. Charlie made some really beautiful libraries, and the data is really, really excellent. For the last couple of weeks, I've been trying to do justice to it.

I'm not quite ready to talk about what I've been finding, but I thought I would take a moment to share a little tool I wrote along the way. It's made my life a lot easier, and maybe other people could get some use out of it.

It's called raygun, a very simple Python interface for running local NCBI BLAST queries. You initialize a RayGun object with a FASTA file containing your target sequences, and then you can query it with strings or other FASTA files. It parses the BLAST output into a list of dictionary objects, so that you can get right to work.

It doesn't take a lot of scripting chops to do this without an interface, of course, and there are other Python tools for running BLAST queries. The advantage of raygun over either the DIY approach or the BioPython approach is that raygun is extremely simple to use. I wanted something that would basically be point-and-shoot :

import raygun
import cleverness

rg = raygun.RayGun( 'ZOMG_DNA_OMG_OMG.fa' )

hits = rg.blastfile( 'very_clever_query.fa' )

results = []
for hit in hits :
    results.append( cleverness.good_idea( hit[ 'subject' ] ) )

cleverness.output_phd_thesis( results )
Unfortunately, you must furnish your own implementation of the cleverness module.

I designed raygun is with interactive use in mind, particularly with ipython (by the way, if you do a lot of work in python and you're not using ipython, you're being silly). The code is available on github.