Russell's Blog

New. Improved. Stays crunchy in milk.

Blogging my candidacy exam

Posted by Russell on March 04, 2012 at 4:02 p.m.
(This is cross-posted from a guest article I wrote on Jonathan's Blog last week. I thought it would be cool to have it on my own blog too.)

Because this seems to be my default mode of organizing my thoughts when it comes to research, I've decided to write my dissertation proposal as a blog post. This way, when I'm standing in front of my committee on Thursday, I can simply fall back on one my more more annoying habits; talking at length about something I wrote on my blog. Or, since he has graciously lent me his megaphone for the occasion, I can talk at length about something I wrote on Jonathan's blog.

Introduction : Seeking a microbial travelogue

Last summer, I had a lucky chance to travel to Kamchatka with Frank Robb and Albert Colman. It was a learning experience of epic proportions. Nevertheless, I came home with a puzzling question. As I continued to ponder it, the question went from puzzling to vexing to maddening, and eventually became an unhealthy obsession. In other words, a dissertation project. In the following paragraphs, I'm going to try to explain why this question is so interesting, and what I'm going to do to try answer it.

About a million years ago (the mid-Pleistocene), one of Kamchatka's many volcanoes erupted and collapsed into its magma chamber to form Uzon Caldera. The caldera floor is now a spectacular thermal field, and one of the most beautiful spots on the planet. I regularly read through Igor Shpilenok's Livejournal, where he posts incredible photographs of Uzon and the nature reserve that encompasses it. It's well worth bookmarking, even if you can't read Russian.

The thermal fields are covered in hot springs of many different sizes. Here's one of my favorites :

Each one of these is about the size of a bowl of soup. In some places the springs are so numerous that it is difficult to avoid stepping in them. You can tell just by looking at these three springs that the chemistry varies considerably; I'm given to understand that the different colors are due to the dominant oxidation species of sulfur, and the one on the far left was about thirty degrees hotter than the other two. All three of them are almost certainly colonized by fascinating microbes.

The experienced microbiologists on the expedition set about the business of pursuing questions like Who is there? and What are they doing? I was there to collect a few samples for metagenomic sequencing, and so my own work was completed on the first day. I spent the rest of my time there thinking about the microbes that live in these beautiful hotsprings, and wondering How did they get there?

Extremophiles are practically made-to-order for this question. The study of extremophile biology has been a bonanza for both applied and basic science. Extremophiles live differently, and their adaptations have taught us a lot about how evolution works, about the history of life on earth, about biochemistry, and all sorts of interesting things. However, their very peculiarity poses an interesting problem. Imagine you would freeze to death at 80° Celsius. How does the world look to you? Pretty inhospitable; a few little ponds of warmth dotted across vast deserts of freezing death.

Clearly, dispersal plays an essential role for the survival and evolution of these organisms, yet we know almost nothing about how they do it. The model of microbial dispersal that has reigned supreme in microbiology since it was first proposed in 1934 is Lourens Baas Becking's, "alles is overal: maar het milieu selecteert" (everything is everywhere, but the environment selects). This is a profound idea; it asserts that microbial dispersal is effectively infinite, and that differences in the composition of microbial communities is due to selection alone. The phenomenon of sites that seem identical but have different communities is explained as a failure to understand and measure their selective properties well enough.

This model has been a powerful tool for microbiology, and much of what we know about cellular metabolism has been learned by the careful tinkering with selective growth media it exhorts one to conduct. Nevertheless, the Baas Becking model just doesn't seem reasonable. Microbes do not disperse among the continents by quantum teleportation; they must face barriers and obstacles, some perhaps insurmountable, as well as conduits and highways. Even with their rapid growth and vast numbers, this landscape of barriers and conduits must influence their spread around the world.

Ecologists have known for a very long time that these barriers and conduits are crucial evolutionary mechanisms. Evolution can be seen as an ainteraction of two processes; mutation and selection. The nature of the interaction is determined by the structure of the population in which they occur. This structure is determined by biological processes such as sexual mechanisms and recombination, which are in turn is determined chiefly by the population's distribution in space and its migration in that space.

As any sports fan knows, the structure of a tournament can be more important than the outcome of any particular game, or even the rules of the game. This is true for life, too. From one generation to the next, genes are shuffled and reshuffled through the population, and the way the population is compartmentalized sets the broad outlines of this process.

A monolithic population -- one in which all players are in the same compartment -- evolves differently than a fragmented population, even if mutation, recombination and selection pressures are identical. And so, if we want to understand the evolution of microbes, we need to know something about this structure. Bass Becking's hypothesis is a statement about the nature of this structure, specifically, that the structure is monolithic. If true, it means that the only difference between an Erlenmeyer flask and the entire planet is the number of unique niches. The difference in size would be irrelevant.

This is a pretty strange thing to claim. And yet, the Baas Becking model has proved surprisingly difficult to knock down. For as long as microbiologists have been systematically classifying microbes, whenever they've found similar environments, they've found basically the same microbes. Baas Becking proposed his hypothesis in an environment of overwhelming evidence.

However, as molecular techniques have allowed researchers to probe deeper into the life and times of microbes (and every other living thing), some cracks have started to show. Rachel Whitaker and Thane Papke have challenged the Bass Becking model by looking at the biogeography of thermophilic microbes (such as Sulfolobus islan and Oscillatoria amphigranulata), first by 16S rRNA phylogenetics and later using high resolution, multi-locus methods. Both Rachel's work and Papke's work, as well as many studies of disease evolution, very clearly show that when you look within a microbial species, the populations do not appear quite so cosmopolitan. While Sulfolobus islandicus is found in hot springs all over the world, the evolutionary distance between each pair of its isolates is strongly correlated with the geographic distance between their sources. So, these microbes are indeed getting around the planet, but if we look at their DNA, we see that they are not getting around so quickly.

However, Baas Becking has an answer for this; "...but the environment selects." What if the variation is due to selection acting at a finer scale? It's well established that species sorting effects play a major role in determining the composition of microbial communities at the species level. There is no particular reason to believe that this effect does not apply at smaller phylogenetic scales. The work with Sulfolobus islandicus attempts to control for this by choosing isolates from hot springs with similar physical and chemical properties, but unfortunately there is no such thing as a pair of identical hot springs. Just walk the boardwalks in Yellowstone, and you'll see what I mean. The differences among the sites from which these microbes were isolated can always be offered as an alternative explanation to dispersal. Even if you crank those differences down to nearly zero, one can always suggest that perhaps there is a difference that we don't know about that happened to be important.

This is why the Baas Becking hypothesis is so hard to refute: One must simultaneously establish that there is a non-uniform phylogeographic distribution, and that this non-uniformity is not due to selection-driven effects such as species sorting or local adaptive selection. To do this, we need a methodology that allows us to simultaneously measure phylogeography and selection.

There are a variety of ways of measuring selection. Jonathan's Evolution textbook has a whole chapter about it. I'll go into a bit more detail in Aim 3, but for now, I'd just like to draw attention to the fact that the effect of selection does not typically fall uniformly across a genome. This non-uniformity tends to leave a characteristic signature in the nucleotide composition of a population. Selective sweeps and bottlenecks, for example, are usually identified by examining how a population's nucleotide diversity varies over its genome.

For certain measures of selection (e.g., linkage disequilibrium) one can design a set of marker genes that could be used to assay the relative effect of selection among populations. This could then extend the single species, multi-locus phylogenetic methods that have already been used to measure the biogeography of microbes to include information about selection. This could, in principle, allow one to simultaneously refute "everything is everywhere..." and "...but the environment selects." However, designing and testing all those markers, ordering all those primers and doing all those PCR reactions would be a drag. If selection turned out to work a little differently than initially imagined, the data would be useless.

But, these are microbes, after all. If I've learned anything from Jonathan, it's that there is very little to be gained by avoiding sequencing.

We're getting better and better at sequencing new genomes, but it is not a trivial undertaking. However, re-sequencing genomes is becoming routine enough it's replacing microarray analysis for many applications. The most difficult part of re-sequencing an isolate is growing the isolate. Fortunately, re-sequencing is particularly well suited for culture-independent approaches. As long as we have complete genomes for the organisms we're interested in, we can build metagenomes from environmental samples using our favorite second-generation sequencing platform. Then we simply map the reads to the reference genomes. The workflow is a bit like ChIP-seq, except without culturing anything and without the ChIP. We go directly from the environmental sample to sequencing to read-mapping. Maybe we can call it Eco-seq? That sounds catchy.

Not only is the whole-genome approach better, but with the right tools, it is easier and cheaper that multi-locus methods, and allows one to include many species simultaneously. The data will do beautifully for phylogeography, and have the added benefit that we can recapitulate the multi-locus methodology by throwing away data, rather collecting more.

To implement this, I have divided my project into three main steps :

  • Aim 1 : Develop a biogeographical sampling strategy to optimize representation of a natural microbial community
  • Aim 2 : Develop an apply techniques for broad matagenomic sampling, metadata collection and data processing
  • Aim 3 : Test the dispersal hypothesis using a phylogeographic model with controls for local selection
But, before I get into the implementation, I should pause for a moment and make sure I've stated my hypothesis perfectly clearly : I think that dispersal plays a major role in establishing the composition of microbial communities. The Baas Becking hypothesis doesn't deny that dispersal happens, in fact, it asserts that dispersal is infinite, but that it is selection, not dispersal, that ultimately determines which microbes are found in any particular place. If I find instead that dispersal itself plays a major role in determining community composition, then the world is a very different place to be a microbe.

Aim 1 : Develop a biogeographical sampling strategy to optimize the representation of a complete natural community

While I would love to keep visiting places like Kamchatka and Yellowstone, I've decided to study the biogeography of halophiles, specifically in California and neighboring states. Firstly, because I can drive and hike to most of the places were they grow. Secondly, because the places where halophiles like to grow tend to be much easier to get permission to sample from. Some of them are industrial waste sites; no worry about disturbing fragile habitats. Thirdly, because our lab has been heavily involved in sequencing halophile genomes, which are necessary component of my approach. There is also a fourth reason, but I'm saving it for the Epilogue.

As I have written about before, the US Geological Survey has built a massive catalog of hydrological features across the Western United States. It's as complete a list of the substantial, persistent halophile habitats one could possibly wish for. It has almost two thousand possible sites in California, Nevada and Oregon alone :


USGS survey sites. UC Davis is marked with a red star.

The database is complete enough that we can get a pretty good sense of what the distribution of sites looks like within this region just by looking at the map. The sites are basically coincident with mountain ranges. Even though they aren't depicted, the Coastal Range, the Sierras, the Cascades and the Rockies all stand out. This isn't surprising; salt lakes require some sort of constraining geographic topology, or the natural drainage would simply carry the salt into the ocean. Interestingly, hot springs are also usually found in mountains (some of these sites are indeed hot springs), but that has less to do with the mountains themselves as it does with the processes that built mountains. To put it more pithily, you find salt lakes where there are mountains, but you find mountains where there are hot springs.

This database obviously contains too many sites to visit. It took Dr. Mariner's team forty years to gather all of this information. I need to choose from among these sites. But which ones? Is there a way to know if I'm making good selections? Does it even matter?

As it turns out, it does matter. When we talk about dispersal in the context of biogeography, we are making a statement about the way organisms get from place to place. Usually, we expect to see a distance decay relationship, because we expect that more distant places are harder to get to, and thus the rates of dispersal across longer distances should be lower. I need to be reasonably confident that I will see the same distance-decay relationship within the sub-sample that I would have seen for every site in the database. This doesn't necessarily mean that the microbes will obey this relationship, but if they do, I need data that would support the measurement.

There is a pretty straightforward way of doing this. If we take every pair of sites in the database, calculate the Great Circle distance between them, and then sort these distances, we can get spectrum of pairwise distances. Here's what that looks like for the sites in my chunk of the USGS database :


The spectrum of pairwise distances among all sites in the USGS databse (solid black), among randomly placed sites over the same geographic area (dashed black), and among random sub-sample of 360 sites from the database (solid red).

I've plotted three spectra here. The dashed black line is what you'd get if the sites had been randomly distributed over the same geographic area, and the solid black line is the spectra of the actual pairwise distances. As you can see, the distribution is highly non-random, but we already knew this just by glancing at the map. The red line is the spectrum of a random sub-sample of 360 sites from the database (I chose 360 because that is about how many samples I could collect in five one-week road trips).

This sub-sample matches the spectrum of the database pretty well, but not perfectly. It's easy to generate candidate sub-samples, and they can be scored by how closely their spectra match the database. I'd like to minimize the amount of time it takes me to finish my dissertation, which I expect will be somewhat related to the number of samples I collect. There is a cute little optimization problem there.

Although I've outlined the field work, laboratory work and analysis as separate steps, these things will actually take place simultaneously. After I return from the field with the first batch of samples, I will process and submit them for sequencing before going on the next collection trip. I can dispatch the analysis pipeline from pretty much anywhere (even with my mobile phone). That's why I've set aside sample selection and collection as a separate aim. The sample selection process determines where to start, how to proceed, and when I'm done.

Aim 2 : Develop an apply techniques for broad matagenomic sampling, metadata collection and data processing

In order to build all these genomes, I need to solve some technical problems. Building this many metagenomes is a pretty new thing, and so some of the tools I need did not exist in a form (or at a cost) that is useful to me. So, I've developed or adapted some new tools to bring the effort, cost and time for large-scale comparative metagenomics into the realm of a dissertation project.

There are four technical challenges :

  • Quickly collect a large number of samples and transport them to the laboratory without degradation.
  • Build several hundred sequencing libraries.
  • Collect high-quality metadata describing the sites.
  • Assemble thousands of re-sequenced genomes.
To solve each of these problems, I've applied exactly the same principle : Simplify and parallelize. I can't claim credit for the idea here, because I was raised on it. Literally.

Sample collection protocol

When I first joined Jonathan's lab, Jenna Morgan (if you're looking for her newer papers, make sure to add "Lang," as she's since gotten married) was testing how well metagenomic sequencing actually represents the target environment. In her paper, now out in PLoS ONE, one of the key findings is that mechanical disruption is essential.

I learned during my trip to Kamchatka that getting samples back to the lab without degradation is very hard, and it really would be best to do the DNA extraction immediately. Unfortunately, another lesson I learned in Kamchatka is that it is surprisingly difficult to do molecular biology in the woods. One of the ways I helped out while I was there was to kill mosquitoes trying to bite our lab technician so she wouldn't have to swat them with her gloved hands. It's not easy to do this without making an aerosol of bug guts and blood over the open spin columns.

So, I was very excited when I went to ASM last year, and encountered a cool idea from Zymo Research. Basically, it's a battery-operated bead mill, and a combined stabilization and cell lysis buffer. This solves the transportation problem and the bead-beating problem, without the need to do any fiddly pipetting and centrifuging in the field. Also, it looks cool.

Unfortunately, the nylon screw threads on the sample processor tend to get gummed up with dirt, so I've designed my own attachment that uses a quick-release style fitting instead of a screw top.

It's called the Smash-o-Tron 3000, and you can download it on Thingiverse.

Sequencing library construction

The next technical problem is actually building the sequencing libraries. Potentially, there could be a lot of them, especially if I do replicates. If I were to collect three biological replicates from every site on the map, I would have to create about six thousand metagenomes. I will not be collecting anywhere close to six thousand samples, but I thought it was an interesting technical problem. So I solved it.

Well, actually I added some mechanization to a solution Epicentre (now part of Illumina) marketed, and my lab-mates Aaron Darling and Qingyi Zhang have refined into a dirt-cheap multiplexed sequencing solution. The standard technique for building Illumina sequencing libraries involves mechanically shearing the source DNA, ligating barcode sequences and sequencing adapters to the fragments, mixing them all together, and then doing size selection and cleanup. The first two steps of this process are fairly tedious and expensive. As it turns out, Tn5 transposase can be used to fragment the DNA and ligate the barcodes and adapters in one easy digest. Qingyi is now growing huge quantities of the stuff.

The trouble is that DNA extraction yields an unpredictable amount of DNA, and the activity of Tn5 is sensitive to the concentration of target DNA. So, before you can start the Tn5 digest, you have to dilute the raw DNA to the right concentration and aliquat the correct amount for the reaction. This isn't a big deal if you have a dozen samples. If you have thousands, the dilutions become the rate limiting step. If I'm the one doing the dilutions, it becomes a show-stopper at around a hundred samples. I'm just not that good at pipetting. (Seriously.)

The usual way of dealing with this problem is to use a liquid handling robot. Unfortunately, liquid handling robots are stupendously expensive. Even at their considerable expense, many of them are shockingly slow.

To efficiently process a large number of samples, we need to be able to treat every sample exactly the same. This way, can bang through the whole protocol with a multichannel pipetter. It occurred to me that many companies sell DNA extraction kits that use spin columns embedded in 96-well plates, and we have a swinging bucket centrifuge with a rotor that accommodates four plates at a time. So, the DNA extraction step is easy to parallelize. The Tn5 digests work just fine in 96-well plates.

We happen to have (well, actually Marc's lab has) a fluorometer that handles 96-well plates. Once the DNA extraction is finished, I can use a multichannel pipetter to make aliquats from the raw DNA, and measure the DNA yield for each sample in parallel. So far, so good.

Now, to dilute the raw DNA to the right concentration for the Tn5 digest, I need to put an equal volume of raw DNA into differing amounts of water. This violates the principle of treating every sample the same, which means I can't use a multichannel pipetter to get the job done. That is, unless I have a 96-well plate that looks like this :


Programmatically generated dilution plate CAD model

I wrote a piece of software that takes a table of concentration measurements from the fluorometer, and designs a 96-well plate with wells of the correct volume to dilute each sample to the right concentration for the Tn5 digest. If I make one of these plates for each batch of 96 samples, I can use a multichannel pipetter throughout.

Of course, unless you are Kevin Flynn, you can't actually pipette liquids into a 3D computer model and achieve the desired effect. To convert the model from bits into atoms, I ordered a 3D printer kit from Ultimaker. (I love working in this lab!)


The Ultimaker kit

After three days of intense and highly entertaining fiddling around, I managed to get the kit assembled. A few more days of experimentation yielded my first successful prints (a couple of whistles). A few days after that, I was starting my first attempts to build my calibrated volume dilution plates.


Dawei Lin and his daughter waiting for their whistle (thing 1046) to finish printing.

Learning about 3D printing has been an adventure, but I've got the basics down and I'm now refining the process. I'm now printing plates with surprisingly good quality. I've had some help from the Ultimaker community on this, particularly from Florian Horsch.

Much to my embarrassment, the first (very lousy) prototype of my calibrated volume dilution plate ended up on AggieTV. Fortunately, the glare from the window made it look much more awesome than it actual was.

The upshot is that if I needed to make ten or twenty thousand metagenomes, I could do it. I can print twelve 96-well dilution plates overnight. Working at a leisurely pace, these would allow me to make 1152 metagenome libraries in about two afternoons' worth of work.

I'm pretty excited about this idea, and there are a lot of different directions one could take it. The College of Engineering here at UC Davis is letting me teach a class this quarter that I've decided to call "Robotics for Laboratory Applications," where we'll be exploring ways to apply this technology to molecular biology, genomics and ecology. Eight really bright UC Davis undergraduates have signed up (along with the director of the Genome Center's Bioinformatics Core), and I'm very excited to see what they'll do!

Environmental metadata collection

To help me sanity check the selection measurement, I decided that I wanted to have detailed measurements of environmental differences among sample sites. Water chemistry, temperature, weather, and variability of these are known to select for or against various species of microbes. The USGS database has extremely detailed measurements of all of these things, all the way down to the isotopic level. However, I still need to take my own measurements to confirm that the site hasn't changed since it was visited by the USGS team, and to get some idea of what the variability of these parameters might be. It would also be nice if I could retrieve the data remotely, and not have to make return trips to every site.

Unfortunately, these products are are extraordinarily expensive. The ones that can be left in the field for a few months to log data cost even more. The ones that can transmit the data wirelessly are so expensive that I'd only be able to afford a handful if I blew an entire R01 grant on them.

This bothers me on a moral level. The key components are a few probes, a little lithium polymer battery, a solar panel the size of your hand, and a cell phone. You can buy them separately for maybe fifty bucks, plus the probes. Buying them as an integrated environmental data monitoring solution costs tens of thousands of dollars per unit. A nice one, with weather monitoring, backup batteries and a good enclosure could cost a hundred thousand dollars. You can make whatever apology you like on behalf of the industry, but the fact is that massive overcharging for simple electronics is preventing science from getting done.

So, I ordered a couple of Arduino boards and made my own.


My prototype Arduino-based environmental data logger. This version has a pH probe, Flash storage, and a Bluetooth interface.

The idea is to walk into the field with a data logger and a stick. Then I will find a suitable rock. Then I will pound the stick into the mud with the rock. Then I will strap the data logger to the stick, and leave it there while I go about the business of collecting samples. To keep it safe from the elements, the electronics will be entombedin a protective wad of silicone elastomer with a little solar panel and a battery.

The bill of materials for one of these data loggers is about $200, and so I won't feel too bad about simply leaving them there to collect data. If the site has cell phone service, I will add a GSM modem to the datalogger (I like the LinkSprite SM5100B with SparkFun's GSM shield), and transmit the data to my server at UC Davis through an SMS gateway. Then I don't have to go back to the site to collect the data. This could easily save $200 worth of gasoline. I'll put a pre-paid return shipping labels on them so that they can find their way home someday. I'm eagerly looking forward to decades of calls from Jonathan complaining about my old grimy data loggers showing up in his mail.

From the water, the data logger can measure pH, dissolved oxygen, oxidation/reduction potential, conductivity (from which salinity can be calculated), and temperature. I may also add a small weather station to record air temperature, precipitation, wind speed and direction, and solar radiation. I doubt if all of these parameters will be useful, but the additional instrumentation is not very expensive.

Assembling the genomes

The final technical hurdle is assembling genomes from the metagenomic data. If I have 360 sites and 100 reference genomes, I'm going to have to assemble 36,000 genomes. Happily, I am really re-sequencing them, which is much, much easier than de novo sequencing. Nevertheless, 36,000 is still a lot of genomes.

For each metagenome, I must :

  • Remove adapter contamination with TagDust
  • Trim reads for quality, discard low quality reads
  • Remove PCR duplicates
  • Map reads to references with bwa, bowtie, SHRiMP, or whatever
This yields a BAM file for each metagenome, each representing an alignment of reads to each scaffold of each reference genome. All of the reference genomes can be placed into a single FASTA file with a consistent naming scheme for distinguishing among scaffolds belonging to different organisms. A hundred-odd archaeal reference genomes is about 200-400 megabases, or an order of magnitude smaller than the human genome. Using the Burrows-Wheeler Aligner on a reasonably modern computer, this takes just a few minutes for each metagenome.

I'm impatient, though, and so I applied for (and received) an AWS in Education grant. Then I wrote a script that parcels each metagenome off to a virtual machine image, and then unleashes all of them simultaneously on Amazon.com's thundering heard of rental computers. Once they finish their alignment, each virtual machine stores the BAM file in my Dropbox account and shuts down. The going rate for an EC2 Extra Large instance is $0.68 per hour.

This approach could be used for any re-sequencing project, including ChIP-seq, RNA-seq, SNP analysis, and many others.

Aim 3 : Test the dispersal hypothesis using a phylogeographic model with controls for local selection

In order to test my hypothesis, I need to model the dispersal of organisms among the sites. However, in order to do a proper job of this, I need to make sure I'm not conflating dispersal and selective effects in the data used to initialize the model. There are three steps :
  • Identify genomic regions that have recently been under selection
  • Build genome trees with those regions masked out
  • Model dispersal among the sites
In all three cases, there are a large number of methods to choose from.

One way of detecting the effects of selection is Tajima's D. This measures deviation from the neutral model by comparing two estimators of the neutral genetic variation, one based on the nucleotide diversity and one based on the number of polymorphic sites. Neutral theory predicts that the two estimators are equal, and so genomic regions in which these two estimators are not equal are evolving in a way that is not predicted by the neutral model (i.e., they are under some kind of selection). One can do this calculation on a sliding window to measure Tajima's D for each coordinate of each the genome of each organism. As it turns out, this exact approach was used by David Begun's lab to study the distribution of selection across the Drosophilia genome.

I will delete the regions of the genomes that deviate significantly (say, by more than one standard deviation) from neutral. Then I'll make whole genome alignments, and build a phylogenetic trees for each organism. This tree would contain only characters that (at least insofar as you believe Tajima's D and Wu and Fey's FST) are evolving neutrally, and are not under selection.

A phylogenetic tree represents evolutionary events that have taken place over time. In order to infer the dispersal of the represented organisms, would need model where those events took place. Again, there are a variety of methods for doing this, and but my personal favorite is probably the approach used by Isabel Sanmartín for modeling dispersal of invertebrates among the Canary Islands. I don't know if this is necessarily the best method, but I like the idea that the DNA model and the dispersal model use the same mathematics, and are computed together. Basically, they allowed each taxa to evolve its own DNA model, but constrained by the requirement that they share a common dispersal model. Then they did Markov Chain Monte Carlo (MCMC) sampling of the posterior distributions of island model parameters (using MrBayes 4.0).

According to Wikipedia, the most respected and widely consulted authority on this and every topic, the General Time Reversible Model it is the most generalized model describing the rates at which one nucleotide replaces another. If we want to know the rate at which a thymine turns into a guanine, we look at elment (2,3) of this matrix :

πG is the stationary state frequency for guanine, and rTG is the exchangability rate between from T to G. However, if we think of this a little differently, as Sanmartín suggests in her paper, we can use the GTR model for the dispersal of species among sites (or islands). If we want to know the rate at which a species migrates from island B to island C, we look in cell (2,3) of a very similar matrix :

Here, πC is the relative carrying capacity of island C, and rBC is the relative dispersal rate from island B to island C. Thus, the total dispersal from island i to island j is

dij = Nπirijπjm

where N is the total number of species in the system, and m is the group-specific dispersal rate. This might look something like this :

One nifty thing I discovered about MrBayes is that it can link against the BEAGLE library, which can accelerate these calculations using GPU clusters. Suspiciously, Aaron Darling is one of the authors. If you were looking for evidence that the Eisen Lab is a den of Bayesians, this would be it.

This brings us, at last, back to the hypothesis and Baas Becking. Here we have a phylogeographic model of dispersal among sites within a metacommunity, with the effects of selection removed. If the model predicts well-supported finite rates of dispersal within the metacommunity, my hypothesis is sustained. If not, then Baas Becking's 78 year reign continues.

Epilogue : Lourens Baas Becking, the man verses the strawman


Lourens Baas Becking

Microbiologists have been taking potshots at the Baas Becking hypothesis for a decade or two now, and I am no exception. I'm certainly hoping that the study I've outlined here will be the fatal blow.

However, it's important to recognize that we've been a bit unfair to Baas Becking himself. The hypothesis that carries his name is a model, and Baas Becking himself fully understood that dispersal must play an important role in community formation. He understood perfectly well that "alles is overal: maar het milieu selecteert" was not literally true; it is only mostly true, and then only in the context of the observational methodology available at the time. In 1934, in the same book where he proposed his eponymous hypothesis, he observed that there are some habitats that were ideally suited for one microbe or another, and yet these microbes were not present. He offered the following explanation: "There thus are rare and less rare microbes. Perhaps there are very rare microbes, i.e., microbes whose possibility of dispersion is limited for whatever reason."

Useful models are never "true" in the usual sense of the word. Models like the Baas Becking hypothesis divide the world into distinct intellectual habitats; one in which the model holds, and one in which it doesn't. At the shore between the two habitats, there is an intellectual littoral zone; a place where the model gives way, and something else rises up. As any naturalist knows, most of the action happens at interfaces; land and sea, sea and air, sea and mud, forest and prairie. The principle applies just as well to the landscape of ideas. The limits of a model, especially one as sweeping as Baas Becking's, provides a lot of cozy little tidal ponds for graduate students to scuttle around in.

By the way, guess where Lourens Baas Becking first developed his hypothesis? He was here in California, studying the halopiles of the local salt lakes. In fact, the very ones I will be studying.

3D printing update

Posted by Russell on October 04, 2011 at 12:19 a.m.
I've been working a bit on the software that generates my 96-well dilution plate. I have a new version that cuts the plastic use by about 80% and print time by about the same. Also, it now prints with the wells upside-down on the build platform, which should help cut down contamination during the printing process.

Things to do :

  • I'm going to try cutting the plastic use even more by adding a skirt around the plate (like a normal titer plate), and adjusting the outer height of each well.
  • Add a fill-line to each well.
  • Raise the well edges a little more, and add drain-holes between wells to prevent spillage between wells and to make filling easier.
  • Add embossed row and column labels.
  • Add an embossed text area for user notations (e.g., for which sample group is this plate calibrated).
Hmm. I might pull this off yet.

Also, if you are interested in this stuff, the UC Davis Biomedical Engineering made me instructor of a variable unit class (graded P/NP) called "Research internship in robotics for the laboratory" for Winter 2012. Sign up for BIM192, sec 2 (the CRN is 24791).

Questions of microbial ecology

Posted by Russell on April 27, 2011 at 10:50 p.m.
When the first environmental sequencing projects were conducted, the genetic bredth present within an environmental sample so far outstripped the available sequencing capacity at the time that it was only possible to obtain a tiny slice of the genetic material present. This gave researchers two choices; either target a particular gene, or go fishing. Both approaches have been extremely fruitful. Targeted studies of ribosomal RNA led to the discovery of the archaea, among other important accomplishments. The "fishing" approach (which has a shorter history) has also led to exciting discoveries. If you do a literature search for your favorite enzyme with the word "novel," it's quite likely that most of the recent publications will involve some kind of metagenomic survey.

As the cost of sequencing continues to plummet, a third approach to environmental sequencing has suddenly become possible: Exhaustive sequencing. It should be possible not only to survey the entire genomes of the organisms present (although assembling them is another story), but also to survey the population-level variability of the organisms present. This is a rather unprecedented development. Microbial communities have suddenly gone from the most challenging ecologies, with only a handful of observable characters, to a spectacularly detailed quantitative picture.

Here is an example from one of my datasets :

This is a small region in the genome of Roseiflexus castenholzii. I have mapped reads from an environmental sample to the reference genome, yielding an average coverage of about 190x. If you look closely at the column in the middle (position 12519 in the genome, in case you care), we see some clear evidence of a single nucleotide polymorphism in this population of this organism.

As it happens, this coordinate falls in what appears to be an intergenic region, between a phospholipid/glycerol acyltransferase gene on the forward strand to the left and a glycosyl transferase gene one the reverse strand to the right. The two versions appear with roughly equal frequency in the data. For this organism, I've found single nucleotide polymorphisms at thousands of sites. There are also insertions and deletions, and probably rearrangements.

In this ecosystem, I'm able to get between 50x and 300x coverage for almost every taxon present. This should make it possible to see variants that make up only a percent or two of their respective taxon's population. With data like this, it should be possible to do some really beautiful ecology!

For example, suppose one wanted to see if a community obeys the island biogeography model. One could measure the theory's three parameters, immigration, emigration and extinction, by comparing the arrivals and disappearances of variants between the "mainland" and the "island" over time. The ability to examine variants within taxa should make these measurements very sensitive. Additionally, because these are genomic characters, it should be possible to control for the effects of selection (to some extent) by leveraging our knowledge of their genomic context. The 12519th nucleotide of the R. castenholzii genome is perhaps a good example of a character that is unlikely to be under selection because it happens to sit downstream from both flanking genes.1

So, here is my question to you : What ecological model or process would you be most excited to see studied in this way?

1 Well, actually I haven't looked at this site in detail, so I'm not sure if one would or wouldn't reasonably expect it to be under selection. My hunch is that it is less likely to be under stringent selection than most other sites. I'm basing this hunch on eyeballing the distance of this locus from where I think RNA polymerase would be ejected on either side, and that both transcripts terminate into its neighborhood. My point is that it should be possible to have some idea of how selection might operate on a particular locus based on its genomic context. One should take this with the usual grain of salt that accompanies inferences drawn solely from models. A better example would be a polymorphism among synonymous codons, but I wasn't able to find one in a hurry.

A sequencer of our own

Posted by Russell on January 27, 2011 at 4:12 p.m.
We just finished running our new GS Jr. gene sequencer for the first time. It produced 115,698 shotgun reads of our E. coli. Here is the read length histogram :

And the GC content histogram :

This was our first time going through the shotgun library protocol, which is pretty involved. For example, we're going to have to be more careful next time when we load the picotiter plate. We got a few bubbles trapped in there. It's kind of funny how obvious the bubbles are in the raw florescence images (this is an A, around cycle 200) :

I've uploaded the FASTA file and the qual file, in case you want to try to assemble your own E. coli genome.

Last minute preparations

Posted by Russell on August 04, 2010 at 5:10 a.m.
We're planning to return from Uzon on the 11th of August, so assuming we leave this evening, we'll be there for seven days. We have to plan for the possibility that we might get fogged in up there, so we might be stuck for a few extra days. Hopefully that won't happen, because we've got a workshop to prepare for once we get back. Also, I've picked up a bad habit from Jonathan, and I still have to write my talk for the 17th.

We had an exhausting day yesterday.

First, the cost of the helicopter has gone up since last time they made the trip, so Frank and Albert had to arrange to transfer the difference from America to Petropavlovsk. This turned out to be an agonizing process, and I'm not even sure of all the details. Albert came back to the apartment after the first day of working on it and passed out instantly. Suffice it to say that both of them have extremely patient and resourceful spouses, without whom we would now be stranded in town with no way to get to our research site.

There remains a great deal of confusion and uncertainty about the status of the generator (or generators?) at Uzon, and so we've had to prepare for the worst. I spent the day with Albert and Alex hunting down motor oil, spark plugs, and two-stroke oil (in case it's a two-stroke engine), and other small-engine stuff. Supposedly there is a new American-market Honda generator up there, a Soviet-era machine that can still be persuaded to work, and perhaps something else of unknown providence and status. We were also told that there was no generator at all, sending us scrambling all over town to buy a new generator, but that was evidently a misscommunication. Fortunately we got it straightened out before we actually started laying out Rubles for the first generator we could carry away!

In the summertime, the research station would be a truly ideal place for an off-grid solar array. One of the things I'm going to do while I'm there is to study the structure an write up a proposal for its owners to install one, if they should so desire.

After much looking around I found that, nobody sells regular fuel canisters for backpacking stoves in this part of Russia. However, they do sell adapters that let you plug them into butane refill canisters. The canisters are very cheap, but they are shaped like cans of hairspray; narrow and tall. Not a very stable platform for cooking! I'm going to set up my stove in a bucket, and pack dirt around the fuel canister to keep it stable and upright (and far from anything that might melt or burn). And yes, I'll only use it outside.


Demonstrating the use of mosquito protection gear for Bo -- you can tell I'm not really excited about mosquitoes

I was able to find a SIM card for my MyTouch 3G, which is awesome. Unfortunately, MTS doesn't know how to automatically configure Android phones for GPRS. At least, that's what I could understand from the girl at the MTS store. That conversation was conducted mostly through hand gestures and giggling, and was a testament to the power of technology-related acronyms to puncture language barriers. It's strange to say, "IP for DNS server?" and see the light of understanding spread across a person's face.


We bought more than 15,000 Rubles of food for the trip! Actually, that's pretty reasonable for seven people.

Last of all, there was the food. By the time we all got to the grocery at 7:00 in the evening, we were almost totally spent. Still, we had to shop for another two hours before we had everything we need (at least, I hope we have everything we need).

This morning a truck from the Institute arrived at the apartment to pick up our food and laboratory equipment. We're not totally sure if we will be riding with it to Uzon, or if it will go on a separate helicopter. So, we had to waterproof everything last night in case it had to spend the day (or evening) on the landing site in the rain. I am glad we had plenty of plastic bags and tape!


Our food and lab equipment getting picked up

With luck, we will catch our helicopter to Uzon this evening.

Science, the practice of

Posted by Russell on July 25, 2010 at 6:40 a.m.

This is the first in a series of articles I plan to write over the next three weeks covering my field expedition to Uzon Caldera and attendance the 2010 International Workshop on Biodiversity, Molecular Biology and Biogeochemistry of Thermophiles. In this post, I'll outline my plans for the series and explain why I'm writing it.

If you would like to follow along, check in here, or subscribe to my RSS feed. Or if you would like to follow the series and not the rest of my blog, I will be tagging all of the posts in the series kamchatka. At Uzon Caldera, I will be posting updates to my Twitter feed by satellite phone (you can also subscribe to my Twitter RSS feed.)

Before I leave on Tuesday, I will post articles introducing the natural history of Kamchatka, my plans and preparations for getting getting there and working there, and maybe a few other things.

I have two broad goals :

  • Study the biochemistry, genomics, and physiology of thermophilic organisms in their natural habitat.
  • Document and share the experience.
These are two fairly distinct missions. First of all, I'm looking for material for my thesis, particularly a metagenomic target suitable for the technique I'm developing. For the hard science, I will try to confine myself to observations and avoid drawing conclusions. I'll save that for the journals.

The second mission is to bring you along. I've been asked by my thesis advisor to write about, photograph, tweet and film as much of the field expedition and the workshop as possible, and present it as an example of what it's like to actually do science. My goal is to present the company, the food, the work, the travel, the joys, the annoyances, the surprises, the good, the bad, and the ridiculous.

Science remains firmly misunderstood by the public. My personal experience suggests that the public actually understands the products of science -- powerful theories and key facts -- a bit better than polling data suggests. The core of public misunderstanding, I think, rests in how people believe science works as an institution and as a profession.

A couple of years ago, Fermilab invited a group of seventh graders to visit the laboratory to check out the various awesome things they have available for the public to see. Before the visit, the students were asked to write about what they thought scientists were like, and to draw a picture to go along with it. After the visit, they were asked to repeat the exercise. The results eye-opening. Here is an example I particularly liked, from a girl named Rachel :

before

after

Most of the before pictures feature lab coats filled by older, white men without much hair. Many of the kids mentioned that they thought scientists were "a little bit crazy," and most represented their scientist as some sort of authority figure. The after-visit results are equally interesting; many of the comments seem astonished that scientists have families, and that they enjoy things other than science.

The phrase "regular people" comes up again and again in their after-visit writing. Students are usually pretty good at ignoring phrases that are deliberately emphasized. When you see a bunch students incorporate exactly the same phrase into a free-form writing assignment, it's usually something that an adult mentioned without anticipating the impact it would have. The concept that scientists could be "regular people" was evidently a bit of a shock.

Obviously this is anecdotal, and it's important not to read too much into it. It is, however, a useful example of the sort of challenges we face if we want society to understand science itself, rather than simply memorizing the things science produces. None of this is original to me. If you want an entertaining treatment of science in the media, check out Christopher Frayling's Mad, Bad and Dangerous?: The Scientist and the Cinema (I apologize for the bizarre question-mark colon thing).

I've written about this before. Last November, I wrote :

The problem is that scientists do not spend enough time talking with the general public. Only a small minority of scientists take the trouble to arrange their findings in a form digestible by the lay audience, as Darwin did. When they do, it is almost never cutting-edge research that fills the pages. Very few scientists go on television or the radio. The practice today is to bring research to lay the audience only when it is neatly tied up (or, the research community feels that it is, anyway). There are those who do otherwise, but there is a negative stigma to it; scientists who announce their findings with press releases instead of peer-reviewed papers are usually regarded with suspicion.
Scientists have a responsibility to share what they do.

Over the next three weeks, I'm going to put that thought into action.

I'm going to Kamchatka!

Posted by Russell on July 19, 2010 at 5:47 p.m.
I just got the reservations for my flight to Petropavlovsk-Kamchatsky for the International Workshop on Biodiversity, Molecular Biology and Biogeochemistry of Thermophiles, hosted by Moscow State University and Winogradsky Institute of Microbiology.

I've been working on the analysis of environmental samples from two sites at Uzon Caldera (about 10,000 Sanger reads from each sequenced at the JGI), and I'm hoping that I'll be able to reprocess the DNA here at the UC Davis Genome Center using some of our high-throughput machines. Licensing and customs restrictions will probably make it impossible to bring my own samples back, but I may be able to entrust them to a colleague with fancier credentials than my own.

Insofar as it will be possible, I will be blogging from Kamchatka and uploading photographs and data, so please ask questions in the comments!

I'll be arriving in Petropavlovsk on the 30th of July, with the help of a generous grant from the Carnegie Institution for Science Deep Carbon Observatory.

What Google knows

Posted by Russell on April 28, 2010 at 11:26 a.m.
After six months of using Google Latitude, I've amassed about 7108 location updates, or about 38 a day. It would probably be a lot more if I hadn't managed on occasion to break the GPS or automatic updating by fiddling with the software.

It's actually quite useful to have this data, especially if it's correlated with some richer information. For example, I've consulted the data to answer questions like, "Where was that awesome sandwich place I ate at last month?" It's also extremely useful to be able to share this data with Google because it allows me to quickly cross-reference location coordinates with Google's database of businesses and addresses. You can also download your complete location history in one giant blob (just ignore the warning that the History map only displays 500 datapoints, and download the KML file). Once you have the KML file, you can do whatever you want with it. For example, I uploaded mine to Indiemapper to map my wanderings for the last six months (Indiemapper is cool, but I quickly found that this dataset is really much too big for a Flash-based web application).

Not surprisingly, I spent most of my time in California, mostly in Davis and the Bay Area, with a few trips to Los Angeles via I-5, the Coast Starlight, and the San Joaquin (the density of points along those routes is indicative of the data service along the way).

The national map shows my trip to visit my dad's family in New Jersey and Massachusetts, as well as a layover in Denver that I'd completely forgotten about.

I have somewhat mixed feelings about this dataset. On one hand, it's very useful to have, and sharing it with my friends and with Google is very useful. It's also cool to have this sort of quantitative insight into my recent past so easily accessible. On the other hand, I'm not particularly happy with the idea that Google controls this data. I chose the word controls deliberately. I don't mind that they have the data -- after all, I did give it to them. As far as I know, Google has been a good citizen when it comes to keeping personal location data confidential. The Latitude documentation makes their policy pretty clear :

Privacy

Google Location History is an opt-in feature that you must explicitly enable for the Google Account you use with Google Latitude. Until you opt in to Location History, no Latitude location history beyond your most recently updated location if you aren't hiding is stored for your account. Your location history can only be viewed when you're signed in to your Google Account.

You may delete your location history by individual location, date range, or entire history. Keep in mind that disabling Location History will stop storing your locations from that point forward but will not remove existing history already stored for your Google Account.

...

If I delete my history, does Google keep a copy or can I recover it?

No. When you delete any part of your location history, it is deleted completely and permanently within 24 hours. Neither you nor Google can recover your deleted location history.

So, that's what they'll do with it, and I'm happy with that. What bothers me is this: Who owns this data?

This question leads directly to one of the most scorchingly controversial questions you could ask for, and there are profound legal, social, economic and moral outcomes riding on how we answer it. This isn't just about figuring out what coffee shops I like. If you want to see how high the stakes go, buy one of 23andMe's DNA tests. You're giving them access to perhaps the most personal dataset imaginable. In fairness, 23andMe has a very strong confidentiality policy.

But therein lays the problem -- it's a policy. Ambiguous or fungible confidentiality policies are at the heart of an increasing number of lawsuits and public snarls. For example, there is the case of the blood samples taken from the Havasupai Indians for use in diabetes research that turned up in research on schizophrenia. The tribe felt insulted and misled, and sued Arizona State University (the case was recently settled, the tribe prevailing on practically every item).

You can't mention informed consent and not revisit HeLa, the first immortal human cells known to science. HeLa was cultured from a tissue biopsy from Henrietta Lacks and shared among thousands of researchers -- even sold as a commercial product -- making her and her family one of the most studied humans in medical history. The biopsy, the culturing, the sharing and the research all happened without her knowledge or consent, or the knowledge or consent of her family.

And, of course, there is Facebook -- again. Their new "Instant Personalization" feature amounts to sharing information about personal relationships and cultural tastes with commercial partners on an op-out basis. Unsurprisingly, people are pissed off.

Some types of data are specifically protected by statute. If you hire a lawyer, the data you share with them is protected by attorney-client privilege, and cannot be disclosed even by court order. Conversations with a psychiatrist are legally confidential under all but a handful of specifically described circumstances. Information you disclose to the Census cannot be used for any purpose other than the Census. Nevertheless, there are many types of data that have essentially no statutory confidentiality requirements, and these types of data are becoming more abundant, more detailed, and more valuable.

While I appreciate Google's promises, I'm disturbed that the only thing protecting my data is the goodwill of a company. While a company might be full of a lots of good people, public companies are always punished for altruistic behavior sooner or later. There is always a constituency of assholes among shareholders who believe that the only profitable company is a mean company, an they'll sue to get their way. Managers must be very mindful of this fact as they navigate the ever changing markets, and so altruistic behavior in a public company can never be relied upon.

We cannot rely on thoughtful policies, ethical researchers or altruistic companies to keep our data under our control. The data we generate in the course of our daily lives is too valuable, and the incentives for abuse are overwhelming. I believe we should go back to the original question -- who owns this data? -- and answer it. The only justifiable answer is that the person described by the data owns the data, and may dictate the terms under which the data may be used.

People who want the data -- advertisers, researchers, statisticians, public servants -- fear that relinquishing their claim on this data will mean that they will lose it. I strongly disagree. I believe that people will share more freely if they know they can change their mind, and that the law will back them up.

Update

The EFF put together a very sad timeline of Facebook's privacy policies as they've evolved from 2005 to now. They conclude, depressingly :
Viewed together, the successive policies tell a clear story. Facebook originally earned its core base of users by offering them simple and powerful controls over their personal information. As Facebook grew larger and became more important, it could have chosen to maintain or improve those controls. Instead, it's slowly but surely helped itself — and its advertising and business partners — to more and more of its users' information, while limiting the users' options to control their own information.

Clones!

Posted by Russell on October 24, 2009 at 5:58 p.m.
Looks like I got my gene to grow in E. coli!

The colonies that didn't get the plasmid I'm using to carry MXAN7396 turn blue when grown with X-gal (bromo-chloro-indolyl-galactopyranoside). The ones that got the plasmid don't.

Neat!

Yays!

Posted by Russell on October 21, 2009 at 4:04 a.m.
I finally got through the double-PCR phase of the protocol without wrecking something. Yay!

My hacked up version of the gene gene is getting snipped up with everyone's favorite restriction enzymes (BamH1 and EcoR1). Then I get to splice it into a plasmid, and electroport the plasmids into some cells, and maybe they will do something interesing.

The cloning blues

Posted by Russell on October 19, 2009 at 8:43 p.m.
I've been doing my first laboratory rotation in Mitch Singer's lab, and trying to learn what people are actually doing when they publish these spiffy experimental results. So far, I've mostly been wrecking things. Fun disasters of the week :
  • Wrecked a DNA extraction by grabbing the wrong Pipetter and putting 300 microliters into a tube instead of 3.
  • Misread an illegible label and used butanol instead of ethanol, destroyed second attempt at the aforementioned DNA extraction.
  • Dropped the wrong tube in the trash, screwed up the third attempt at the aforementioned DNA extraction.
  • Kept a gel on the UV bench too long while trying to chop out little cubes with a razor blade, annihilated all the DNA, and screwed up fourth attempt at aforementioned DNA extraction.
  • The PCR cycler didn't close correctly, and my reaction tubes evaporated; screwing up fifth attempt at aforementioned DNA extraction. (At least this one wasn't my fault.)
I'm now spending the evening in the lab running everything over again, for the sixth time. Yays!

I definitely sticking to informatics -- that part of the rotation is going pretty well. I'm just not cut out for benchwork.

Google for bioinformatics

Posted by Russell on April 30, 2009 at 12:53 p.m.
Interesting thing of the day : If you plug the following nucleotide sequence into Google :
"gctagttaaa aaaggaaatt catacccaaa"
The only hit you will find is the Swine Flu genome. Google is a sequence homology tool!