The second chapter of my dissertation has now been published. It is freely available from the BMC Bioinformatics website. In this post I hope to provide you will a short overview of the research. The reference is

Cartwright RA (2006) Logarithmic gap costs decrease alignment accuracy.

BMC Bioinformatics7:527.

Our genomes evolve not only via point mutations (where one base changes into another base) but also via insertions and deletions. That is the addition or removal of bases. Collectively insertions and deletions are known as “indels”. Now several studies over the last fifteen years have found that the size of insertions and deletions follows a power law, i.e. a log-log plot of size versus frequency is linear. However, this observation has had little impact on bioinformatics for various reasons. Now why is this observation important? For starters, several scientists have proposed that sequences should be aligned using logarithmic gap costs (w_{k}=a + b k + c ln k) instead of the standard affine gap costs (w_{k}=a + b k).

Because in my first chapter I created a sequence simulation program, Dawg, that could simulate evolution using a power law of indel sizes, I felt that it was important to test this suggestion out. Specifically, whether the slowness of algorithms using logarithmic gap costs are offset by their improved accuracy. So that is exactly what I did for my second chapter, and the results were surprising.

Before I could begin testing logarithmic gap costs, I needed get a dataset of aligned sequences. So, using Dawg I generated 10,000 independent pairwise alignments using the following template.

Model = "JC" Lambda = 0.15 GapModel = "PL" GapParams = {1.5, 1000} Length = 1000 Tree = (A:${BranchLength}):B;

In this template, BranchLength is a random variable drawn from an exponential distribution with mean of 0.2. I was thus able to average across a set of branch lengths that were consistent with the process of evolution.

Once I had these sequence pairs, I then aligned them using another tool that I had written, Ngila, which is the only tool out there capable of doing global alignments using logarithmic gap costs. (I am currently revising my Ngila paper after the first round of reviews.) After a couple of test runs, I decided to align each sequence pair 512 different ways, under the following setup.

Match Cost = 0 Mismatch Cost = 1 Gap of size k Cost = a+b*k+c*log(k) {a,b,c} = {0, 0.125, 0.25, 0.5, 1, 2, 4, 8}

Once Ngila had found the best alignment for each pair using each of the 512 different gap costs, I calculated how accurate each of Ngila’s hypothesized alignments were to the real alignments given by Dawg. (See the paper for an explanation of my accuracy metric.) After about two weeks of running the alignments on my multiprocesser workstation, I had 5.12 million data points that I could use to determine how much better logarithmic gap cost are than affine gap costs.

So after two weeks of generating my data, what is the first thing that I did? I threw half of it away because R could not handle that much data. Oh well, *c’est la vie*.

I then analyzed my data by looking at three different gap cost schemes: Log-Affine, Affine, and Logarithmic. For each of these schemes, I was able to determine the gap costs that produced the most accurate alignments on average. They were as follows:

Log-Affine: 2 + 0.25 k + 0.5 ln k (w/ 94.1% accuracy) Affine: 4 + 0.25 k (w/ 92.5% accuracy) Logarithmic: 0.125 + 8 ln k (w/ 68.7% accuracy)

This was quite a shock. Not only had I expected logarithmic gap costs to perform better than affine gap costs, I was also surprised to see such little difference between Affine and Log-Affine gap costs. The paper looks at the data a couple of other ways, but the same conclusion holds in each case. Log-Affine is the best followed closely by Affine, while Logarithmic gap costs do poorly.

Now I was at a loss for a long time on how to explain this discrepancy between our expectations and the real world. Eventually, I found a hint of it in a paper that I was looking at. For you see, when I picked my match and mismatch costs to be 0 and 1, I caused the best gap cost to change. This change introduces the unexpected linear component to the best gap cost.

Okay, so I had an idea where it came from, but how can I demonstrate it? In the appendix to my paper (which is not vestigial at all), I use a statistical model to derive an equation for the best gap cost based on the three evolutionary parameters that went into my simulations: the mean branch length between pairs (t = 0.2), the rate of indel formation (r = 0.15), and the slope parameter of the power law (z = 1.5). According to the model, the best gap cost should be

Now what do I find when I substitute my values into this equation? That the gap cost should be w_{k}= 1.69 + 0.23 k + 0.56 ln k. As I was working on my statistical model for gap costs, I really had no idea how it would come out. In fact, I was very worried that I’d have to explain some large discrepancy. Now that I have the theoretical prediction, how does that compare to the cost that I found earlier: w_{k}= 2 + 0.25 k + 0.5 ln k. Eureka, Frankenstein’s monster is alive; my two equations, derived via different means, are extremely similar. I have to confess that I did a little cabbage patch when I finished my calculations.

So what have I shown with this research? 1) That in contrast to suggestions found in the literature, logarithmic gap costs do not provide accurate alignments. 2) That log-affine gap costs are in fact implied by the power-law distribution. 3) That affine-gap costs can be good approximations for log-affine gap costs. And 4) that it is possible to derive gap costs based on the average branch length, indel rate, and power-law parameter.

## Post A Comment