Logarithmic Gap Costs Decrease Alignment Accuracy

The second chapter of my dissertation has now been published. It is freely available from the BMC Bioinformatics website. In this post I hope to provide you will a short overview of the research. The reference is

Cartwright RA (2006) Logarithmic gap costs decrease alignment accuracy. BMC Bioinformatics 7:527.

Our genomes evolve not only via point mutations (where one base changes into another base) but also via insertions and deletions. That is the addition or removal of bases. Collectively insertions and deletions are known as “indels”. Now several studies over the last fifteen years have found that the size of insertions and deletions follows a power law, i.e. a log-log plot of size versus frequency is linear. However, this observation has had little impact on bioinformatics for various reasons. Now why is this observation important? For starters, several scientists have proposed that sequences should be aligned using logarithmic gap costs (wk=a + b k + c ln k) instead of the standard affine gap costs (wk=a + b k).

Because in my first chapter I created a sequence simulation program, Dawg, that could simulate evolution using a power law of indel sizes, I felt that it was important to test this suggestion out. Specifically, whether the slowness of algorithms using logarithmic gap costs are offset by their improved accuracy. So that is exactly what I did for my second chapter, and the results were surprising.

Before I could begin testing logarithmic gap costs, I needed get a dataset of aligned sequences. So, using Dawg I generated 10,000 independent pairwise alignments using the following template.

Model = "JC"
Lambda = 0.15
GapModel = "PL"
GapParams = {1.5, 1000}
Length = 1000
Tree = (A:${BranchLength}):B;

In this template, BranchLength is a random variable drawn from an exponential distribution with mean of 0.2. I was thus able to average across a set of branch lengths that were consistent with the process of evolution.

Once I had these sequence pairs, I then aligned them using another tool that I had written, Ngila, which is the only tool out there capable of doing global alignments using logarithmic gap costs. (I am currently revising my Ngila paper after the first round of reviews.) After a couple of test runs, I decided to align each sequence pair 512 different ways, under the following setup.

Match Cost = 0
Mismatch Cost = 1
Gap of size k Cost = a+b*k+c*log(k)
{a,b,c} = {0, 0.125, 0.25, 0.5, 1, 2, 4, 8}

Once Ngila had found the best alignment for each pair using each of the 512 different gap costs, I calculated how accurate each of Ngila’s hypothesized alignments were to the real alignments given by Dawg. (See the paper for an explanation of my accuracy metric.) After about two weeks of running the alignments on my multiprocesser workstation, I had 5.12 million data points that I could use to determine how much better logarithmic gap cost are than affine gap costs.

So after two weeks of generating my data, what is the first thing that I did? I threw half of it away because R could not handle that much data. Oh well, c’est la vie.

I then analyzed my data by looking at three different gap cost schemes: Log-Affine, Affine, and Logarithmic. For each of these schemes, I was able to determine the gap costs that produced the most accurate alignments on average. They were as follows:

Log-Affine:   2 + 0.25 k + 0.5 ln k   (w/ 94.1% accuracy)
Affine:       4 + 0.25 k              (w/ 92.5% accuracy)
Logarithmic:  0.125 + 8 ln k          (w/ 68.7% accuracy)

This was quite a shock. Not only had I expected logarithmic gap costs to perform better than affine gap costs, I was also surprised to see such little difference between Affine and Log-Affine gap costs. The paper looks at the data a couple of other ways, but the same conclusion holds in each case. Log-Affine is the best followed closely by Affine, while Logarithmic gap costs do poorly.

Now I was at a loss for a long time on how to explain this discrepancy between our expectations and the real world. Eventually, I found a hint of it in a paper that I was looking at. For you see, when I picked my match and mismatch costs to be 0 and 1, I caused the best gap cost to change. This change introduces the unexpected linear component to the best gap cost.

Okay, so I had an idea where it came from, but how can I demonstrate it? In the appendix to my paper (which is not vestigial at all), I use a statistical model to derive an equation for the best gap cost based on the three evolutionary parameters that went into my simulations: the mean branch length between pairs (t = 0.2), the rate of indel formation (r = 0.15), and the slope parameter of the power law (z = 1.5). According to the model, the best gap cost should be

equation

Now what do I find when I substitute my values into this equation? That the gap cost should be wk= 1.69 + 0.23 k + 0.56 ln k. As I was working on my statistical model for gap costs, I really had no idea how it would come out. In fact, I was very worried that I’d have to explain some large discrepancy. Now that I have the theoretical prediction, how does that compare to the cost that I found earlier: wk= 2 + 0.25 k + 0.5 ln k. Eureka, Frankenstein’s monster is alive; my two equations, derived via different means, are extremely similar. I have to confess that I did a little cabbage patch when I finished my calculations.

So what have I shown with this research? 1) That in contrast to suggestions found in the literature, logarithmic gap costs do not provide accurate alignments. 2) That log-affine gap costs are in fact implied by the power-law distribution. 3) That affine-gap costs can be good approximations for log-affine gap costs. And 4) that it is possible to derive gap costs based on the average branch length, indel rate, and power-law parameter.

TrackBack URL for this entry: http://scit.us/cgi-bin/mt/mt-tb.fcgi/872.

Use KwickXML Formatting to markup your comments, acceptable tags: <b> <blockquote> <br> <code> <em> <email> <h1> <h2> <h3> <h4> <h5> <h6> <i> <li> <list> <ol> <p> <qref> <quote> <s> <strong> <sub> <sup> <u> <ul> <url>. You may need to refresh before you will see your comment.




Remember personal info?

  


Posted by slpage on January 19, 2007 8:23 PM

I prefer to do my alignments by eye :)

Posted by Tom on January 19, 2007 11:33 PM

Was there ever any valid justification for assigning logarithmic gap costs? It seems to me (a mere mortal molecular biologist, albeit one who passed through the UGA genetics program back in the Devonian) that, at a given position, either gaps occur or they don’t. This seems like a linear function, regardless of the length.

Posted by Reed A. Cartwright on January 20, 2007 1:40 AM

Tom wrote:

Was there ever any valid justification for assigning logarithmic gap costs? It seems to me (a mere mortal molecular biologist, albeit one who passed through the UGA genetics program back in the Devonian) that, at a given position, either gaps occur or they don’t. This seems like a linear function, regardless of the length.

(Comment # 35047)

Tom,

It is not that simple because a single evolutionary event will insert or remove residues in blocks. Therefore, a gap of length two does not have the same probability of occurring as a two gaps of length one.

Linear gap costs [cost(k) = b k] treat every position as if it is independent of every other position. That is a very bad assumption to make if you scoring gaps.

Several papers argue that gap sizes in nature follow a power-law distribution, f(k) = k^-z/Zeta(z). If we take the log of this, we can find the most likely alignment with an algorithm that uses addition instead of multiplication. Taking the negative log gives us the following cost for a gap of size k, cost(k) = z log k + log Zeta(z). This is rather straight forward save two points. 1) The probability of a gap occurring has to be factored into the cost, which is easily done and was never an issue. 2) Applying any transformations to the costs of substitutions will introduce a linear cost to the gap cost as well.

Posted by Reed A. Cartwright on January 20, 2007 1:41 AM

slpage wrote:

I prefer to do my alignments by eye :)

(Comment # 35045)

That is what Larry Moran said to me tonight at dinner. My response to him was to ask if he’d ever tried to align human and mice introns by hand.

Posted by David vun Kannon on January 20, 2007 9:43 AM

Given the results you got, wouldn’t a better title for this post be “Logarithmic Gap Costs Help, But Not Very Much”? It’s great that you put all the details and data that doesn’t fit expectations out in public. This is real science at work. I must read too much ID blogging, something that is not just PR is too refreshing.

Posted by Reed A. Cartwright on January 23, 2007 4:44 PM

David vun Kannon wrote:

Given the results you got, wouldn’t a better title for this post be “Logarithmic Gap Costs Help, But Not Very Much”?

(Comment # 35054)

No. Logarithmic gap costs do not help at all, unless they are paired with affine costs. My paper is the first to point out this missed fact. In the paper I distinguish between logarithmic and log-affine gap costs.

Besides, the existing title is rather catchy. I’ve had one reviewer on another paper actually do a virtual double-take when he saw that title.

Posted by slpage on January 24, 2007 11:52 AM

Reed A. Cartwright wrote:

slpage wrote:

I prefer to do my alignments by eye :)

(Comment # 35045)

That is what Larry Moran said to me tonight at dinner. My response to him was to ask if he’d ever tried to align human and mice introns by hand.

(Comment # 35049)

(grinning) Yeah, I hear ya. I once tried to align the complete mitochondrial genomes of 32 species by eye. It was actually easier than I had expected, at least initially. Then I hit the HV regions, and I put it on a back-burner…

I do usually use something like ClustalX to get an initial alignment when working with “new” data, but I always run through it by eye afterwards just to pick up the ‘errors’ and weird spots.

Do you think your work will be incorporated into an alignment program package at some point?

Posted by Reed A. Cartwright on January 25, 2007 5:41 AM

slpage wrote:

Do you think your work will be incorporated into an alignment program package at some point?

(Comment # 35068)

My Ngila app is currently under review. It can do log-affine alignments, but you have to calculate your own costs based on the above model. Maybe a future version will include the conversion from evolutionary parameters to alignment costs.