понедельник, 14 апреля 2014 г.

On direct construction of de Bruijn graphs: the Sibelia way

Recently I saw a blog post about the SplitMEM paper. The paper introduces a way to construct compressed de Bruijn graphs without using an ordinary de Bruijn graph first and compressing it explicitly. Authors utilize the suffix tree framework to do this and introduce additional data structure called "suffix skip table", their method has time complexity O(nlogn). A similar method lies in the core of Sibelia, a synteny block finding tool that uses direct construction of de Bruijn graphs as well.

The method implemented in Sibelia is linear in time/space and uses suffix array and LCP array without additional data structures. We didn't describe it in the main paper due to the following reasons:
  1. Space limitations
  2. It is pretty straightforward
But appearance of the SplitMEM paper made me think that our ideas may be useful for other people, so I decided to cover the algorithm in a blog post.

For those, who are not familiar with de Bruijn graphs and their application in genomics, there is an excellent introduction here. So called compressed de Bruijn graph can be obtained from an ordinary graph by replacing all non-branching paths with single edges. Let's consider the de Bruijn graph constructed with k = 2 from the string S = "ACTACGTACGTAT":



And it's compressed (or condensed) version:


Obviously, the condensed form is more favourable -- you have to store less vertices and edges. The naive approach to obtain the condensed form it is to construct the ordinary graph and traverse it. However, one can construct it directly using the suffix array. Let's see how to do this.

Our goal is to construct the compressed de Bruijn graph G(S, k) given a string S and an integer k. If you compare two pictures, you may notice that vertices preserved in the condensed form either have at least two outgoing or at least two ingoing edges, let's call those vertices (and corresponding k-mers) bifurcations. The only exceptions are vertices corresponding to the first and the last k-mers of S. Thus, to identify the vertex set of G(S, k), we have to identify the set of bifurcation k-mers. Let's take a look at the suffix array SA(S), constructed from S:

ACGTACGTAT
ACGTAT
ACTACGTACGTAT
AT
CGTACGTAT
CGTAT
CTACGTACGTAT
GTACGTAT
GTAT
T
TACGTACGTAT
TACGTAT
TAT

You may notice two things:
  1. Suffixes that start with the same k-mer (highlighted by colouring) are grouped in the suffix array due to its sorted nature. Speaking in stringology terms, those suffixes have longest common prefix, or LCP of at least k
  2. We can easily enumerate outgoing edges by looking at (k + 1)-th characters of the suffixes (they are underscored). For example, the vertex "AC" has two edges: "ACG" and "ACT", while the vertex "CG" has only one edge "CGT"
Given these two properties we can find all vertices that have at least two outgoing edges: just look over bundles of suffixes that start with the same k-mer and consider their (k + 1)-th characters. To do this efficiently, we utilize the LCP array -- it stores the length of the longest common prefix between adjacent pairs of suffixes in the suffix array. For our example, the LCP array is the following:
12345678910111213
_521041030062
Each value represents the length of LCP between a suffix and a previous one; first value is absent since there is no previous suffix for the first one. Putting all together, we have the following algorithm:


The algorithm is linear in time and space:
  1. Both suffix and LCP arrays construction can be linear
  2. Our algorithm visits each suffix only once
  3. Maintaining a set of four DNA characters is trivial
Of course, you have to care about all ingoing edges as well, but it's easy: take a look not only at (k + 1)-th, but at the preceding character of a suffix as well. And don't forget about the first and the last k-mers of the string!

Given the vertex set of the condensed graph we then construct edge set. Let's write down all k-mers of S in the order that they occur in the string:

     AC CT TA AC CG GT TA AC CG GT TA AT

I highlighted with red k-mers belonging to the vertex set of the condensed graph. Notice that any sequence of k-mers (including empty ones) between a pair of bifurcations represents a simple path in the de Bruijn graph or, equivalently, an edge in the condensed graph (just look at the picture above). So we remove all non-bifurcation k-mers and treat each pair of consecutive k-mers generates as an edge:

      AC TA AC TA AC TA AT

After removing duplicates we have the following edges:

      (AC, TA) (TA, AC), (TA, AT)

One can easily generalize the method for multiple strings. Please note that condensation of a graph may generate multi-edges with different labels, I ignored this for the sake of simplicity.

In this post I described how to construct condensed de Bruijn graphs from suffix arrays in linear time and space. The algorithm is pretty simple in implementation: there are ready-made suffix and LCP arrays libraries available and the rest is trivial. However, I'm not sure whether this approach can compete with state-of-the-art de Bruijn graph construction algorithms. In Sibelia we mostly selected this way due to its simplicity and that unlike hash tables (another straightforward implementation) its complexity doesn't depend on the value of k used -- Sibelia works with pretty big k's, hundreds or even thousands.

Комментариев нет:

Отправить комментарий