Scaffolding (bioinformatics)

From Wikipedia, the free encyclopedia

This is an example of a scaffold.

Scaffolding is a technique used in bioinformatics. It is defined as follows:[1]

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

When creating a draft genome, individual reads of DNA are second assembled into contigs, which, by the nature of their assembly, have gaps between them. The next step is to then bridge the gaps between these contigs to create a scaffold.[2] This can be done using either optical mapping or mate-pair sequencing.[3]

The sequencing of the Haemophilus influenzae genome marked the advent of scaffolding. That project generated a total of 140 contigs, which were oriented and linked using paired-end reads. The success of this strategy prompted The Institute for Genomic Research to develop the scaffolding program Grouper for their other sequencing projects. Until 2001, Grouper was the only stand-alone scaffolding software.[4] After the Human Genome Project and Celera proved that it was possible to create a large draft genome, several other similar programs were created. Bambus was created in 2003 and was a rewrite of the original grouper software, but afforded researchers the ability to adjust scaffolding parameters.[4] This software also allowed for optional use of other linking data, such as contig order in a reference genome.

Algorithms used by assembly software are very diverse, and can be classified as based on iterative marker ordering, or graph based. Graph based applications have the capacity to order and orient over 10,000 markers, compared to the maximum 3000 markers capable of iterative marker applications.[5] Algorithms can be further classified as greedy, non greedy, conservative, or non conservative. Bambus uses a greedy algorithm, defined as such because it joins together contigs with the most links first. The algorithm used by Bambus 2 removes repetitive contigs before orienting and ordering them into scaffolds. SSPACE also uses a greedy algorithm that begins building its first scaffold with the longest contig provided by the sequence data. SSPACE is the most commonly cited assembly tool in biology publications, likely due to the fact that it is rated as a significantly more intuitive program to install and run than other assemblers.[6]

In recent years, there has been an advent of new kinds of assemblers capable of integrating linkage data from multiple types of linkage maps. ALLMAPS is the first of such programs and is capable of combining data from genetic maps, created using SNPs or recombination data, with physical maps such as optical or synteny maps.[7]

Some software, like ABySS and SOAPdenovo 1 and 2,[8] contain gap filling algorithms which, although they do not create any new scaffolds, serve to decrease the gap length between contigs of individual scaffolds. A standalone programs like GapFiller and TGS Gap Closer[9] are capable of closing a larger amount of gaps, using less memory than gap filling algorithms contained within assembly programs.[10]

Utturkar et al. investigated the utility of several different assembly software packages in combination with hybrid sequence data. They concluded that the ALLPATHS-LG and SPAdes algorithms were superior to other assemblers in terms of the number of, maximum length of, and N50 length of contigs and scaffolds.[11]

Scaffolding and next generation sequencing

Optical mapping

See also

Related Articles

Wikiwand AI