Time filter

Source Type

Li D.,University of Hong Kong | Luo R.,University of Hong Kong | Liu C.-M.,L3 Bioinformatics Ltd | Leung C.-M.,University of Hong Kong | And 4 more authors.
Methods | Year: 2016

The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) [1]), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory-efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU).In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252. Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43. h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484. Gbp), the largest publicly available dataset, can now be assembled using no more than 500. GB of memory in 7.5. days. The assemblies of these datasets (and other large metgenomic datasets), as well as the software, are available at the website https://hku-bal.github.io/megabox. © 2016 Elsevier Inc.

Mai H.,Hong Kong University of Science and Technology | Li D.,Hong Kong University of Science and Technology | Zhang Y.,Hong Kong University of Science and Technology | Leung H.C.-M.,Hong Kong University of Science and Technology | And 5 more authors.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2016

To speed up the alignment of DNA reads or assembled contigs against a protein database has been a challenge up to now. The recent tool DIAMOND has significantly improved the speed of BLASTX and RAPSearch, while giving similar degree of sensitivity. Yet for applications like metagenomics, where large amount of data is involved, DIAMOND still takes a lot of time. This paper introduces an even faster protein alignment tool, called AC-DIAMOND, which attempts to speed up DIAMOND via better SIMD parallelization and more space-efficient indexing of the reference database; the latter allows more queries to be loaded into the memory and processed together. Experimental results show that ACDIAMOND is about 4 times faster than DIAMOND on aligning DNA reads or contigs, while retaining the same sensitivity as DIAMOND.For example, the latest assembly of the Iowa praire soil metagenomic dataset generates over 9 milllion of contigs, with a total size about 7Gbp; when aligning these contigs to the protein database NCBI-nr, DIAMOND takes 4 to 5 days, and AC-DIAMOND takes about 1 day. AC-DIAMOND is available for testing at http://ac-diamond.sourceforge.net. © Springer International Publishing Switzerland 2016.

Ou M.,University of Hong Kong | Ma R.,L3 Bioinformatics Ltd | Cheung J.,L3 Bioinformatics Ltd | Lo K.,L3 Bioinformatics Ltd | And 9 more authors.
Bioinformatics | Year: 2015

Rapid advances of next-generation sequencing technology have led to the integration of genetic information with clinical care. Genetic basis of diseases and response to drugs provide new ways of disease diagnosis and safer drug usage. This integration reveals the urgent need for effective and accurate tools to analyze genetic variants. Due to the number and diversity of sources for annotation, automating variant analysis is a challenging task. Here, we present database.bio, a web application that combines variant annotation, prioritization and visualization so as to support insight into the individual genetic characteristics. It enhances annotation speed by preprocessing data on a supercomputer, and reduces database space via a unified database representation with compressed fields. © The Author 2015. Published by Oxford University Press. All rights reserved.

Li D.,University of Hong Kong | Liu C.-M.,L3 Bioinformatics Ltd | Luo R.,L3 Bioinformatics Ltd | Sadakane K.,L3 Bioinformatics Ltd | And 3 more authors.
Bioinformatics | Year: 2015

Summary: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252∈Gbps in 44.1 and 99.6∈h on a single computing node with and without a graphics processing unit, respectively. MEGAHIT assembles the data as a whole, i.e. no pre-processing like partitioning and normalization was needed. When compared with previous methods on assembling the soil data, MEGAHIT generated a three-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a fourfold improvement. © The Author 2015. Published by Oxford University Press.

Discover hidden collaborations