Minocherhomji S.,Copenhagen University |
Seemann S.,Center for Non coding in Technology and Health |
Seemann S.,Copenhagen University |
Mang Y.,Copenhagen University |
And 15 more authors.
Nucleic Acids Research | Year: 2012
The finished human genome-assemblies comprise several hundred un-sequenced euchromatic gaps, which may be rich in long polypurine/polypyrimidine stretches. Human chromosome 20 (chr 20) currently has three unfinished gaps remaining on its q-arm. All three gaps are within gene-dense regions and/or overlap disease-associated loci, including the DLGAP4 locus. In this study, we sequenced ∼99 of all three unfinished gaps on human chr 20, determined their complete genomic sizes and assessed epigenetic profiles using a combination of Sanger sequencing, mate pair paired-end high-throughput sequencing and chromatin, methylation and expression analyses. We found histone 3 trimethylated at Lysine 27 to be distributed across all three gaps in immortalized B-lymphocytes. In one gap, five novel CpG islands were predominantly hypermethylated in genomic DNA from peripheral blood lymphocytes and human cerebellum. One of these CpG islands was differentially methylated and paternally hypermethylated. We found all chr 20 gaps to comprise structured non-coding RNAs (ncRNAs) and to be conserved in primates. We verified expression for 13 candidate ncRNAs, some of which showed tissue specificity. Four ncRNAs expressed within the gap at DLGAP4 show elevated expression in the human brain. Our data suggest that unfinished human genome gaps are likely to comprise numerous functional elements. © The Author(s) 2012.
Will S.,Albert Ludwigs University of Freiburg |
Will S.,University of Leipzig |
Schmiedl C.,Albert Ludwigs University of Freiburg |
Miladi M.,Albert Ludwigs University of Freiburg |
And 3 more authors.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2013
Motivation: There is increasing evidence of pervasive transcription, resulting in hundreds of thousands of ncRNAs of unknown function. Standard computational analysis tasks for inferring functional annotations like clustering require fast and accurate RNA comparisons based on sequence and structure similarity. The gold standard for the latter is Sankoff's algorithm , which simultaneously aligns and folds RNAs. Because of its extreme time complexity of O(n6), numerous faster "Sankoff-style" approaches have been suggested. Several such approaches introduce heuristics based on sequence alignment, which compromises the alignment quality for RNAs with sequence identities below 60% . Avoiding such heuristics, as e.g. in LocARNA , has been assumed to prohibit time complexities better than O(n 4), which strongly limits large-scale applications. © 2013 Springer-Verlag.
Velandia-Huerto C.A.,National University of Colombia |
Gittenberger A.A.,Leiden University |
Gittenberger A.A.,Naturalis Biodiversity Center |
Brown F.D.,University of Los Andes, Colombia |
And 8 more authors.
BMC Genomics | Year: 2016
Background: The colonial ascidian Didemnum vexillum, sea carpet squirt, is not only a key marine organism to study morphological ancestral patterns of chordates evolution but it is also of great ecological importance due to its status as a major invasive species. Non-coding RNAs, in particular microRNAs (miRNAs), are important regulatory genes that impact development and environmental adaptation. Beyond miRNAs, not much in known about tunicate ncRNAs. Results: We provide here a comprehensive homology-based annotation of non-coding RNAs in the recently sequenced genome of D. vexillum. To this end we employed a combination of several computational approaches, including blast searches with a wide range of parameters, and secondary structured centered survey with infernal. The resulting candidate set was curated extensively to produce a high-quality ncRNA annotation of the first draft of the D. vexillum genome. It comprises 57 miRNA families, 4 families of ribosomal RNAs, 22 isoacceptor classes of tRNAs (of which more than 72 % of loci are pseudogenes), 13 snRNAs, 12 snoRNAs, and 1 other RNA family. Additionally, 21 families of mitochondrial tRNAs and 2 of mitochondrial ribosomal RNAs and 1 long non-coding RNA. Conclusions: The comprehensive annotation of the D. vexillum non-coding RNAs provides a starting point towards a better understanding of the restructuring of the small RNA system in ascidians. Furthermore it provides a valuable research for efforts to establish detailed non-coding RNA annotations for other recently published and recently sequences in tunicate genomes. © 2016 The Author(s).
Indrischek H.,University of Leipzig |
Indrischek H.,The Interdisciplinary Center |
Wieseke N.,University of Leipzig |
Stadler P.F.,University of Leipzig |
And 8 more authors.
Algorithms for Molecular Biology | Year: 2016
Background: The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein coding genes are needed in particular for phylogenetics and for the analysis of gene family histories. Results: We present a pipeline, ExonMatchSolver, that is designed to help the user to produce and curate high quality models of the protein-coding part of genes. The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. This paralog-to-contig assignment problem is shown to be NP-complete. It is phrased and solved as an Integer Linear Programming problem. Conclusions: The ExonMatchSolver-pipeline can be employed to build highly accurate models of protein coding genes even when spanning several genomic fragments. This sets the stage for a better understanding of the evolutionary history within particular gene families which possess a large number of paralogs and in which frequent gene duplication events occurred. © 2016 Indrischek et al.
Velandia-Huerto C.A.,National University of Colombia |
Berkemer S.J.,Max Planck Institute for Mathematics in the Sciences |
Berkemer S.J.,The Interdisciplinary Center |
Hoffmann A.,The Interdisciplinary Center |
And 11 more authors.
BMC Genomics | Year: 2016
Background: Transfer RNAs (tRNAs) are ubiquitous in all living organism. They implement the genetic code so that most genomes contain distinct tRNAs for almost all 61 codons. They behave similar to mobile elements and proliferate in genomes spawning both local and non-local copies. Most tRNA families are therefore typically present as multicopy genes. The members of the individual tRNA families evolve under concerted or rapid birth-death evolution, so that paralogous copies maintain almost identical sequences over long evolutionary time-scales. To a good approximation these are functionally equivalent. Individual tRNA copies thus are evolutionary unstable and easily turn into pseudogenes and disappear. This leads to a rapid turnover of tRNAs and often large differences in the tRNA complements of closely related species. Since tRNA paralogs are not distinguished by sequence, common methods cannot not be used to establish orthology between tRNA genes. Results: In this contribution we introduce a general framework to distinguish orthologs and paralogs in gene families that are subject to concerted evolution. It is based on the use of uniquely aligned adjacent sequence elements as anchors to establish syntenic conservation of sequence intervals. In practice, anchors and intervals can be extracted from genome-wide multiple sequence alignments. Syntenic clusters of concertedly evolving genes of different families can then be subdivided by list alignments, leading to usually small clusters of candidate co-orthologs. On the basis of recent advances in phylogenetic combinatorics, these candidate clusters can be further processed by cograph editing to recover their duplication histories. We developed a workflow that can be conceptualized as stepwise refinement of a graph of homologous genes. We apply this analysis strategy with different types of synteny anchors to investigate the evolution of tRNAs in primates and fruit flies. We identified a large number of tRNA remolding events concentrated at the tips of the phylogeny. With one notable exception all phylogenetically old tRNA remoldings do not change the isoacceptor class. Conclusions: Gene families evolving under concerted evolution are not amenable to classical phylogenetic analyses since paralogs maintain identical, species-specific sequences, precluding the estimation of correct gene trees from sequence differences. This leaves conservation of syntenic arrangements with respect to "anchor elements" that are not subject to concerted evolution as the only viable source of phylogenetic information. We have demonstrated here that a purely synteny-based analysis of tRNA gene histories is indeed feasible. Although the choice of synteny anchors influences the resolution in particular when tight gene clusters are present, and the quality of sequence alignments, genome assemblies, and genome rearrangements limits the scope of the analysis, largely coherent results can be obtained for tRNAs. In particular, we conclude that a large fraction of the tRNAs are recent copies. This proliferation is compensated by rapid pseudogenization as exemplified by many very recent alloacceptor remoldings. © 2016 The Author(s).
zu Siederdissen C.H.,University of Leipzig |
zu Siederdissen C.H.,University of Vienna |
zu Siederdissen C.H.,The Interdisciplinary Center |
Prohaska S.J.,University of Leipzig |
And 8 more authors.
BMC Bioinformatics | Year: 2015
Background: Dynamic programming algorithms provide exact solutions to many problems in computational biology, such as sequence alignment, RNA folding, hidden Markov models (HMMs), and scoring of phylogenetic trees. Structurally analogous algorithms compute optimal solutions, evaluate score distributions, and perform stochastic sampling. This is explained in the theory of Algebraic Dynamic Programming (ADP) by a strict separation of state space traversal (usually represented by a context free grammar), scoring (encoded as an algebra), and choice rule. A key ingredient in this theory is the use of yield parsers that operate on the ordered input data structure, usually strings or ordered trees. The computation of ensemble properties, such as a posteriori probabilities of HMMs or partition functions in RNA folding, requires the combination of two distinct, but intimately related algorithms, known as the inside and the outside recursion. Only the inside recursions are covered by the classical ADP theory. Results: The ideas of ADP are generalized to a much wider scope of data structures by relaxing the concept of parsing. This allows us to formalize the conceptual complementarity of inside and outside variables in a natural way. We demonstrate that outside recursions are generically derivable from inside decomposition schemes. In addition to rephrasing the well-known algorithms for HMMs, pairwise sequence alignment, and RNA folding we show how the TSP and the shortest Hamiltonian path problem can be implemented efficiently in the extended ADP framework. As a showcase application we investigate the ancient evolution of HOX gene clusters in terms of shortest Hamiltonian paths. Conclusions: The generalized ADP framework presented here greatly facilitates the development and implementation of dynamic programming algorithms for a wide spectrum of applications. © 2015 zu Siederdissen et al.
Mork So.,Center for Non coding in Technology and Health |
Pletscher-Frankild S.,Novo Nordisk AS |
Caro A.P.,Novo Nordisk AS |
Gorodkin J.,Center for Non coding in Technology and Health |
Jensen L.J.,Novo Nordisk AS
Bioinformatics | Year: 2014
Motivation: MicroRNAs (miRNAs) are a highly abundant class of noncoding RNA genes involved in cellular regulation and thus also diseases. Despite miRNAs being important disease factors, miRNA-disease associations remain low in number and of variable reliability. Furthermore, existing databases and prediction methods do not explicitly facilitate forming hypotheses about the possible molecular causes of the association, thereby making the path to experimental follow-up longer. Results: Here we present miRPD in which miRNA-Protein-Disease associations are explicitly inferred. Besides linking miRNAs to diseases, it directly suggests the underlying proteins involved, which can be used to form hypotheses that can be experimentally tested. The inference of miRNAs and diseases is made by coupling known and predicted miRNA-protein associations with protein-disease associations text mined from the literature. We present scoring schemes that allow us to rank miRNA-disease associations inferred from both curated and predicted miRNA targets by reliability and thereby to create high- and medium-confidence sets of associations. Analyzing these, we find statistically significant enrichment for proteins involved in pathways related to cancer and type I diabetes mellitus, suggesting either a literature bias or a genuine biological trend. We show by example how the associations can be used to extract proteins for disease hypothesis. © 2013 The Author 2013.