Missirian V.,University of California at Davis |
Comai L.,Genome Center |
Filkov V.,University of California at Davis
BMC Bioinformatics | Year: 2011
Background: TILLING (Targeting induced local lesions IN genomes) is an efficient reverse genetics approach for detecting induced mutations in pools of individuals. Combined with the high-throughput of next-generation sequencing technologies, and the resolving power of overlapping pool design, TILLING provides an efficient and economical platform for functional genomics across thousands of organisms.Results: We propose a probabilistic method for calling TILLING-induced mutations, and their carriers, from high throughput sequencing data of overlapping population pools, where each individual occurs in two pools. We assign a probability score to each sequence position by applying Bayes' Theorem to a simplified binomial model of sequencing error and expected mutations, taking into account the coverage level. We test the performance of our method on variable quality, high-throughput sequences from wheat and rice mutagenized populations.Conclusions: We show that our method effectively discovers mutations in large populations with sensitivity of 92.5% and specificity of 99.8%. It also outperforms existing SNP detection methods in detecting real mutations, especially at higher levels of coverage variability across sequenced pools, and in lower quality short reads sequence data. The implementation of our method is available from: http://www.cs.ucdavis.edu/filkov/CAMBa/. © 2011 Missirian et al; licensee BioMed Central Ltd. Source
Dong X.,University of Massachusetts Medical School |
Greven M.C.,University of Massachusetts Medical School |
Kundaje A.,Stanford University |
Djebali S.,Center for Genomic Regulation and |
And 7 more authors.
Genome Biology | Year: 2012
Background: Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines.Results: We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA.Conclusions: Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts. © 2012 Dong et al.; licensee BioMed Central Ltd. Source
Parra G.,Genome Center |
Bradnam K.,Genome Center |
Rose A.B.,Genome Center |
Korf I.,Genome Center |
Korf I.,University of California at Davis
Nucleic Acids Research | Year: 2011
Introns in a wide range of organisms including plants, animals and fungi are able to increase the expression of the gene that they are contained in. This process of intron-mediated enhancement (IME) is most thoroughly studied in Arabidopsis thaliana, where it has been shown that enhancing introns are typically located near the promoter and are compositionally distinct from downstream introns. In this study, we perform a comprehensive comparative analysis of several sequenced plant genomes. We find that enhancing sequences are conserved in the multi-cellular plants but are either absent or unrecognizable in algae. IME signals are preferentially located towards the 5′-end of first introns but also appear to be enriched in 5′-UTRs and coding regions near the transcription start site. Enhancing introns are found most prominently in genes that are highly expressed in a wide range of tissues. Through site-directed mutagenesis in A. thaliana, we show that IME signals can be inserted or removed from introns to increase or decrease gene expression. Although we do not yet know the specific mechanism of IME, the predicted signals appear to be both functional and highly conserved. © 2011 The Author(s). Source
No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment. Sperm DNA from adult males was extracted for sequencing as described in Supplementary Note 2. A single male was used for each species to minimize the impact of heterozygosity on assembly. For Saccoglossus, approximately eightfold redundant random shotgun coverage (totalling 8.1 Gb) was obtained with Sanger dideoxy sequencing at the Baylor College of Medicine Genome Center, including 34,279 BAC ends and 459,052 fosmid ends. For Ptychodera, 1.3 Gb in Sanger shotgun sequences, 15.3 Gb in Roche 454 pyrosequence reads, and 52-Gb paired-end sequences with Illumina MiSeq, along with mate-pairs, were generated at the Okinawa Institute of Science and Technology Graduate University. More sequencing details are available in Supplementary Note 2. We assembled the Saccoglossus genome with Arachne50, combined with BAC/fosmid pair information to produce the final assembly. This Saccoglossus assembly includes 7,282 total scaffold sequences spanning a total length of 758 Mb. The relatively modest nucleotide heterozygosity (0.5%) of S. kowalevskii, coupled with longer read lengths, enabled assembly of a single composite reference sequence. Half of the assembly is in scaffolds longer than 552 kb (the N50 scaffold length), and 82% of the assembled sequence is found in 1,602 scaffolds longer than 100 kb. For Ptychodera we used the Platanus51 assembler. The resulting total scaffold length was 1,229 Mb, with half the assembly in scaffolds longer than 196 kb (N50 scaffold length). P. flava exhibited a notably higher heterozygosity (1.3% single nucleotide heterozygosity with frequent indels) than S. kowalevskii, presumably related to its pelagic dispersal and larger effective population size52. We therefore initially produced stringent separate assemblies of the two divergent haplotypes, and found that many scaffolds had a closely related second scaffold with ~94% BLASTN identity (over longer stretches, including indels). To avoid reporting both haplotypes at these loci, scaffolds with less than 6% divergence over at least 75% of their length were merged into a single haploid reference for comparative analysis. To further classify regions with ‘double’ depth and single haplotype regions we implemented a Hidden Markov Model classifier. We find that at least 63% of the initial Platanus assembly constitutes merged haplotypes. The inferred SNP rate for those regions is 1.3%, while for the remaining haplotype regions it is below 0.1%. Further details of assemblies are described in Supplementary Note 2. Transcriptome data for both species were used, along with homology-guided and ab initio methods, to predict protein-coding genes (Supplementary Note 3). For Saccoglossus, 8.6 million RNAseq reads were generated from 7 adult tissues and 15 developmental stages using Roche 454 sequencing, along with previously deposited ESTs in GenBank. For Ptychodera, extensive EST data from egg, blastulae, gastrulae, larvae, juveniles, adult proboscis, stomochord, and gills defining 34,159 cDNA clones53, and 879,000 Roche/454 RNAseq reads from a mixed library of developmental stages54 were used. The Saccoglossus genome was annotated using JGI gene prediction pipeline55, while Augustus56 was used to produce gene models for Ptychodera. We find a total of 34,239 gene predictions for Saccoglossus (68% with transcript evidence) and 34,687 for Ptychodera (43% with transcript evidence), although these are overestimates of the true gene number due to fragmented gene predictions, mis-annotated repetitive sequences, and spurious predictions. As described in the main text, 18–19,000 gene models in each species have known annotations and/or orthologues in other species. Gene family clustering was done using a progressive (leaf to root) BLASTP-based clustering algorithm, where at a given phylogenetic node the gene families are constructed taking into account protein similarities among ingroups and outgroups57. For the inference of deuterostome gene families we use the bilaterian node of the clustering. To call gene families present in the deuterostome ancestor, we required (1) at least two ambulacrarian orthologues out of the three available ambulacrarian genomes and at least two chordate orthologues, or (2) at least two deuterostomes (chordates and/or ambulacrarians) and two outgroups in the bilaterian level clusters. Repetitive sequences were identified using RepeatScout58, followed by manual curation and annotation using both a Repbase release (version 20140131)59 and BLASTX-based search against a custom collection of transposons, using a previously described repeat identification and annotation pipeline57 (Supplementary Note 5). The assemblies were then masked with RepeatMasker version open-4.0.560. The repetitive complements of the two hemichordate genomes are summarized in Supplementary Table 5.1. Phylogenetic analyses were done using metazoan-level gene family clusters based on whole-genome sequences (Supplementary Note 4), selecting a single orthologue per genome with the best cumulative BLASTP to other species, and best reciprocal BLASTP hits to species with transcriptome-only information (Supplementary Note 6). Single gene alignments were built using Muscle61 and filtered using Trimal62 for each orthologue, and were concatenated, yielding a supermatrix of 506,428 positions with 34.9% missing data. This supermatrix was analysed with ExaML assuming a site-homogenous LG+Γ model partitioned for each gene63. A slow-fast analysis was conducted to stratify marker genes based on the length of the branch leading to acoels in individual trees. A subset of the slowest 10% of genes was analysed with the site-heterogenous CAT+GTR+Γ model using Phylobayes24. Molecular dating was carried out using Phylobayes24 using the log-normal relaxed clock model and the calibrations described in Supplementary Table 6.2. Macro- and micro-syntenic linkages were calculated as described in Supplementary Note 7. For Fig. 3a, we merged the amphioxus scaffolds into 17 pre-defined scaffold groups as suggested in ref. 27. These 17 merged scaffold groups represent the 17 ancestral linkage groups (ALGs) shared in chordates. Then we calculated the orthologous gene groups shared by each amphioxus ALG–Saccoglossus scaffold pair and generated the dot plot as described in Supplementary Note 7. For micro-synteny we required at least three genes (separated by a maximum of ten genes) to be present in pairwise comparisons. Under random reshuffling of the genome, this yields 10% false positives in pairwise genome comparisons, that is, we observe approximately one-tenth as many micro-syntenic blocks between the two genomes when gene orders are shuffled. This false-positive rate, however, falls to 1% when considering more than two species. For our inference of deuterostome ancestral and novel synteny we therefore focus on blocks present in at least three species (and both ingroup representatives, that is, ambulacrarians and chordates). This yields 698 blocks that can be traced back to the deuterostome ancestor, including 71 blocks found exclusively in deuterostome species (shared among ambulacrarians and chordates), including the pharyngeal cluster discussed in Fig. 4. Whole-genome alignments were conducted with MEGABLAST64 using parameters previously reported65. We assessed the distribution of the resulting 12,722 aligned loci across known gene annotations in ENSEMBL66, previously identified conserved pan-vertebrate elements65, as well as known enhancers in human according to LBL database67. Deuterostome gene novelties were assessed initially through bilaterian gene clusters (Supplementary Note 10) by requiring at least two species on both ambulacrarian and chordate side to be present. The novelties were further automatically subdivided into four categories: G1 (gain type I), with no BLASTP hit outside of deuterostomes; G2 (gain type II), with a novel PFAM domain present only in deuterostomes; G3 (gain type III) having a novel PFAM combination unique to deuterostomes; and G4 (gain type IV), those that do not fall under any of the G1–3 categories and define novelties due to acceleration in the substitution rate on the deuterostome stem. To confirm the novel nature, especially for G4 novelties, we have constructed phylogenies for the members and non-deuterostome BLASTP hits (up to an e-value of 1 × 10−20) using MAFFT-alignment-based FastTree calculations. The trees were assessed for the accelerated rate of evolution at the deuterostome stem (Supplementary Fig. 9.1.1). The final result is provided in the Supplementary Information. We examined in detail gene families found broadly in deuterostomes whose encoded peptides were readily alignable to microbial sequences but had no detectable similarity in non-deuterostome animals. Criteria for evaluation included: (1) the hemichordate gene matches microbial genes at least ten orders of magnitude in the e-value better than it matches sequences of non-deuterostome metazoans (most of the putative HGTs we describe have no non-deuterostome metazoan hit at all); (2) it has a defined genomic locus among bona fide metazoan genes; (3) it shares an exon–intron structure with genes of chordates and other ambulacraria; and (4) when a low bitscore match is found to a non-deuterostome metazoan sequence, that sequence is identified as containing different domains (domain structure according to CDD68) and/or different exon–intron structure, implying dubious relatedness. When phylogenetic trees are constructed for these HGT-candidate proteins, the trees contain numerous branches for microbial sequences and none for non-deuterostome metazoan sequences, or only very long branches for dubiously relatives, and hence the trees differ greatly from the metazoan species tree, except within the deuterostome clade. Original data and code can be accessed at https://groups.oist.jp/molgenu.
News Article | November 2, 2015
Enterprise apps are becoming the norm in the workplace, but too many are unappealing to employees. Here are the most important items to consider if you want to build compelling internal mobile apps.