Cold Spring Harbor, United States
Cold Spring Harbor, United States

Time filter

Source Type

Fortunato S.A.V.,University of Bergen | Adamski M.,University of Bergen | Ramos O.M.,University of St. Andrews | Ramos O.M.,Stanley Institute for Cognitive Genomics | And 5 more authors.
Nature | Year: 2014

Sponges are simple animals with few cell types, but their genomes paradoxically contain a wide variety of developmental transcription factors, including homeobox genes belonging to the Antennapedia (ANTP) class, which in bilaterians encompass Hox, ParaHox and NK genes. In the genome of the demosponge Amphimedon queenslandica, no Hox or ParaHox genes are present, but NK genes are linked in a tight cluster similar to the NK clusters of bilaterians. It has been proposed that Hox and ParaHox genes originated from NK cluster genes after divergence of sponges from the lineage leading to cnidarians and bilaterians. On the other hand, synteny analysis lends support to the notion that the absence of Hox and ParaHox genes in Amphimedon is a result of secondary loss (the ghost locus hypothesis). Here we analysed complete suites of ANTP-class homeoboxes in two calcareous sponges, Sycon ciliatum and Leucosolenia complicata. Our phylogenetic analyses demonstrate that these calcisponges possess orthologues of bilaterian NK genes (Hex, Hmx and Msx), a varying number of additional NK genes and one ParaHox gene, Cdx. Despite the generation of scaffolds spanning multiple genes, we find no evidence of clustering of Sycon NK genes. All Sycon ANTP-class genes are developmentally expressed, with patterns suggesting their involvement in cell type specification in embryos and adults, metamorphosis and body plan patterning. These results demonstrate that ParaHox genes predate the origin of sponges, thus confirming the ghost locus hypothesis, and highlight the need to analyse the genomes of multiple sponge lineages to obtain a complete picture of the ancestral composition of the first animal genome. ©2014 Macmillan Publishers Limited. All rights reserved.


Yao J.,Stanley Institute for Cognitive Genomics | Yao J.,Merck And Co. | Zhang K.X.,Howard Hughes Medical Institute | Kramer M.,Stanley Institute for Cognitive Genomics | And 2 more authors.
Bioinformatics | Year: 2014

FamAnn is an automated variant annotation pipeline designed for facilitating target discovery for family-based sequencing studies. It can apply a different inheritance pattern or a de novo mutations discovery model to each family and select single nucleotide variants and small insertions and deletions segregating in each family or shared by multiple families. It also provides a variety of variant annotations and retains and annotates all transcripts hit by a single variant. Excel-compatible outputs including all annotated variants segregating in each family or shared by multiple families will be provided for users to prioritize variants based on their customized thresholds. A list of genes that harbor the segregating variants will be provided as well for possible pathway/network analyses. FamAnn uses the de facto community standard Variant Call Format as the input format and can be applied to whole exome, genome or targeted resequencing data. © 2014 The Author.


Ballouz S.,Stanley Institute for Cognitive Genomics | Gillis J.,Stanley Institute for Cognitive Genomics
PLoS Computational Biology | Year: 2016

In addition to detecting novel transcripts and higher dynamic range, a principal claim for RNA-sequencing has been greater replicability, typically measured in sample-sample correlations of gene expression levels. Through a re-analysis of ENCODE data, we show that replicability of transcript abundances will provide misleading estimates of the replicability of conditional variation in transcript abundances (i.e., most expression experiments). Heuristics which implicitly address this problem have emerged in quality control measures to obtain ‘good’ differential expression results. However, these methods involve strict filters such as discarding low expressing genes or using technical replicates to remove discordant transcripts, and are costly or simply ad hoc. As an alternative, we model gene-level replicability of differential activity using co-expressing genes. We find that sets of housekeeping interactions provide a sensitive means of estimating the replicability of expression changes, where the co-expressing pair can be regarded as pseudo-replicates of one another. We model the effects of noise that perturbs a gene’s expression within its usual distribution of values and show that perturbing expression by only 5% within that range is readily detectable (AUROC~0.73). We have made our method available as a set of easily implemented R scripts. © 2016 Ballouz, Gillis.


Verleyen W.,Stanley Institute for Cognitive Genomics | Ballouz S.,Stanley Institute for Cognitive Genomics | Gillis J.,Stanley Institute for Cognitive Genomics
Bioinformatics | Year: 2016

Motivation: Gene networks have become a central tool in the analysis of genomic data but are widely regarded as hard to interpret. This has motivated a great deal of comparative evaluation and research into best practices. We explore the possibility that this may lead to overfitting in the field as a whole. Results: We construct a model of 'research communities' sampling from real gene network data and machine learning methods to characterize performance trends. Our analysis reveals an important principle limiting the value of replication, namely that targeting it directly causes 'easy' or uninformative replication to dominate analyses. We find that when sampling across network data and algorithms with similar variability, the relationship between replicability and accuracy is positive (Spearman's correlation, rs ∼0.33) but where no such constraint is imposed, the relationship becomes negative for a given gene function (rs ∼ -0.13). We predict factors driving replicability in some prior analyses of gene networks and show that they are unconnected with the correctness of the original result, instead reflecting replicable biases. Without these biases, the original results also vanish replicably. We show these effects can occur quite far upstream in network data and that there is a strong tendency within protein-protein interaction data for highly replicable interactions to be associated with poor quality control. © 2015 The Author 2015. Published by Oxford University Press. All rights reserved.


Ballouz S.,Stanley Institute for Cognitive Genomics | Verleyen W.,Stanley Institute for Cognitive Genomics | Gillis J.,Stanley Institute for Cognitive Genomics
Bioinformatics | Year: 2015

Motivation: RNA-seq co-expression analysis is in its infancy and reasonable practices remain poorly defined. We assessed a variety of RNA-seq expression data to determine factors affecting functional connectivity and topology in co-expression networks. Results: We examine RNA-seq co-expression data generated from 1970 RNA-seq samples using a Guilt-By-Association framework, in which genes are assessed for the tendency of co-expression to reflect shared function. Minimal experimental criteria to obtain performance on par with microarrays were >20 samples with read depth >10∈M per sample. While the aggregate network constructed shows good performance (area under the receiver operator characteristic curve ∼0.71), the dependency on number of experiments used is nearly identical to that present in microarrays, suggesting thousands of samples are required to obtain 'gold-standard' co-expression. We find a major topological difference between RNA-seq and microarray co-expression in the form of low overlaps between hub-like genes from each network due to changes in the correlation of expression noise within each technology. © 2015 The Author 2015. Published by Oxford University Press. All rights reserved.


Verleyen W.,Stanley Institute for Cognitive Genomics | Ballouz S.,Stanley Institute for Cognitive Genomics | Gillis J.,Stanley Institute for Cognitive Genomics
Bioinformatics | Year: 2015

Motivation: Network-based gene function inference methods have proliferated in recent years, but measurable progress remains elusive. We wished to better explore performance trends by controlling data and algorithm implementation, with a particular focus on the performance of aggregate predictions. Results: Hypothesizing that popular methods would perform well without hand-tuning, we used well-characterized algorithms to produce verifiably 'untweaked' results. We find that most stateof-the-art machine learning methods obtain 'gold standard' performance as measured in critical assessments in defined tasks. Across a broad range of tests, we see close alignment in algorithm performances after controlling for the underlying data being used. We find that algorithm aggregation provides only modest benefits, with a 17% increase in area under the ROC (AUROC) above the mean AUROC. In contrast, data aggregation gains are enormous with an 88% improvement in mean AUROC. Altogether, we find substantial evidence to support the view that additional algorithm development has little to offer for gene function prediction. © The Author 2014. Published by Oxford University Press. All rights reserved.


Pavlidis P.,University of British Columbia | Gillis J.,Stanley Institute for Cognitive Genomics
F1000Research | Year: 2012

In this opinion piece, we attempt to unify recent arguments we have made that serious confounds affect the use of network data to predict and characterize gene function. The development of computational approaches to determine gene function is a major strand of computational genomics research. However, progress beyond using BLAST to transfer annotations has been surprisingly slow. We have previously argued that a large part of the reported success in using "guilt by association" in network data is due to the tendency of methods to simply assign new functions to already well-annotated genes. While such predictions will tend to be correct, they are generic; it is true, but not very helpful, that a gene with many functions is more likely to have any function. We have also presented evidence that much of the remaining performance in cross-validation cannot be usefully generalized to new predictions, making progressive improvement in analysis difficult to engineer. Here we summarize our findings about how these problems will affect network analysis, discuss some ongoing responses within the field to these issues, and consolidate some recommendations and speculation, which we hope will modestly increase the reliability and specificity of gene function prediction. © 2012 Pavlidis P et al.


Gillis J.,Stanley Institute for Cognitive Genomics | Ballouz S.,Stanley Institute for Cognitive Genomics | Pavlidis P.,University of British Columbia
Journal of Proteomics | Year: 2014

Networks constructed from aggregated protein-protein interaction data are commonplace in biology. But the studies these data are derived from were conducted with their own hypotheses and foci. Focusing on data from budding yeast present in BioGRID, we determine that many of the downstream signals present in network data are significantly impacted by biases in the original data. We determine the degree to which selection bias in favor of biologically interesting bait proteins goes down with study size, while we also find that promiscuity in prey contributes more substantially in larger studies. We analyze interaction studies over time with respect to data in the Gene Ontology and find that reproducibly observed interactions are less likely to favor multifunctional proteins. We find that strong alignment between co-expression and protein-protein interaction data occurs only for extreme co-expression values, and use this data to suggest candidates for targets likely to reveal novel biology in follow-up studies. Biological significance: Protein-protein interaction data finds particularly heavy use in the interpretation of disease-causal variants. In principle, network data allows researchers to find novel commonalities among candidate genes. In this study, we detail several of the most salient biases contributing to aggregated protein-protein interaction databases. We find strong evidence for the role of selection and laboratory biases. Many of these effects contribute to the commonalities researchers find for disease genes. In order for characterization of disease genes and their interactions to not simply be an artifact of researcher preference, it is imperative to identify data biases explicitly. Based on this, we also suggest ways to move forward in producing candidates less influenced by prior knowledge. © 2014 Elsevier B.V.


Gillis J.,Stanley Institute for Cognitive Genomics | Pavlidis P.,University of British Columbia
Bioinformatics | Year: 2013

Motivation: The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored.Results: We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their 'functional identity' over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks.Availability: Data available at http://chibi.ubc.ca/assessGO. © 2013 The Author.


Narzisi G.,Simons Center for Quantitative Biology | Narzisi G.,New York Genome Center | O'Rawe J.A.,Stanley Institute for Cognitive Genomics | O'Rawe J.A.,State University of New York at Stony Brook | And 11 more authors.
Nature Methods | Year: 2014

We present an open-source algorithm, Scalpel (http://scalpel.sourceforge.net/), which combines mapping and assembly for sensitive and specific discovery of insertions and deletions (indels) in exome-capture data. A detailed repeat analysis coupled with a self-tuning k-mer strategy allows Scalpel to outperform other state-of-the-art approaches for indel discovery, particularly in regions containing near-perfect repeats. We analyzed 593 families from the Simons Simplex Collection and demonstrated Scalpel's power to detect long (‰ 30 bp) transmitted events and enrichment for de novo likely gene-disrupting indels in autistic children.

Loading Stanley Institute for Cognitive Genomics collaborators
Loading Stanley Institute for Cognitive Genomics collaborators