Spiliopoulou A.,University of Edinburgh |
Spiliopoulou A.,Pharmatics Ltd |
Colombo M.,University of Edinburgh |
Orchard P.,Pharmatics Ltd |
And 2 more authors.
Genetics | Year: 2017
We address the task of genotype imputation to a dense reference panel given genotype likelihoods computed from ultralow coverage sequencing as inputs. In this setting, the data have a high-level of missingness or uncertainty, and are thus more amenable to a probabilistic representation. Most existing imputation algorithms are not well suited for this situation, as they rely on prephasing for computational efficiency, and, without definite genotype calls, the prephasing task becomes computationally expensive. We describe GeneImp, a program for genotype imputation that does not require prephasing and is computationally tractable for whole-genome imputation. GeneImp does not explicitly model recombination, instead it capitalizes on the existence of large reference panels-comprising thousands of reference haplotypes-and assumes that the reference haplotypes can adequately represent the target haplotypes over short regions unaltered. We validate GeneImp based on data from ultralow coverage sequencing (0.53), and compare its performance to the most recent version of BEAGLE that can perform this task. We show that GeneImp achieves imputation quality very close to that of BEAGLE, using one to two orders of magnitude less time, without an increase in memory complexity. Therefore, GeneImp is the first practical choice for whole-genome imputation to a dense reference panel when prephasing cannot be applied, for instance, in datasets produced via ultralow coverage sequencing. A related future application for GeneImp is whole-genome imputation based on the off-target reads from deep whole-exome sequencing. © 2017 by the Genetics Society of America.
Agency: European Commission | Branch: FP7 | Program: MC-ITN | Phase: FP7-PEOPLE-2012-ITN | Award Amount: 3.76M | Year: 2013
Over the last decade, enormous progress has been made on recording the health state of an individual patient down to the molecular level of gene activity and genomic information even sequencing a patients genome for less than 1000 dollars is no longer an unrealistic goal. However, the ultimate hope to use all this information for personalized medicine, that is to tailor medical treatment to the needs of an individual, remains largely unfulfilled. To turn the vision of personalized medicine into reality, many methodological problems remain to be solved: there is a lack of methods that allow us to gain a causal understanding of the underlying disease mechanisms, including gene-gene and gene-environment interactions. Similarly, there is an urgent need for integration of the heterogeneous patient data currently available, for improved and robust biomarker discovery for disease diagnosis, prognosis and therapy outcome prediction. The field of machine learning, which tries to detect patterns, rules and statistical dependencies in large datasets, has also witnessed dramatic progress over the last decade and has had a profound impact on the Internet. Amongst others, advanced methods for high-dimensional feature selection, causality inference, and data integration have been developed or are topics of current research. These techniques address many of the key methodological challenges that personalized medicine faces today and keep it from rising to the next level. Despite this rich potential of machine learning in personalized medicine, its impact on data-driven medicine remains low, due to a lack of experts with knowledge in both machine learning and in statistical genetics. Our ITN aims to close this gap by bringing together leading European research institutes in Machine Learning and Statistical Genetics, both from the private and public sector, to train 14 early stage researchers.
Agency: European Commission | Branch: FP7 | Program: CP-FP | Phase: HEALTH.2012.2.1.1-3 | Award Amount: 7.88M | Year: 2012
MIMOmics develops statistical methods for the integrated analysis of metabolomics, proteomics, glycomics and genomic datasets in large studies. Our project is based on our involvement in studies participating in EU funded projects, i.e. GEHA, IDEAL, Mark-Age, ENGAGE and EuroSpan. In these consortia the primary goal is to identify molecular profiles that monitor and explain complex traits with novel findings so far. Support for methodological development is missing. The state-of-the-art methodology does not match by far the complexity of the biological problem. Complex data are being analysed in a rather simple way which misses the opportunity to uncover combinations of predictive profiles among the omics data. The objectives of MIMOmics are: to develop a statistical framework of methods for all analysis steps needed for identifying and interpreting omics-based biomarkers; and to integrate data derived from multiple omics platforms across several study designs and populations. Specific steps include: experimental design; pipelines for data gathering; cleaning of noisy spectra; predictive modeling of biomarkers; meta analysis; and causality assessment. To enhance our understanding, systems approaches will be considered for pathways and structural modelling of biological networks. The major challenge in the joint analysis of omics datasets will be to develop methods that deal with the high dimensionality, noisy spectral data, heterogeneity, and structure of these datasets. To perform these tasks successfully we bring together established EU academic and industrial researchers in metabolomics, glycomics, biostatistics, bioinformatics, scientific computing and epidemiology, with complementary expertise. A key feature of our project is the validation of novel methodology by performing a proof of principle (Metabolic Health) . Special effort will be made for rapid uptake of methods by communication with associated consortia and development of user-friendly software
Zgaga L.,University of Edinburgh |
Zgaga L.,University of Zagreb |
Theodoratou E.,University of Edinburgh |
Kyle J.,Aberdeen Group |
And 10 more authors.
PLoS ONE | Year: 2012
Introduction: Hyperuricemia is a strong risk factor for gout. The incidence of gout and hyperuricemia has increased recently, which is thought to be, in part, due to changes in diet and lifestyle. Objective of this study was to investigate the association between plasma urate concentration and: a) food items: dairy, sugar-sweetened beverages (SSB) and purine-rich vegetables; b) related nutrients: lactose, calcium and fructose. Methods: A total of 2,076 healthy participants (44% female) from a population-based case-control study in Scotland (1999-2006) were included in this study. Dietary data was collected using a semi-quantitative food frequency questionnaire (FFQ). Nutrient intake was calculated using FFQ and composition of foods information. Urate concentration was measured in plasma. Results: Mean urate concentration was 283.8±72.1 mmol/dL (females: 260.1±68.9 mmol/dL and males: 302.3±69.2 mmol/dL). Using multivariate regression analysis we found that dairy, calcium and lactose intakes were inversely associated with urate (p = 0.008, p = 0.003, p = 0.0007, respectively). Overall SSB consumption was positively associated with urate (p = 0.008), however, energy-adjusted fructose intake was not associated with urate (p = 0.66). The intake of purine-rich vegetables was not associated to plasma urate (p = 0.38). Conclusions: Our results suggest that limiting purine-rich vegetables intake for lowering plasma urate may be ineffectual, despite current recommendations. Although a positive association between plasma urate and SSB consumption was found, there was no association with fructose intake, suggesting that fructose is not the causal agent underlying the SSB-urate association. The abundant evidence supporting the inverse association between plasma urate concentration and dairy consumption should be reflected in dietary guidelines for hyperuricemic individuals and gout patients. Further research is needed to establish which nutrients and food products influence plasma urate concentration, to inform the development of evidence-based dietary guidelines. © 2012 Zgaga et al.
Sivakumaran S.,University of Edinburgh |
Agakov F.,University of Edinburgh |
Agakov F.,Pharmatics Ltd |
Theodoratou E.,University of Edinburgh |
And 8 more authors.
American Journal of Human Genetics | Year: 2011
We present a systematic review of pleiotropy among SNPs and genes reported to show genome-wide association with common complex diseases and traits. We find abundant evidence of pleiotropy; 233 (16.9%) genes and 77 (4.6%) SNPs show pleiotropic effects. SNP pleiotropic status was associated with gene location (p = 0.024; pleiotropic SNPs more often exonic [14.5% versus 4.9% for nonpleiotropic, trait-associated SNPs] and less often intergenic [15.8% versus 23.6%]), "predicted transcript consequence" (p = 0.001; pleiotropic SNPs more often predicted to be structurally deleterious [5% versus 0.4%] but not more often in regulatory sequences), and certain disease classes. We develop a method to calculate the likelihood that pleiotropic links between traits occurred more often than expected and demonstrate that this approach can identify etiological links that are already known (such as between fetal hemoglobin and malaria risk) and those that are not yet established (e.g., between plasma campesterol levels and gallstones risk; and between immunoglobulin A and juvenile idiopathic arthritis). Examples of pleiotropy will accumulate over time, but it is already clear that pleiotropy is a common property of genes and SNPs associated with disease traits, and this will have implications for identification of molecular targets for drug development, future genetic risk-profiling, and classification of diseases. © 2011 The American Society of Human Genetics.
Zgaga L.,University of Edinburgh |
Zgaga L.,University of Zagreb |
Agakov F.,University of Edinburgh |
Agakov F.,Pharmatics Ltd |
And 7 more authors.
PLoS ONE | Year: 2013
Introduction:Vitamin D deficiency has been associated with increased risk of colorectal cancer (CRC), but causal relationship has not yet been confirmed. We investigate the direction of causation between vitamin D and CRC by extending the conventional approaches to allow pleiotropic relationships and by explicitly modelling unmeasured confounders.Methods:Plasma 25-hydroxyvitamin D (25-OHD), genetic variants associated with 25-OHD and CRC, and other relevant information was available for 2645 individuals (1057 CRC cases and 1588 controls) and included in the model. We investigate whether 25-OHD is likely to be causally associated with CRC, or vice versa, by selecting the best modelling hypothesis according to Bayesian predictive scores. We examine consistency for a range of prior assumptions.Results:Model comparison showed preference for the causal association between low 25-OHD and CRC over the reverse causal hypothesis. This was confirmed for posterior mean deviances obtained for both models (11.5 natural log units in favour of the causal model), and also for deviance information criteria (DIC) computed for a range of prior distributions. Overall, models ignoring hidden confounding or pleiotropy had significantly poorer DIC scores.Conclusion:Results suggest causal association between 25-OHD and colorectal cancer, and support the need for randomised clinical trials for further confirmations. © 2013 Zgaga et al.
Agakov F.V.,Pharmatics Ltd |
Orchard P.,University of Edinburgh |
Storkey A.,University of Edinburgh
Journal of Machine Learning Research | Year: 2012
We describe a simple and efficient approach to learning structures of sparse high-dimensional latent variable models. Standard algorithms either learn structures of specific predefined forms, or estimate sparse graphs in the data space ignoring the possibility of the latent variables. In contrast, our method learns rich dependencies and allows for latent variables that may confound the relations between the observations. We extend the model to conditional mixtures with side information and non-Gaussian marginal distributions of the observations. We then show that our model may be used for learning sparse latent variable structures corresponding to multiple unknown states, and for uncovering features useful for explaining and predicting structural changes. We apply the model to real-world financial data with heavy-tailed marginals covering the low-and high-market volatility periods of 2005-2011. We show that our method tends to give rise to significantly higher likelihoods of test data than standard network learning methods exploiting the sparsity assumption. We also demonstrate that our approach may be practical for financial stress-testing and visualization of dependencies between financial instruments.
PubMed | Cancer Research UK Research Institute, Pharmatics Ltd, University of Edinburgh, Genos Glycoscience Research Laboratory and BlackRock
Type: | Journal: Scientific reports | Year: 2016
In this study we demonstrate the potential value of Immunoglobulin G (IgG) glycosylation as a novel prognostic biomarker of colorectal cancer (CRC). We analysed plasma IgG glycans in 1229 CRC patients and correlated with survival outcomes. We assessed the predictive value of clinical algorithms and compared this to algorithms that also included glycan predictors. Decreased galactosylation, decreased sialylation (of fucosylated IgG glycan structures) and increased bisecting GlcNAc in IgG glycan structures were strongly associated with all-cause (q<0.01) and CRC mortality (q=0.04 for galactosylation and sialylation). Clinical algorithms showed good prediction of all-cause and CRC mortality (Harrells C: 0.73, 0.77; AUC: 0.75, 0.79, IDI: 0.02, 0.04 respectively). The inclusion of IgG glycan data did not lead to any statistically significant improvements overall, but it improved the prediction over clinical models for stage 4 patients with the shortest follow-up time until death, with the median gain in the test AUC of 0.08. These glycan differences are consistent with significantly increased IgG pro-inflammatory activity being associated with poorer CRC prognosis, especially in late stage CRC. In the absence of validated biomarkers to improve upon prognostic information from existing clinicopathological factors, the potential of these novel IgG glycan biomarkers merits further investigation.
PubMed | Roslin Institute, Pharmatics Ltd and University of Edinburgh
Type: | Journal: Scientific reports | Year: 2015
In this study, we investigated the effect of five feature selection approaches on the performance of a mixed model (G-BLUP) and a Bayesian (Bayes C) prediction method. We predicted height, high density lipoprotein cholesterol (HDL) and body mass index (BMI) within 2,186 Croatian and into 810 UK individuals using genome-wide SNP data. Using all SNP information Bayes C and G-BLUP had similar predictive performance across all traits within the Croatian data, and for the highly polygenic traits height and BMI when predicting into the UK data. Bayes C outperformed G-BLUP in the prediction of HDL, which is influenced by loci of moderate size, in the UK data. Supervised feature selection of a SNP subset in the G-BLUP framework provided a flexible, generalisable and computationally efficient alternative to Bayes C; but careful evaluation of predictive performance is required when supervised feature selection has been used.
PubMed | Pharmatics Ltd and Genos Glycoscience Research Laboratory
Type: | Journal: Methods in molecular biology (Clifton, N.J.) | Year: 2016
Ultra-performance liquid chromatography (UPLC) is the established technology for accurate analysis of IgG Fc N-glycosylation due to its superior sensitivity, resolution, speed, and its capability to provide branch-specific information of glycan species. Correct and cost-efficient preprocessing of chromatographic data is the major prerequisite for subsequent analyses ranging from inference of structural isomers to biomarker discovery and prediction of humoral immune response from characterized changes in glycosylation. The complexity of glycomic chromatograms poses a number of challenges for developing automated data annotation and quantitation algorithms, which frequently necessitated manual or semi-manual approaches to preprocessing, most notably to peak detection and integration. Such procedures are meticulous and time-consuming, and may be a source of confounding due to their dependence on human labelers. Although liquid chromatography is a mature field and a number of methods have been developed for automatic peak detection outside the area of glycomics analysis, we found that hardly any of them are suitable for automatic integration of UPLC glycomic profiles without substantial modifications. In this chapter, we illustrate practical challenges of automatic peak detection of UPLC glycomics chromatograms. We outline a robust, semi-supervised method ACE (Automatic Chromatogram Extraction) for automated alignment and detection of glycan peaks in chromatograms, developed by Pharmatics Limited (UK) in collaboration with Genos Limited (Croatia). Application of the tool requires minimal human interference, which results in a significant reduction in the time and cost of IgG glycomics signal integration using Waters Acquity UPLC instrument (Milford, MA, USA) in several human cohorts with blind technical replicas.