Data Mining and Modeling Group
Data Mining and Modeling Group
Pahikkala T.,University of Turku |
Pahikkala T.,Turku Center for Computer Science |
Okser S.,University of Turku |
Okser S.,Turku Center for Computer Science |
And 8 more authors.
Algorithms for Molecular Biology | Year: 2012
Background: Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible.Results: We have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension - UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high-end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS.Conclusions: Greedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/. © 2012 Pahikkala et al.; licensee BioMed Central Ltd.
Okser S.,University of Turku |
Lehtimaki T.,University of Tampere |
Elo L.L.,University of Turku |
Elo L.L.,Data Mining and Modeling Group |
And 17 more authors.
PLoS Genetics | Year: 2010
The relative contribution of genetic risk factors to the progression of subclinical atherosclerosis is poorly understood. It is likely that multiple variants are implicated in the development of atherosclerosis, but the subtle genotypic and phenotypic differences are beyond the reach of the conventional case-control designs and the statistical significance testing procedures being used in most association studies. Our objective here was to investigate whether an alternative approach-in which common disorders are treated as quantitative phenotypes that are continuously distributed over a population-can reveal predictive insights into the early atherosclerosis, as assessed using ultrasound imaging-based quantitative measurement of carotid artery intima-media thickness (IMT). Using our population-based follow-up study of atherosclerosis precursors as a basis for sampling subjects with gradually increasing IMT levels, we searched for such subsets of genetic variants and their interactions that are the most predictive of the various risk classes, rather than using exclusively those variants meeting a stringent level of statistical significance. The area under the receiver operating characteristic curve (AUC) was used to evaluate the predictive value of the variants, and cross-validation was used to assess how well the predictive models will generalize to other subsets of subjects. By means of our predictive modeling framework with machine learning-based SNP selection, we could improve the prediction of the extreme classes of atherosclerosis risk and progression over a 6-year period (average AUC 0.844 and 0.761), compared to that of using conventional cardiovascular risk factors alone (average AUC 0.741 and 0.629), or when combined with the statistically significant variants (average AUC 0.762 and 0.651). The predictive accuracy remained relatively high in an independent validation set of subjects (average decrease of 0.043). These results demonstrate that the modeling framework can utilize the "gray zone" of genetic variation in the classification of subjects with different degrees of risk of developing atherosclerosis. © 2010 Okser et al.
Mysara M.,Belgian Nuclear Research Center |
Mysara M.,Vrije Universiteit Brussel |
Mysara M.,Center for the Biology of Disease |
Saeys Y.,Data Mining and Modeling Group |
And 6 more authors.
Applied and Environmental Microbiology | Year: 2015
In ecological studies, microbial diversity is nowadays mostly assessed via the detection of phylogenetic marker genes, such as 16S rRNA. However, PCR amplification of these marker genes produces a significant amount of artificial sequences, often referred to as chimeras. Different algorithms have been developed to remove these chimeras, but efforts to combine different methodologies are limited. Therefore, two machine learning classifiers (reference-based and de novo CATCh) were developed by integrating the output of existing chimera detection tools into a new, more powerful method. When comparing our classifiers with existing tools in either the reference-based or de novo mode, a higher performance of our ensemble method was observed on a wide range of sequencing data, including simulated, 454 pyrosequencing, and Illumina MiSeq data sets. Since our algorithm combines the advantages of different individual chimera detection tools, our approach produces more robust results when challenged with chimeric sequences having a low parent divergence, short length of the chimeric range, and various numbers of parents. Additionally, it could be shown that integrating CATCh in the preprocessing pipeline has a beneficial effect on the quality of the clustering in operational taxonomic units. © 2015, American Society for Microbiology.