Time filter

Source Type

Sonabend A.M.,Columbia University | Bansal M.,Center for Computational Biology and Bioinformatics | Guarnieri P.,Center for Computational Biology and Bioinformatics | Lei L.,Columbia University | And 17 more authors.
Cancer Research | Year: 2014

Proneural glioblastoma is defined by an expression pattern resembling that of oligodendrocyte progenitor cells and carries a distinctive set of genetic alterations. Whether there is a functional relationship between the proneural phenotype and the associated genetic alterations is unknown. To evaluate this possible relationship, we performed a longitudinal molecular characterization of tumor progression in a mouse model of proneural glioma. In this setting, the tumors acquired remarkably consistent genetic deletions at late stages of progression, similar to those deleted in human proneural glioblastoma. Further investigations revealed that p53 is a master regulator of the transcriptional network underlying the proneural phenotype. This p53-centric transcriptional network and its associated phenotype were observed at both the early and late stages of progression, and preceded the proneural-specific deletions. Remarkably, deletion of p53 at the time of tumor initiation obviated the acquisition of later deletions, establishing a link between the proneural transcriptional network and the subtype-specific deletions selected during glioma progression. © 2013 American Association for Cancer Research.


Trifonov V.,Center for Computational Biology and Bioinformatics | Pasqualucci L.,Columbia University | Favera R.D.,Columbia University | Rabadan R.,Center for Computational Biology and Bioinformatics
BMC Systems Biology | Year: 2013

Background: Most tumors are the result of accumulated genomic alterations in somatic cells. The emerging spectrum of alterations in tumors is complex and the identification of relevant genes and pathways remains a challenge. Furthermore, key cancer genes are usually found amplified or deleted in chromosomal regions containing many other genes. Point mutations, on the other hand, provide exquisite information about amino acid changes that could be implicated in the oncogenic process. Current large-scale genomic projects provide high throughput genomic data in a large number of well-characterized tumor samples.Methods: We define a Bayesian approach designed to identify candidate cancer genes by integrating copy number and point mutation information. Our method exploits the concept that small and recurrent alterations in tumors are more informative in the search for cancer genes. Thus, the algorithm (Mutations with Common Focal Alterations, or MutComFocal) seeks focal copy number alterations and recurrent point mutations within high throughput data from large panels of tumor samples.Results: We apply MutComFocal to Diffuse Large B-cell Lymphoma (DLBCL) data from four different high throughput studies, totaling 78 samples assessed for copy number alterations by single nucleotide polymorphism (SNP) array analysis and 65 samples assayed for protein changing point mutations by whole exome/whole transcriptome sequencing. In addition to recapitulating known alterations, MutComFocal identifies ARID1B, ROBO2 and MRS1 as candidate tumor suppressors and KLHL6, IL31 and LRP1 as putative oncogenes in DLBCL.Conclusions: We present a Bayesian approach for the identification of candidate cancer genes by integrating data collected in large number of cancer patients, across different studies. When trained on a well-studied dataset, MutComFocal is able to identify most of the reported characterized alterations. The application of MutComFocal to large-scale cancer data provides the opportunity to pinpoint the key functional genomic alterations in tumors. © 2013 Trifonov et al.; licensee BioMed Central Ltd.


Wang Y.,Indiana University | Cardenas H.,Indiana University | Fang F.,Indiana University | Condello S.,Indiana University | And 8 more authors.
Cancer Research | Year: 2014

Emerging results indicate that cancer stem-like cells contribute to chemoresistance and poor clinical outcomes in many cancers, including ovarian cancer. As epigenetic regulators play a major role in the control of normal stem cell differentiation, epigenetics may offer a useful arena to develop strategies to target cancer stem-like cells. Epigenetic aberrations, especially DNA methylation, silence tumor-suppressor and differentiation-associated genes that regulate the survival of ovarian cancer stem-like cells (OCSC). In this study, we tested the hypothesis that DNA-hypomethylating agents may be able to reset OCSC toward a differentiated phenotype by evaluating the effects of the new DNA methytransferase inhibitor SGI-110 on OCSC phenotype, as defined by expression of the cancer stem-like marker aldehyde dehydrogenase (ALDH). We demonstrated that ALDH+ovarian cancer cells possess multiple stem cell characteristics, were highly chemoresistant, and were enriched in xenografts residual after platinum therapy. Low-dose SGI-110 reduced the stem-like properties of ALDH+cells, including their tumor-initiating capacity, resensitized these OCSCs to platinum, and induced reexpression of differentiation-associated genes. Maintenance treatment with SGI-110 after carboplatin inhibited OCSC growth, causing global tumor hypomethylation and decreased tumor progression. Our work offers preclinical evidence that epigenome-targeting strategies have the potential to delay tumor progression by reprogramming residual cancer stem-like cells. Furthermore, the results suggest that SGI-110 might be administered in combination with platinum to prevent the development of recurrent and chemoresistant ovarian cancer. © 2014 American Association for Cancer Research.


Cardenas H.,Indiana University | Vieth E.,Indiana University | Lee J.,Indiana University | Segar M.,Center for Computational Biology and Bioinformatics | And 6 more authors.
Epigenetics | Year: 2014

A key step in the process of metastasis is the epithelial-to-mesenchymal transition (EMT). We hypothesized that epigenetic mechanisms play a key role in EMT and to test this hypothesis we analyzed global and gene-specific changes in DNA methylation during TGF-β-induced EMT in ovarian cancer cells. Epigenetic profiling using the Infinium HumanMethylation450 BeadChip (HM450) revealed extensive (P < 0.01) methylation changes after TGF-β stimulation (468 and 390 CpG sites altered at 48 and 120 h post cytokine treatment, respectively). The majority of gene-specific TGF-β-induced methylation changes occurred in CpG islands located in or near promoters (193 and 494 genes hypermethylated at 48 and 120 h after TGF-β stimulation, respectively). Furthermore, methylation changes were sustained for the duration of TGF-β treatment and reversible after the cytokine removal. Pathway analysis of the hypermethylated loci identified functional networks strongly associated with EMT and cancer progression, including cellular movement, cell cycle, organ morphology, cellular development, and cell death and survival. Altered methylation and corresponding expression of specific genes during TGF-β-induced EMT included CDH1 (E-cadherin) and COL1A1 (collagen 1A1). Furthermore, TGF-β induced both expression and activity of DNA methyltransferases (DNMT) -1, -3A, and -3B, and treatment with the DNMT inhibitor SGI-110 prevented TGF-β-induced EMT. These results demonstrate that dynamic changes in the DNA methylome are implicated in TGF-β-induced EMT and metastasis. We suggest that targeting DNMTs may inhibit this process by reversing the EMT genes silenced by DNA methylation in cancer. © 2014 Taylor & Francis Group, LLC.


Tiacci E.,University of Perugia | Trifonov V.,Center for Computational Biology and Bioinformatics | Schiavoni G.,University of Perugia | Holmes A.,Center for Computational Biology and Bioinformatics | And 33 more authors.
New England Journal of Medicine | Year: 2011

Background: Hairy-cell leukemia (HCL) is a well-defined clinicopathological entity whose underlying genetic lesion is still obscure. Methods: We searched for HCL-associated mutations by performing massively parallel sequencing of the whole exome of leukemic and matched normal cells purified from the peripheral blood of an index patient with HCL. Findings were validated by Sanger sequencing in 47 additional patients with HCL. Results: Whole-exome sequencing identified five missense somatic clonal mutations that were confirmed on Sanger sequencing, including a heterozygous mutation in BRAF that results in the BRAF V600E variant protein. Since BRAF V600E is oncogenic in other tumors, further analyses were focused on this genetic lesion. The same BRAF mutation was noted in all the other 47 patients with HCL who were evaluated by means of Sanger sequencing. None of the 195 patients with other peripheral B-cell lymphomas or leukemias who were evaluated carried the BRAF V600E variant, including 38 patients with splenic marginal-zone lymphomas or unclassifiable splenic lymphomas or leukemias. In immunohistologic and Western blot studies, HCL cells expressed phosphorylated MEK and ERK (the downstream targets of the BRAF kinase), indicating a constitutive activation of the RAF-MEK-ERK mitogen-activated protein kinase pathway in HCL. In vitro incubation of BRAF-mutated primary leukemic hairy cells from 5 patients with PLX-4720, a specific inhibitor of active BRAF, led to a marked decrease in phosphorylated ERK and MEK. Conclusions: The BRAF V600E mutation was present in all patients with HCL who were evaluated. This finding may have implications for the pathogenesis, diagnosis, and targeted therapy of HCL. (Funded by Associazione Italiana per la Ricerca sul Cancro and others.) Copyright © 2011 Massachusetts Medical Society.


Zhou F.C.,Stark Neuroscience Research Institute | Balaraman Y.,Indiana University | Teng M.,Center for Computational Biology and Bioinformatics | Teng M.,Harbin Institute of Technology | And 3 more authors.
Alcoholism: Clinical and Experimental Research | Year: 2011

Background: Potential epigenetic mechanisms underlying fetal alcohol syndrome (FAS) include alcohol-induced alterations of methyl metabolism, resulting in aberrant patterns of DNA methylation and gene expression during development. Having previously demonstrated an essential role for epigenetics in neural stem cell (NSC) development and that inhibiting DNA methylation prevents NSC differentiation, here we investigated the effect of alcohol exposure on genome-wide DNA methylation patterns and NSC differentiation. Methods: Neural stem cells in culture were treated with or without a 6-hour 88mM ("binge-like") alcohol exposure and examined at 48hours, for migration, growth, and genome-wide DNA methylation. The DNA methylation was examined using DNA-methylation immunoprecipitation followed by microarray analysis. Further validation was performed using Independent Sequenom analysis. Results: Neural stem cell differentiated in 24 to 48hours with migration, neuronal expression, and morphological transformation. Alcohol exposure retarded the migration, neuronal formation, and growth processes of NSC, similar to treatment with the methylation inhibitor 5-aza-cytidine. When NSC departed from the quiescent state, a genome-wide diversification of DNA methylation was observed-that is, many moderately methylated genes altered methylation levels and became hyper- and hypomethylated. Alcohol prevented many genes from such diversification, including genes related to neural development, neuronal receptors, and olfaction, while retarding differentiation. Validation of specific genes by Sequenom analysis demonstrated that alcohol exposure prevented methylation of specific genes associated with neural development [cut-like 2 (cutl2), insulin-like growth factor 1 (Igf1), epidermal growth factor-containing fibulin-like extracellular matrix protein 1 (Efemp1), and SRY-box-containing gene 7 (Sox 7)]; eye development, lens intrinsic membrane protein 2 (Lim 2); the epigenetic mark Smarca2 (SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a, member 2); and developmental disorder [DiGeorge syndrome critical region gene 2 (Dgcr2)]. Specific sites altered by DNA methylation also correlated with transcription factor binding sites known to be critical for regulating neural development. Conclusion: The data indicate that alcohol prevents normal DNA methylation programming of key neural stem cell genes and retards NSC differentiation. Thus, the role of DNA methylation in FAS warrants further investigation. © 2011 by the Research Society on Alcoholism.


Gusev A.,Columbia University | Palamara P.F.,Columbia University | Aponte G.,Columbia University | Zhuang Z.,Columbia University | And 4 more authors.
Molecular Biology and Evolution | Year: 2012

Homologous long segments along the genomes of close or remote relatives that are identical by descent (IBD) from a common ancestor provide clues for recent events in human genetics. We set out to extensively map such IBD segments in large cohorts and investigate their distribution within and across different populations. We report analysis of several data sets, demonstrating that IBD is more common than expected by naïve models of population genetics. We show that the frequency of IBD pairs is population dependent and can be used to cluster individuals into populations, detect a homogeneous subpopulation within a larger cohort, and infer bottleneck events in such a subpopulation. Specifically, we show that Ashkenazi Jewish individuals are all connected through transitive remote family ties evident by sharing of 50 cM IBD to a publicly available data set of less than 400 individuals. We further expose regions where long-range haplotypes are shared significantly more often than elsewhere in the genome, observed across multiple populations, and enriched for common long structural variation. These are inconsistent with recent relatedness and suggest ancient common ancestry, with limited recombination between haplotypes. © The Author 2011. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved.


News Article | March 11, 2016
Site: www.scientificcomputing.com

Handling big data can sometimes feel like driving on an unpaved road for researchers with a need for speed and supercomputers. "When you're in the world of data, there are rocks and bumps in the way, and a lot of things that you have to take care of," said Niall Gaffney, a former Hubble Space Telescope scientist who now heads the Data Intensive Computing group at the Texas Advanced Computing Center (TACC). Gaffney led the effort to bring online a new kind of supercomputer, called Wrangler. Like the old Western cowboys who tamed wild horses, Wrangler tames beasts of big data, such as computing problems that involve analyzing thousands of files that need to be quickly opened, examined and cross-correlated. Wrangler fills a gap in the supercomputing resources of XSEDE, the Extreme Science and Engineering Discovery Environment, supported by the National Science Foundation (NSF). XSEDE is a collection of advanced digital resources that scientists can easily use to share and analyze the massive datasets being produced in nearly every field of research today. In 2013, NSF awarded TACC and its academic partners Indiana University and the University of Chicago $11.2 million to build and operate Wrangler, a supercomputer to handle data-intensive high performance computing. Wrangler was designed to work closely with the Stampede supercomputer, the 10th most powerful in the world according to the bi-annual Top500 list, and the flagship of TACC at The University of Texas at Austin (UT Austin). Stampede has computed over six million jobs for open science since it came online in 2013. "We kept a lot of what was good with systems like Stampede," said Gaffney, "but added new things to it like a very large flash storage system, a very large distributed spinning disc storage system, and high-speed network access. This allows people who have data problems that weren't being fulfilled by systems like Stampede and Lonestar to be able to do those in ways that they never could before." Gaffney made the analogy that supercomputers like Stampede are like racing sports cars, with fantastic compute engines optimized for going fast on smooth, well-defined race-tracks. Wrangler, on the other hand, is built like a rally car to go fast on unpaved, bumpy roads with muddy gravel. "If you take a Ferrari off-road you may want to change the way that the suspension is done," Gaffney said. "You want to change the way that the entire car is put together, even though it uses the same components, to build something suitable for people who have a different job." At the heart of Wrangler lie 600 terabytes of flash memory shared via PCI interconnect across Wrangler's over 3,000 Haswell compute cores. "All parts of the system can access the same storage," Gaffney said. "They can work in parallel together on the data that are stored inside this high-speed storage system to get larger results they couldn't get otherwise." This massive amount of flash storage comes from DSSD, a startup co-founded by Andy Bechtolsheim of Sun Microsystems fame and acquired in May of 2015 by EMC. Bechtolsheim's influence at TACC goes back to the 'Magnum' Infiniband network switch he led design on for the now-decommissioned Ranger supercomputer, the predecessor to Stampede. What's new is that DSSD took a shortcut between the CPU and the data. "The connection from the brain of the computer goes directly to the storage system. There's no translation in between," Gaffney said. "It actually allows people to compute directly with some of the fastest storage that you can get your hands on, with no bottlenecks in between." Gaffney recalled the hang-up scientists had with code called OrthoMCL, which combs through DNA sequences to find common genetic ancestry in seemingly unrelated species. The problem was that OrthoMCL let loose databases wild as a bucking bronco. "It generates a very large database and then runs computational programs outside and has to interact with this database," said biologist Rebecca Young of the Department of Integrative Biology and the Center for Computational Biology and Bioinformatics at UT Austin. She added, "That's not what Lonestar and Stampede and some of the other TACC resources were set up for." Young recounted how at first, using OrthoMCL with online resources, she was only able to pull out 350 comparable genes across 10 species. "When I run OrthoMCL on Wrangler, I'm able to get almost 2,000 genes that are comparable across the species," Young said. "This is an enormous improvement from what is already available. What we're looking to do with OrthoMCL is to allow us to make an increasing number of comparisons across species when we're looking at these very divergent, these very ancient species separated by 450 million years of evolution." "We were able to go through all of these work cases in anywhere between 15 minutes and 6 hours," Gaffney said. "This is a game changer." Gaffney added that getting results quickly lets scientists explore new and deeper questions by working with larger collections of data and driving previously unattainable discoveries. Computer scientist Joshua New with the Oak Ridge National Laboratory (ORNL) hopes to take advantage of Wrangler's ability to tame big data. New is the principal investigator of the Autotune project, which creates a software version of a building and calibrates the model with over 3,000 different data inputs from sources like utility bills to generate useful information, such as what an optimal energy-efficient retrofit might be. "Wrangler has enough horsepower that we can run some very large studies and get meaningful results in a single run," New said. He currently uses the Titan supercomputer of ORNL to run 500,000 simulations and write 45 TB of data to disk in 68 minutes. He said he wants to scale out his parametric studies to simulate all 125.1 million buildings in the U.S. "I think that Wrangler fills a specific niche for us in that we're turning our analysis into an end-to-end workflow, where we define what parameters we want to vary," New said. "It creates the sampling matrix. It creates the input files. It does the computationally challenging task of running all the simulations in parallel. It creates the output. Then we run our artificial intelligence and statistic techniques to analyze that data on the back end. Doing that from beginning to end as a solid workflow on Wrangler is something that we're very excited about." When Gaffney talks about storage on Wrangler, he's talking about is a lot of data storage — a 10 petabyte Lustre-based file system hosted at TACC and replicated at Indiana University. "We want to preserve data," Gaffney said. "The system for Wrangler has been set up for making data a first-class citizen amongst what people do for research, allowing one to hold onto data and curate, share and work with people with it. Those are the founding tenants of what we wanted to do with Wrangler." "Data is really the biggest challenge with our project," said UT Austin astronomer Steve Finkelstein. His NSF-funded project is called HETDEX, the Hobby-Eberly Telescope Dark Energy Experiment. It's the largest survey of galaxies ever attempted. Scientists expect HETDEX to map over a million galaxies in three dimensions, in the process discovering thousands of new galaxies. The main goal is to study dark energy, a mysterious force pushing galaxies apart. "Every single night that we observe — and we plan to observe more or less every single night for at least three years — we're going to make 200 GB of data," Finkelstein said. It'll measure the spectra of 34,000 points of skylight every six minutes. "On Wrangler is our pipeline," Finkelstein said. "It's going to live there. As the data comes in, it's going to have a little routine that basically looks for new data, and as it comes in every six minutes or so it will process it. By the end of the night, it will actually be able to take all the data together to find new galaxies." Another example of a new HPC user Wrangler enables is an NSF-funded science initiative called PaleoCore. It hopes to take advantage of Wrangler's swiftness with databases to build a repository for scientists to dig through geospatially-aware data on all fossils related to human origins. This would combine older digital collections in formats like Excel worksheets and SQL databases with newer ways of gathering data, such as real-time fossil GPS information collected from iPhones or iPads. "We're looking at big opportunities in linked open data," PaleoCore principal investigator Denne Reed said. Reed is an associate professor in the Department of Anthropology at UT Austin. Linked open data allows for queries to get meaning from the relationships of seemingly disparate pieces of data. "Wrangler is the type of platform that enables that," Reed said. "It enables us to store large amounts of data, both in terms of photo imagery, satellite imagery and related things that go along with geospatial data. Then also, it allows us to start looking at ways to effectively link those data with other data repositories in real time." Wrangler's shared memory supports data analytics on the Hadoop and Apache Spark frameworks. "Hadoop is a big buzzword in all of data science at this point," Gaffney said. "We have all of that and are able to configure the system to be able to essentially be like the Google Search engines are today in data centers. The big difference is that we are servicing a few people at a time, as opposed to Google." Users bring data in and out of Wrangler in one of the fastest ways possible. Wrangler connects to Internet2, an optical network which provides 100 gigabytes per second worth of throughput to most of the other academic institutions around the country. What's more, TACC has tools and techniques to transfer their data in parallel. "It's sort of like being at the supermarket," explained Gaffney. "If there's only one lane open, it is just as fast as one person checking you out. But if you go in and have 15 lanes open, you can spread that traffic across and get more people through in less time." Biologists, astronomers, energy efficiency experts, and paleontologists are just a small slice of the new user community Wrangler aims to attract. Wrangler is also more web-enabled than typically found in high performance computing. A web portal allows users to manage the system and gives the ability to use web interfaces such as VNC, RStudio, and Jupyter Notebooks to support more desktop-like user interactions with the system. "We need these bigger systems for science," Gaffney said. "We need more kinds of systems. And we need more kinds of users. That's where we're pushing towards with these sort of portals. This is going to be the new face, I believe, for many of these systems that we're moving forward with now. Much more web-driven, much more graphical, much less command line driven. " Wrangler is primed to lead the way in computing the bumpy world of data-intensive science research. "There are some great systems and great researchers out there who are doing groundbreaking and very important work on data, to change the way we live and to change the world," Gaffney said. "Wrangler is pushing forth on the sharing of these results, so that everybody can see what's going on."


News Article | March 17, 2016
Site: phys.org

"When you're in the world of data, there are rocks and bumps in the way, and a lot of things that you have to take care of," said Niall Gaffney, a former Hubble Space Telescope scientist who now heads the Data Intensive Computing group at the Texas Advanced Computing Center (TACC). Gaffney led the effort to bring online a new kind of supercomputer, called Wrangler. Like the old Western cowboys who tamed wild horses, Wrangler tames beasts of big data, such as computing problems that involve analyzing thousands of files that need to be quickly opened, examined and cross-correlated. Wrangler fills a gap in the supercomputing resources of XSEDE, the Extreme Science and Engineering Discovery Environment, supported by the National Science Foundation (NSF). XSEDE is a collection of advanced digital resources that scientists can easily use to share and analyze the massive datasets being produced in nearly every field of research today. In 2013, NSF awarded TACC and its academic partners Indiana University and the University of Chicago $11.2 million to build and operate Wrangler, a supercomputer to handle data-intensive high performance computing. Wrangler was designed to work closely with the Stampede supercomputer, the 10th most powerful in the world according to the bi-annual Top500 list, and the flagship of TACC at The University of Texas at Austin (UT Austin). Stampede has computed over six million jobs for open science since it came online in 2013. "We kept a lot of what was good with systems like Stampede," said Gaffney, "but added new things to it like a very large flash storage system, a very large distributed spinning disc storage system, and high speed network access. This allows people who have data problems that weren't being fulfilled by systems like Stampede and Lonestar to be able to do those in ways that they never could before." Gaffney made the analogy that supercomputers like Stampede are like racing sports cars, with fantastic compute engines optimized for going fast on smooth, well-defined race-tracks. Wrangler, on the other hand, is built like a rally car to go fast on unpaved, bumpy roads with muddy gravel. "If you take a Ferrari off-road you may want to change the way that the suspension is done," Gaffney said. "You want to change the way that the entire car is put together, even though it uses the same components, to build something suitable for people who have a different job." At the heart of Wrangler lie 600 terabytes of flash memory shared via PCI interconnect across Wrangler's over 3,000 Haswell compute cores. "All parts of the system can access the same storage," Gaffney said. "They can work in parallel together on the data that are stored inside this high-speed storage system to get larger results they couldn't get otherwise." This massive amount of flash storage comes from DSSD, a startup co-founded by Andy Bechtolsheim of Sun Microsystems fame and acquired in May of 2015 by EMC. Bechtolsheim's influence at TACC goes back to the 'Magnum' Infiniband network switch he led design on for the now-decommissioned Ranger supercomputer, the predecessor to Stampede. What's new is that DSSD took a shortcut between the CPU and the data. "The connection from the brain of the computer goes directly to the storage system. There's no translation in between," Gaffney said. "It actually allows people to compute directly with some of the fastest storage that you can get your hands on, with no bottlenecks in between." Gaffney recalled the hang-up scientists had with code called OrthoMCL, which combs through DNA sequences to find common genetic ancestry in seemingly unrelated species. The problem was that OrthoMCL let loose databases wild as a bucking bronco. "It generates a very large database and then runs computational programs outside and has to interact with this database," said biologist Rebecca Young of the Department of Integrative Biology and the Center for Computational Biology and Bioinformatics at UT Austin. She added, "That's not what Lonestar and Stampede and some of the other TACC resources were set up for." Young recounted how at first, using OrthoMCL with online resources, she was only able to pull out 350 comparable genes across 10 species. "When I run OrthoMCL on Wrangler, I'm able to get almost 2,000 genes that are comparable across the species," Young said. "This is an enormous improvement from what is already available. What we're looking to do with OrthoMCL is to allow us to make an increasing number of comparisons across species when we're looking at these very divergent, these very ancient species separated by 450 million years of evolution." "We were able to go through all of these work cases in anywhere between 15 minutes and 6 hours," Gaffney said. "This is a game changer." Gaffney added that getting results quickly lets scientists explore new and deeper questions by working with larger collections of data and driving previously unattainable discoveries. Computer scientist Joshua New with the Oak Ridge National Laboratory (ORNL) hopes to take advantage of Wrangler's ability to tame big data. New is the principal investigator of the Autotune project, which creates a software version of a building and calibrates the model with over 3,000 different data inputs from sources like utility bills to generate useful information such as what an optimal energy-efficient retrofit might be. "Wrangler has enough horsepower that we can run some very large studies and get meaningful results in a single run," New said. He currently uses the Titan supercomputer of ORNL to run 500,000 simulations and write 45 TB of data to disk in 68 minutes. He said he wants to scale out his parametric studies to simulate all 125.1 million buildings in the U.S. "I think that Wrangler fills a specific niche for us in that we're turning our analysis into an end-to-end workflow, where we define what parameters we want to vary," New said. "It creates the sampling matrix. It creates the input files. It does the computationally challenging task of running all the simulations in parallel. It creates the output. Then we run our artificial intelligence and statistic techniques to analyze that data on the back end. Doing that from beginning to end as a solid workflow on Wrangler is something that we're very excited about." When Gaffney talks about storage on Wrangler, he's talking about is a lot of data storage—a 10 petabyte Lustre-based file system hosted at TACC and replicated at Indiana University. "We want to preserve data," Gaffney said. "The system for Wrangler has been set up for making data a first-class citizen amongst what people do for research, allowing one to hold onto data and curate, share, and work with people with it. Those are the founding tenants of what we wanted to do with Wrangler." "Data is really the biggest challenge with our project," said UT Austin astronomer Steve Finkelstein. His NSF-funded project is called HETDEX, the Hobby-Eberly Telescope Dark Energy Experiment. It's the largest survey of galaxies ever attempted. Scientists expect HETDEX to map over a million galaxies in three dimensions, in the process discovering thousands of new galaxies. The main goal is to study dark energy, a mysterious force pushing galaxies apart. "Every single night that we observe—and we plan to observe more or less every single night for at least three years—we're going to make 200 GB of data," Finkelstein said. It'll measure the spectra of 34,000 points of skylight every six minutes. "On Wrangler is our pipeline," Finkelstein said. "It's going to live there. As the data comes in, it's going to have a little routine that basically looks for new data, and as it comes in every six minutes or so it will process it. By the end of the night it will actually be able to take all the data together to find new galaxies." Another example of a new HPC user Wrangler enables is an NSF-funded science initiative called PaleoCore. It hopes to take advantage of Wrangler's swiftness with databases to build a repository for scientists to dig through geospatially-aware data on all fossils related to human origins. This would combine older digital collections in formats like Excel worksheets and SQL databases with newer ways of gathering data such as real-time fossil GPS information collected from iPhones or iPads. "We're looking at big opportunities in linked open data," PaleoCore principal investigator Denne Reed said. Reed is an associate professor in the Department of Anthropology at UT Austin. Linked open data allows for queries to get meaning from the relationships of seemingly disparate pieces of data. "Wrangler is the type of platform that enables that," Reed said. "It enables us to store large amounts of data, both in terms of photo imagery, satellite imagery and related things that go along with geospatial data. Then also, it allows us to start looking at ways to effectively link those data with other data repositories in real time." Wrangler's shared memory supports data analytics on the Hadoop and Apache Spark frameworks. "Hadoop is a big buzzword in all of data science at this point," Gaffney said. "We have all of that and are able to configure the system to be able to essentially be like the Google Search engines are today in data centers. The big difference is that we are servicing a few people at a time, as opposed to Google." Users bring data in and out of Wrangler in one of the fastest ways possible. Wrangler connects to Internet2, an optical network which provides 100 gigabytes per second worth of throughput to most of the other academic institutions around the country. What's more, TACC has tools and techniques to transfer their data in parallel. "It's sort of like being at the supermarket," explained Gaffney. "If there's only one lane open, it is just as fast as one person checking you out. But if you go in and have 15 lanes open, you can spread that traffic across and get more people through in less time." Biologists, astronomers, energy efficiency experts, and paleontologists are just a small slice of the new user community Wrangler aims to attract. Wrangler is also more web-enabled than typically found in high performance computing. A web portal allows users to manage the system and gives the ability to use web interfaces such as VNC, RStudio, and Jupyter Notebooks to support more desktop-like user interactions with the system. "We need these bigger systems for science," Gaffney said. "We need more kinds of systems. And we need more kinds of users. That's where we're pushing towards with these sort of portals. This is going to be the new face, I believe, for many of these systems that we're moving forward with now. Much more web-driven, much more graphical, much less command line driven. " "The NSF shares with TACC great pride in Wrangler's continuing delivery of world-leading technical throughput performance as an operational resource available to the open science community in specific characteristics most responsive to advance data-focused research," said Robert Chadduck, the program officer overseeing the NSF award. Wrangler is primed to lead the way in computing the bumpy world of data-intensive science research. "There are some great systems and great researchers out there who are doing groundbreaking and very important work on data, to change the way we live and to change the world," Gaffney said. "Wrangler is pushing forth on the sharing of these results, so that everybody can see what's going on." Explore further: Texas Stampede supercomputer to join the eXtreme Digital (XD) program


Petrey D.,Center for Computational Biology and Bioinformatics | Petrey D.,Howard Hughes Medical Institute | Honig B.,Center for Computational Biology and Bioinformatics | Honig B.,Howard Hughes Medical Institute
Annual Review of Biophysics | Year: 2014

The past decade has seen a dramatic expansion in the number and range of techniques available to obtain genome-wide information and to analyze this information so as to infer both the functions of individual molecules and how they interact to modulate the behavior of biological systems. Here, we review these techniques, focusing on the construction of physical protein-protein interaction networks, and highlighting approaches that incorporate protein structure, which is becoming an increasingly important component of systems-level computational techniques. We also discuss how network analyses are being applied to enhance our basic understanding of biological systems and their disregulation, as well as how these networks are being used in drug development. Copyright © 2014 by Annual Reviews. All rights reserved.

Loading Center for Computational Biology and Bioinformatics collaborators
Loading Center for Computational Biology and Bioinformatics collaborators