Villar D.,University of Cambridge |
Flicek P.,European Bioinformatics Institute |
Odom D.T.,University of Cambridge
Nature Reviews Genetics | Year: 2014
Differences in transcription factor binding can contribute to organismal evolution by altering downstream gene expression programmes. Genome-wide studies in Drosophila melanogaster and mammals have revealed common quantitative and combinatorial properties of in vivo DNA binding, as well as marked differences in the rate and mechanisms of evolution of transcription factor binding in metazoans. Here, we review the recently discovered rapid 're-wiring' of in vivo transcription factor binding between related metazoan species and summarize general principles underlying the observed patterns of evolution. We then consider what might explain the differences in genome evolution between metazoan phyla and outline the conceptual and technological challenges facing this research field. © 2014 Macmillan Publishers Limited.
Clarke L.,European Bioinformatics Institute
Nature methods | Year: 2012
The 1000 Genomes Project was launched as one of the largest distributed data collection and analysis projects ever undertaken in biology. In addition to the primary scientific goals of creating both a deep catalog of human genetic variation and extensive methods to accurately discover and characterize variation using new sequencing technologies, the project makes all of its data publicly available. Members of the project data coordination center have developed and deployed several tools to enable widespread data access.
McWilliam H.,European Bioinformatics Institute
Nucleic acids research | Year: 2013
Since 2004 the European Bioinformatics Institute (EMBL-EBI) has provided access to a wide range of databases and analysis tools via Web Services interfaces. This comprises services to search across the databases available from the EMBL-EBI and to explore the network of cross-references present in the data (e.g. EB-eye), services to retrieve entry data in various data formats and to access the data in specific fields (e.g. dbfetch), and analysis tool services, for example, sequence similarity search (e.g. FASTA and NCBI BLAST), multiple sequence alignment (e.g. Clustal Omega and MUSCLE), pairwise sequence alignment and protein functional analysis (e.g. InterProScan and Phobius). The REST/SOAP Web Services (http://www.ebi.ac.uk/Tools/webservices/) interfaces to these databases and tools allow their integration into other tools, applications, web sites, pipeline processes and analytical workflows. To get users started using the Web Services, sample clients are provided covering a range of programming languages and popular Web Service tool kits, and a brief guide to Web Services technologies, including a set of tutorials, is available for those wishing to learn more and develop their own clients. Users of the Web Services are informed of improvements and updates via a range of methods.
del-Toro N.,European Bioinformatics Institute
Nucleic acids research | Year: 2013
The Proteomics Standard Initiative Common QUery InterfaCe (PSICQUIC) specification was created by the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) to enable computational access to molecular-interaction data resources by means of a standard Web Service and query language. Currently providing >150 million binary interaction evidences from 28 servers globally, the PSICQUIC interface allows the concurrent search of multiple molecular-interaction information resources using a single query. Here, we present an extension of the PSICQUIC specification (version 1.3), which has been released to be compliant with the enhanced standards in molecular interactions. The new release also includes a new reference implementation of the PSICQUIC server available to the data providers. It offers augmented web service capabilities and improves the user experience. PSICQUIC has been running for almost 5 years, with a user base growing from only 4 data providers to 28 (April 2013) allowing access to 151 310 109 binary interactions. The power of this web service is shown in PSICQUIC View web application, an example of how to simultaneously query, browse and download results from the different PSICQUIC servers. This application is free and open to all users with no login requirement (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml).
Agency: GTR | Branch: BBSRC | Program: | Phase: Research Grant | Award Amount: 543.72K | Year: 2016
Research on domesticated animals has important socio-economic impacts, including underpinning and accelerating improvements in the animal sector of agriculture (animal breeding and animal health), contributing to human and veterinary medicine by providing animal models, and improving animal health and welfare. The chicken also serves as a model for all other avian species, so is important in the fields of embryology and development, neurobiology and behaviour, and the ecology and evolution of natural populations. The genome is the entire DNA content of an organism. For the genome sequence to be useful, the sequence needs to be annotated with the location of genes and their regulatory elements along the DNA sequence. Information on the location of at least coding genes (that is, genes which make proteins) is now available for many economically important farmed and companion animals, and efforts are underway in many parts of the world to increase our knowledge of non-coding regions (i.e. non-coding RNAs and regulatory elements). In addition, the advent of new DNA sequencing technologies now mean that it is possible to sequence many individuals and compare their differences in DNA sequence and gene content and relate this to their physiological differences. It is also possible to generate functional data by sequencing (assay by sequencing). Functional sequence data tells us about which genes are active, which genes code for proteins and which genes code for regulatory RNAs. It can also tell us about other features within the DNA sequence that are responsible for genes being switched on or off, for example, in specific tissues or in response to signaling molecules. Functional sequence data is therefore very important in informing us about how differences in the DNA sequence in individuals can affect gene activity and are therefore likely to affect phenotypes, such as production or disease resistance traits. Some DNA databases contain functional data for farmed and companion animals often in its raw form (e.g. sequence reads), however this data is most useful when it has been checked for quality and processed further, for example assigning it to specific genes and transcripts. It would also be preferable if more data were submitted to these databases from the research community around the world. Our proposed research aims to look at the data that is available for these animals in the public DNA databases, and check it for quality. We will also work to ensure that future datasets that are submitted to the databases have as much useful information associated with them as possible, for example, breed, sex and tissue type, etc. We will also define quality standards for such data and improve data discoverability by drawing together datasets from disparate projects into a cohesive collection that is accessible both programmatically and via a website.
Agency: GTR | Branch: BBSRC | Program: | Phase: Research Grant | Award Amount: 307.67K | Year: 2016
The structure of a protein dictates the manner in which it interacts with other proteins and whether or how it binds and changes the compounds it is exposed to. Knowing a proteins structure can help rationalise the mechanism by which it performs its biological role. It is also important for understanding how genetic changes such as mutations in the residues that make up the protein, can destroy or modify the way in which it performs that role. Revolutionary new technologies in biology, known as next generation sequencing, are now allowing biologists to collect vast amounts of genetic variation data. For example, information on changes in the sequences of proteins collected from humans suffering from different diseases like cancer or heart disease. Alternatively, sequences of proteins from species important in an agricultural context. For example different strains of wheat that may be more resistant to frost or produce higher yields. However, it is much harder and more expensive to determine the 3D structure of a protein than its sequence. It is particularly difficult for human, mouse, chicken, plants and other eukaryotic organisms that we need to study to understand disease or ensure food security. Currently, on average less than 15% of proteins from these important model organisms have an experimentally determined 3D structure. To address this deficit of structural data, algorithms have been developed for predicting the structure of a protein. The most successful approaches identify a relative having a known structure and inherit 3D information by exploiting the known conservation of structural features between evolutionary related proteins. Five of the top world-leading resources generating such annotations are based in the UK (SUPERFAMILY, Gene3D, Phyre, Fugure, pDomTHREADER). These exploit structural relatives in the SCOP and CATH structural classification - the two world leading resources capturing information on domain structures - to use as templates for predicting structures of uncharacterised relatives. The Genome3D resource, which was launched in 2012, integrates domain structure predictions from all five resources for ten model organisms used to study biological systems and important for the study of human health (e.g. human, mouse) or agriculture and food security (e.g. plant). Although the algorithms used by the resources are powerful for recognising even very remote relationships and inheriting structural information between relatives, their accuracy is < 90%. However, by combining all the data in a single resource and identifying positions in the protein where all the methods agree, it is possible to provide much more reliable annotations. Since it is easier to find these consensus regions if equivalent sets of relatives (i.e. families) in SCOP and CATH have been identified, a large part of the project involves mapping between these resources. We now wish to continue this project, improving the mapping of SCOP and CATH and using this to increase the amount of reliable consensus data that Genome3D provides. We will include additional organisms important for health and agriculture. However, a major benefit from this project will be the integration of the Genome3D structural data with structurally uncharacterised sequences in InterPro, a world-leading resource that combines information on protein families from 11 different resources worldwide. By including Genome3D data for families in InterPro we will be able to increase the number of proteins for which we can provide structural data ten-fold. In addition we will provide a very intuitive web-based viewer for looking at the structures and assessing the likely impacts of any changes in the sequence on the function of the protein. Since many biologists are unfamiliar with the value of structural data in assessing genetic variations we will develop web-based training material and arrange workshops both in our institutes and at international meetings.
Agency: GTR | Branch: BBSRC | Program: | Phase: Research Grant | Award Amount: 817.23K | Year: 2016
Micro-organisms are found in virtually all environments. Typically, they form the base of the food chain (such as plankton in the sea) and play essential roles in their ecosystems. There is often a complex interplay between different micro-organisms, with some organisms requiring that others be present in order for them to exist. When there is an imbalance within a community, this can lead to severe effects, such as disease in the human gut, or the inability for plants to grow efficiently in soil. An understanding of the composition and interplay within the communities allows us to potentially manipulate them. Thus, there is intense research into micro-organism communities in many different fields, such as improving livestock yields, the recovery from bacterial infections using fecal transplants and the efficient production of biofuels. Many of these communities also contain important proteins that could be useful to the biotechnological and pharmaceutical industries, such as enzymes involved in the production of antibiotics. Metagenomics is the study of these different micro-organism communities, which is achieved by isolating the DNA from the organisms within an environmental sample (e.g. water, soil, animal stool), sequencing the DNA, followed by the computational analysis to decode which organisms are present and the functions they might be performing. This computation is complicated: (1) there is a huge amount of data; (2) The sequence data is a jumbled mix of fragments from different organisms; (3) Decoding the DNA is hard - typically >90% of organisms within a sample are not well characterised. This proposal brings together three major resources within the field of metagenomics data archiving and analysis. The European Nucleotide Archive (ENA) is a repository of DNA sequence data. Importantly, ENA also captures metagenomic contextual data, such as where and when the sample was taken, how the DNA was extracted and sequenced. The EBI metagenomics portal (EMG, UK) and MG-RAST (MGR, US) are two metagenomics sequence analysis platforms. Uniquely, they represent the only free to use services, whereby researchers can upload sequence data and have it analysed without restriction. Despite the widespread use of metagenomics, currently the community lacks standards to ensure that metagenomics sequence data and the derived functional and taxonomic information are deposited within a database of record. Consequently, the navigation between metagenomics datasets is very difficult for even experienced users. As they offer slightly different, yet complementary, analysis services, there is often the desire to have a metagenomics dataset analysed by both resources. But, the number of equivalent datasets between the two resources is unknown. Unless a user has prior knowledge about equivalent projects, they remain disconnected. Also, sequence data submitted to MGR may not necessarily be deposited in ENA. We propose to set up a computational framework, termed Metagenomics Exchange (ME), to enable metagenomics datasets and the results of their analysis to be linked. All sequences will become available to the research community via ENA and analysis results we be automatically exchanged between EMG and EMR. The ME will be implemented to enable other metagenomics analysis providers to join, and so that it can be used by researchers wishing to perform large scale analyses. We will also investigate ways that our own pipelines can be enhanced through the use of the ME, sharing software and processing tasks, for example. This will lead to computational savings, increasing the capacity for metagenomics analysis. We will also generate a knowledge transfer forum, enabling the exchange of ideas on a range of topics, from hardware solutions to algorithms. Finally, we will undertake a research program to investigate the optimal combination of pipeline analysis components, and whether a single, unified analysis pipeline could be engineered.
Agency: GTR | Branch: BBSRC | Program: | Phase: Research Grant | Award Amount: 327.05K | Year: 2016
Even the simplest living organisms perform a huge number of different processes, which are interconnected in complex ways to ensure that the organism responds appropriately to its environment. One of the ways of ensuring that we really understand how these processes fit together is to build quantitative mathematical models of them which can be simulated using computers. This approach is known as Computational Systems Biology. As in most domains of science, databases are essential to allow the proverbial standing on the shoulders of giants. BioModels provides access to quantitative biology models that have been published in the scientific literature, and verified to be accurate. For instance, in order to develop a quantitative model of cell tumorigenesis, one may choose a suitable model of cell-cycle, and attempt to merge it with models of relevant cell signalling pathways such as the MAP kinase cascade. Since its creation at EMBL-EBI in 2005 - and supported by the BBSRC since 2008 - BioModels has undergone an exponential growth to become the worldwide reference for quantitative models of biological processes. Deposition of models upon publication is advised by several hundred scientific journals, and the resource receives around a million page requests a year from around 65 000 distinct users. Over the recent years, mathematical models of biological processes have become larger and more complex. To be able to represent more accurately living structures, multi-scale models are developed, with processes taking place at different scales described by components coupled together. Because of the nature of the processes as well as the experimental information available, a variety of modelling approaches are use. In 2012, a landmark publication described the first such model of a bacterium (Mycoplasma genitalium) that accounts for all of its components and their interactions. Multi-scale models of organs and organisms (for instance the BBSRC-funded model of Arabidopsis) are now available. Those models must be made available, regardless of their structure, the modelling approach used and the formats they are encoded with. The MultiMod project, developed in collaboration by the Babraham Institute and EMBL-EBI, will extend BioModels coverage considerably, to support a large variety of mathematical models developed in the biosciences. This will be achieved by accepting more file formats, improving support for standard formats and expanding the information distributed (for instance including necessary experimental datasets, simulation descriptions etc). We will also expand the spectrum of models we verify by reproducing published results. Such an increase in scope will be accompanied by an improvement of the software and hardware infrastructure underlying the resource. To make the resource more useful we will develop better browsing, search and download capabilities. Finally, we will improve and further develop our documentation and training offer, including online and in person tutorial and courses. We expect that MultiMod will improve the user experience significantly, increase the number of users and the communities using the resource, and make it more useful for the UK biosciences.
Agency: GTR | Branch: BBSRC | Program: | Phase: Research Grant | Award Amount: 685.41K | Year: 2016
In molecular biology, the central dogma explains that the genes in DNA code for RNA. RNA molecules are then translated into proteins that are the mini-machines that carry out the main processes in the cell. Recently it has become apparent that potentially many thousands of human genes code for RNAs that are not translated into proteins, but rather carry out important functions in the cell as RNA. These molecules are often known as non-coding RNAs. Much of the focus in biology over the past thirty years of research has been on DNA and proteins, but recently there has been a surge of interest in non-coding RNAs. In fact, the core of the machine that makes proteins from RNA, called the ribosome, has itself been shown to be made of RNA. Non-coding RNAs have also been shown to be widely involved in regulating the levels of other genes and may be useful in making treatments for patients with a variety of diseases. The role of non-coding RNAs in plant and animal development is evident, but a deeper understanding of the biology is essential, thereby allowing their modulation to enhance features such as yield or resistance to diseases. Unsurprisingly, aberrant expression of non-coding RNAs has also been implicated in numerous disease states. Research and innovation in the area of non-coding RNAs, and in molecular biology more generally, is hampered by the lack of an authoritative and complete resource collecting together all known non-coding RNAs. There are over 30 different online databases that contain information about different types of RNA molecules. Each of these resources makes their information available in different ways. The scattered nature of these resources has made it nearly impossible for biologists to discover what is known about non-coding RNAs related to their research area. To address this problem we created a resource called RNAcentral that brings together information from all the different RNA databases in one place. The most important information stored in RNAcentral is called the sequence of the RNA. Many existing RNA resources (called RNAcentral Expert Databases) have provided their data to RNAcentral. In this proposal we will add further more detailed information about the structure and function of RNAs into RNAcentral. We will work closely with one specific expert database, called miRBase, based at the University of Manchester, who will test out the system for searching the RNAcentral sequence database on specific subsets of RNAs. By the end of this project, researchers from around the UK and the rest of the world will have access to an increased set of information about RNAs. This information will be freely available in a variety of ways including via a website and as a downloadable database. Having access to this information will help researchers connect RNAs into their work better to help them make new discoveries sooner.
Agency: GTR | Branch: BBSRC | Program: | Phase: Research Grant | Award Amount: 310.68K | Year: 2016
Over the last three decades, DNA sequencing has become a key technology across and beyond the life sciences. Indeed, few areas of biological research remain untouched by either the direct use of the technology or knowledge that is derived from others work in which sequencing has been used. The technology has advanced rapidly. In the mid-2000s, a second generation of sequencing technologies, quite unlike the first, brought a step change in the rate at which sequencing machines could operate, and a corresponding vast reduction in the cost. These technologies now dominate and have led to a wealth of new and impactful scientific findings, not least as the core sequencing technology behind many thousands of animal, plant, fungal and bacterial projects. We are now on the cusp of a third-wave of technology, nanopore sequencing, again quite unlike those that proceed it, that promises similar game-changing advances. In the Adaptive Sampling project, we recognise the potential of nanopore sequencing and focus on a particular, as yet under-explored, feature of the technology that promises very significant impact. Nanopore sequencing uses microscopic pores that can be engineered and organised onto a surface. The pores allow DNA molecules to pass through one at a time from one side of the surface to the other. As they transit, the pores provide a direct read-out of the bases (A, C, G and T) that pass the inner surface of the pore. The user places a mixture of DNA molecules (fragments of a whole genome) above the pore, which then captures the end of a DNA molecule and starts to draw it through, reading its sequence as it goes. The control of the system is so refined that, if desired, a DNA molecule can be rejected from a pore before it has been fully sequenced and the capture process can start again rapidly. A key challenge for all sequencing platforms is that some parts of genomes are difficult to sequence and others are not. Because of this, to be certain that a genome sequencing experiment has captured all parts of a genome, the user must set the experiment up to read the genome many times (often 30), so that the difficult regions are read at least once. With Adaptive Sampling, we plan to overcome this obstacle with software that will rapidly read the early sequence from a pore, and make a decision about whether the part of the genome that is emerging from the pore has been read already or is yet to be read. Based on this, a decision can be made as to whether or not to reject the DNA molecule from the pore or to carry on reading to the end. The time saving to be achieved by avoiding re-sequencing in this way will be substantial, driving at far more cost-effective, rapid and targeted sequencing. While our technology will be useful broadly, we will work specifically with five example challenges, in which the tools will be useful. These cover detection and identification of infectious bacteria, the study of agricultural livestock, investigation of crop plant genomes, work on farmed fish to understand responses to disease-causing species and the analysis of communities of microbial species in the environment. There is substantial novelty in this approach. In previous work on Ebola virus, we have shown that rejecting reads using a prototype of our software has potential. What we now propose will be the first example, to the best of our knowledge, of a sequencing approach in which data analysis (previously something that happened after sequencing was completed) has direct impact on the way in which the physical sequencing machine itself is operated during a sequencing experiment. As part of the project, aiming at the broadest possible benefit to the research community, we plan to publish the software and hold two workshops in which we disseminate what we have developed to technologists, genomics laboratories, research scientists and industry.