Center for Excellence in Bioinformatics

Belo Horizonte, Brazil

Center for Excellence in Bioinformatics

Belo Horizonte, Brazil
Time filter
Source Type

Protasio A.V.,Wellcome Trust Sanger Institute | Tsai I.J.,Wellcome Trust Sanger Institute | Babbage A.,Wellcome Trust Sanger Institute | Nichol S.,Wellcome Trust Sanger Institute | And 22 more authors.
PLoS Neglected Tropical Diseases | Year: 2012

Schistosomiasis is one of the most prevalent parasitic diseases, affecting millions of people in developing countries. Amongst the human-infective species, Schistosoma mansoni is also the most commonly used in the laboratory and here we present the systematic improvement of its draft genome. We used Sanger capillary and deep-coverage Illumina sequencing from clonal worms to upgrade the highly fragmented draft 380 Mb genome to one with only 885 scaffolds and more than 81% of the bases organised into chromosomes. We have also used transcriptome sequencing (RNA-seq) from four time points in the parasite's life cycle to refine gene predictions and profile their expression. More than 45% of predicted genes have been extensively modified and the total number has been reduced from 11,807 to 10,852. Using the new version of the genome, we identified trans-splicing events occurring in at least 11% of genes and identified clear cases where it is used to resolve polycistronic transcripts. We have produced a high-resolution map of temporal changes in expression for 9,535 genes, covering an unprecedented dynamic range for this organism. All of these data have been consolidated into a searchable format within the GeneDB ( and SchistoDB ( databases. With further transcriptional profiling and genome sequencing increasingly accessible, the upgraded genome will form a fundamental dataset to underpin further advances in schistosome research. © 2012 Protasio et al.

Coimbra R.S.,Center for Excellence in Bioinformatics | Coimbra R.S.,Genomics and Computational Biology Group | Artiguenave F.,University of Évry Val d'Essonne | Jacques L.S.R.Z.,Center for Excellence in Bioinformatics | And 2 more authors.
Journal of Clinical Microbiology | Year: 2010

Escherichia coli and Shigella O antigens can be inferred using the rfb-restriction fragment length polymorphism (RFLP) molecular test. We present herein a dynamic programming algorithm-based software to compare the rfb-RFLP patterns of clinical isolates with those in a database containing the 171 previously published patterns corresponding to all known E. coli/Shigella O antigens. Copyright © 2010, American Society for Microbiology. All Rights Reserved.

Coimbra R.S.,Center for Excellence in Bioinformatics | Coimbra R.S.,Genomics and Computational Biology Group | Vanderwall D.E.,Glaxosmithkline | Oliveira G.C.,Center for Excellence in Bioinformatics | Oliveira G.C.,Genomics and Computational Biology Group
BMC Genomics | Year: 2010

Background: Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of PubMed abstracts. This may be true even when these aliases are not associated with the same subset of documents. This gene-specific vocabulary defines a unique fingerprint that can be used to disclose ambiguous aliases. The present work describes an original method for automatically assessing the ambiguity levels of gene aliases in large gene terminologies based exclusively in the content of their associated literature. The method can deal with the two major problems restricting the usage of current text mining tools: 1) different names associated with the same gene; and 2) one name associated with multiple genes, or even with non-gene entities. Important, this method does not require training examples.Results: Aliases were considered " ambiguous" when their Jaccard distance to the respective official gene symbol was equal or greater than the smallest distance between the official gene symbol and one of the three internal controls (randomly picked unrelated official gene symbols). Otherwise, they were assigned the status of " synonyms" . We evaluated the coherence of the results by comparing the frequencies of the official gene symbols in the text corpora retrieved with their respective " synonyms" or " ambiguous" aliases. Official gene symbols were mentioned in the abstract collections of 42 % (70/165) of their respective synonyms. No official gene symbol occurred in the abstract collections of any of their respective ambiguous aliases. In overall, querying PubMed with official gene symbols and " synonym" aliases allowed a 3.6-fold increase in the number of unique documents retrieved.Conclusions: These results confirm that this method is able to distinguish between synonyms and ambiguous gene aliases based exclusively on their vocabulary fingerprint. The approach we describe could be used to enhance the retrieval of relevant literature related to a gene. © 2010 Coimbra et al; licensee BioMed Central Ltd.

Camargo D.R.A.,Oswaldo Cruz Foundation | Camargo D.R.A.,Ezequiel Dias Foundation FUNED | Pais F.S.,Genomics and Computational Biology Group | Pais F.S.,Center for Excellence in Bioinformatics | And 4 more authors.
BMC Genomics | Year: 2015

Background: Ninety-two Streptococcus pneumoniae serotypes have been described so far, but the pneumococcal conjugate vaccine introduced in the Brazilian basic vaccination schedule in 2010 covers only the ten most prevalent in the country. Pneumococcal serotype-shifting after massive immunization is a major concern and monitoring this phenomenon requires efficient and accessible serotyping methods. Pneumococcal serotyping based on antisera produced in animals is laborious and restricted to a few reference laboratories. Alternatively, molecular serotyping methods assess polymorphisms in the cps gene cluster, which encodes key enzymes for capsular polysaccharides synthesis in pneumococci. In one such approach, cps-RFLP, the PCR amplified cps loci are digested with an endonuclease, generating serotype-specific fingerprints on agarose gel electrophoresis. Methods: In this work, in silico and in vitro approaches were combined to demonstrate that XhoII is the most discriminating endonuclease for cps-RFLP, and to build a database of serotype-specific fingerprints that accommodates the genetic diversity within the cps locus of 92 known pneumococci serotypes. Results: The expected specificity of cps-RFLP using XhoII was 76% for serotyping and 100% for serogrouping. The database of cps-RFLP fingerprints was integrated to Molecular Serotyping Tool (MST), a previously published web-based software for molecular serotyping. In addition, 43 isolates representing 29 serotypes prevalent in the state of Minas Gerais, Brazil, from 2007 to 2013, were examined in vitro; 11 serotypes (nine serogroups) matched the respective in silico patterns calculated for reference strains. The remaining experimental patterns, despite their resemblance to their expected in silico patterns, did not reach the threshold of similarity score to be considered a match and were then added to the database. Conclusion: The cps-RFLP method with XhoII outperformed the antisera-based and other molecular serotyping methods in regard of the expected specificity. In order to accommodate the genetic variability of the pneumococci cps loci, the database of cps-RFLP patterns will be progressively expanded to include new variant in vitro patterns. The cps-RFLP method with endonuclease XhoII coupled with MST for computer-assisted interpretation of results may represent a relevant contribution to the real time detection of changes in regional pneumococci population diversity in response to mass immunization programs. © 2015 Camargo et al.; licensee BioMed Central Ltd.

Torrieri R.,Center for Excellence in Bioinformatics | Oliveira F.S.,Center for Excellence in Bioinformatics | Oliveira G.,Center for Excellence in Bioinformatics | Oliveira G.,Genomics and Computational Biology Group | And 2 more authors.
PLoS ONE | Year: 2012

In the last years, there was an exponential increase in the number of publicly available genomes. Once finished, most genome projects lack financial support to review annotations. A few of these gene annotations are based on a combination of bioinformatics evidence, however, in most cases, annotations are based solely on sequence similarity to a previously known gene, which was most probably annotated in the same way. As a result, a large number of predicted genes remain unassigned to any functional category despite the fact that there is enough evidence in the literature to predict their function. We developed a classifier trained with term-frequency vectors automatically disclosed from text corpora of an ensemble of genes representative of each functional category of the J. Craig Venter Institute Comprehensive Microbial Resource (JCVI-CMR) ontology. The classifier achieved up to 84% precision with 68% recall (for confidence≥0.4), F-measure 0.76 (recall and precision equally weighted) in an independent set of 2,220 genes, from 13 bacterial species, previously classified by JCVI-CMR into unambiguous categories of its ontology. Finally, the classifier assigned (confidence≥0.7) to functional categories a total of 5,235 out of the ~24 thousand genes previously in categories "Unknown function" or "Unclassified" for which there is literature in MEDLINE. Two biologists reviewed the literature of 100 of these genes, randomly picket, and assigned them to the same functional categories predicted by the automatic classifier. Our results confirmed the hypothesis that it is possible to confidently assign genes of a real world repository to functional categories, based exclusively on the automatic profiling of its associated literature. The LitProf - Gene Classifier web server is accessible at: © 2012 Torrieri et al.

Pereira U.P.,Federal University of Minas Gerais | Pereira U.P.,Federal University of Lavras | Dos Santos A.R.D.,Federal University of Minas Gerais | Hassan S.S.,Federal University of Minas Gerais | And 21 more authors.
Standards in Genomic Sciences | Year: 2013

Streptococcus agalactiae (Lancefield group B; GBS) is the causative agent of meningoencephalitis in fish, mastitis in cows, and neonatal sepsis in humans. Meningoencephalitis is a major health problem for tilapia farming and is responsible for high economic losses worldwide. Despite its importance, the genomic characteristics and the main molecular mechanisms involved in virulence of S. agalactiae isolated from fish are still poorly understood. Here, we present the genomic features of the 1,820,886 bp long complete genome sequence of S. agalactiae SA20-06 isolated from a meningoencephalitis outbreak in Nile tilapia (Oreochromis niloticus) from Brazil, and its annotation, consisting of 1,710 pro-tein-coding genes (excluding pseudogenes), 7 rRNA operons, 79 tRNA genes and 62 pseudogenes.

Thiele E.A.,Purdue University | Correa-Oliveira G.,Center for Excellence in Bioinformatics | Correa-Oliveira G.,Instituto Nacional Of Ciencia E Tecnologia Em Doencas Tropicais | Gazzinelli A.,Instituto Nacional Of Ciencia E Tecnologia Em Doencas Tropicais | And 2 more authors.
Tropical Medicine and International Health | Year: 2013

Objective: The freshwater snail Biomphalaria glabrata is the principal intermediate host for the parasite Schistosoma mansoni within Brazil. We assessed the potential effects of snail population dynamics on parasite transmission dynamics via population genetics. Methods: We sampled snail populations located within the confines of three schistosome-endemic villages in the state of Minas Gerais, Brazil. Snails were collected from individual microhabitats following seasonal periods of flood and drought over the span of 1 year. Snail spatio-temporal genetic diversity and population differentiation of 598 snails from 12 sites were assessed at seven microsatellite loci. Results: Average genetic diversity was relatively low, ranging from 4.29 to 9.43 alleles per locus, and overall, subpopulations tended to exhibit heterozygote deficits. Genetic diversity was highly spatially partitioned among subpopulations, while virtually, no partitioning was observed across temporal sampling. Comparison with previously published parasite genetic diversity data indicated that S. mansoni populations are significantly more variable and less subdivided than those of the B. glabrata intermediate hosts. Discussion: Within individual Brazilian villages, observed distributions of snail genetic diversity indicate temporal stability and very restricted gene flow. This is contrary to observations of schistosome genetic diversity over the same spatial scale, corroborating the expectation that parasite gene flow at the level of individual villages is likely driven by vertebrate host movement. © 2013 John Wiley & Sons Ltd.

Pais F.S.M.,Center for Excellence in Bioinformatics | Pais F.S.M.,Instituto Nacional Of Ciencia E Tecnologia | Ruy P.D.C.,Center for Excellence in Bioinformatics | Oliveira G.,Center for Excellence in Bioinformatics | And 3 more authors.
Algorithms for Molecular Biology | Year: 2014

Background: Multiple sequence alignment (MSA) is an extremely useful tool for molecular and evolutionary biology and there are several programs and algorithms available for this purpose. Although previous studies have compared the alignment accuracy of different MSA programs, their computational time and memory usage have not been systematically evaluated. Given the unprecedented amount of data produced by next generation deep sequencing platforms, and increasing demand for large-scale data analysis, it is imperative to optimize the application of software. Therefore, a balance between alignment accuracy and computational cost has become a critical indicator of the most suitable MSA program. We compared both accuracy and cost of nine popular MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the benchmark alignment dataset BAliBASE and discuss the relevance of some implementations embedded in each program's algorithm. Accuracy of alignment was calculated with the two standard scoring functions provided by BAliBASE, the sum-of-pairs and total-column scores, and computational costs were determined by collecting peak memory usage and time of execution.Results: Our results indicate that mostly the consistency-based programs Probcons, T-Coffee, Probalign and MAFFT outperformed the other programs in accuracy. Whenever sequences with large N/C terminal extensions were present in the BAliBASE suite, Probalign, MAFFT and also CLUSTAL OMEGA outperformed Probcons and T-Coffee. The drawback of these programs is that they are more memory-greedy and slower than POA, CLUSTALW, DIALIGN-TX, and MUSCLE. CLUSTALW and MUSCLE were the fastest programs, being CLUSTALW the least RAM memory demanding program.Conclusions: Based on the results presented herein, all four programs Probcons, T-Coffee, Probalign and MAFFT are well recommended for better accuracy of multiple sequence alignments. T-Coffee and recent versions of MAFFT can deliver faster and reliable alignments, which are specially suited for larger datasets than those encountered in the BAliBASE suite, if multi-core computers are available. In fact, parallelization of alignments for multi-core computers should probably be addressed by more programs in a near future, which will certainly improve performance significantly. © 2014 Pais et al.; licensee BioMed Central Ltd.

Loading Center for Excellence in Bioinformatics collaborators
Loading Center for Excellence in Bioinformatics collaborators