Uratim Ltd.


Uratim Ltd.

Time filter
Source Type

Szalkai B.,Eötvös Loránd University | Grolmusz V.,Eötvös Loránd University | Grolmusz V.,Uratim Ltd.
Methods | Year: 2017

Biological sequences can be considered as data items of high-, non-fixed dimensions, corresponding to the length of those sequences. The comparison and the classification of biological sequences in their relations to large databases are important areas of research today. Artificial neural networks (ANNs) have gained a well-deserved popularity among machine learning tools upon their recent successful applications in image- and sound processing and classification problems. ANNs have also been applied for predicting the family or function of a protein, knowing its residue sequence. Here we present two new ANNs with multi-label classification ability, showing impressive accuracy when classifying protein sequences into 698 UniProt families (AUC = 99.99%) and 983 Gene Ontology classes (AUC = 99.45%). © 2017 Elsevier Inc.

Kerepesi C.,Eötvös Loránd University | Grolmusz V.,Eötvös Loránd University | Grolmusz V.,Uratim Ltd.
Current Microbiology | Year: 2016

DNA sequencing technologies are applied widely and frequently today to describe metagenomes, i.e., microbial communities in environmental or clinical samples, without the need for culturing them. These technologies usually return short (100–300 base-pairs long) DNA reads, and these reads are processed by metagenomic analysis software that assign phylogenetic composition–information to the dataset. Here we evaluate three metagenomic analysis software (AmphoraNet—a webserver implementation of AMPHORA2—, MG-RAST, and MEGAN5) for their capabilities of assigning quantitative phylogenetic information for the data, describing the frequency of appearance of the microorganisms of the same taxa in the sample. The difficulties of the task arise from the fact that longer genomes produce more reads from the same organism than shorter genomes, and some software assign higher frequencies to species with longer genomes than to those with shorter ones. This phenomenon is called the “genome length bias.” Dozens of complex artificial metagenome benchmarks can be found in the literature. Because of the complexity of those benchmarks, it is usually difficult to judge the resistance of a metagenomic software to this “genome length bias.” Therefore, we have made a simple benchmark for the evaluation of the “taxon-counting” in a metagenomic sample: we have taken the same number of copies of three full bacterial genomes of different lengths, break them up randomly to short reads of average length of 150 bp, and mixed the reads, creating our simple benchmark. Because of its simplicity, the benchmark is not supposed to serve as a mock metagenome, but if a software fails on that simple task, it will surely fail on most real metagenomes. We applied three software for the benchmark. The ideal quantitative solution would assign the same proportion to the three bacterial taxa. We have found that AMPHORA2/AmphoraNet gave the most accurate results and the other two software were under-performers: they counted quite reliably each short read to their respective taxon, producing the typical genome length bias. The benchmark dataset is available at http://pitgroup.org/static/3RandomGenome-100kavg150bps.fna. © 2016, Springer Science+Business Media New York.

Szalkai B.,Eötvös Loránd University | Grolmusz V.,Eötvös Loránd University | Grolmusz V.,Uratim Ltd.
Genomics | Year: 2016

Discoveries of new biomarkers for frequently occurring diseases are of special importance in today's medicine. While fully developed type II diabetes (T2D) can be detected easily, the early identification of high risk individuals is an area of interest in T2D, too. Metagenomic analysis of the human bacterial flora has shown subtle changes in diabetic patients, but no specific microbes are known to cause or promote the disease. Moderate changes were also detected in the microbial gene composition of the metagenomes of diabetic patients, but again, no specific gene was found that is present in disease-related and missing in healthy metagenome. However, these fine differences in microbial taxon- and gene composition are difficult to apply as quantitative biomarkers for diagnosing or predicting type II diabetes. In the present work we report some nucleotide 9-mers with significantly differing frequencies in diabetic and healthy intestinal flora. To our knowledge, it is the first time such short DNA fragments have been associated with T2D. The automated, quantitative analysis of the frequencies of short nucleotide sequences seems to be more feasible than accurate phylogenetic and functional analysis, and thus it might be a promising direction of diagnostic research. © 2016 Elsevier Inc.

Tothmeresz L.,H+ Technology | Grolmusz V.,H+ Technology | Grolmusz V.,Uratim Ltd.
Protein and Peptide Letters | Year: 2013

New methods for reliable quantitative analysis of biological network data are in high demand in today's bioinformatics and systems biology. Here we demonstrate the applicability of the co-citation, developed earlier for the analysis of scientific literature for finding functionally similar nodes in protein-protein interaction networks in several model organisms. We prove the power of our approach in a novel way: the predicted closely related enzymes are compared to the closeness of their enzyme commission (EC) numbers, therefore we can numerically evaluate our prediction method. We have found clear correspondence between related enzymatic functions and high co-citation of proteins in interaction networks. © 2013 Bentham Science Publishers.

Kerepesi C.,Eötvös Loránd University | Grolmusz V.,Eötvös Loránd University | Grolmusz V.,Uratim Ltd.
Archives of Virology | Year: 2016

The Kutch Desert (Great Rann of Kutch, Gujarat, India) is a unique ecosystem: in the larger part of the year it is a hot, salty desert that is flooded regularly in the Indian monsoon season. In the dry season, the crystallized salt deposits form the “white desert” in large regions. The first metagenomic analysis of the soil samples of Kutch was published in 2013, and the data were deposited in the NCBI Sequence Read Archive. At the same time, the sequences were analyzed phylogenetically for prokaryotes, especially for bacteria. In the present work, we identified DNA sequences of recently discovered giant viruses in the soil samples from the Kutch Desert. Since most giant viruses have been discovered in biofilms in industrial cooling towers, ocean water, and freshwater ponds, we were surprised to find their DNA sequences in soil samples from a seasonally very hot and arid, salty environment. © 2015, Springer-Verlag Wien.

Grolmusz V.,Eötvös Loránd University | Grolmusz V.,Uratim Ltd.
Information Processing Letters | Year: 2015

The PageRank is a widely used scoring function of networks in general and of the World Wide Web graph in particular. The PageRank is defined for directed graphs, but in some special cases applications for undirected graphs occur. In the literature it is widely - but not exclusively - noted that the PageRank for undirected graphs is proportional to the degrees of the vertices of the graph. We prove that statement for a particular personalization vector in the definition of the PageRank, and we also show that in general, the PageRank of an undirected graph is not exactly proportional to the degree distribution of the graph: our main theorem gives an upper and a lower bound to the l1 norm of the difference of the PageRank and the degree distribution vectors. A necessary and sufficient condition is also given for the PageRank for being proportional to the degree. © 2015 Elsevier B.V. Allrightsreserved.

Szalkai B.,Eötvös Loránd University | Grolmusz V.,Eötvös Loránd University | Grolmusz V.,Uratim Ltd.
Biochimica et Biophysica Acta - General Subjects | Year: 2016

Background: Metagenomic analysis of environmental and clinical samples is gaining considerable importance in today's literature. Changes in the composition of the intestinal microbial communities, relative to the healthy control, are reported in numerous conditions. Methods: We have carefully analyzed the frequencies of the short nucleotide sequences in the metagenomes of two different enterotypes; namely of Chinese and European origins. Results: We have identified 255 nucleotide sequences of length up to 9, such that their frequencies significantly differ in the two enterotypes examined. Conclusions: We have demonstrated that short nucleotide sequences are capable of differentiating enterotypes, and not only metagenomes, originating from healthy and diseased subjects. General significance: Our results may imply that the frequency-differences of certain short nucleotides have diagnostical value if properly applied for different clusters of metagenomes. © 2016 Elsevier B.V.

Kerepesi C.,Eötvös Loránd University | Banky D.,Eötvös Loránd University | Banky D.,Uratim Ltd. | Grolmusz V.,Eötvös Loránd University | Grolmusz V.,Uratim Ltd.
Gene | Year: 2014

Motivation: Metagenomics went through an astonishing development in the past few years. Today not only gene sequencing experts, but numerous laboratories of other specializations need to analyze DNA sequences gained from clinical or environmental samples. Phylogenetic analysis of the metagenomic data presents significant challenges for the biologist and the bioinformatician. The program suite AMPHORA and its workflow version are examples of publicly available software that yields reliable phylogenetic results for metagenomic data. Results: Here we present AmphoraNet, an easy-to-use webserver that is capable of assigning a probability-weighted taxonomic group for each phylogenetic marker gene found in the input metagenomic sample; the webserver is based on the AMPHORA2 workflow. Since a large proportion of molecular biologists uses the BLAST program and its clones on public webservers instead of the locally installed versions, we believe that the occasional user may find it comfortable that, in this version, no time-consuming installation of every component of the AMPHORA2 suite or expertise in Linux environment is required. Availability: The webserver is freely available at http://amphoranet.pitgroup.org; no registration is required. © 2013 Elsevier B.V.

Banky D.,Eötvös Loránd University | Banky D.,Uratim Ltd. | Ivan G.,Eötvös Loránd University | Ivan G.,Uratim Ltd. | And 2 more authors.
PLoS ONE | Year: 2013

Biological network data, such as metabolic-, signaling- or physical interaction graphs of proteins are increasingly available in public repositories for important species. Tools for the quantitative analysis of these networks are being developed today. Protein network-based drug target identification methods usually return protein hubs with large degrees in the networks as potentially important targets. Some known, important protein targets, however, are not hubs at all, and perturbing protein hubs in these networks may have several unwanted physiological effects, due to their interaction with numerous partners. Here, we show a novel method applicable in networks with directed edges (such as metabolic networks) that compensates for the low degree (non-hub) vertices in the network, and identifies important nodes, regardless of their hub properties. Our method computes the PageRank for the nodes of the network, and divides the PageRank by the in-degree (i.e., the number of incoming edges) of the node. This quotient is the same in all nodes in an undirected graph (even for large- and low-degree nodes, that is, for hubs and non-hubs as well), but may differ significantly from node to node in directed graphs. We suggest to assign importance to non-hub nodes with large PageRank/in-degree quotient. Consequently, our method gives high scores to nodes with large PageRank, relative to their degrees: therefore non-hub important nodes can easily be identified in large networks. We demonstrate that these relatively high PageRank scores have biological relevance: the method correctly finds numerous already validated drug targets in distinct organisms (Mycobacterium tuberculosis, Plasmodium falciparum and MRSA Staphylococcus aureus), and consequently, it may suggest new possible protein targets as well. Additionally, our scoring method was not chosen arbitrarily: its value for all nodes of all undirected graphs is constant; therefore its high value captures importance in the directed edge structure of the graph. © 2013 Bánky et al.

Ivan G.,Eötvös Loránd University | Ivan G.,Uratim Ltd. | Grolmusz V.,Eötvös Loránd University | Grolmusz V.,Uratim Ltd.
Bioinformatics | Year: 2011

Motivation: Enormous and constantly increasing quantity of biological information is represented in metabolic and in protein interaction network databases. Most of these data are freely accessible through large public depositories. The robust analysis of these resources needs novel technologies, being developed today.Results: Here we demonstrate a technique, originating from the PageRank computation for the World Wide Web, for analyzing large interaction networks. The method is fast, scalable and robust, and its capabilities are demonstrated on metabolic network data of the tuberculosis bacterium and the proteomics analysis of the blood of melanoma patients. © The Author 2010. Published by Oxford University Press. All rights reserved.

Loading Uratim Ltd. collaborators
Loading Uratim Ltd. collaborators