Shanghai Key Laboratory of Intelligent Information Processing

Shanghai, China

Shanghai Key Laboratory of Intelligent Information Processing

Shanghai, China
Time filter
Source Type

Pan Y.,Central South University | Liu D.,Central South University | Deng L.,Central South University | Deng L.,Shanghai Key Laboratory of Intelligent Information Processing
PLoS ONE | Year: 2017

Single amino acid variations (SAVs) potentially alter biological functions, including causing diseases or natural differences between individuals. Identifying the relationship between a SAV and certain disease provides the starting point for understanding the underlying mechanisms of specific associations, and can help further prevention and diagnosis of inherited disease.We propose PredSAV, a computational method that can effectively predict how likely SAVs are to be associated with disease by incorporating gradient tree boosting (GTB) algorithm and optimally selected neighborhood features. A two-step feature selection approach is used to explore the most relevant and informative neighborhood properties that contribute to the prediction of disease association of SAVs across a wide range of sequence and structural features, especially some novel structural neighborhood features. In cross-validation experiments on the benchmark dataset, PredSAV achieves promising performances with an AUC score of 0.908 and a specificity of 0.838, which are significantly better than that of the other existing methods. Furthermore, we validate the capability of our proposed method by an independent test and gain a competitive advantage as a result. PredSAV, which combines gradient tree boosting with optimally selected neighborhood features, can return reliable predictions in distinguishing between disease-associated and neutral variants. Compared with existing methods, PredSAV shows improved specificity as well as increased overall performance. © 2017 Pan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Liu B.,Harbin Institute of Technology | Liu B.,Shanghai Key Laboratory of Intelligent Information Processing | Liu B.,Gordon Life Science Institute | Zhang D.,Shenyang Aerospace University | And 6 more authors.
Bioinformatics | Year: 2014

Motivation: Owing to its importance in both basic research (such as molecular evolution and protein attribute prediction) and practical application (such as timely modeling the 3D structures of proteins targeted for drug development), protein remote homology detection has attracted a great deal of interest. It is intriguing to note that the profile-based approach is promising and holds high potential in this regard. To further improve protein remote homology detection, a key step is how to find an optimal means to extract the evolutionary information into the profiles.Results: Here, we propose a novel approach, the so-called profile-based protein representation, to extract the evolutionary information via the frequency profiles. The latter can be calculated from the multiple sequence alignments generated by PSI-BLAST. Three top performing sequence-based kernels (SVM-Ngram, SVM-pairwise and SVM-LA) were combined with the profile-based protein representation. Various tests were conducted on a SCOP benchmark dataset that contains 54 families and 23 superfamilies. The results showed that the new approach is promising, and can obviously improve the performance of the three kernels. Furthermore, our approach can also provide useful insights for studying the features of proteins in various families. It has not escaped our notice that the current approach can be easily combined with the existing sequence-based methods so as to improve their performance as well.Availability and implementation: For users' convenience, the source code of generating the profile-based proteins and the multiple kernel learning was also provided at main/∼binliu/remote/Contact: or bliu@gordonlifescience.orgSupplementary information: Supplementary data are available at Bioinformatics online. © 2013 The Author.

Ding Y.,Shanghai University | Ding Y.,Shanghai Key Laboratory of Intelligent Information Processing
IEEE Transactions on Information Theory | Year: 2016

Burst errors are a type of distortion in many data communications and data storage channels. In this paper, we consider the list decodability of codes for single burst error case and phased-burst error case independently. Firstly, we analyze the list decodability of random codes, and we show that the burst list decoding radius and the rate of random codes achieve the Singleton bound and the Gilbert-Varshamov bound for single case and phased case, respectively. Second, we illustrate that cyclic codes and algebraic geometry codes are good burst list-decodable codes for single case and phased case, respectively. © 2015 IEEE.

Jin L.,Shanghai Key Laboratory of Intelligent Information Processing
IEEE Transactions on Information Theory | Year: 2014

It has been a great challenge to construct new quantum maximum-distance- separable (MDS) codes. In particular, it is very hard to construct the quantum MDS codes with relatively large minimum distance. So far, except for some sparse lengths, all known q-ary quantum MDS codes have minimum distance ≤q/2+1. In this paper, we provide a construction of the quantum MDS codes with minimum distance >q2+1. In particular, we show the existence of the q -ary quantum MDS codes with length n=q2+1 and minimum distance d for any d≤q+1 (this result extends those given in [10], [12], and [13]); and with length (q2+2)/3 and minimum distance d for any d≤ (2q+2)/3 if 3\vert (q+1). Our method is through Hermitian self-orthogonal codes. The main idea of constructing the Hermitian self-orthogonal codes is based on the solvability in Fq of a system of homogenous equations over F q2. © 2014 IEEE.

Xu M.,Shanghai Maritime University | Xu M.,Shanghai Key Laboratory of Intelligent Information Processing | Liu G.,Shanghai Maritime University
International Journal of Distributed Sensor Networks | Year: 2013

Low data delivery efficiency and high energy consumption are the inherent problems in Underwater Wireless Sensor Networks (UWSNs) characterized by the acoustic channels. Existing energy-efficient routing algorithms have been shown to reduce energy consumption of UWSNs to some extent, but still neglect the correlation existing in the local data of sensor nodes. In this paper, we present a Multi-population Firefly Algorithm (MFA) for correlated data routing in UWSNs. We design three kinds of fireflies and their coordination rules in order to improve the adaptability of building, selecting, and optimization of routing path considering the data correlation and their sampling rate in various sensor nodes. Different groups of fireflies conduct their optimization in the evolution in order to improve the convergence speed and solution precision of the algorithm. Moreover, after the data packets are merged during the process of routing path finding, MFA can also eliminate redundant information before they are sent to the sink node, which in turn saves energy and bandwidth. Simulation results have shown that MFA achieves better performance than existing protocols in metrics of packet delivery ratio, energy consumption, and network throughput. © 2013 Ming Xu and Guangzhong Liu.

Liu B.,Harbin Institute of Technology | Liu B.,Shanghai Key Laboratory of Intelligent Information Processing | Wang X.,Harbin Institute of Technology | Zou Q.,Xiamen University | And 2 more authors.
Molecular Informatics | Year: 2013

Protein remote homology detection is a key problem in bioinformatics. Currently the discriminative methods, such as Support Vector Machine (SVM) can achieve the best performance. The most efficient approach to improve the performance of SVM-based methods is to find a general protein representation method that is able to convert proteins with different lengths into fixed length vectors and captures the different properties of the proteins for the discrimination. The bottleneck of designing the protein representation method is that native proteins have different lengths. Motivated by the success of the pseudo amino acid composition (PseAAC) proposed by Chou, we applied this approach for protein remote homology detection. Some new indices derived from the amino acid index (AAIndex) database are incorporated into the PseAAC to improve the generalization ability of this method. Finally, the performance is further improved by combining the modified PseAAC with profile-based protein representation containing the evolutionary information extracted from the frequency profiles. Our experiments on a well-known benchmark show this method achieves superior or comparable performance with current state-of-theart methods. © 2013 Wiley-VCH Verlag GmbH and Co.

Liu B.,Harbin Institute of Technology | Liu B.,Shanghai Key Laboratory of Intelligent Information Processing | Liu B.,Gordon Life Science Institute | Xu J.,Harbin Institute of Technology | And 6 more authors.
PLoS ONE | Year: 2014

Playing crucial roles in various cellular processes, such as recognition of specific nucleotide sequences, regulation of transcription, and regulation of gene expression, DNA-binding proteins are essential ingredients for both eukaryotic and prokaryotic proteomes. With the avalanche of protein sequences generated in the postgenomic age, it is a critical challenge to develop automated methods for accurate and rapidly identifying DNA-binding proteins based on their sequence information alone. Here, a novel predictor, called "iDNA-Prot|dis", was established by incorporating the amino acid distancepair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition (PseAAC) vector. The former can capture the characteristics of DNA-binding proteins so as to enhance its prediction quality, while the latter can reduce the dimension of PseAAC vector so as to speed up its prediction process. It was observed by the rigorous jackknife and independent dataset tests that the new predictor outperformed the existing predictors for the same purpose. As a user-friendly web-server, iDNA-Prot|dis is accessible to the public at Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step protocol guide is provided on how to use the web-server to get their desired results without the need to follow the complicated mathematic equations that are presented in this paper just for the integrity of its developing process. It is anticipated that the iDNAProt| dis predictor may become a useful high throughput tool for large-scale analysis of DNA-binding proteins, or at the very least, play a complementary role to the existing predictors in this regard. © 2014 Liu et al.

Zeng J.,Soochow University of China | Zeng J.,Shanghai Key Laboratory of Intelligent Information Processing | Hannenhalli S.,University of Maryland University College
BMC Genomics | Year: 2013

Background: Gene duplication, followed by functional evolution of duplicate genes, is a primary engine of evolutionary innovation. In turn, gene expression evolution is a critical component of overall functional evolution of paralogs. Inferring evolutionary history of gene expression among paralogs is therefore a problem of considerable interest. It also represents significant challenges. The standard approaches of evolutionary reconstruction assume that at an internal node of the duplication tree, the two duplicates evolve independently. However, because of various selection pressures functional evolution of the two paralogs may be coupled. The coupling of paralog evolution corresponds to three major fates of gene duplicates: subfunctionalization (SF), conserved function (CF) or neofunctionalization (NF). Quantitative analysis of these fates is of great interest and clearly influences evolutionary inference of expression. These two interrelated problems of inferring gene expression and evolutionary fates of gene duplicates have not been studied together previously and motivate the present study. Results: Here we propose a novel probabilistic framework and algorithm to simultaneously infer (i) ancestral gene expression and (ii) the likely fate (SF, NF, CF) at each duplication event during the evolution of gene family. Using tissue-specific gene expression data, we develop a nonparametric belief propagation (NBP) algorithm to predict the ancestral expression level as a proxy for function, and describe a novel probabilistic model that relates the predicted and known expression levels to the possible evolutionary fates. We validate our model using simulation and then apply it to a genome-wide set of gene duplicates in human. Conclusions: Our results suggest that SF tends to be more frequent at the earlier stage of gene family expansion, while NF occurs more frequently later on. © 2013 Zeng and Hannenhalli.

Ding J.,Fudan University | Ding J.,Shanghai Key Laboratory of Intelligent Information Processing | Zhou S.,Fudan University | Zhou S.,Shanghai Key Laboratory of Intelligent Information Processing | Guan J.,Tongji University
BMC Bioinformatics | Year: 2011

Background: MicroRNAs (miRNAs) are ~22 nt long integral elements responsible for post-transcriptional control of gene expressions. After the identification of thousands of miRNAs, the challenge is now to explore their specific biological functions. To this end, it will be greatly helpful to construct a reasonable organization of these miRNAs according to their homologous relationships. Given an established miRNA family system (e.g. the miRBase family organization), this paper addresses the problem of automatically and accurately classifying newly found miRNAs to their corresponding families by supervised learning techniques. Concretely, we propose an effective method, miRFam, which uses only primary information of pre-miRNAs or mature miRNAs and a multiclass SVM, to automatically classify miRNA genes.Results: An existing miRNA family system prepared by miRBase was downloaded online. We first employed n-grams to extract features from known precursor sequences, and then trained a multiclass SVM classifier to classify new miRNAs (i.e. their families are unknown). Comparing with miRBase's sequence alignment and manual modification, our study shows that the application of machine learning techniques to miRNA family classification is a general and more effective approach. When the testing dataset contains more than 300 families (each of which holds no less than 5 members), the classification accuracy is around 98%. Even with the entire miRBase15 (1056 families and more than 650 of them hold less than 5 samples), the accuracy surprisingly reaches 90%.Conclusions: Based on experimental results, we argue that miRFam is suitable for application as an automated method of family classification, and it is an important supplementary tool to the existing alignment-based small non-coding RNA (sncRNA) classification methods, since it only requires primary sequence information.Availability: The source code of miRFam, written in C++, is freely and publicly available at: © 2011 Ding et al; licensee BioMed Central Ltd.

Lu H.,Fudan University | Wei H.,Fudan University | Wei H.,Shanghai Key Laboratory of Intelligent Information Processing
Physica A: Statistical Mechanics and its Applications | Year: 2012

Determining community structure in networks is fundamental to the analysis of the structural and functional properties of those networks, including social networks, computer networks, and biological networks. Modularity function Q, which was proposed by Newman and Girvan, was once the most widely used criterion for evaluating the partition of a network into communities. However, modularity Q is subject to a serious resolution limit. In this paper, we propose a new function for evaluating the partition of a network into communities. This is called community coefficient C. Using community coefficient C, we can automatically identify the ideal number of communities in the network, without any prior knowledge. We demonstrate that community coefficient C is superior to the modularity Q and does not have a resolution limit. We also compared the two widely used community structure partitioning methods, the hierarchical partitioning algorithm and the normalized cuts (Ncut) spectral partitioning algorithm. We tested these methods on computer-generated networks and real-world networks whose community structures were already known. The Ncut algorithm and community coefficient C were found to produce better results than hierarchical algorithms. Unlike several other community detection methods, the proposed method effectively partitioned the networks into different community structures and indicated the correct number of communities. © 2012 Elsevier B.V. All rights reserved.

Loading Shanghai Key Laboratory of Intelligent Information Processing collaborators
Loading Shanghai Key Laboratory of Intelligent Information Processing collaborators