Hong Kong Bioinformatics Center

Hong Kong, Hong Kong

Hong Kong Bioinformatics Center

Hong Kong, Hong Kong
Time filter
Source Type

Yip K.Y.,Hong Kong Bioinformatics Center | Yip K.Y.,CUHK BGI Innovation Institute of Trans omics | Yip K.Y.,Chinese University of Hong Kong
Bioinformatics | Year: 2016

Motivation: The three-dimensional structure of genomes makes it possible for genomic regions not adjacent in the primary sequence to be spatially proximal. These DNA contacts have been found to be related to various molecular activities. Previous methods for analyzing DNA contact maps obtained from Hi-C experiments have largely focused on studying individual interactions, forming spatial clusters composed of contiguous blocks of genomic locations, or classifying these clusters into general categories based on some global properties of the contact maps. Results: Here, we describe a novel computational method that can flexibly identify small clusters of spatially proximal genomic regions based on their local contact patterns. Using simulated data that highly resemble Hi-C data obtained from real genome structures, we demonstrate that our method identifies spatial clusters that are more compact than methods previously used for clustering genomic regions based on DNA contact maps. The clusters identified by our method enable us to confirm functionally related genomic regions previously reported to be spatially proximal in different species. We further show that each genomic region can be assigned a numeric affinity value that indicates its degree of participation in each local cluster, and these affinity values correlate quantitatively with DNase I hypersensitivity, gene expression, super enhancer activities and replication timing in a cell type specific manner. We also show that these cluster affinity values can precisely define boundaries of reported topologically associating domains, and further define local sub-domains within each domain. © 2016 The Author 2016. Published by Oxford University Press.

Li J.-W.,Hong Kong Bioinformatics Center | Li J.-W.,Chinese University of Hong Kong | Wan R.,Hong Kong Bioinformatics Center | Yu C.-S.,Hong Kong Bioinformatics Center | And 3 more authors.
Bioinformatics | Year: 2013

Summary: Insertional mutagenesis from virus infection is an important pathogenic risk for the development of cancer. Despite the advent of high-throughput sequencing, discovery of viral integration sites and expressed viral fusion events are still limited. Here, we present ViralFusionSeq (VFS), which combines soft-clipping information, read-pair analysis and targeted de novo assembly to discover and annotate viral-human fusions. VFS was used in an RNA-Seq experiment, simulated DNA-Seq experiment and re-analysis of published DNA-Seq datasets. Our experiments demonstrated that VFS is both sensitive and highly accurate. © 2013 The Author 2013. Published by Oxford University Press.

Chan T.-M.,Chinese University of Hong Kong | Leung K.-S.,Chinese University of Hong Kong | Lee K.-H.,Chinese University of Hong Kong | Wong M.-H.,Chinese University of Hong Kong | And 3 more authors.
Nucleic Acids Research | Year: 2012

In protein-DNA interactions, particularly transcription factor (TF) and transcription factor binding site (TFBS) bindings, associated residue variations form patterns denoted as subtypes. Subtypes may lead to changed binding preferences, distinguish conserved from flexible binding residues and reveal novel binding mechanisms. However, subtypes must be studied in the context of core bindings. While solving 3D structures would require huge experimental efforts, recent sequence-based associated TF-TFBS pattern discovery has shown to be promising, upon which a large-scale subtype study is possible and desirable. In this article, we investigate residue-varying subtypes based on associated TF-TFBS patterns. By re-categorizing the patterns with respect to varying TF amino acids, statistically significant (P values≤0.005) subtypes leading to varying TFBS patterns are discovered without using TF family or domain annotations. Resultant subtypes have various biological meanings. The subtypes reflect familial and functional properties and exhibit changed binding preferences supported by 3D structures. Conserved residues critical for maintaining TF-TFBS bindings are revealed by analyzing the subtypes. In-depth analysis on the subtype pair PKVVIL-CACGTG versus PKVEILCAGCTG shows the V/E variation is indicative for distinguishing Myc from MRF families. Discovered from sequences only, the TF-TFBS subtypes are informative and promising for more biological findings, complementing and extending recent one-sided subtype and familial studies with comprehensive evidence. © 2012 The Author(s).

Chan T.-M.,Chinese University of Hong Kong | Wong K.-C.,Chinese University of Hong Kong | Wong K.-C.,King Abdullah University of Science and Technology | Lee K.-H.,Chinese University of Hong Kong | And 5 more authors.
Bioinformatics | Year: 2011

Motivation: The bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) are fundamental protein-DNA interactions in transcriptional regulation. Extensive efforts have been made to better understand the protein-DNA interactions. Recent mining on exact TF-TFBS-associated sequence patterns (rules) has shown great potentials and achieved very promising results. However, exact rules cannot handle variations in real data, resulting in limited informative rules. In this article, we generalize the exact rules to approximate ones for both TFs and TFBSs, which are essential for biological variations. Results: A progressive approach is proposed to address the approximation to alleviate the computational requirements. Firstly, similar TFBSs are grouped from the available TF-TFBS data (TRANSFAC database). Secondly, approximate and highly conserved binding cores are discovered from TF sequences corresponding to each TFBS group. A customized algorithm is developed for the specific objective. We discover the approximate TF-TFBS rules by associating the grouped TFBS consensuses and TF cores. The rules discovered are evaluated by matching (verifying with) the actual protein-DNA binding pairs from Protein Data Bank (PDB) 3D structures. The approximate results exhibit many more verified rules and up to 300% better verification ratios than the exact ones. The customized algorithm achieves over 73% better verification ratios than traditional methods. Approximate rules (64-79%) are shown statistically significant. Detailed variation analysis and conservation verification on NCBI records demonstrate that the approximate rules reveal both the flexible and specific protein-DNA interactions accurately. The approximate TF-TFBS rules discovered show great generalized capability of exploring more informative binding rules. © The Author 2010. Published by Oxford University Press. All rights reserved.

Leung K.-S.,Chinese University of Hong Kong | Wong K.-C.,Chinese University of Hong Kong | Chan T.-M.,Chinese University of Hong Kong | Wong M.-H.,Chinese University of Hong Kong | And 4 more authors.
Nucleic Acids Research | Year: 2010

Protein-DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) play an essential role in transcriptional regulation. Over the past decades, significant efforts have been made to study the principles for protein-DNA bindings. However, it is considered that there are no simple one-to-one rules between amino acids and nucleotides. Many methods impose complicated features beyond sequence patterns. Protein-DNA bindings are formed from associated amino acid and nucleotide sequence pairs, which determine many functional characteristics. Therefore, it is desirable to investigate associated sequence patterns between TFs and TFBSs. With increasing computational power, availability of massive experimental databases on DNA and proteins, and mature data mining techniques, we propose a framework to discover associated TF-TFBS binding sequence patterns in the most explicit and interpretable form from TRANSFAC. The framework is based on association rule mining with Apriori algorithm. The patterns found are evaluated by quantitative measurements at several levels on TRANSFAC. With further independent verifications from literatures, Protein Data Bank and homology modeling, there are strong evidences that the patterns discovered reveal real TF-TFBS bindings across different TFs and TFBSs, which can drive for further knowledge to better understand TF-TFBS bindings. © The Author(s) 2010. Published by Oxford University Press.

Tsui S.K.W.,Center for Microbial Genomics and Proteomics | Tsui S.K.W.,Hong Kong Bioinformatics Center | Fong N.-Y.,Center for Microbial Genomics and Proteomics | Li S.-K.,Center for Microbial Genomics and Proteomics | And 6 more authors.
AIDS Research and Human Retroviruses | Year: 2010

With considerable capacity for genetic diversification, new HIV-1 genotypes have been reported over the years. Three HIV-1 isolates previously genotyped as B using gag and env sequences were completely sequenced and reanalyzed. Several amino acid mutations were found in vif, rev, and nef genes but not in gag or env sequences. These alterations have not previously been reported in Hong Kong. The investigation of phylogenetic relatedness revealed that a region of the vif of the studied Hong Kong isolates subtype B cluster contains several subtype D signature amino acid residues. Several unique mutations on vif in these three isolates were also identified. © 2010, Mary Ann Liebert, Inc.

Loading Hong Kong Bioinformatics Center collaborators
Loading Hong Kong Bioinformatics Center collaborators