Time filter

Source Type

Cape May Court House, NJ, United States

Zhu J.,Northeastern University China | Ma M.,Scientific Works
ACM Transactions on Speech and Language Processing

This article deals with pool-based active learning with uncertainty sampling. While existing uncertainty sampling methods emphasize selection of instances near the decision boundary to increase the likelihood of selecting informative examples, our position is that this heuristic is a surrogate for selecting examples for which the current learning algorithm iteration is likely to misclassify. To more directly model this intuition, this article augments such uncertainty sampling methods and proposes a simple instability-based selective sampling approach to improving uncertainty-based active learning, in which the instability degree of each unlabeled example is estimated during the learning process. Experiments on seven evaluation datasets show that instability-based sampling methods can achieve significant improvements over the traditional uncertainty sampling method. In terms of the average percentage of actively selected examples required for the learner to achieve 99% of its performance when training on the entire dataset, instability sampling and sampling by instability and density methods achieve better effectiveness in annotation cost reduction than random sampling and traditional entropy-based uncertainty sampling. Our experimental results have also shown that instability-based methods yield no significant improvement for active learning with SVMs when a popular sigmoidal function is used to transform SVM outputs to posterior probabilities. © 2012 ACM. Source

Zhu J.,Northeastern University China | Wang H.,Northeastern University China | Zhu M.,Northeastern University China | Tsou B.K.,The Hong Kong Institute of Education | Ma M.,Scientific Works
IEEE Transactions on Affective Computing

Opinion polling has been traditionally done via customer satisfaction studies in which questions are carefully designed to gather customer opinions about target products or services. This paper studies aspect-based opinion polling from unlabeled free-form textual customer reviews without requiring customers to answer any questions. First, a multi-aspect bootstrapping method is proposed to learn aspect-related terms of each aspect that are used for aspect identification. Second, an aspect-based segmentation model is proposed to segment a multi-aspect sentence into multiple single-aspect units as basic units for opinion polling. Finally, an aspect-based opinion polling algorithm is presented in detail. Experiments on real Chinese restaurant reviews demonstrated that our approach can achieve 75.5 percent accuracy in aspect-based opinion polling tasks. The proposed opinion polling method does not require labeled training data. It is thus easy to implement and can be applicable to other languages (e.g., English) or other domains such as product or movie reviews. © 2011 IEEE. Source

Zhu J.,Northeastern University China | Wang H.,Northeastern University China | Hovy E.,University of Southern California | Ma M.,Scientific Works
ACM Transactions on Speech and Language Processing

The labor-intensive task of labeling data is a serious bottleneck for many supervised learning approaches for natural language processing applications. Active learning aims to reduce the human labeling cost for supervised learning methods. Determiningwhen to stop the active learning process is a very important practical issue in real-world applications. This article addresses the stopping criterion issue of active learning, and presents four simple stopping criteria based on confidence estimation over the unlabeled data pool, including maximum uncertainty, overall uncertainty, selected accuracy, and minimum expected error methods. Further, to obtain a proper threshold for a stopping criterion in a specific task, this article presents a strategy by considering the label change factor to dynamically update the predefined threshold of a stopping criterion during the active learning process. To empirically analyze the effectiveness of each stopping criterion for active learning, we design several comparison experiments on seven real-world datasets for three representative natural language processing applications such as word sense disambiguation, text classification and opinion analysis. © 2010 ACM 1550-4875/2010/04-ART3 $10.00. Source

Zhu J.,Northeastern University China | Zhang C.,Northeastern University China | Ma M.Y.,Scientific Works
IEEE Transactions on Affective Computing

This paper explores the problem of content-based rating inference from online opinion-based texts, which often expresses differing opinions on multiple aspects. To sufficiently capture information from various aspects, we propose an aspect-based segmentation algorithm to first segment a user review into multiple single-aspect textual parts, and an aspect-augmentation approach to generate the aspect-specific feature vector of each aspect for aspect-based rating inference. To tackle the problem of inconsistent rating annotation, we present a tolerance-based criterion to optimize training sample selection for parameter updating during the model training process. Finally, we present a collaborative rating inference model which explores meaningful correlations between ratings across a set of aspects of user opinions for multi-aspect rating inference. We compared our proposed methods with several other approaches, and experiments on real Chinese restaurant reviews demonstrated that our approaches achieve significant improvements over others. © 2010-2012 IEEE. Source

Zhu J.,Northeastern University China | Wang H.,Northeastern University China | Tsou B.K.,City University of Hong Kong | Ma M.,Scientific Works
IEEE Transactions on Audio, Speech and Language Processing

To solve the knowledge bottleneck problem, active learning has been widely used for its ability to automatically select the most informative unlabeled examples for human annotation. One of the key enabling techniques of active learning is uncertainty sampling, which uses one classifier to identify unlabeled examples with the least confidence. Uncertainty sampling often presents problems when outliers are selected. To solve the outlier problem, this paper presents two techniques, sampling by uncertainty and density (SUD) and density-based re-ranking. Both techniques prefer not only the most informative example in terms of uncertainty criterion, but also the most representative example in terms of density criterion. Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets demonstrate the effectiveness of the proposed methods. © 2010 IEEE. Source

Discover hidden collaborations