Thomee B.,Yahoo! Research |
Thomee B.,Leiden University |
Lew M.S.,Leiden University
International Journal of Multimedia Information Retrieval | Year: 2012
We are living in an Age of Information where the amount of accessible data from science and culture is almost limitless. However, this also means that finding an item of interest is increasingly difficult, a digital needle in the proverbial haystack. In this article, we focus on the topic of content-based image retrieval using interactive search techniques, i.e., how does one interactively find any kind of imagery from any source, regardless of whether it is photographic, MRI or X-ray? We highlight trends and ideas from over 170 recent research papers aiming to capture the wide spectrum of paradigms and methods in interactive search, including its subarea relevance feedback. Furthermore, we identify promising research directions and several grand challenges for the future. © 2012, The Author(s).
Francoisse K.,Catholic University of Louvain |
Kivimaki I.,Catholic University of Louvain |
Kivimaki I.,Aalto University |
Mantrach A.,Yahoo! Research |
And 3 more authors.
Neural Networks | Year: 2017
This work develops a generic framework, called the bag-of-paths (BoP), for link and network data analysis. The central idea is to assign a probability distribution on the set of all paths in a network. More precisely, a Gibbs–Boltzmann distribution is defined over a bag of paths in a network, that is, on a representation that considers all paths independently. We show that, under this distribution, the probability of drawing a path connecting two nodes can easily be computed in closed form by simple matrix inversion. This probability captures a notion of relatedness, or more precisely accessibility, between nodes of the graph: two nodes are considered as highly related when they are connected by many, preferably low-cost, paths. As an application, two families of distances between nodes are derived from the BoP probabilities. Interestingly, the second distance family interpolates between the shortest-path distance and the commute-cost distance. In addition, it extends the Bellman–Ford formula for computing the shortest-path distance in order to integrate sub-optimal paths (exploration) by simply replacing the minimum operator by the soft minimum operator. Experimental results on semi-supervised classification tasks show that both of the new distance families are competitive with other state-of-the-art approaches. In addition to the distance measures studied in this paper, the bag-of-paths framework enables straightforward computation of many other relevant network measures. © 2017 Elsevier Ltd
Daldal R.,Sabanci University |
Gamzu I.,Yahoo! Research |
Segev D.,Haifa University |
Unluyurt T.,Sabanci University
Naval Research Logistics | Year: 2016
We introduce and study a generalization of the classic sequential testing problem, asking to identify the correct state of a given series system that consists of independent stochastic components. In this setting, costly tests are required to examine the state of individual components, which are sequentially tested until the correct system state can be uniquely identified. The goal is to propose a policy that minimizes the expected testing cost, given a-priori probabilistic information on the stochastic nature of each individual component. Unlike the classic setting, where variables are tested one after the other, we allow multiple tests to be conducted simultaneously, at the expense of incurring an additional set-up cost. The main contribution of this article consists in showing that the batch testing problem can be approximated in polynomial time within factor 6.829 + ϵ, for any fixed ϵ ∈ (0,1). In addition, we explain how, in spite of its highly nonlinear objective function, the batch testing problem can be formulated as an approximate integer program of polynomial size, while blowing up its expected cost by a factor of at most 1 + ϵ. Finally, we conduct extensive computational experiments, to demonstrate the practical effectiveness of these algorithms as well as to evaluate their limitations. © 2016 Wiley Periodicals, Inc. Naval Research Logistics 63: 275–286, 2016. © 2016 Wiley Periodicals, Inc.
Giannopoulos G.,IMIS Institute |
Koniaris M.,National Technical University of Athens |
Weber I.,Qatar Computing Research Institute |
Jaimes A.,Yahoo! Research |
Sellis T.,RMIT University
Journal of Intelligent Information Systems | Year: 2014
In this paper, we introduce an approach for diversifying user comments on news articles. We claim that, although content diversity suffices for the keyword search setting, as proven by existing work on search result diversification, it is not enough when it comes to diversifying comments of news articles. Thus, in our proposed framework, we define comment-specific diversification criteria in order to extract the respective diversification dimensions in the form of feature vectors. These criteria involve content similarity, sentiment expressed within comments, named entities, quality of comments and combinations of them. Then, we apply diversification on comments, utilizing the extracted features vectors. The outcome of this process is a subset of the initial set that contains heterogeneous comments, representing different aspects of the news article, different sentiments expressed, different writing quality, etc. We perform an experimental analysis showing that the diversity criteria we introduce result in distinctively diverse subsets of comments, as opposed to the baseline of diversifying comments only w.r.t. to their content. We also present a prototype system that implements our diversification framework on news articles comments. © 2014, Springer Science+Business Media New York.
Gullo F.,Yahoo! Research |
Domeniconi C.,George Mason University |
Tagarelli A.,University of Calabria
Machine Learning | Year: 2013
The Projective Clustering Ensemble (PCE) problem is a recent clustering advance aimed at combining the two powerful tools of clustering ensembles and projective clustering. PCE has been formalized as either a two-objective or a single-objective optimization problem. Two-objective PCE has been recognized as more accurate than its single-objective counterpart, although it is unable to jointly handle the object-based and feature-based cluster representations.In this paper, we push forward the current PCE research, aiming to overcome the limitations of all existing PCE formulations. We propose a novel single-objective PCE formulation so that (i) the object-based and feature-based cluster representations are jointly considered, and (ii) the resulting optimization strategy follows a metacluster-based methodology borrowed from traditional clustering ensembles. As a result, the proposed formulation features best suitability to the PCE problem, thus guaranteeing improved effectiveness. Experiments on benchmark datasets have shown how the proposed approach achieves better average accuracy than all existing PCE methods, as well as efficiency superior to the most accurate existing metacluster-based PCE method on larger datasets. © 2013, The Author(s).
Gamzu I.,Yahoo! Research |
Medina M.,Tel Aviv University
Algorithmica | Year: 2016
An instance of the maximum mixed graph orientation problem consists of a mixed graph and a collection of source-target vertex pairs. The objective is to orient the undirected edges of the graph so as to maximize the number of pairs that admit a directed source-target path. This problem has recently arisen in the study of biological networks, and it also has applications in communication networks. In this paper, we identify an interesting local-to-global orientation property. This property enables us to modify the best known algorithms for maximum mixed graph orientation and some of its special structured instances, due to Elberfeld et al. (Theor. Comput. Sci. 483:96–103, 2013), and obtain improved approximation ratios. We further proceed by developing an algorithm that achieves an even better approximation guarantee for the general setting of the problem. Finally, we study several well-motivated variants of this orientation problem. © 2014, Springer Science+Business Media New York.
Singh G.,University College London |
Mantrach A.,Yahoo! Research |
Silvestri F.,Yahoo! Research
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2016
The majority of online users do not engage highly with services that are offered via Web. This is a well-known fact and it is also one of the main issues that personalization algorithms try to overcome. A popular way of personalizing an online service is to record users’ actions into user profiles. Weakly-engaged users lead to sparsely populated user profiles, or weak profiles as we name them. Such weak profiles constitute a source of potential increase in user engagement and as a consequence, windfall profits for Internet companies. In this paper, we define the novel problem of enhancing weak profiles in positive space and propose an effective solution based on learning collective embedding space in order to capture a low-dimensional manifold designed to specifically reconstruct sparse user profiles. Our method consistently outperforms baselines consisting of kNN and collective factorization without constraints on user profile. Experiments on two datasets, news and video, from a popular online portal show improvements of up to more than 100% in terms of MAP for extremely weak profiles, and up to around 10% for moderately weak profiles. In order to evaluate the impact of our method on learned latent space embeddings for users and items, we generate recommendations exploiting our user profile constrained approach. The generated recommendations outperform state-of-the-art techniques based on low-rank collective matrix factorization in particular for users that clicked at most four times (78–82% of the total) on the items published by the online portal we consider. © Springer International Publishing Switzerland 2016.
Sadeghi S.,RMIT University |
Blanco R.,Yahoo! Research |
Mika P.,Yahoo! Research |
Sanderson M.,RMIT University |
And 2 more authors.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2015
In this study, we address the problem of identifying if users are attempting to re-find information and estimating the level of difficulty of the refinding task. We propose to consider the task information (e.g. multiple queries and click information) rather than only queries. Our resultant prediction models are shown to be significantly more accurate (by 2%) than the current state of the art. While past research assumes that previous search history of the user is available to the prediction model, we examine if re-finding detection is possible without access to this information. Our evaluation indicates that such detection is possible, but more challenging. We further describe the first predictive model in detecting re-finding difficulty, showing it to be significantly better than existing approaches for detecting general search difficulty. © Springer International Publishing Switzerland 2015.
PubMed | Yahoo! Research
Type: Journal Article | Journal: Behavior research methods | Year: 2012
Amazons Mechanical Turk is an online labor market where requesters post jobs and workers choose which jobs to do for pay. The central purpose of this article is to demonstrate how to use this Web site for conducting behavioral research and to lower the barrier to entry for researchers who could benefit from this platform. We describe general techniques that apply to a variety of types of research and experiments across disciplines. We begin by discussing some of the advantages of doing experiments on Mechanical Turk, such as easy access to a large, stable, and diverse subject pool, the low cost of doing experiments, and faster iteration between developing theory and executing experiments. While other methods of conducting behavioral research may be comparable to or even better than Mechanical Turk on one or more of the axes outlined above, we will show that when taken as a whole Mechanical Turk can be a useful tool for many researchers. We will discuss how the behavior of workers compares with that of experts and laboratory subjects. Then we will illustrate the mechanics of putting a task on Mechanical Turk, including recruiting subjects, executing the task, and reviewing the work that was submitted. We also provide solutions to common problems that a researcher might face when executing their research on this platform, including techniques for conducting synchronous experiments, methods for ensuring high-quality work, how to keep data private, and how to maintain code security.
PubMed | Yahoo! Research
Type: Journal Article | Journal: PloS one | Year: 2012
We live in a computerized and networked society where many of our actions leave a digital trace and affect other peoples actions. This has lead to the emergence of a new data-driven research field: mathematical methods of computer science, statistical physics and sociometry provide insights on a wide range of disciplines ranging from social science to human mobility. A recent important discovery is that search engine traffic (i.e., the number of requests submitted by users to search engines on the www) can be used to track and, in some cases, to anticipate the dynamics of social phenomena. Successful examples include unemployment levels, car and home sales, and epidemics spreading. Few recent works applied this approach to stock prices and market sentiment. However, it remains unclear if trends in financial markets can be anticipated by the collective wisdom of on-line users on the web. Here we show that daily trading volumes of stocks traded in NASDAQ-100 are correlated with daily volumes of queries related to the same stocks. In particular, query volumes anticipate in many cases peaks of trading by one day or more. Our analysis is carried out on a unique dataset of queries, submitted to an important web search engine, which enable us to investigate also the user behavior. We show that the query volume dynamics emerges from the collective but seemingly uncoordinated activity of many users. These findings contribute to the debate on the identification of early warnings of financial systemic risk, based on the activity of users of the www.