Chen S.,Turn Inc.
Proceedings of the VLDB Endowment | Year: 2010
Large-scale data analysis has become increasingly important for many enterprises. Recently, a new distributed computing paradigm, called MapReduce, and its open source implementation Hadoop, has been widely adopted due to its impressive scalability and flexibility to handle structured as well as unstructured data. In this paper, we describe our data warehouse system, called Cheetah, built on top of MapReduce. Cheetah is designed specifically for our online advertising application to allow various simplifications and custom optimizations. First, we take a fresh look at the data warehouse schema design. In particular, we define a virtual view on top of the common star or snowflake data warehouse schema. This virtual view abstraction not only allows us to design a SQL-like but much more succinct query language, but also makes it easier to support many advanced query processing features. Next, we describe a stack of optimization techniques ranging from data compression and access method to multi-query optimization and exploiting materialized views. In fact, each node with commodity hardware in our cluster is able to process raw data at 1GBytes/s. Lastly, we show how to seamlessly integrate Cheetah into any ad-hoc MapReduce jobs. This allows MapReduce developers to fully leverage the power of both MapReduce and data warehouse technologies. © 2010 VLDB Endowment.
Elmeleegy K.,Turn Inc.
Proceedings of the VLDB Endowment | Year: 2013
Cluster computing has emerged as a key parallel processing platform for large scale data. All major internet companies use it as their major central processing platform. One of cluster computing's most popular examples is MapReduce and its open source implementation Hadoop. These systems were originally designed for batch and massive-scale computations. Interestingly, over time their production workloads have evolved into a mix of a small fraction of large and long-running jobs and a much bigger fraction of short jobs. This came about because these systems end up being used as data warehouses, which store most of the data sets and attract ad hoc, short, data-mining queries. Moreover, the availability of higher level query languages that operate on top of these cluster systems proliferated these ad hoc queries. Since existing systems were not designed for short, latency-sensistive jobs, short interactive jobs suffer from poor response times. In this paper, we present Piranha-a system for optimizing short jobs on Hadoop without affecting the larger jobs. It runs on existing unmodified Hadoop clusters facilitating its adoption. Piranha exploits characteristics of short jobs learned from production workloads at Yahoo!1 clusters to reduce the latency of such jobs. To demonstrate Piranha's effectiveness, we evaluated its performance using three realistic short queries. Piranha was able to reduce the queries' response times by up to 71%. © 2013 VLDB Endowment.
Turn Inc. | Date: 2014-02-20
At a marketing platform, an aggregated profile device graph is provided that associates different unique aggregated profile identifiers with different sets of related devices and their associated user identifiers. At least a portion of the aggregated profile device graph also associates user profile data and activity data with each devices user identifier and corresponding aggregated profile identifier. In response to the marketing platform obtaining a report request for a performance metric for a particular user segment, the performance metric is determined from activity data of specific ones of the aggregated user identifiers and their associated sets of related devices. Each of these specific aggregated user identifiers and their associated sets of related devices are associated with user profile data that includes the particular user segment. A report on the performance metric for the particular audience segment is then provided, for example, to the report requester.
Turn Inc. | Date: 2014-04-22
Systems, methods, and apparatus are disclosed herein for allocating a budget among sub-campaigns of an advertisement campaign. The methods may include retrieving data associated with a plurality of users. The data may include data points and action identifiers associated with each user of the plurality of users. Each data point may identify an interaction between a user and a sub-campaign. Each action identifier may include one or more data values identifying a user action. The methods may also include determining a plurality of performance metrics based on the retrieved data. A performance metric may be determined for each sub-campaign. The methods may further include determining a plurality of allocated budgets based on the plurality of performance metrics. An allocated budget may be determined for each sub-campaign. Moreover, each allocated budget may be a portion of a total budget associated with the advertisement campaign.
Turn Inc. | Date: 2014-04-08
Methods and apparatus for finding similar on-line users for advertisement or content targeting are disclosed. In one embodiment, a plurality of user data sets associated with a plurality of user identifiers for a plurality of anonymous users are obtained, and each user data set of each user identifier specifies one or more user attributes and on-line user events that have occurred for such user identifier. For each attribute, a correlation to a success metric value is determined for a particular type of event or attribute that has occurred for a plurality of user identifiers that are each associated with such attribute. The plurality of user identifiers and associated data sets are clustered into a plurality of user groups that each has similar data sets by weighting based on the attributes relative correlation to the success metric.