Xu Y.,Teradata |
Kostamaa P.,Teradata |
Qi Y.,Teradata |
Wen J.,University of California at Riverside |
Zhao K.K.,University of California at San Diego
Proceedings of the ACM SIGMOD International Conference on Management of Data | Year: 2011
One critical part of building and running a data warehouse is the ETL (Extraction Transformation Loading) process. In fact, the growing ETL tool market is already a multi-billion-dollar market. Getting data into data warehouses has been a hindering factor to wider potential database applications such as scientific computing, as discussed in recent panels at various database conferences. One particular problem with the current load approaches to data warehouses is that while data are partitioned and replicated across all nodes in data warehouses powered by parallel DBMS(PDBMS), load utilities typically reside on a single node which face the issues of i) data loss/data availability if the node/hard drives crash; ii) file size limit on a single node; iii) load performance. All of these issues are mostly handled manually or only helped to some degree by tools. We notice that one common thing between Hadoop and Teradata Enterprise Data Warehouse (EDW) is that data in both systems are partitioned across multiple nodes for parallel computing, which creates parallel loading opportunities not possible for DBMSs running on a single node. In this paper we describe our approach of using Hadoop as a distributed load strategy to Teradata EDW. We use Hadoop as the intermediate load server to store data to be loaded to Teradata EDW. We gain all the benefits from HDFS (Hadoop Distributed File System): i) significantly increased disk space for the file to be loaded; ii) once the data is written to HDFS, it is not necessary for the data sources to keep the data even before the file is loaded to Teradata EDW; iii) MapReduce programs can be used to transform and add structures to unstructured or semi-structured data; iv) more importantly since a file is distributed in HDFS, the file can be loaded more quickly in parallel to Teradata EDW, which is the main focus in this paper. When both Hadoop and Teradata EDW coexist on the same hardware platform, as being increasingly required by customers because of reduced hardware and system administration costs, we have another optimization opportunity to directly load HDFS data blocks to Teradata parallel units on the same nodes. However, due to the inherent non-uniform data distribution in HDFS, rarely we can avoid transferring HDFS blocks to remote Teradata nodes. We designed a polynomial time optimal algorithm and a polynomial time approximate algorithm to assign HDFS blocks to Teradata parallel units evenly and minimize network traffic. We performed experiments on synthetic and real data sets to compare the performances of the algorithms. © 2011 ACM.
Scanlon J.R.,Teradata |
Gerber M.S.,University of Virginia
IEEE Transactions on Information Forensics and Security | Year: 2015
The Internet's increasing use as a means of communication has led to the formation of cyber communities, which have become appealing to violent extremist (VE) groups. This paper presents research on forecasting the daily level of cyber-recruitment activity of VE groups. We used a previously developed support vector machine model to identify recruitment posts within a Western jihadist discussion forum. We analyzed the textual content of this data set with latent Dirichlet allocation (LDA), and we fed these analyses into a variety of time series models to forecast cyber-recruitment activity within the forum. Quantitative evaluations showed that employing LDA-based topics as predictors within time series models reduces forecast error compared with naive (random-walk), autoregressive integrated moving average, and exponential smoothing baselines. To the best of our knowledge, this is the first result reported on this forecasting task. This research could ultimately help assist with efficient allocation of intelligence analysts in response to predicted levels of cyber-recruitment activity. © 2015 IEEE.
News Article | November 10, 2016
SAN DIEGO, Nov. 10, 2016 /PRNewswire/ -- Teradata (NYSE: TDC), a leading analytics solutions company, has announced that Teradata Everywhere™, introduced in September, is the winner of the 2016 Ventana Research Technology Innovation Award for Information Management. The Ventana awards,...
Anandan B.,Purdue University |
Clifton C.,Purdue University |
Jiang W.,Missouri University of Science and Technology |
Murugesan M.,Teradata |
And 2 more authors.
Transactions on Data Privacy | Year: 2012
De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates), they fail on two counts. The first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers. Observe that a diagnosis "tuberculosis" is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term "infectious disease" also reduces identifiability. That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.
News Article | November 14, 2016
SAN DIEGO, Nov. 14, 2016 /PRNewswire/ -- Teradata Corp. (NYSE: TDC) ("Teradata") will host its previously announced Analyst Day for financial analysts and institutional investors at its research and development facility in Rancho Bernardo, California, on Thursday, November 17, 2016 from 8...
News Article | February 28, 2017
SAN DIEGO, Feb. 28, 2017 /PRNewswire/ -- Teradata is positioned as a leader in the Gartner, Inc. 2017 Magic Quadrant for Data Management Solutions for Analytics1 (DMSA) issued February 20, 2017, by Gartner analysts Roxane Edjlali, Adam M. Ronthal, Rick Greenwald, Mark A. Beyer, and Donald...
News Article | November 14, 2016
SAN DIEGO, Nov. 14, 2016 /PRNewswire/ -- Teradata (NYSE: TDC), a leading analytics solutions company, today announced the immediate availability of Teradata Consulting and Managed Services for Amazon Web Services (AWS), increasing the company's ability to accelerate positive business...
Johnston J.,CGG |
Proceedings of the Annual Offshore Technology Conference | Year: 2015
In the current cost-saving and high-tech environment, this paper aims at demonstrating that significant business value can be derived from advanced information technology. The objective was indeed to identify and reduce risk in the Drilling and Wells domains using iterative, multi-disciplinary Big Data analytics and workflows. Examples of operational risk identified in this project include low borehole quality, poor wellbore stability, and stuck pipe. Subject-matter expertise and advanced analytical capabilities were assembled to mine and analyze large amounts of different data types across drilling parameters, petrophysics and well logs, and geological formation tops for a released data set of approximately 350 oil and gas wells in the UK North Sea. The data set contained information about a large geographical area, which conventional analysis techniques would find difficult, if not impossible, to handle and analyze in its entirety. Results of this study showed that iterative Big Data "discovery workflows" uncover hidden patterns and unknown correlations in the data and unexpected correlations across the data set are exhibited. It also confirmed the possibility to improve Drilling models using business analytics. In addition the correlations found allow predictive statistics to be computed. Finally advanced visualization capabilities provided an aid to interpret, understand, and make recommendations for Drilling plan and operations. This novel approach uncovered that patterns and correlations can be detected across a disparate data set, where data types are not traditionally linked, by integrating a large variety and complexity of data in one analytical environment. Furthermore the multi-domain analyses run during the study were all performed 'on-the-fly', without preconception or business requirements. As a final point Big Data Analytics can also be used as a Quality Control tool and will certainly be leveraged for further multi-variate analysis in Oil and Gas. Copyright © (2015) by the Offshore Technology Conference All rights reserved.
Xu Y.,Teradata |
Proceedings - International Conference on Data Engineering | Year: 2010
Large enterprises have been relying on parallel database management systems (PDBMS) to process their ever-increasing data volume and complex queries. Business intelligence tools used by enterprises frequently generate a large number of outer joins and require high performance from the underlying database systems. A common type of outer joins in business applications is the small-large table outer join studied in this paper where one table is relatively small and the other is large. We present an efficient and easy to implement algorithm called DER (Duplication and Efficient Redistribution) for small and large table outer joins. Our experimental results show that the DER algorithm significantly speeds up query elapsed time and scales linearly. © 2010 IEEE.
Xu Y.,Teradata |
Kostamaa P.,Teradata |
Proceedings of the ACM SIGMOD International Conference on Management of Data | Year: 2010
Teradata's parallel DBMS has been successfully deployed in large data warehouses over the last two decades for large scale business analysis in various industries over data sets ranging from a few terabytes to multiple petabytes. However, due to the explosive data volume increase in recent years at some customer sites, some data such as web logs and sensor data are not managed by Teradata EDW (Enterprise Data Warehouse), partially because it is very expensive to load those extreme large volumes of data to a RDBMS, especially when those data are not frequently used to support important business decisions. Recently the MapReduce programming paradigm, started by Google and made popular by the open source Hadoop implementation with major support from Yahoo!, is gaining rapid momentum in both academia and industry as another way of performing large scale data analysis. By now most data warehouse researchers and practitioners agree that both parallel DBMS and MapReduce paradigms have advantages and disadvantages for various business applications and thus both paradigms are going to coexist for a long time . In fact, a large number of Teradata customers, especially those in the e-business and telecom industries have seen increasing needs to perform BI over both data stored in Hadoop and data in Teradata EDW. One common thing between Hadoop and Teradata EDW is that data in both systems are partitioned across multiple nodes for parallel computing, which creates integration optimization opportunities not possible for DBMSs running on a single node. In this paper we describe our three efforts towards tight and efficient integration of Hadoop and Teradata EDW. Copyright 2010 ACM.