R and Python are two of the most popular languages for data science. Python is often praised as a general-purpose language traditionally applied in engineering disciplines, while R was initially developed with a focus on statistics. Whatever your language of choice is, chances are you have run into the same problems:
Did you know that Teradata enables you to solve these problems with your language of choice by processing data directly within a Vantage system? Taking the analytic processing into the Teradata Vantage
platform has tremendous processing and performance benefits, namely:
- R/Python code can leverage Vantage’s massively parallel platform (MPP) for performance and scalability,
- Data resident in the Vantage ecosystem need not be moved to another analytic processing platform, and
- It provides a platform for operationalizing analytics.
Why is this important?
While in the past, running R/Python on laptops or an analytic server might have been sufficient, the exponentially growing data volumes in contemporary use cases can make similar tasks unbearably slow. This need gave rise to scalable MPP platforms for efficient analytics. Scalable performance has a profound effect on analyst productivity and accuracy. If it takes several minutes to run an analysis, the expert can willingly iterate through parameter refinements and resubmit the analysis. When the processing requires several hours, their willingness to make accuracy improvements fades away, along with the spirit of agile data exploration and analytic modelling.
Teradata Vantage was built from the ground up for efficient analytics. Over the last decades, facilities have been added to the evolving core platform that enable users to bring analytics to the data. The design of the shared nothing architecture* of Vantage allows much more than the simple storage and reporting of voluminous detailed data records. Additionally, Teradata Vantage now includes
a Machine Learning Engine that provides multivariate statistical, machine learning and graph functions to existing core capabilities. Together, these engines allow the execution of advanced analytical tasks directly on the data without data movement, a processing capability Teradata pioneered in the late 1990s and continues to evolve today.
How is this done in Vantage?
Depending upon where the interpreters and packages are installed and executed, R/Python are processed in Teradata Vantage in two distinctly different ways:
1. Languages are running external to Teradata Vantage, with Teradata open source packages and native language SQL drivers providing interfaces to the functionality available in the Advanced SQL and Machine Learning Engines.
2. Languages are running directly on the Vantage platform.
Let’s take a look at both of these approaches.
Client-Side Language and Packages
In approach (1), R/Python users add to their installations the native language SQL driver along with one of two open source libraries from Teradata – tdplyr
for R and teradataml
for Python. Both tdplyr and teradataml provide the following common functionality:
As the name indicates, tdplyr is based upon the well-known R package dplyr, arguably the most widely used R package for data manipulation and preparation. R users can transform their data in-database by using dplyr’s verbs and treating tables in Vantage as R Data Frames. In the case of Python, the teradataml package provides similar functionality; it is based upon the SQLAlchemy package, and a construct that mimics the core functionality of pandas DataFrames.
Server-Side Languages & Packages
For approach (2), it is important to understand Vantage’s shared nothing MPP architecture. On the Vantage system, data is evenly distributed across all its virtual units of parallelism, known as Access Module Processors (AMPs). To enable R/Python processing, the respective interpreters, base packages and any desired add-on packages need to be installed on every node.*** Vantage’s Table Operator mechanism drives execution, and the processing is simultaneously performed on every AMP against the data available to that unit of parallelism. The number of AMPs per node varies from 30 to 45 units, based on the version of the Intel CPU and overall node performance. Thus, a ten-node system with 40 AMPs per node will run 400 parallel processes of R/Python.
Notably, each instance of R/Python is operating independently, with no inter-process communication across nodes or AMPs within a node. Hence, data scientists must make a script cognizant of the data available to its running instance. On this basis, the following use case types can be addressed:
- Row-Independent Processing – Depends only on the input from individual data rows on a single
- Partition-Independent Processing – Depends on the input from individual data partitions on a single AMP. Examples include model fitting for a given location, time period or product.
- System-wide processing – Based upon the entire input table which is evenly spread across every system AMP. In this situation, additional design or programming may be needed. Examples include calculating a global average or building an attrition model for the entire customer base.
Out of the box, Vantage provides support for the first two processing patterns. For system-wide processing, the data scientist must construct a master processing level to combine and appropriately process the partial results returned from every AMP process.
This is part one of a three-part blog series. As with most complex topics, the devil is in the details - and in two subsequent blogs, we will dive more deeply into each of the approaches described above.
Start Optimizing Your Data Science Process
Whichever your preferred method, with Teradata Vantage you can use R and Python while taking advantage of its massively parallel platform (MPP) for performance and scalability. If you’re on a previous version and curious about upgrading to Vantage, contact us today.
* In a shared nothing distributed computing architecture, each processing node is independent and self-sufficient. The nodes share no memory or disk storage, and there is no single point of contention across the system. so that the maximum performance and scalability is achieved.
** Both tdplyr and teradataml make R Data Frames and pandas Data Frames appear locally to the programmer but are virtually pointing to tables or views in Vantage.
*** Vantage, makes the installation process easier by providing bundles of R/Python base and add-on packages that have been tested against the base operating system and vetted for security and legal constraints.
Tim Miller has been in a wide variety of R&D roles at Teradata over his 30+ year career. He has been involved in all aspects of enterprise systems software development, from software architecture and design; to system test and quality assurance. Tim has developed software in domains ranging from transaction processing to decision support, with the last 20 years dedicated to predictive analytics. He is one of two principals in the development of the first commercial in-database data mining system, Teradata Warehouse Miner. As a member of Teradata's Partner Integration Lab, he consulted with Teradata's advanced analytics ISV partners, including SAS, IBM SPSS, RStudio and Dataiku, to integrate and optimize their products with Teradata's platform family. He spent several years with Teradata’s Data Science Practice, working closely with customers to optimize their analytic environments. Today, Tim is a Sr. Technologist in Teradata’s Technology Innovation Office, focused on the Vantage platform.
View all posts by Tim Miller