In part 1
of this blog series, we introduced the two approaches that R and Python programmers can use to leverage the Teradata Vantage™ platform. In part 2
, we focused on the client-side languages and packages for R and Python – tdplyr
. In this third and final blog, we describe the server-side options for executing R and Python directly in Vantage.
Teradata has always been on the bleeding edge of exploiting its shared nothing massively parallel platform (MPP) architecture for the purpose of scaling advanced analytics. In this regard, the Table Operators (TOs) database construct, that was developed as part of the Teradata SAS partnership, allowed SAS PROC’s and data step language to operate directly within the Teradata Database years ago. Today, TOs allow a similar processing capability in Vantage for R and Python.
As a reminder from part 1, a wide variety of use cases can be addressed with TOs. To explain them all, we introduced the following processing nomenclature:
- Row-independent processing (RI) – The analytic result depends only on the input from individual data rows on a single AMP – e.g. model scoring.
- Partition-independent processing (PI) – The analytic result depends on the input from individual data partitions on a single AMP – e.g. simultaneous model building.
- System-wide processing – The analytic result is based upon the entire input table which is evenly spread across every AMP in the system – e.g. single model building on the corpus.
How Vantage specifically handles each of these processing paradigms is described in subsequent sections. First, let’s talk about the new R and Python package bundles for Vantage!
Vantage R and Python Package Bundles
R and Python programmers are used to having ANY mathematical, statistical or scientific package at their fingertips, installing them in a few clicks of the mouse, or a single line of code. However, in production MPP environments, each package needs to be inspected for security vulnerabilities, and potential licensing issues prior to deployment. Then each package needs to be installed and validated on every node within the MPP platform. This unwieldy process often results in major conflicts between the IT organization, and the data science community they need to serve.
In the recent release of Vantage, Teradata is helping to resolve that conflict by offering R and Python Distribution packages. Each language package bundle includes an Interpreter package and an Add-Ons package. In their initial release, the Add-Ons packages are collections of some 400 of the most utilized R packages and over 300 of the most utilized Python packages. These package bundles will evolve and be updated multiple times a year depending upon customer requests and will include a change control for an easy Teradata Customer or Managed Service installation.
Vantage SCRIPT Table Operator for R and Python
The first table operator we will discuss is a generic language processor known as the SCRIPT table operator. For an R or Python script to be processed in Vantage through SCRIPT, there are several simple rules that must be followed. First, the script must be “installed” or registered to the database. Vantage provides a very simple one-line SQL command that performs this registration. Second, the R or Python script must read data from Vantage through the Standard Input Stream, commonly referred to as stdin*, and write back through the Standard Output Stream or stdout*. For the first processing model described above (RI), these are the only two rules.
The second processing model (PI) requires an additional rule be followed. As these programs are executed on every AMP independently, Vantage provides a partitioning mechanism which guarantees at runtime that distinct data partitions will land on distinct AMP’s. By using a PARTITION BY clause when writing the SQL statement to source the data for use within the script, each AMP will simultaneously execute the installed script on its partition of data.
For system-wide style processing, the data scientist must construct a master process to combine and appropriately process the partial results returned from every AMP process. This can be done either by using a MapReduce style that nests multiple calls to the SCRIPT table operator or by embedding calls to the SCRIPT table operator within a C++ or Java external stored procedure (XSP). In either case, the results are aggregated across all AMPs and processed further to produce a meaningful final answer.
Get a step-by-step demonstration on how to use the Python and the SCRIPT table operator in this short video, Using R and Python with Vantage, Part 5 - Python and Table Operators.
Vantage ExecR Table Operator for R
As the name indicates, the ExecR table operator is specific to R programs. It forces R to run in a special protected mode server for code licensed under the General Public License (GPL), and has the same processing considerations for RI, PI and system wide analytics as described for SCRIPT.
One difference between SCRIPT and ExecR is that ExecR does not require a script file installation or registration process. Instead the R code is passed directly within the SQL statement calling ExecR. This code comes in two pieces – the “contract” which specifies the result schema returned by the R script, and the R code itself, called the “operator.” Additionally, while SCRIPT is limited to a single input – referred to as an ON clause – ExecR can have up to 16. Finally, I/O is not limited to STDIN and STDOUT as with SCRIPT; instead ExecR supports the FNC API, which is used in standard C, C++ or Java User Defined Functions to read and write data and read metadata from the Vantage data dictionary. The R FNC API’s are defined in the “tdr” add-on provided by the teradata-udfgpl package, available from Teradata At Your Service.
Get a step-by-step demonstration on how to use R with the SCRIPT and ExecR table operators in this short video, Using R and Python with Vantage, Part 4 - R and Table Operators
Scaling Your Data Science Process
With Teradata Vantage, you can use R and Python to take advantage of its MPP for performance and scalability. With faster analytic processing in Vantage, the highly iterative tasks required of the data scientist are accomplished in minutes or hours, rather than days. If you are on a previous version of Teradata, and curious about upgrading to Vantage, contact us today.
* The standard input (stdin) and standard output (stdout) streams are preconnected input and output communication channels between a computer program and its environment when it begins execution.
Tim Miller has been in a wide variety of R&D roles at Teradata over his 30+ year career. He has been involved in all aspects of enterprise systems software development, from software architecture and design; to system test and quality assurance. Tim has developed software in domains ranging from transaction processing to decision support, with the last 20 years dedicated to predictive analytics. He is one of two principals in the development of the first commercial in-database data mining system, Teradata Warehouse Miner. As a member of Teradata's Partner Integration Lab, he consulted with Teradata's advanced analytics ISV partners, including SAS, IBM SPSS, RStudio and Dataiku, to integrate and optimize their products with Teradata's platform family. He spent several years with Teradata’s Data Science Practice, working closely with customers to optimize their analytic environments. Today, Tim is a Sr. Technologist in Teradata’s Technology Innovation Office, focused on the Vantage platform.
View all posts by Tim Miller