Run scripts to move and transform data leveraging Vantage
Many Teradata customers are interested in integrating Teradata Vantage with Microsoft Azure First Party Services. This Getting Started Guide will help you to connect Teradata Vantage using Azure Data Factory Custom Activity feature.
Although this approach has been implemented and tested internally, it is offered on an as-is basis. Microsoft does not provide validation of Teradata Vantage using Azure Data Factory Custom Activity feature.
Disclaimer: This guide includes content from both Microsoft and Teradata product documentation.
Overview
We will use the Teradata Access Module for Azure LOAD script for high-speed data movement from Azure Blob Storage to Teradata Vantage followed by a BTEQ script for further in-database transformation using the
Azure Data Factory Custom Activity feature.
Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
There are two types of activities that you can use in an Azure Data Factory pipeline.
The current, ODBC-based
Azure Data Factory Teradata Connector supports data movement with Vantage being a data source but does not support data transformation activities using the DataFlow feature.
Some customers have large data movement requirements where ODBC may not perform, or they need to transform or process data in a way that isn't supported by Data Factory. Fortunately, you can create a
Custom Activity with your own data movement or transformation logic and use the activity in a pipeline in Data Factory. The custom activity runs your customized code logic on an
Azure Batch pool of virtual machines.
This is a diagram of the workflow. You will need to install the Vantage client software on a compute node.
Prerequisites
You are expected to be familiar with Azure services and Teradata.
You will need the following accounts, objects, and systems. Links have been included with setup instructions.
Getting Started
For Azure Data Factory Custom Activity to run scripts against Vantage, you need a Azure Blob Storage account and container, an Azure Batch Service account and pool of compute nodes (i.e. Windows or Linux VM), Vantage client software on a virtual machine, and scripts in an Azure Blob Storage account. Finally, we will create Azure Data Factory pipeline job using Custom Activity feature to execute your scripts on the compute node VM against Vantage.
If are new to Azure Data Factory, see
Introduction to Azure Data Factory
Create an Azure Blob Storage account and container
We need an Azure Blob Storage account to store the following:
- A dataset to load into a Vantage table.
- The scripts which the Custom Activity will access and execute the load and transformation scripts.
If you do not have an Azure Blob Storage account, please
Create an Azure Storage account.
We suggest creating the storage account with two (2) containers. Call one of the containers
data for the dataset and the other container
files for scripts. This document will use the names
data and
files for the containers. You call them something entirely different. However, for this document, we will refer to the containers as
data and
files. You will need to adjust the directions for the names that you actually used.
Create an Azure Batch Service
Azure Batch creates and manages a pool of virtual machines
. Installs the applications you want to run and schedules jobs to run on the virtual machines.
1. Logon to the
Azure portal and
create a Batch account.
2. After deployment, click on
Go to resource to view the Batch account in preparation for the next step.
Create a pool of compute nodes
We need to create a pool of virtual machines, which can be either Windows- or Linux-based. We will use Windows.
For simplicity, we will only
create a pool with a single virtual machine. In production, you would create more virtual machines for scaling and redundancy.
1. In the Batch account, click on
Pools in the left pane under Features and then
+Add. Enter a
Pool ID name
.
2. For the
Operating System, choose the following.
3. For the
Node Size, choose the following.
4. For the
Scale, choose the following.
5. Leave the other options at their default values. Click the
Save button.
Batch will create the pool immediately, but it will take a few minutes to allocate and start the virtual machines.
6. Once the
Allocation state shows as
Steady, click on your
Pool ID.
You may need to click on
Refresh to update the Allocation state.
7. Click on
Nodes in the left pane and then click on your specific node.
8. Create a user on the virtual machine by clicking on
Connect. Select
Specify your own option. Enter a
Username and
Password, set
Is administrator to
true, and set an appropriate
Expiry Time. Click on
Add user account. Click on
Download RDP file. (Alternatively, you can copy the IP address and port and manually connect with the Remote Desktop application).
Note that the password must be sufficiently complex. An error message will remind you of what characters and length are required.
We will use the remote connection to the server to install software in the next section.
Install Vantage client software
We use the Teradata Access Module for Azure as well as BTEQ scripts in this document. However, you can use other client software tools and scripting languages, such as Teradata Parallel Transporter or the
Teradata Vantage Python package for Advanced Analytics, as an alternative.
1. Log into the Windows virtual machine created in the previous step, if you have not already done so. (You can use the downloaded RDP file as a shortcut). You will need the username and password you created in the previous section.
2. Open
Server Manager. Select
Local Server in the left pane.
3. Set
IE Enhanced Security to
Off.
While this is normally a security concern, it is acceptable for our purposes. You can always turn it back on.
4. Open
Internet Explorer and go to the
Teradata Downloads site. If you do not have a login, you will need to register. Click on
TTU Windows Install and then on the
TTU 16.20.xx.yy Windows – Base link. You will be asked to login and agree to the licensing provisions. Choose to save the file.
The client software is periodically updated. The version will change. The latest version is acceptable.
5. Once completed, open the downloaded file. It appears as a folder as it is a compressed file. Right-click on the file and select
Extract all. A new Explorer window will appear. Open the folder
TeradataToolsAndUtiltiesBase. Double-click to run the
setup program. Accept any defaults. Select
BTEQ,
Teradata Parallel Transporter Base,
Teradata Parallel Transporter Stream, and
Teradata Access Module for Azure and click the
Install button.
You can verify that the software installed by opening a command prompt or PowerShell window and running
BTEQ. The BTEQ program will start. Enter “.quit” to exit.
Create Teradata Access Module for Azure and BTEQ scripts
Azure Data Factory Custom Activity requires the scripts to be in an Azure Blob Storage account container. We will put them in the previously created container,
files.
We will then load a data file from an Azure Blob Storage container, called
data, into Vantage.
For our access module load script to access Azure Blob Storage account, we need to create a credentials file. For more information and an example of a credentials file, see
Teradata Tools and Utilities Access Module Reference.
1. You may place the credentials file where you want as you can use the
-ConfigDir parameter to specify this path. We will create a directory,
azureaxsmod, in the Teradata directory. (That is,
c:\users\public\azureaxsmod.) The file name is
credentials.
Here is a sample credentials file. The StorageAccountKey below is abbreviated. You will need to ensure that your full key in your file. Note that the file name has
no extension.
Replace any value within the angled brackets with actual values first.
Here are some examples.
You can find the "StorageAccountName" and "StorageAccountKey" properties by selecting
Storage Resource group, then
Storage account, and finally
Access keys in the Azure portal.
2. Create a batch file,
tbuild_load.bat, for the Custom Activity. This can be placed in the azureaxsmod directory.
Note that the full path name to the script and variable files is included. You will need to adjust for your batch file.
3. Create an access module load script,
tdaxs4az(load).txt, which will be run by the Custom Activity. This can be placed in the azureaxsmod directory. Replace any value within the angled brackets with actual values first.
Here are some examples.
The script will first drop any previous staging and error tables and then create our Vantage staging table. Next, the script will perform a load to the staging table using an existing data file,
dataset, in our Azure Blob Storage container
data.
4. Create a job variable (jobvars) file,
jobvars(load).txt, which will supply directions to the access module script. This can be placed in the azureaxsmod directory. Replace any value within the angled brackets with actual values first.
Here are some examples.
5. Create a batch file,
bteq_transform.bat, for the Custom Activity to use when invoking a BTEQ script. This can be placed in the azureaxsmod directory.
6. Create a BTEQ script,
update_inssel.txt, which will be run by the Custom Activity. This can be placed in the azureaxsmod directory. Replace any value within the angled brackets with actual values first.
Here are some examples.
The script will drop and then create our target table, perform an UPDATE statement (recent loaded) on the staging table, and then perform an INSERT/SELECT to our target table. This can be made much more complex in a production scenario, but it shows how additional processing can be performed.
Upload the files to Azure Blob Storage
1. Upload the previously created files to the container,
files, in Azure Blob Storage.
- tbuild_load.bat
- bteq_transform.bat
- tdaxs4az(load).txt
- jobvars(load).txt
- update_inssel.txt
You can use the Azure portal to upload files into your Azure Blob Storage account.
We created these files on the Windows virtual machine. You will need to copy them to your local system or whichever system that you are using to access the Azure portal.
Alternatively, you can use the
Azure Storage Explorer. We do not cover how to configure or use the Azure Storage Explorer but it is fairly intuitive.
Create an Azure Data Factory pipeline job with a Custom Activity
We need an Azure Data Factory instance to create our Custom Activity pipeline job.
1.
Create a data factory. De-select
Enable GIT as we are not using it.
2. After deployment succeeds, open the Data Factory instance. Select the
Author & Monitor tile.
This starts the Azure Data Factory user interface (UI) application on a separate tab. You may need to (re‑)select your Azure account as this is opening a new browser tab.
3. Click on
Create pipeline.
4. Open the
Batch Service folder under
Activities on the left pane. Drag
Custom into your pipeline area in the right.
5. Drag another instance of
Custom into the pipeline area. Drag the green connector from the first Custom instance to the second Custom instance.
6. Select the first Custom instance. In the
General section below, rename it from
Custom1 to
Load to Vantage.
Select the second Custom instance and rename it from
Custom2 to
Transform in Vantage.
7. In the other browser tab, which should still be open to the Azure portal, look up the Batch account. You will need information about the Batch account that you previously created. Click on
Key in the left pane under
Settings for the URL and access key. Click on
Pools in the left pane under
Features for the pool ID.
8. Select the
Load to Vantage object. Click the
Azure Batch tab below. Click
+New to add a new linked service.
9. In
New linked service (Azure Batch) dialog, enter batch account into the
Account name, the batch access key into
Access key, the batch account URL into
Batch URL, and the batch account pool ID into
Pool name.
10. Select the
Storage linked service name dropdown and then click on
+New. Select your Azure subscription, your Storage account name, and then click
Create. Once the dialog box is dismissed, click
Create again.
11. Select the
Transform in Vantage object. Click the
Azure Batch tab below. In the
Azure Batch linked service, select the Azure Batch linked service you just created (e.g. AzureBatch1).
12. Select the
Load to Vantage object. Click the
Settings tab below. In the
Command text box, enter “cmd /c tbuild_load.bat”. This will run the access module script. Select the
Resource linked service to Azure Blob Storage linked service connection you just created (e.g. AzureBlobStorage1). Set the
Folder path and set folder path to our container,
files.
13. Select the
Transform in Vantage object. Click the
Settings tab below. In the
Command text box, enter “cmd /c bteq_transform.bat”. This will run the BTEQ script. Select the
Resource linked service to Azure Blob Storage linked service connection you just created (e.g. AzureBlobStorage1). Set the
Folder path and set folder path to our container,
files.
14. Click on
Publish all. In the dialog box, click on
Publish.
15. Once the pipeline has been published, click on
Debug to run the job.
The job status will appear in the bottom pane.
If the job ran successfully,
Status will display Succeeded. You can check target table on Vantage.
Troubleshooting a job failure
A job failure will show as Failed in the status pane.
If you look in the drawing pane, you should see a red X that will give you an indication of where the failure occurred.
The first step is to ensure that the pipeline can make successful connections to the pool and storage. Select the
Load to Vantage object and then the
Azure Batch tab below. Click on
Test Connection. Fix any errors encountered.
Click on the
Settings tab and, again, click on
Test Connection. Fix any errors encountered.
Repeat this for the
Transform in Vantage object.
If the job still fails, log into the Windows virtual machine. Open a PowerShell (or command prompt) window and run the batch job,
tbuild_load.bat, manually. Fix any errors encountered. Then run the batch job,
bteq_transform.bat, manually. Fix any errors encountered.
Run the Data Factory job again.
If there are still errors, you can investigate it from Azure.
Open
Batch Service resource group >
Batch account >
Pools > [
Poll ID] >
Nodes >
Name. In your node folders, click on
workitems >
adfv2-[Pool ID] >
job-[n] >
[unique job identifier] where
stderr.txt and
stdout.txt results of your job can be found.
To view our load process, click the
unique job identifier with the
stdout.txt results.
To view our transform process, click the
unique job identifier and
CD wd directory where you will find our
output.txt file we identified
bteq_transform.bat.
Similarly, you can find job run result on Windows VM under
D:\batch\tasks\workitems\adfv2-[Pool ID]\job-[n]\[unique job identifier].
Summary
You can use Azure Data Factory with a Custom Activity feature to load and transform data into Teradata Vantage. We used the Azure Batch Service, created a Windows virtual machine compute pool, and installed Vantage client software to run our script against Vantage. We also used Azure Blob Storage container to store scripts for Custom Activity pipeline job.