Teradata Vantage - Row-based Tables
Transcript
Welcome to TechBytes. My name is Tom Fastner with Teradata Labs, and in this brief video I am going to give you an overview of row-based table design options. With Teradata 14.0 we introduced columnar tables as an additional option in physical data modeling. Since then there have been 2 more flavors added, and I want to take this opportunity to discuss similarities and differences between those physical design options. This first part is a quick review of row-based table options, and the second part will discuss the 3 columnar-based options. After a quick review of the hash based data distribution that was there from the very beginning of Teradata, we will look at the 2 row-based options: tables using a Primary Index (PI), and tables that use No Primary Index (NoPI). This short discussion focuses on characteristics that are helpful explaining the fundamental differences between the 5 basic table layouts and will not discuss many other fundamental aspects. Specifically this will not include a discussion of Partitioned Primary Indexes (PPI). Row-based tables have been the only choices for a physical table layout for a long time and have recently been augmented by the columnar-based options. PI and NoPI tables still suit most use cases, specifically when the tables are small or have only a few columns, or when the queries are asking for most of the columns regardless of the table size. The data distribution on a MPP system is a fundamental aspect of the physical data model and influences the performance of operations performed on it. On Teradata the data is usually distributed across the AMPs based on the values in the defined Primary Index columns. It is a 2 step process where the PI column values are concatenated and run through a mathematical function to produce a 20-bit long bucket number, followed by a lookup in the hash map to determine which AMP actually owns the data. In this little example the function returns bucket #29 for the PI values “17, … , Aloha”, and this bucket is assigned to AMP #8 in the hash map. I want to point out implications for 3 operations that are important for the comparison of our options later on. 1st – As long as a request contains all the PI column values for a table, Teradata knows which AMP owns it – the request becomes a single-AMP operation. 2nd – If my SQL contains a GROUP BY and all PI columns are part of it, then all the aggregations can be calculated locally on the AMP. There is no data distribution required. And 3rd – When 2 tables have the same PI columns, they are called co-located. An equality-join between those 2 tables based on the PI columns can be done AMP-local. With that in mind let’s look at a row-based table with a Primary Index. Our example table used throughout the slide deck contains information on albums and artists. So when you have a row-based table, you pick a primary index – a column or a set of columns – that’s used to generate something called the row hash for every row. The row hash contains the hash bucket number as discussed on the previous slides. Other information like partition number, Row#, and some parity bits are stored with a row as well but are not of concern for this discussion. As data is being loaded, this 32-bit row hash is stored with each row automatically. The first half of the row hash is used to find the AMP who owns the data, and the full row hash is used to sort the rows within the AMP. A later lookup of the row specifying the Primay Index column values will use the same information to locate the row on the AMP. So what does this mean to you when you are dealing with row-based tables with a primary index? It allows for single AMP access when the values of all primary index columns are specified. So if you only want to look for one row, you can reach into the table directly, generate the row hash and pick out the row you want without having to read the full table – you know the AMP, and within the AMP the data is sorted by row hash. For an aggregation the data must be redistributed by the GROUP BY columns to perform the operation. However if all PI columns are included in the group by then a costly redistribution is not necessary and the group by operation can be performed locally on the AMP. For a join, one or both tables must be distributed based on the joining columns in order to perform the join. But if both tables have the exact same PI, and the equality-join includes all PI columns - and potentially others – the join can be performed AMP local avoiding the cost of moving data across AMPs. Here you see the DDL for a row-based table with a primary index. You can see this is how you specify a primary index designation. This has been true since the first day of Teradata. Next we are moving on to a row-based table with no primary index, also known as a NoPI table. It has a different set of characteristics and a different set of considerations for when and where you might want to use this instead of the row-based table with primary index. The first thing to notice here is when you load a row-based table with no primary index designation, there’s no row hash generated by the system. The rows are distributed across the AMPs in a round-robin fashion – basically like shuffling deck of cards across all players. Since there’s no row hash, there’s no row hash to sort at the end of the load operation. What does this mean from the considerations point of view? A NoPI table will reduce skew in intermediate tables that have no natural Primary Index. This in fact is actually why the NoPI feature was created to begin with. A poor selection of primary index columns can cause the data not being evenly distributed across all the AMPs, leading to skew related performance issues. Very often this happened because users forgot to specify a PI and that defaulted to the first column in the table. NoPI tables eliminate all the issues associated with defaulting of the primary index to the first column. Unlike the row-based tables with PI, when you load a table with no primary index there’s no sorting of rows at the end because there’s no row hash. You might see a 25% to 30% performance gain in loading a NoPI table over loading row-based tables with a primary index. But you will be missing the row-hash in cases where it can help directly with performance. There is no single AMP operation support, and no co-location for Joins when using NoPI tables. The facts that NOPI tables do not require the selection of a Primary Index and demonstrate great load performance make it a good choice for loading unknown data. Once you understand the data demographics and use cases, you are in a better position to chose the appropriate physical table layout from the options discussed. Here we have the DDL for a row-based table with no primary index. And here’s the no primary index designation. So we have quickly gone through key characteristics and considerations for row-based tables – row-based table with a primary index and row-based table with no primary index. These characteristics and considerations are all summarized in this table for a direct comparison. In part II of the row and column partitioned tables TechBytes, we will complete this summary as we go over the three column-based tables.
Did you know Vantage offers two physical implementation options for row-based tables, in addition to Columnar tables? Learn the basics of the row-based tables and their implementation considerations. Also find out how Vantage distributes data to achieve Massively Parallel Processing (MPP) using hash-based data distribution.