Teradata Vantage - MAPS: 3 Query Behavior
Transcript
I’m Carrie Ballinger. I am a Senior Technologist in the Technology & Innovation Office here at Teradata. Thank you for joining me on this TechByte. This is the 3rd in the series of 3 TechBytes on the MAPS feature. The MAPS feature is new in Teradata 16.10, and what I want to focus on in this session is how queries behave when multiple maps exist in an environment. Now, we touched on in our first segment and in our second segment that 2 key types of maps are available to you in the MAPS feature. The contiguous map covers a contiguous range of AMPs, always starting with AMP 0. When you expand a system, you are given a second contiguous map. But the first one doesn’t go away. You have an option of leaving tables in that first contiguous map, and then bringing the system up after the expansion and moving those tables. The second type of map we discussed was the sparse map which always is defined within a contiguous map and is the means of moving a table to a single AMP or to a small number of AMPs. You get 2 sparse maps by default – a single-AMP sparse map and a multi-AMP sparse map. You can move, as we discussed in our last segment, multiple small tables to either one of the sparse maps. The Optimizer understands about maps. When you get on 16.10, the Optimizer is going to detect what map owns a particular table and it’s going to know how many AMPs are in that map, and how the data is spread across that particular map. Now, it’s important that the Optimizer be able to have this knowledge and insight because there are going to be times when you issue a query and tables in that query reside on different maps. For instance, if you look at this graphic on this slide – Here we have a Product table which is on the old contiguous map, TD_Map1, which only covers 4 AMPs. And we are going to be joining it to the Sales table in a query. The Sales table on the other hand is spread across all of the AMPs in the second, larger contiguous map, TD_Map2. So the Optimizer needs to see this, needs to understand how to build this plan to make this join work when different maps are involved in the activity. Let’s take a step back: You’ve just expanded your system, and you have not yet moved any tables into the new AMPs. The Optimizer at this point will know that your tables are only in the old TD_Map1, but will consider how it can take advantage of the new processing power in TD_Map2. Here’s an example that we discovered in our own experimentation with the MAPS feature – We had all of our tables in TD_Map1. What the Optimizer did when it was joining 2 of those tables was it read the tables in the old map, but it redistributed the spool files across all of the AMPs in the configuration. And then it did the join, it actually performed the join across all of the AMPs in the configuration. So the data resided on a subset of the AMPs but it was able to find out that it was actually more cost effective and would be a faster plan if it redistribute the data first across AMPs and joined it there. This particular plan was a result of the Optimizer doing some costing. And it’s going to make a determination on a case by case basis whether that is a most efficient plan. And I tell you one thing: If you got a very large table, an extremely large table – a very big FACT table for example – that has not yet been moved to the new TD_Map2 the Optimizer will probably do most of your joints to that table in a smaller number of AMPs because it wants to keep that big table AMP local, which I’m sure you can understand. But if you were joining moderate tables or similarly sized tables, it will think about and in some cases use that extra processing power. So now you’ve started moving your tables into TD_Map2. And there may be a period of time - perhaps several weeks, perhaps several months – where a certain number of queries are going to be joining tables that are in different map: you may have one of the tables in TD_Map1 and another in TD_Map2. The Optimizer is going to have to make sure that rows to be joined are in the same map before it can perform that join. It has to be using the same hash map for the spool and the table being joined to. So it’s very possible that the Optimizer is going read a table in 1 map build a spool file, and write that spool file in another map, sort of like the example I showed you in the previous slide. Costing algorithm is going to be used by the Optimizer to determine the most efficient way to get the two tables that are going to be joined to get their rows onto a single map in order to perform the join efficiently. So you might see something like this happening – where step one in a query plan is to read the Product table which is in TD_Map1 create a spool file and redistribute that spool file to all of the AMPs in TD_Map2 in preparation for a join to the second table, which is the Orders table. So again, the spool or the table that are being joined have to be on the same map in order to make the join happen. Some of the things the Optimizer will consider are the same kinds of things it considers in the pre-map plans: that is duplicating small tables or redistributing one or both tables. And the map it chooses to redistribute to will be determined by the costing algorithms. The nice thing though is you can see this in the explain text. It very clearly will tell you the source map and the destination map if more than 1 map is involved in a given step. So there will be no confusion in your mind about how the Optimizer has been making its decisions when it comes to situations like I’m describing on this slide. You need to be a little bit careful when you start moving tables if those tables have primary index join relationships. In other words, they share the same primary index domain. For instance, let’s take an Orders table and an OrderItem table and both share Orderkey as their primary index. One of course uses it as a NUPI and the other has a unique primary index. When both of those tables resided in TD_Map1, the associated rows, the rows that share the same primary index value, would both end up on the same AMP. So when we join the rows between Orders and OrderItems, it was an AMP local activity. As soon as we move one of those tables into TD_Map2 - and in this example we are moving the OrderItem, the larger one - to TD_Map2 it gets its rows rehashed and replaced on different AMPs based on the same Orderkey value but it’s going to pick a different AMP now because it’s a different hash map. So now in our example, the Orderkey value of 55 for the OrderItem table is now on AMP 17. So we no longer can do an AMP local join between these 2 tables. So now we have introduced the need for the Optimizer to do either a redistribution or duplication; and that’s exactly what will happen here. And it will do it in a most efficient way possible. However if you are moving tables, it usually is a good idea to identify the tables that have this kind of relationship and move them in the same Mover job. The Viewpoint Maps Manager portlet will help you identify such situations; it will use as an input the step-level output from the DBQL step table and it will see when those primary next joins are done and will feed that information into the Analyze. And it will make recommendations to move those types of tables in the same Mover job. I want to suggest that you make it a goal to get all of your tables moved. The period of time in which you are issuing queries where some of the tables are in one map and some of the tables are in another map – we are talking about a contiguous map here – should be seen as a temporary time. Even though the Optimizer is well-equipped to make good costing decisions you never are going to get as good of a performance as you will get from your queries when all your tables have been moved to the larger contiguous map. That’s when the greatest degree of parallelism will be able to be applied to all the operations taking place in that map. So that should be your goal. And it will also be easier for the Optimizer, simpler for you to understand what’s going on, and better for performance. What you want to avoid is going through several expansion cycles and ending up with several contiguous maps each of which has a groups of tables in them. Once you get all of your tables moved to the most current contiguous map, you can drop that old contiguous map, and keep your environment clean and simple. And I think that’s a good formula for success when you are using the MAPS feature. So in conclusion, want to remind you that the Optimizer is map-aware; you don’t have to do the thinking for the Optimizer here at all. And from my experiences using MAPS, I really want to recommend you move your very very large tables first. Get them moved into that large contiguous map, and they will act like a magnet; they will pull the other tables into the large map for most of the joins and things will perform better. Be cautious about your primary index join queries. The Optimizer will still do the joins as efficiently as possible but you’ll lose the AMP local join capability if one of the tables is moved and the other is not. And again, the best performance is when you get everything moved. I wish you the best of luck in using MAPS. Thank you very much.
See how the Vantage's Optimizer works with MAPS feature to produce the best query plan after a system expansion to maximize queries. Plus find out some of the best practice tips for maximizing the query performance while expanding your Vantage system.