DevelopTime: The power of T-SQL's APPLY operator

APPLY, in both its inner (CROSS APPLY) and outer (OUTER APPLY) forms, allows a correlated sub-query or table-valued function to be part of the FROM clause. This is different to a regular JOIN, which is between two sets that are independent of each other, but able to be joined according to some predicate that lives in the ON clause. Using APPLY, that restriction disappears, and the predicate can be built into the sub-query (or TVF) itself.
These two queries are exactly the same. They have identical execution plans, and produce the same result.

SELECT p.Name, s.Name, s.ProductCategoryID
FROM Production.Product AS p
JOIN Production.ProductSubcategory AS s
ON s.ProductSubcategoryID = p.ProductSubcategoryID;

SELECT p.Name, s.Name, s.ProductCategoryID
FROM Production.Product AS p
CROSS APPLY (
SELECT subcat.Name, subcat.ProductCategoryID
FROM Production.ProductSubcategory AS subcat
WHERE subcat.ProductSubcategoryID = p.ProductSubcategoryID) AS s;

You may look at this and figure that the first option is entirely adequate, and see no reason to switch. There isn’t a reason to switch for most situations, and I don’t actually want you to change the way you write regular joins. But I do want you to consider the power of having a predicate as part of a correlated table-sub-query.
One common scenario is being able to hook easily into the first row in a particular join. It’s easy – you just add TOP (1) to the sub-query, and put an ORDER BY clause in there. I do this happily in BI solutions, where I want to grab a record from a dimension table that is current at the time that an event occurred.

OUTER APPLY (
SELECT TOP (1) d.DimKey
FROM dbo.DimensionTable AS d
WHERE d.BusinessKey = e.DimensionCode
AND d.StartDate < e.EventDate
ORDER BY d.StartDate DESC) AS d

I know, you can do this very easily using a correlated sub-query in the SELECT clause, since the sub-query returns a single value (yes, it’s scalar). But what if we needed to return multiple columns from a sub-query, or needed to return the best three records for a particular category. Here I’m finding the three best-selling products for each subcategory.

SELECT s.Name, TopProds.*
FROM Production.ProductSubcategory AS s
OUTER APPLY (
SELECT TOP (3) p.Name, SUM(o.OrderQty) AS NumOrdered
FROM Production.Product AS p
JOIN Sales.SalesOrderDetail AS o
ON o.ProductID = p.ProductID
WHERE p.ProductSubcategoryID = s.ProductSubcategoryID
GROUP BY p.ProductID, p.Name
ORDER BY SUM(o.OrderQty) DESC) as TopProds
;

This could be done with a Common Table Expression (CTE) and the ROW_NUMBER() functdion, but that solution isn’t nearly as elegant as this. This is the kind of solution that was designed for APPLY. One of those times when a problem and solution are nicely matched, made for each other.
Here’s an example that achieves the same using a CTE:

WITH OrderedProds AS
(
SELECT p.ProductSubcategoryID,
p.Name,
SUM(o.OrderQty) AS NumOrdered,
ROW_NUMBER() OVER (PARTITION BY p.ProductSubcategoryID
ORDER BY SUM(o.OrderQty) DESC) AS ProdOrder
FROM Production.Product AS p
JOIN Sales.SalesOrderDetail AS o
ON o.ProductID = p.ProductID
GROUP BY p.ProductID, p.Name, p.ProductSubcategoryID
)
SELECT s.Name, o.Name, o.NumOrdered
FROM Production.ProductSubcategory s
LEFT JOIN OrderedProds o
ON o.ProductSubcategoryID = s.ProductSubcategoryID
WHERE o.ProdOrder <= 3
;

I’m sure you’ll appreciate that this feels a lot more contrived than using APPLY. The Query Optimizer should really break these two queries down to exactly the same, recognising that joining on the PARTITION BY column(s) is a correlation mechanism. It doesn’t do that yet, but one day, I’m sure it might.
(The CTE query runs quicker than the APPLY equivalent in this particular case on the AdventureWorks database, but it shouldn’t need to. The behaviour largely depends on a number of other factors, in the same way that multiple Seeks can sometimes be faster than a Scan, but not always.)
Let’s think about views for a moment. They’re sometimes referred to as “Virtual Tables” (yes, SQL Books Online, I’m looking at you!)
People who have heard my ~~preaching~~ presenting about writing good T-SQL will have heard me refer to views as “stored sub-queries”. I reject the notion that they’re “virtual tables”. That’s rubbish – they’re treated in by the Query Optimizer as a stored sub-query, and should be referred to as such. The natural extension of this is that an inline table-valued function is a parameterised view, or stored correlated table sub-query. Basically the same as what we see here.
So let’s create it.

CREATE FUNCTION dbo.Top3ProductsBySubcategory(@subcat int) RETURNS TABLE
AS RETURN(
SELECT TOP (3) p.Name, SUM(o.OrderQty) AS NumOrdered
FROM Production.Product AS p
JOIN Sales.SalesOrderDetail AS o
ON o.ProductID = p.ProductID
WHERE p.ProductSubcategoryID = @subcat
GROUP BY p.ProductID, p.Name
ORDER BY SUM(o.OrderQty) DESC
);

Notice I’ve simply changed the correlation hook (s.ProductSubcategoryID) into the parameter. Also note that I’m NOT using BEGIN and END, which would render the function procedural and non-simplifiable. To hear me go on about that in more detail, have a look at the video at http://bit.ly/SimpleRob – a URL I have to thank Farmer Paul for, which will direct you to a presentation I did at SQLBits V, in late 2009. (I figure what Paul meant when he created “Simple Rob” is that I present simple truths that have a profound importance, and not suggesting that I’m ‘simple’ at all. He’s not suggesting that my IQ is low or anything like that. He wouldn’t do that, he’s far too nice a bloke, and my friend. I mean, I know he’s a Kiwi, and Kiwis have a friendly rivalry with Aussies – I think they’re jealous of our weather – but he wouldn’t pick on me like that. Besides, Paul and I are both English originally, he just moved to NZ instead of Australia. Maybe he just enjoys colder weather.)
So now my query can become:

SELECT s.Name, TopProds.*
FROM Production.ProductSubcategory AS s
OUTER APPLY dbo.Top3ProductsBySubcategory(s.ProductSubcategoryID) AS TopProds
;

This executes in exactly the same way, but we’ve contained our logic nicely by storing the sub-query in an inline TVF. Notice the join condition – it’s buried inside the function, abstracted away, and yet completely simplifiable by the Query Optimizer. It’s still very much a join – if you were to look at the execution plan, you’d see a Nested Loop (Left Outer Join), with a Seek Predicate being used further down. (I’ve created indexes to help, of course.) You should be able to click on the image to see it in more detail.

I really could keep going about APPLY, and talk about a bunch of other uses for it, but I’ll leave that for another time.
What I want to leave you with is the reminder that APPLY is just a more powerful way of doing a JOIN, because the set that you’re joining to doesn’t need to exist yet. You may have heard people talk about APPLY being bad, in the same breath as people will tell you that a correlated sub-query is bad. Unfortunately, there are times when this is true. But there are times when any join is bad, it largely depends on what indexes you have in place, and what kind of simplification the Query Optimizer can do in working out your plan. If you have considered issues such as SARGability and Residual Predicates, then you really shouldn’t have much of a problem.
Of course you understand joining to Tables, and to Views. If you aren’t used to joining to sub-queries, then try expanding a view yourself. Then think about what happens if you use APPLY and sneak the join condition into a predicate within the sub-query. All of a sudden you’ll find yourself able to write richer queries, and have a much stronger T-SQL arsenal.

------------------------------------------------------------------------------------------
The article shows how to write a function that returns the top three countries per category. That's always been something that was difficult to do in SQL so I was curious about the approach. The article started out well but I was very unhappy by the end. It's just soooo much easier to do this in SQL Server 2005 using the new CROSS APPLY clause in Transact-SQL. So I'm going to write a query to return the top 3 orders for each customer and I'm going to do it in about 10 lines of SQL. (UPDATE: An alert reader found an even better approach!)

First please take a second to read the MSDN article. Pay special attention to how much C# code is required for the solution and how flexible it is.Note: I also want to mention here that I started out trying to replicate their solution. I tried to install the MSI but that failed with an invalid characters message. I tried to recreate the CLR code but since he mostly posted snippets and had three versions I finally gave up trying to get that to work. One thing I really wanted to do was compare performance on identical data. Unfortunately I wasn't able to do that. My example will use the AdventureWorks database

Let's start at the TOP

Along the way to making this solution work we'll discuss a few new features of SQL Server 2005. The first is that the TOP clause can now take a variable:

DECLARE @Rows INT
SET @Rows = 10

SELECT TOP ( @Rows ) *
FROM Sales.SalesOrderHeader

This will return the top 10 rows from SalesOrderHeader. You can also replace @Rows with anything that evaluates to a number. The following query looks odd but runs just fine:

SELECT TOP ( 
 SELECT COUNT(*)
 FROM Sales.Customer
   ) *
FROM Sales.SalesOrderHeader

There are 19,185 rows in the Customer table and this query returns the top 19,185 rows from SalesOrderHeader. You can also use the TOP clause for INSERT, UPDATE and DELETE statements. If you wanted to DELETE in batches of 500 you can now do that using the TOP clause.

My Function

Next we need a function to return the TOP X rows from SalesOrderHeader based on the total sales amount. That function looks like this:

CREATE FUNCTION dbo.fn_GetTopOrders(@custid AS int, @n AS INT)
  RETURNS TABLE
AS
RETURN
  SELECT TOP(@n) *
  FROM Sales.SalesOrderHeader
  WHERE CustomerID = @custid
  ORDER BY TotalDue DESC
GO

Notice that it accepts the number of orders to return as a parameter. Also notice that I'm using SELECT * all over the place. I really encourage you to explicitly list out the columns when you write this for real. If you call this with a CustomerID and a number of rows it will return that many rows ordered by the total amount of the order in descending order. Also notice that there is an ORDER BY clause in this function. We'll talk about that in a minute.

Apply the APPLY Clause

The real magic happens when you use SQL Server 2005's new APPLY clause. The APPLY clause let's you join a table to a table-valued-function. That let's you write a query like this:

SELECT  C.CustomerID, 
 O.SalesOrderID,
 O.TotalDue
FROM 
 AdventureWorks.Sales.Customer AS C
CROSS APPLY
 AdventureWorks.dbo.fn_GetTopOrders(C.CustomerID, 3) AS O
ORDER BY 
 CustomerID ASC, TotalDue DESC

which results in this...

CustomerID  SalesOrderID TotalDue
----------- ------------ ---------------------
1           45283        37643.1378
1           46042        34722.9906
1           44501        26128.8674
2           46976        10184.0774
2           47997        5469.5941
2           57044        4537.8484
3           53616        92196.9738
3           47439        78578.9054
3           48378        56574.3871
4           47658        132199.8023
. . .

The APPLY clause acts like a JOIN without the ON clause comes in two flavors: CROSS and OUTER. The OUTER APPLY clause returns all the rows on the left side (Customers) whether they return any rows in the table-valued-function or not. The columns that the table-valued-function returns are null if no rows are returned. The CROSS APPLY only returns rows from the left side (Customers) if the table-valued-function returns rows.
Notice that I'm just passing in the CustomerID to the function. It returns the TOP 3 rows based on the amount of the order. Since I'm using CROSS APPLY a customer without orders won't appear in the list. I can also pass in a number other than 3 to easily return a different number of orders per customer. Even better I can pass in a different number of orders for each customer. So I could list the top 5 orders for one type of customer but the top 10 for another type of customer. How cool is that?!?
And it gets even better. Remember the function has an ORDER BY in it. It will always return the top orders based on the amount. However you can change the ORDER BY clause in the query that calls the function to display those rows in whatever order you want. You could easily display the top 3 orders in ascending order instead.

Performance and Conclusion

Unfortunately I wasn't able to compare the performance of my solution to MSDN's. The query plan from my solution doesn't look that bad. (And did I mention that it showed me a full query plan for the query with the UDF rather than just a UDF step? Sweet!) If anyone gets their solution running send me the code and I'll update the article.
I also think their solution returns the result as a comma separated value. Now that would be a great use for the CLR. You could easily wrap that around the results of this function for a fast, effecient procedure. And it happens to be the next article I'm working on. :) If you want to see a great article on using the CLR inside SQL Server I'd start with Using CLR Integration in SQL Server 2005. The code is from an earlier beta but it gives you a great idea of when to use CLR code.
In summary I think that 5-10 lines of SQL is much simpler than 50+ lines of C# (or VB) for this type of solution. The article provides a lot of good information on the CLR inside SQL Server. I just don't think this is the best place to use it. One of my biggest fears is that people will use the CLR incorrectly and cause performance problems. This is the first real case I've seen of this. Is this really how Microsoft wants us to use the CLR?

UPDATE

I was curious when I posted this how long it would take to find an even better solution. It didn't take long. Arnold Fribble posted it in the forums. The syntax needed a little tweaking but it appears you can do this in a single query. And how did I forget the PARTITION BY clause of the ROW_NUMBER function. Doh! Here's the cleaned up query:

SELECT 
 CustomerID,
 SalesOrderID,
 TotalDue
FROM (
 SELECT
  O.CustomerID,
  O.SalesOrderID, 
  O.TotalDue, 
  ROW_NUMBER = ROW_NUMBER() OVER (
   PARTITION BY O.CustomerID 
   ORDER BY O.TotalDue DESC)
 FROM 
  Sales.SalesOrderHeader AS O
 ) AS d
WHERE
 d.ROW_NUMBER <= 3

You have to use a derived table (or a Common Table Expression) since the row number function can't be used in the WHERE clause of the query where it's defined. Now I'll have to write an article on the new ranking functions.

see also
http://www.developer.com/db/article.php/3798361/Using-T-SQL-CROSS-APPLY-and-OUTER-APPLY.htm

DevelopTime

Thursday, March 8, 2012

The power of T-SQL's APPLY operator

Let's start at the TOP

My Function

Apply the APPLY Clause

Performance and Conclusion

UPDATE

No comments:

Post a Comment