Thursday, September 3, 2009

Understanding INNER JOIN in detail


The objective of this article is dig into detail of how INNER joins produce the result set. The INNER join is the common join type used in many cases. Even though it is used very often I have seen most people are not certain how it produces the result set when changing the join condition.




Before go into the article contents, I need to create the two tables below to illustrate the examples used here.
Table 1
CREATE TABLE #InnerJoinTest1    
(    
 InnerJoinTest1_PK INT IDENTITY(1,1)     
)    
GO
Table 2
CREATE TABLE #InnerJoinTest2    
(    
 InnerJoinTest2_PK int    
)    
GO
Since the tables above are created in tempdb, the database currently used in your query editor will not matter. Then I use the queries below to populate the created tables with some sample data.
To keep the explanation easy I used only 5 records in each table. However once you understand the theory behind it you can map it to larger tables as well. The concept is similar despite of the number records in the tables.
Populate table 1
WHILE SCOPE_IDENTITY()<5>
INSERT INTO #InnerJoinTest1    
DEFAULT VALUES
Populate table 2
INSERT INTO #InnerJoinTest2    
SELECT * FROM #InnerJoinTest1
Once you have executed all the queries above successfully, everything is set for our discussion.
The Scenario
I will present you all the queries which involve an INNER JOIN in advance so that you can manipulate and decide how many records each query will return. Finally you can compare your answers with the result set that actual query execution produces by the SQL Server.
Query 1
SELECT a.*,b.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK=b.InnerJoinTest2_PK
Query 2
SELECT a.*,b.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK>b.InnerJoinTest2_PK
Query 3
SELECT a.*,b.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK>=b.InnerJoinTest2_PK
Query 4
SELECT b.*,a.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK
ORDER BY a.InnerJoinTest1_PK
Query 5
SELECT b.*,a.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK<=b.InnerJoinTest2_PK    
ORDER BY a.InnerJoinTest1_PK
Query 6
SELECT a.*,b.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK<>b.InnerJoinTest2_PK
Take a piece of paper and manually determine the result set of each query will return.
Cartesian Product
First of all, I would like to mention that I will consider table 1 as set 1 and table 2 as set 2. (according to the set theory).
Let's produce the Cartesian product of the two sets. The Cartesian product of the two sets is given below.

You can produce the Cartesian product when you join every row of one table to every row of another table. You can also get one by joining every row of a table to every row of itself.
It is very important to understand how the result set is produced for the Cartesian product from the queries above. The Cartesian product will give you the maximum possible number of combinations that can be created from the given two sets. The number of records in the Cartesian product can be computed by using the following formula.
Number of records = No.of records of table 1 X No.of record of table2
In our case it is 25. (5x5)
Rationalization
Let's look at the first query. This kind of INNER JOIN is the most popular and often you may be able to determine the result set it produces. This INNER JOIN with the equal condition will give you the matching records of both sets. If you look at the above Cartesian product, how many matching records you will be able to find?

I have highlighted the matching instances in the above Cartesian product. So that the first query will give you the highlighted records as the output.
Then consider the second query.
In this case you need to find all the possible record combinations where the first table value is greater than the second table values.
The image below shows the Cartesian product and I have highlighted the record combination for query 2.

Execute query 2 and compare the result set with the highlighted records shown above. The actual query execution result should match with the above highlighted result set.
Applying the same theory for rest of the queries, you should be able to figure out the result set of each one.
Let's take query 3. Again I'm using the Cartesian product to determine the result set. This query will produce the records where InnerJoinTest1_PK is greater than or equal to the InnerJoinTest2_PK.

It is quite easier now to determine the result set if you consider the Cartesian product of two tables.
Let's look at the next two queries. (Query 4 &5) The Cartesian product below is shown the result set of both queries 4 and 5.
The green highlighted records are the output of query 4 where as query 5 will give you the combination of green highlighted and yellow highlighted records. Because the difference of query 4 and 5 is the '=' operator. So that the query 5 will output equal values of both columns in addition to the output of the query 4. The query 4 will produce the records where InnerJoinTest1_PK is less than the InnerJoinTest2_PK of the table 2.


The order of results from the actual execution of this query may be different with the Cartesian product. But it is not really a matter for our discussion. You use simply an ORDER BY clause to order the result set as you desire.
Finally consider the last query. (Query 6) This query will output the records where both columns are not equal. Again look at out most important Cartesian product shown below.
The green highlighted records show where both columns are not equal. Theoretically this result set is similar to Cartesian product - intersection. The intersection is the INNER JOIN result with equal operator. (Output of the query 1)

Conclusion
I assume you all got the better understanding over the different types of INNER JOIN presented in this article and how the result set is determined without actually executing the query. This understanding is very important when it comes to development or the administration work. The article is open for further discussions.

No comments:

Post a Comment

How to interpret Disk Latency

I was analyzing IO stats in one of our SQL Servers and noticed that IO latency (Read / Write) are very high. As a rule of thumb, we know tha...