Sunday, September 27, 2009

Query best practices

In this blog I'm stating several best practices when writing SQL queries in MS SQL Server. They are;
  • Choose the most appropriate data type
  • Write correctly formed queries
  • Return only the columns/rows needed
  • Fully qualified object names
  • Avoid long actions in triggers
  • Avoid expensive operators such as NOT LIKE
  • Minimize cursor use
  • Limits query and index hints
  • Avoid implicit or explicit functions in WHERE clause

Monday, September 21, 2009

Name begins with sp_

If your stored procedures names begin with the sp_ prefix and is not in the master database, you will see SP:CacheMiss before the cache hit for each execution even if the stored procedure call is owner qualified. The reason for this is that the sp_ prefix tells SQL Server that the stored procedure is a system stored procedure. System stored procedures have different name resolution rules. With system stored procedures, SQL Server will look first in the master database if the call is not database qualified. Then will look in the current database. To avoid this additional work, do not use stored procedure names that begin with sp_

Saturday, September 19, 2009

Loop through all the databases in a server by using SSIS

As a database developer or administrator you may need to loop through all the databases in a server and perform various tasks for each database. There are various methods you can perform this in T-SQL. However here I’m presenting a simple solution by using an SSIS package.Let me go through the solution and explain how it is implemented.

Below are the steps;


  1. Start SQL Server Business Intelligence Development Studio.
  2. Create an Integration Project. Name the directory and project name.
  3. Rename the package to EachDatabase.dtsx by using solution explorer.

  1. Create an OLE DB connection in Connection Manager to the target server. Set the RetainSameConnection property to True in connection manager's property window.
  2. Add two Execute SQL Tasks, Script Task and For Loop Container to the Control Flow section.
  3. Connect them as shown below;

  1. I have given meaningful names for each component.
  2. You may be getting errors and warnings in Execute SQL Tasks and ignore them for the movement.
  3. You need to add four variables to the package. Scope of the variables will be different according to the place where they use.
  4. See the diagram below for the variables.

  1. Max_ID and Min_ID variables have the package scope while “database_name” and “loop” are having For Loop Container scope.
  2. Now let’s configure Execute SQL Tasks, For Loop Container and Script Task to get the final output.
Configure “Get all databases” Execute SQL Task



Change the properties mentioned below.
  1. Name, Description (You can give any name)
  2. ResultSet: Single row
  3. ConnectionType: OLE DB
  4. Connection: Select the connection you’ve created in the beginning
  5. SQLSourceType: Direct input
  6. SQL Statement: Add the SQL statement mentioned below;


  1. Set the variables “Result Set”. This is the place where you initialize Max_ID and Min_ID variables. Max_ID and Min_ID are not the database id rather it is an identity value of the temp table.



Configure For Loop Container

  1. Double click on the container and you will get the For Loop Editor as shown below;



  1. Set the For Loop Properties like above.

Configure “Get database name” Execute SQL Task

  1. Double click on “Get database name” Execute SQL Task.
  2. Set the property values;
    1. ResultSet: Single row
    2. ConnectionType: OLE DB
    3. Connection: (The connection you’ve created)
    4. SQLSourceType: Direct input
    5. SQLStatement: [select dbname from #databases where id=?]

  1. Set the parameter mapping. Variable “loop” will be the input and by using that value you get the database name.

  1. Setting Result Set. Assigning database name to the variable, “User::database_name



Configure “Display database name” Script Task

  1. Double click on the Script Task.
  2. Select Script tab in the left pan.
  3. Enter User::database_name in ReadWriteVariables section.
  4. Click on Design Script button.
  5. Add the VB.NET code mentioned below;

Dim db_name As String

db_name = Dts.Variables("User::database_name").Value.ToString()

MsgBox(db_name)

  1. Save and exit.

Now the package is all set for the final execution. Start the package and see its execution. Database name will be displayed in a message box and when you click on “OK” the next one will be displayed.

The advantage of this solution is to define the databases which you are going to loop through from the beginning.


I have used SQL Server 2005 BIDS (Business Intelligence Development Manager) to demonstrate the solution.

Wednesday, September 9, 2009

More about ORDER BY clause


Every one knows the purpose and the use of ORDER BY clause in SQL Server, but most may not know how ORDER BY works, in-side SQL Server. The implementation of T-SQL is based on set theory. Each table is considered as a set. Let’s take an example of Customer and Order tables. They are actually considered as two sets. The records of the table are identical to members or elements in sets. The T-SQL statements will not work as sequential language execution. Consider the T-SQL statement below.


use AdventureWorks
GO

SELECT TOP 100 SH.CustomerID,SH.SalesOrderNumber,SD.OrderQty,SD.UnitPrice FROM Sales.SalesOrderHeader SH
INNER JOIN Sales.SalesOrderDetail SD
ON SH.SalesOrderID=SD.SalesOrderID
ORDER BY SH.CustomerID


When you execute the query above, it will not execute sequentially within the SQL Server. Even though the SELECT statement is appeared first it will not execute as the first statement. Instead it will execute at a later part of the query (The execution order of a T-SQL query is beyond the scope of the article). During execution, T-SQL query will go through several virtual tables. The final result set of the query also displays as a virtual table.



But when you include the ORDER BY clause in your query the final result will not be a table, instead it returns as a cursor output. Because, a set does not have predetermined order of its members. It is a logical collection of members and the order of the members should not matter. In other words you can not perform an ORDER BY in a set simply by using set concepts. This is the reason the SQL Server uses cursors for this operation. ORDER BY is a costly operation to SQL Server (The use of cursors is anyway a costly operation).


Consider the T-SQL query below.



--Query1
USE AdventureWorks
GO

SELECT SalesOrderID,OrderQty,ProductID,UnitPrice FROM Sales.SalesOrderDetail
ORDER BY SalesOrderID
GO


Look at the execution plan of this query.



You may wonder why the execution plan does not contain the SORT operator. The obvious reason is SalesOrderID is the clustered index of the table, Sales.SalesOrderDetail which means the data is physically ordered by using the key of SalesOrederID column. So there will not be any additional cost for this sorting.

Let’s look at the query below;


--Query2

USE AdventureWorks
GO

SELECT SalesOrderID,OrderQty,ProductID,UnitPrice FROM Sales.SalesOrderDetail
ORDER BY UnitPrice
GO
Look at the execution plan of this query.



Now you can see the SORTing is happening here and it consumes 71% percent from the total cost of the query.
Consider this query;



--Query3
USE AdventureWorks
GO

SELECT SalesOrderID,OrderQty,ProductID,UnitPrice FROM
(SELECT SalesOrderID,OrderQty,ProductID,UnitPrice FROM Sales.SalesOrderDetail
ORDER BY UnitPrice) A




I have used ORDER BY clause withing the sub-query, let’s execute this and see the result.

What do you expect as the result of this query?

Think before it executes.

Below is the output;



Msg 1033, Level 15, State 1, Line 3
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified.


Were you able to figure out why we are getting this error?

As I already explained, the query will execute by creating several virtual tables, keeping the intermediate results and final result also a virtual table. But ORDER BY clause works quite differently. It returns a cursor. So the sub-query is returning a cursor result while outer query is implementing a virtual table. This is the reason why you can not use the ORDER BY clause in sub-queries, views, inline functions, etc (as mentioned in the error message).

The error message gives a hint to resolve the issue.

Consider the query below;


--Query4

USE AdventureWorks
GO

SELECT SalesOrderID,OrderQty,ProductID,UnitPrice FROM
(SELECT TOP 100 PERCENT SalesOrderID,OrderQty,ProductID,UnitPrice FROM Sales.SalesOrderDetail
ORDER BY UnitPrice) A





Now you can see the result of the query and the execution plan.

The difference of query 3 & 4 is the TOP 100 PERCENT statement. When you use the TOP operator (which is not a relational operator rather it is a T-SQL specific) the ORDER BY clause will not use a cursor when implementing. It uses a virtual table thus it prevents the above error and also the cost is very low with compared to the query 2.



Conclusion



As a rule of thumb it is a good practice to avoid ORDER BY clause unless you really need the rows sorted. Use ORDER BY clause in sub-queries with TOP 100 PERCENT if you really need it. A query with ORDER BY clause can not be used as a table expression that is, a view, inline table-valued function, sub query, derived table, or common table expression (CTE).

SQL version used for this article is, Microsoft SQL Server 2005 - 9.00.3073.00 (Intel X86) Copyright (c) 1988-2005 Microsoft Corporation Enterprise Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

Thursday, September 3, 2009

Understanding INNER JOIN in detail


The objective of this article is dig into detail of how INNER joins produce the result set. The INNER join is the common join type used in many cases. Even though it is used very often I have seen most people are not certain how it produces the result set when changing the join condition.




Before go into the article contents, I need to create the two tables below to illustrate the examples used here.
Table 1
CREATE TABLE #InnerJoinTest1    
(    
 InnerJoinTest1_PK INT IDENTITY(1,1)     
)    
GO
Table 2
CREATE TABLE #InnerJoinTest2    
(    
 InnerJoinTest2_PK int    
)    
GO
Since the tables above are created in tempdb, the database currently used in your query editor will not matter. Then I use the queries below to populate the created tables with some sample data.
To keep the explanation easy I used only 5 records in each table. However once you understand the theory behind it you can map it to larger tables as well. The concept is similar despite of the number records in the tables.
Populate table 1
WHILE SCOPE_IDENTITY()<5>
INSERT INTO #InnerJoinTest1    
DEFAULT VALUES
Populate table 2
INSERT INTO #InnerJoinTest2    
SELECT * FROM #InnerJoinTest1
Once you have executed all the queries above successfully, everything is set for our discussion.
The Scenario
I will present you all the queries which involve an INNER JOIN in advance so that you can manipulate and decide how many records each query will return. Finally you can compare your answers with the result set that actual query execution produces by the SQL Server.
Query 1
SELECT a.*,b.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK=b.InnerJoinTest2_PK
Query 2
SELECT a.*,b.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK>b.InnerJoinTest2_PK
Query 3
SELECT a.*,b.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK>=b.InnerJoinTest2_PK
Query 4
SELECT b.*,a.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK
ORDER BY a.InnerJoinTest1_PK
Query 5
SELECT b.*,a.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK<=b.InnerJoinTest2_PK    
ORDER BY a.InnerJoinTest1_PK
Query 6
SELECT a.*,b.* FROM #InnerJoinTest1 a    
INNER JOIN #InnerJoinTest2 b    
ON a.InnerJoinTest1_PK<>b.InnerJoinTest2_PK
Take a piece of paper and manually determine the result set of each query will return.
Cartesian Product
First of all, I would like to mention that I will consider table 1 as set 1 and table 2 as set 2. (according to the set theory).
Let's produce the Cartesian product of the two sets. The Cartesian product of the two sets is given below.

You can produce the Cartesian product when you join every row of one table to every row of another table. You can also get one by joining every row of a table to every row of itself.
It is very important to understand how the result set is produced for the Cartesian product from the queries above. The Cartesian product will give you the maximum possible number of combinations that can be created from the given two sets. The number of records in the Cartesian product can be computed by using the following formula.
Number of records = No.of records of table 1 X No.of record of table2
In our case it is 25. (5x5)
Rationalization
Let's look at the first query. This kind of INNER JOIN is the most popular and often you may be able to determine the result set it produces. This INNER JOIN with the equal condition will give you the matching records of both sets. If you look at the above Cartesian product, how many matching records you will be able to find?

I have highlighted the matching instances in the above Cartesian product. So that the first query will give you the highlighted records as the output.
Then consider the second query.
In this case you need to find all the possible record combinations where the first table value is greater than the second table values.
The image below shows the Cartesian product and I have highlighted the record combination for query 2.

Execute query 2 and compare the result set with the highlighted records shown above. The actual query execution result should match with the above highlighted result set.
Applying the same theory for rest of the queries, you should be able to figure out the result set of each one.
Let's take query 3. Again I'm using the Cartesian product to determine the result set. This query will produce the records where InnerJoinTest1_PK is greater than or equal to the InnerJoinTest2_PK.

It is quite easier now to determine the result set if you consider the Cartesian product of two tables.
Let's look at the next two queries. (Query 4 &5) The Cartesian product below is shown the result set of both queries 4 and 5.
The green highlighted records are the output of query 4 where as query 5 will give you the combination of green highlighted and yellow highlighted records. Because the difference of query 4 and 5 is the '=' operator. So that the query 5 will output equal values of both columns in addition to the output of the query 4. The query 4 will produce the records where InnerJoinTest1_PK is less than the InnerJoinTest2_PK of the table 2.


The order of results from the actual execution of this query may be different with the Cartesian product. But it is not really a matter for our discussion. You use simply an ORDER BY clause to order the result set as you desire.
Finally consider the last query. (Query 6) This query will output the records where both columns are not equal. Again look at out most important Cartesian product shown below.
The green highlighted records show where both columns are not equal. Theoretically this result set is similar to Cartesian product - intersection. The intersection is the INNER JOIN result with equal operator. (Output of the query 1)

Conclusion
I assume you all got the better understanding over the different types of INNER JOIN presented in this article and how the result set is determined without actually executing the query. This understanding is very important when it comes to development or the administration work. The article is open for further discussions.

How to interpret Disk Latency

I was analyzing IO stats in one of our SQL Servers and noticed that IO latency (Read / Write) are very high. As a rule of thumb, we know tha...