Dave + SQL Server = Ingenious SQL….

SQL Thoughts from @IngeniousSQL

SSIS Find Non-Imported Files

Traversing a folder looking for files to import into a database table is a common practice using integration services. What if you have little control over the import folder or files because of security restrictions? What if there are many files stored in the folder and number of files grows daily? One way to deal with this is using a Foreach Loop Container, call to a database table or log and determine if the file had previously been imported. This may be a sufficient solution if the number of files is small, however if the number of files is larger this could have a big impact on database performance and/or package performance. A better approach may be to use a Script Task. In this blog post I will show how we can use a Script Task with some simple C# code to capture a list of files, remove processed files, and use a Foreach Loop Container to traverse the new files.

File Import Table

Before we get into creating the Integration Services package, we will need a FileImport table to keep a list of all files that have already been imported.

CREATE TABLE dbo.FileImport(
     FileImportID INT IDENTITY(1, 1) NOT NULL
   , FileName NVARCHAR(1000) NOT NULL
   , ImportDate DATETIME NULL
   , RecordCount INT NULL
   , ErrorMessage NVARCHAR(MAX) NULL   

New File Import SSIS package

The New File Import SSIS package will query the FileImport table to capture a list of previously imported files as well as capturing a list of available files in the import folder. Using some code we will eliminate files that have already been imported and then loop over this new set of files that have not been imported.

Let’s start by creating a new SSIS Package and named it NewFileImport. We will use an OLE DB connection to query the FileImport table and a few package variables to store processing data. The following package variables will be created for the associated purposes.

Trav_SSIS Variables

Variable Definitions

  • The FileName variable will be used store each of the new files as we traverse the new files using a Foreach Loop Container.
  • The ImportedFiles variable will be used to store the previously imported files that is received from FileImport Table
  • The ImportedPath variable is the path where the files to be imported are stored.
  • The NewFiles variable is a list of files that have not been imported into the system yet.

Package Tasks

The package will use a SQL Task, Script Task and Foreach Loop Container.Trav_SSIS Tasks

SQL – Get a list of all processed files

This Task will capture a list of all the files that have been previously imported files and will store them into the ImportedFiles variable. The ResultSet should be set to “Full result set” and a simple query to the FileImport table “SELECT FileName FROM dbo.FileImport;” should be saved in the SQLStatement.


The Result Set should be saved to the User::ImportedFiles variable.


SCR – Get a list of files to Import

This task will create a data set and array list object that will store all of the file names in the FileImport Table and all the file names in the Import Path. A loop will be used to traverse the files names in FileImport object. On each pass through the loop the corresponding file name in the Import Path object will be removed. When the loop is complete a list of only new file names will be remaining in the Import Path object. This Import Path object will be stored in the NewFiles variable and used in the Foreach Loop Container. The following C# script implements this logic.

public void Main(){			
   OleDbDataAdapter da = new OleDbDataAdapter();
   DataTable dt = new DataTable();
   object ImportedFiles = Dts.Variables["ImportedFiles"].Value;
   string ImportedPath = Dts.Variables["ImportedPath"].Value.ToString();
   // Load the processed files from the record set into a DataTable variable.              
   da.Fill(dt, ImportedFiles);
   // Load the files to be imported from the directory to be imported into an 
   // ArrayList variable.                        
   ArrayList ImportDataFiles = new ArrayList(Directory.GetFiles(ImportedPath));
   // Traverse the DataTable and remove any matching files from the ArrayList.
   foreach (DataRow dr in dt.Rows) {
   // Return the ArrayList which will only have new files.
   Dts.Variables["NewFiles"].Value = ImportDataFiles;
   Dts.TaskResult = (int)ScriptResults.Success;

The ImportedFiels, ImportPath and NewFiles are used in the script, so we will need to declare their use in the script.


FELC – Loop through all files

This Foreach Loop Container will loop through all of the files that are stored in the NewFiles variable. The enumerator type that will be used is the “Foreach From Variable Enumerator”. The selected Variable should be User::NewFiles.


The Variable Mapping should store the result of each loop into the User::FileName variable.


On each iteration of the Foreach Loop Container, the full path to the file, from the NewFiles variable will be stored into the FileName variable. Having the path to the file will help in the file import process inside the Data Flow Task. The details of importing file data will be left to another blog post.

T-SQL Bitwise Matching Logic – #TSQL2sDay


A good friend of mine Rick Kruger (@DataOgre|blog) is hosting this month’s T-SQL Tuesday Blog party. The T-SQL Tuesday Blog parties were started by Adam Machanic (@AdamMachanic|blog) back in December of 2009. This month’s topic is on Rube Goldberg Machines. Rick asked us to look through our closets to dig up a skeleton, something that we made T-SQL do, which might make other DBA’s cringe a bit. After hearing this, I knew exactly what I was going to blog about.

Bitwise Matching!

I remember when I implemented this feature and shook my head when I came up with the idea of using bitwise matching. The requirements I was given was to match two entities based on assigned lookup values. This would not have been that bad, except the lookup values were not static, they were dynamically entered through a web UI. When I first pitched the idea I used a dating site as a way to explain the logic, so in this blog post I will use that same concept.


Within a dating site one person is matched to another based on activities they both enjoy. Activities can be grouped into different categories. In this example we have two different categories; Exercise and Things to do. People must have at least one matching activity in each category to be a match. We are going to try to match Bob with another person.

People and their activities

  • Bob enjoys Walking, Biking, and Camping.
  • Sally enjoys Walking and Biking
  • Nancy enjoys Camping
  • Beth enjoys Walking, Running, Golfing, and Camping.

The following T-SQL implements a few temporary tables we will use to implement the bitwise matching logic.

-- Create the People that can be matched.
   PersonID smallint
 , Name nvarchar(25));
VALUES   (1, N'Bob'), (2, N'Sally'), (3, N'Nancy'), (4, N'Beth');
-- Create the Activities that people may have in common.
CREATE TABLE #Activities (
   ActivityID smallint
 , Activity nvarchar(25)
 , CategoryID smallint);
INSERT   INTO #Activities
         (Activity, CategoryID)
VALUES   (1, N'Walking', 1), (2, N'Biking', 1), (3, N'Running', 1), (4, N'Yoga', 1),        
	 (5, N'Movies', 2), (6, N'Golf', 2), (7, N'Camping', 2);
-- Create the Weak Entity Table to store the Persons Activities.
CREATE TABLE #PersonActivities (
   ActivityID smallint
 , PersonID smallint);
INSERT   INTO #PersonActivities
         (PersonID, ActivityID)
VALUES   (1, 1), (1, 2), (1, 7),
         (2, 1),(2, 2),
         (3, 7),
         (4, 1),(4, 3),(4, 6),(4, 7);

Bitwise Logic

Using the activities we can assign a bitwise value based on the identity and category.

Activity Activity ID Category ID Bitwise Value
Walking 1 1 21 = 2
Biking 2 1 22 = 4
Running 3 1 23 = 8
Yoga 4 1 24 = 16
Movies 1 2 21 = 2
Golf 2 2 22 = 4
Camping 3 2 23 = 8

If we summarize the bitwise values we can use the binary and operator (&) to determine if any matches exist. Example:

Summed Bitwise Value 23 22 21 20
6 = 0 1 1 0
12 = 1 1 0 0
4 = 0 1 0 0

The value of 6 represents Walking and Biking, the value of 12 represents Biking and Running. The intersection of the two is Biking, this would be the matched value of 4. Being that we have a matched value, the result is greater than 0. Using this logic we can implement the following query.

WITH  PeopleSums
        AS (
            SELECT   p.Name
                   , p.PersonID
                   , a.CategoryID
                   , BitWiseSum = SUM(POWER(2, pa.ActivityID))
            FROM     #Person p
            JOIN     #PersonActivities pa ON p.PersonID = pa.PersonID
            JOIN     #Activities a ON a.ActivityID = pa.ActivityID
            GROUP BY p.Name
                   , p.PersonID
                   , a.CategoryID
   SELECT   ps2.Name
   FROM     PeopleSums ps
   JOIN     PeopleSums ps2 ON ps.PersonID != ps2.PersonID
                              AND ps.BitWiseSum & ps2.BitWiseSum <> 0
                              AND ps.CategoryID = ps2.CategoryID
   WHERE    ps.Name = 'Bob'
   GROUP BY ps.Name
          , ps2.Name
   HAVING   COUNT(*) = 2;

This query uses a CTE to calculate and then summarize the bitwise values, grouped by person and category. We then self-reference the CTE using the binary AND operator (&) where the result is anything but zero. This concept can show us who Bob matches, we can use the table below to illustrate this.

Persons that Match Bob Category 1 Category 2
Sally Walking, Biking
Nancy Camping
Sally Walking Camping

To make sure that we have a match for each category, we do a count and ensure that it matches the number of categories we currently have. And voila Beth is a match for Bob, because Beth is the only candidate that had matches in both categories.

LAG() and LEAD() Analytical Window Functions

I have been finding more and more uses for the OVER statement recently and I figured it was about time I did a blog post on it. The OVER statement captures a window of data that can be aggregated and included with the details of the original result set. There are a set of window functions that can be applied to the OVER statement to generate row numbers, ranking values, averaging, getting previous values and next values in a set. This idea of getting the previous and next values, in a set of data, is a problem that I just ran into recently. This blog post will describe the usage of LAG and LEAD analytical functions. LAG and LEAD are two new Analytical Window Functions that were introduced in SQL Server 2012.

Recently I was tasked with calculating the difference between two values, however the values existed in different rows. My first thought was a self-join on the table and compare the two results. Using Adventure Works 2012, I ran the following query and the results are shown.

SELECT   soh.PurchaseOrderNumber
       , soh.OrderDate
       , soh.CustomerID
       , PreviousTotal = soh1.TotalDue
       , CurrentTotal = soh.TotalDue
       , NextTotal = soh2.TotalDue
FROM     Sales.SalesOrderHeader soh
JOIN     Sales.SalesOrderHeader soh1 ON soh1.SalesOrderID = soh.SalesOrderID - 1
JOIN     Sales.SalesOrderHeader soh2 ON soh2.SalesOrderID = soh.SalesOrderID + 1;


This result set shows the previous and next record values for the TotalDue column. This got me 1/2 way to what was needed to perform the difference calculations. I needed the results for a given customer rather than all customers. I updated the query to allow for a customer ID to be passed into the query and the results are shown.

DECLARE @CustomerID int = 30117;
SELECT   soh.PurchaseOrderNumber
       , soh.OrderDate
       , soh.CustomerID
       , PreviousTotal = soh1.TotalDue
       , CurrentTotal = soh.TotalDue
       , NextTotal = soh2.TotalDue
FROM     Sales.SalesOrderHeader soh
LEFT JOIN Sales.SalesOrderHeader soh1 ON soh1.SalesOrderID = soh.SalesOrderID - 1
                                         AND soh1.CustomerID = soh.CustomerID
LEFT JOIN Sales.SalesOrderHeader soh2 ON soh2.SalesOrderID = soh.SalesOrderID + 1
                                         AND soh2.CustomerID = soh.CustomerID
WHERE    soh.CustomerID = @CustomerID;


This is was a problem. Being that I was using the SalesOrderID to perform the self-joins on the tables, and the Customer’s records were not in sequential order, I was not getting any of the previous and next records.


Instead of going down the path of craziness, I decided to eliminate the use of the self-referencing joins and opted to use the OVER clause with LEAD and LAG to solve the problem. Using the following query and output we can explain the LEAD(), LAG(), OVER (PARTITION BY) statements.

DECLARE @CustomerID int = 30117;
SELECT   PurchaseOrderNumber
       , OrderDate
       , CustomerID
       , CurrentTotalDue = TotalDue
FROM     Sales.SalesOrderHeader soh
WHERE    soh.CustomerID = @CustomerID;


In the blue box is the CustomerID column, this is the column we want to Partition by.

In the red box is the PurchaseOrderNumber and OrderDate columns, these columns we want to order the results of the window.

In the green box is the CurrentTotalDue, this is the column that we will pass to the LEAD and LAG window functions.

I updated the query to use the LEAD and LAG window functions and the results are shown. The red highlights show the results of the LEAD function and the green highlights show the results of the LAG function.

DECLARE @CustomerID int = 30117;
SELECT   PurchaseOrderNumber
       , OrderDate
       , PreviousTotal = LAG(TotalDue) OVER (PARTITION BY CustomerID ORDER BY OrderDate , PurchaseOrderNumber)
       , CurrentTotal = TotalDue
       , NextTotal = LEAD(TotalDue) OVER (PARTITION BY CustomerID ORDER BY OrderDate , PurchaseOrderNumber)
FROM     Sales.SalesOrderHeader soh
WHERE    soh.CustomerID = @CustomerID;


As you can see the syntax for writing a query to use these LEAD and LAG are not overly complex. In a future blog post I will dive deeper into using the OVER statement.


In prior blog posts I talked about a Standard PIVOT query and a Dynamic PIVOT query. It is only fitting that I talk about how to implement an UNPIVOT query. As DBA’s we commonly unpivot data and turn it into normalized data for storing in a relational data store. Understanding how to use the UNPIVOT query is a great skill to have as a DBA. Just like the PIVOT operator, the UNPIVOT operator was introduced with SQL Server 2005. In this blog post I will show how we can use the UNPIVOT operator.

I got the idea for this blog post as a result from my online grade book. I noticed that the data is displayed as a denormalized set of data. As an instructor I thought it would be interesting to see average, sum, minimum and maximum assignment scores for each student. Below is an example of the display of the online grade book.

StudentID Assignment1 Assignment2 Assignment3 Assignment4
101 90 100 95 100
102 70 65 40 80
103 100 100 100 100

I found that writing a simple query to get the sum (adding each column) and average (adding each column and dividing by the number of columns) was not too bad. However, finding the minimum and maximum proved to be a bit tricky. Sure we could implement a CASE WHEN statement, however for each column the complexity and possibility of error increases significantly, and maintainability dang near goes out the window. In comes UNPIVOT to save the day. After unpivoting the data, we simply use the AVG(), SUM(), MIN() and MAX() functions.

Let’s take a look at the basic structure of the UNPIVOT query. Using the UNPIVOT statement, you specify a value and data column, then select each of the columns you want to unpivot into these new columns. The new value and data columns become select-able columns in the query, as shown in the following code.

SELECT   [AnyColumn]
       , [DataColumn]
       , [ValueColumn]
FROM     SomeTable ST 
UNPIVOT ( ValueColumn FOR DataColumn IN (UnPivotCol1, UnPivotCol2, UnPivotCol2) ) AS UNPVT

Applying this UNPIVOT syntax to our data above we yield the following query.

SELECT   StudentID
       , Assignment
       , ASMT.Score
FROM     #ClassGrades CG 
UNPIVOT ( Score FOR Assignment IN (Assignment1, Assignment2, Assignment3, Assignment4) ) AS ASMT

This query returns the following normalized view of the data

StudentID Assignment Score
101 Assignment1 90
101 Assignment2 100
101 Assignment3 95
101 Assignment4 100
102 Assignment1 70
102 Assignment2 65
102 Assignment3 40
102 Assignment4 80
103 Assignment1 100
103 Assignment2 100
103 Assignment3 100
103 Assignment4 100

Now that we have this data in a normalized result, applying SQL functions such as AVG() and SUM(), is relatively simple. Below is an updated version of the query using the AVG(), SUM(), MIN(), and MAX() functions grouping by StudentID.

SELECT   StudentID
       , Average = AVG(ASMT.Score)
	   , MaxScore = MAX(ASMT.Score)
	   , MinScore = MIN(ASMT.Score)
	   , Total = SUM(ASMT.Score)
FROM     #ClassGrades CG 
UNPIVOT ( Score FOR Assignment IN (Assignment1, Assignment2, Assignment3, Assignment4) ) AS ASMT

This query returns the following view of the data

StudentID Average MaxScore MinScore Total
101 96 100 90 385
102 63 80 40 255
103 100 100 100 400

As you can see the PIVOT and UNPIVOT statements are not overly complex and are definitely great TSQL tools for any DBA to keep in their back-pocket. Below is the entire script with test data included.

IF OBJECT_ID('tempdb..#ClassGrades') IS NOT NULL 
   DROP TABLE #ClassGrades;
CREATE TABLE #ClassGrades (
   StudentID int PRIMARY KEY
                 NOT NULL
 , Assignment1 int NULL
 , Assignment2 int NULL
 , Assignment3 int NULL
 , Assignment4 int NULL);
INSERT   INTO #ClassGrades
         (StudentID, Assignment1, Assignment2, Assignment3, Assignment4)
VALUES   (101, 90, 100, 95, 100),
         (102, 70, 65, 40, 80),
         (103, 100, 100, 100, 100); 
SELECT   StudentID
       , Assignment
       , ASMT.Score
FROM     #ClassGrades CG 
UNPIVOT ( Score FOR Assignment IN (Assignment1, Assignment2, Assignment3, Assignment4) ) AS ASMT
SELECT   StudentID
       , Average = AVG(ASMT.Score)
	   , MaxScore = MAX(ASMT.Score)
	   , MinScore = MIN(ASMT.Score)
	   , Total = SUM(ASMT.Score)
FROM     #ClassGrades CG 
UNPIVOT ( Score FOR Assignment IN (Assignment1, Assignment2, Assignment3, Assignment4) ) AS ASMT

Rules of Engagement – NoSQL Graph Databases

Well Rick Krueger( blog | @DataOgre ) and I are back at it again. We have decided to pair up and do a talk about a rather controversial topic in the SQL Server community, NoSQL. Now we are not jumping ship and saying that NoSQL is the way of the future, because that is simply not true. Relational databases have been around since the 70′s and still outperform NoSQL Solutions at most tasks. The idea that we are going to seed is that these NoSQL solutions may have a place in enterprise applications. We are not going to say you should use SQL Server or a NoSQL Solution exclusively, rather we are going to claim Polyglot Persistence. Polyglot Persistence is the idea of using multiple data stores to solve multiple problems, hence Rules of Engagement: NoSQL is SQL Server’s Ally. In a previous blog post I discussed Key Value Stores and Column Stores, in this blog post I will be discussing another version of the NoSQL solutions, Graph Databases.

Graph Databases

Graph Databases are data structures that consist of nodes, properties, and edges. In RDBMS terms a node would be a record in a table, a property would be the data within the record, and an edge would be similar to the relationship between tables. However the edges can store properties too, these properties describe the relationship between the nodes.

In the RDBMS land, writing queries to join many large tables, especially self referencing tables, can degrade performance of a query and can be rather costly, even with proper index usage. This is where Graph databases really shine, the graph databases has the ability traverse nodes using the edges and the properties within the edges. Using Graph Database to find relationships between highly interconnected data is very fast. Finding data that would take multiple joins or a recursive query in a RDBMS is a simple task with Graph Databases.

You can visualize a graph database as a flattened RDBMS table structure with named connections. Looking at the the relational tables below we see simple car buying and selling database. This database has a list of cars that people currently own and the types of cars they are looking for.GraphDB_DBStructure If we were to flatten and denormalize the structure we would get something that looks like this.


Here the Car node is the CarType and Manufacture tables denormalized, the Person node is the Person table, and the edges that connect these two nodes would be the Request and Cars tables. Depending on the data you are looking for, both of these structures have their merits. Lets say you need to write a query that will return a list cars for a given owner that is willing to trade with another owner. This is a simple query for both, however what if there is not a match, but there is a match using a 3rd party for a trade. Or maybe there needs to be a 4th or 5th party for a four or five way trade. This query is starting to get very complicated in a RDBMS, and the performance at each level is progressively degrading. This is where a Graph Database is really going to stand out! Being that the emphasis is on the edges between the nodes, the query is simple and will give you performance your application will desire.

Graph Database Implementations

One of the most recognized Graph Databases implementations is Neo4j and is currently made available by Neo Technologies.

“Neo4j is a robust (fully ACID) transactional property graph database. Due to its graph data model, Neo4j is highly agile and blazing fast. For connected data operations, Neo4j runs a thousand times faster than relational databases.” – http://www.neo4j.org/

Neo4j is available for most platforms and comes with a WebAdmin management console for viewing the graph database. Below are some screen captures of the Web Admin utility showing a Dashboard for over all database health, and some ways to view the data inside the database.

Polyglot Persistence, SQL Server and Graph Databases

Recently Facebook announced a graph search feature in their application.

“With graph search you can look up anything shared with you on Facebook, and others can find stuff you’ve shared with them.” – https://www.facebook.com/about/graphsearch

This is idea of using a graph database within their application is the idea behind polyglot persistence. Their primary data could be stored in any data structure such as a RDBMS, but their graph search feature would be stored in a Graph Database.


I am not recommending that you re-write your applications to make room for a NoSQL solution, I am simply suggesting that you as the DBA (keeper of data), should keep an open mind to other possibilities. If there is another solution out there that could assist SQL Server, or is maybe a better fit, exploring the avenue might result in a better data solution. As always, a round peg will fit into a square hole if you have a big enough hammer.