T-SQL Bitwise Matching Logic – #TSQL2sDay

TSQLTUES

A good friend of mine Rick Kruger (@DataOgre|blog) is hosting this month’s T-SQL Tuesday Blog party. The T-SQL Tuesday Blog parties were started by Adam Machanic (@AdamMachanic|blog) back in December of 2009. This month’s topic is on Rube Goldberg Machines. Rick asked us to look through our closets to dig up a skeleton, something that we made T-SQL do, which might make other DBA’s cringe a bit. After hearing this, I knew exactly what I was going to blog about.

Bitwise Matching!

I remember when I implemented this feature and shook my head when I came up with the idea of using bitwise matching. The requirements I was given was to match two entities based on assigned lookup values. This would not have been that bad, except the lookup values were not static, they were dynamically entered through a web UI. When I first pitched the idea I used a dating site as a way to explain the logic, so in this blog post I will use that same concept.

Concept

Within a dating site one person is matched to another based on activities they both enjoy. Activities can be grouped into different categories. In this example we have two different categories; Exercise and Things to do. People must have at least one matching activity in each category to be a match. We are going to try to match Bob with another person.

People and their activities

  • Bob enjoys Walking, Biking, and Camping.
  • Sally enjoys Walking and Biking
  • Nancy enjoys Camping
  • Beth enjoys Walking, Running, Golfing, and Camping.

The following T-SQL implements a few temporary tables we will use to implement the bitwise matching logic.

-- Create the People that can be matched.
CREATE TABLE #Person (
   PersonID smallint
 , Name nvarchar(25));
INSERT   INTO #Person
         (Name)
VALUES   (1, N'Bob'), (2, N'Sally'), (3, N'Nancy'), (4, N'Beth');
 
-- Create the Activities that people may have in common.
CREATE TABLE #Activities (
   ActivityID smallint
 , Activity nvarchar(25)
 , CategoryID smallint);
 
INSERT   INTO #Activities
         (Activity, CategoryID)
VALUES   (1, N'Walking', 1), (2, N'Biking', 1), (3, N'Running', 1), (4, N'Yoga', 1),        
	 (5, N'Movies', 2), (6, N'Golf', 2), (7, N'Camping', 2);
 
-- Create the Weak Entity Table to store the Persons Activities.
CREATE TABLE #PersonActivities (
   ActivityID smallint
 , PersonID smallint);
 
INSERT   INTO #PersonActivities
         (PersonID, ActivityID)
VALUES   (1, 1), (1, 2), (1, 7),
         (2, 1),(2, 2),
         (3, 7),
         (4, 1),(4, 3),(4, 6),(4, 7);

Bitwise Logic

Using the activities we can assign a bitwise value based on the identity and category.
Example:

Activity Activity ID Category ID Bitwise Value
Walking 1 1 21 = 2
Biking 2 1 22 = 4
Running 3 1 23 = 8
Yoga 4 1 24 = 16
Movies 1 2 21 = 2
Golf 2 2 22 = 4
Camping 3 2 23 = 8

If we summarize the bitwise values we can use the binary and operator (&) to determine if any matches exist. Example:

Summed Bitwise Value 23 22 21 20
6 = 0 1 1 0
12 = 1 1 0 0
4 = 0 1 0 0

The value of 6 represents Walking and Biking, the value of 12 represents Biking and Running. The intersection of the two is Biking, this would be the matched value of 4. Being that we have a matched value, the result is greater than 0. Using this logic we can implement the following query.

WITH  PeopleSums
        AS (
            SELECT   p.Name
                   , p.PersonID
                   , a.CategoryID
                   , BitWiseSum = SUM(POWER(2, pa.ActivityID))
            FROM     #Person p
            JOIN     #PersonActivities pa ON p.PersonID = pa.PersonID
            JOIN     #Activities a ON a.ActivityID = pa.ActivityID
            GROUP BY p.Name
                   , p.PersonID
                   , a.CategoryID
           )
   SELECT   ps2.Name
   FROM     PeopleSums ps
   JOIN     PeopleSums ps2 ON ps.PersonID != ps2.PersonID
                              AND ps.BitWiseSum & ps2.BitWiseSum <> 0
                              AND ps.CategoryID = ps2.CategoryID
   WHERE    ps.Name = 'Bob'
   GROUP BY ps.Name
          , ps2.Name
   HAVING   COUNT(*) = 2;

This query uses a CTE to calculate and then summarize the bitwise values, grouped by person and category. We then self-reference the CTE using the binary AND operator (&) where the result is anything but zero. This concept can show us who Bob matches, we can use the table below to illustrate this.

Persons that Match Bob Category 1 Category 2
Sally Walking, Biking
Nancy Camping
Sally Walking Camping

To make sure that we have a match for each category, we do a count and ensure that it matches the number of categories we currently have. And voila Beth is a match for Bob, because Beth is the only candidate that had matches in both categories.

Standard T-SQL UNPIVOT

In prior blog posts I talked about a Standard PIVOT query and a Dynamic PIVOT query. It is only fitting that I talk about how to implement an UNPIVOT query. As DBA’s we commonly unpivot data and turn it into normalized data for storing in a relational data store. Understanding how to use the UNPIVOT query is a great skill to have as a DBA. Just like the PIVOT operator, the UNPIVOT operator was introduced with SQL Server 2005. In this blog post I will show how we can use the UNPIVOT operator.

I got the idea for this blog post as a result from my online grade book. I noticed that the data is displayed as a denormalized set of data. As an instructor I thought it would be interesting to see average, sum, minimum and maximum assignment scores for each student. Below is an example of the display of the online grade book.

StudentID Assignment1 Assignment2 Assignment3 Assignment4
101 90 100 95 100
102 70 65 40 80
103 100 100 100 100

I found that writing a simple query to get the sum (adding each column) and average (adding each column and dividing by the number of columns) was not too bad. However, finding the minimum and maximum proved to be a bit tricky. Sure we could implement a CASE WHEN statement, however for each column the complexity and possibility of error increases significantly, and maintainability dang near goes out the window. In comes UNPIVOT to save the day. After unpivoting the data, we simply use the AVG(), SUM(), MIN() and MAX() functions.

Let’s take a look at the basic structure of the UNPIVOT query. Using the UNPIVOT statement, you specify a value and data column, then select each of the columns you want to unpivot into these new columns. The new value and data columns become select-able columns in the query, as shown in the following code.

SELECT   [AnyColumn]
       , [DataColumn]
       , [ValueColumn]
FROM     SomeTable ST 
UNPIVOT ( ValueColumn FOR DataColumn IN (UnPivotCol1, UnPivotCol2, UnPivotCol2) ) AS UNPVT

Applying this UNPIVOT syntax to our data above we yield the following query.

SELECT   StudentID
       , Assignment
       , ASMT.Score
FROM     #ClassGrades CG 
UNPIVOT ( Score FOR Assignment IN (Assignment1, Assignment2, Assignment3, Assignment4) ) AS ASMT

This query returns the following normalized view of the data

StudentID Assignment Score
101 Assignment1 90
101 Assignment2 100
101 Assignment3 95
101 Assignment4 100
102 Assignment1 70
102 Assignment2 65
102 Assignment3 40
102 Assignment4 80
103 Assignment1 100
103 Assignment2 100
103 Assignment3 100
103 Assignment4 100

Now that we have this data in a normalized result, applying SQL functions such as AVG() and SUM(), is relatively simple. Below is an updated version of the query using the AVG(), SUM(), MIN(), and MAX() functions grouping by StudentID.

SELECT   StudentID
       , Average = AVG(ASMT.Score)
	   , MaxScore = MAX(ASMT.Score)
	   , MinScore = MIN(ASMT.Score)
	   , Total = SUM(ASMT.Score)
FROM     #ClassGrades CG 
UNPIVOT ( Score FOR Assignment IN (Assignment1, Assignment2, Assignment3, Assignment4) ) AS ASMT
GROUP BY StudentID

This query returns the following view of the data

StudentID Average MaxScore MinScore Total
101 96 100 90 385
102 63 80 40 255
103 100 100 100 400

As you can see the PIVOT and UNPIVOT statements are not overly complex and are definitely great TSQL tools for any DBA to keep in their back-pocket. Below is the entire script with test data included.

IF OBJECT_ID('tempdb..#ClassGrades') IS NOT NULL 
   DROP TABLE #ClassGrades;
 
CREATE TABLE #ClassGrades (
   StudentID int PRIMARY KEY
                 NOT NULL
 , Assignment1 int NULL
 , Assignment2 int NULL
 , Assignment3 int NULL
 , Assignment4 int NULL);
 
INSERT   INTO #ClassGrades
         (StudentID, Assignment1, Assignment2, Assignment3, Assignment4)
VALUES   (101, 90, 100, 95, 100),
         (102, 70, 65, 40, 80),
         (103, 100, 100, 100, 100); 
 
SELECT   StudentID
       , Assignment
       , ASMT.Score
FROM     #ClassGrades CG 
UNPIVOT ( Score FOR Assignment IN (Assignment1, Assignment2, Assignment3, Assignment4) ) AS ASMT
 
SELECT   StudentID
       , Average = AVG(ASMT.Score)
	   , MaxScore = MAX(ASMT.Score)
	   , MinScore = MIN(ASMT.Score)
	   , Total = SUM(ASMT.Score)
FROM     #ClassGrades CG 
UNPIVOT ( Score FOR Assignment IN (Assignment1, Assignment2, Assignment3, Assignment4) ) AS ASMT
GROUP BY StudentID;

Dynamic T-SQL Pivot

In a prior blog post (Standard T-SQL PIVOT) I showed how we can use a PIVOT query to take distinct values in a column and make those values new columns. However,  data changes and new columns may be needed. Looking at the data from the prior post, we only had three types of candy bars. What happens when a new candy bar shows up in our transactions? Well, we could update our query to use the new column value. If another new candy bar shows up we could update the query again. This process could go on and on and on. I am going to propose an alternative: using a dynamic PIVOT query. In this blog post I am going to show how we can use dynamic SQL to implement a dynamic PIVOT query.

Let’s start out by looking at the original data:

CustomerID ProductName Quantity
1 Snickers 1
1 Butterfinger 1
1 Rolo 2
1 Snickers 1
1 Butterfinger 1
2 Rolo 2

After using a PIVOT query we will see the data as described in the table below:

CustomerIDButterfingerRoloSnickers

CustomerID Butterfinger Rolo Snickers
1 2 2 2
2 NULL 2 NULL

Now update the original data with a few new products:

CustomerID ProductName Quantity
1 Snickers 1
1 Butterfinger 1
1 Rolo 2
1 Snickers 1
1 Butterfinger 1
2 Rolo 2
1 MilkyWay 3
2 Twix 2

If we apply that same query from the prior blog post, listed below, we see the same results as before.

SELECT   [CustomerID], 
         [Butterfinger], [Milkyway], [Rolo], [Snicker]
FROM    #CustomerProducts P 
PIVOT ( SUM(Quantity) FOR P.ProductName IN (
                          [Butterfinger], [Milkyway], [Rolo], [Snicker])
) AS PVT
CustomerID Butterfinger Rolo Snickers
1 2 2 2
2 NULL 2 NULL

But what we really wanted was

CustomerID Butterfinger Milkyway Rolo Snickers Twix
1 2 3 2 2 NULL
2 NULL NULL 2 NULL 2

As I mentioned before, we could simply update our query and this blog post is over. However, we are going to push on and create our dynamic version. Let’s take a look at creating the dynamic portion of the PIVOT query. The key part of the dynamic query is the distinct column list. Getting the distinct columns is just a simple DISTINCT query. To get the comma separated column list we will employ a trick that involves a COALESCE or ISNULL statement.

SELECT @PivotList = COALESCE(@PivotList + ',', '')

Once we have the distinct comma separated list, we can start to build out the dynamic query. We will create an nvarchar variable that we will use to store our dynamic SQL statement, and then we will execute that statement with the sp_executesql function.

Looking at our PIVOT query again, what we will want to do is replace the distinct columns with our comma separated list of values. Then we will take and store the new query into our variable

SELECT   [CustomerID], 
         [Butterfinger], [Milkyway], [Rolo], [Snicker]
FROM    #CustomerProducts P 
PIVOT ( SUM(Quantity) FOR P.ProductName IN (
                          [Butterfinger], [Milkyway], [Rolo], [Snicker])
) AS PVT

Converts to this

SET @Output = 'SELECT  [CustomerID]
	, ' + @PivotList + '
FROM    #CustomerProducts P 
PIVOT ( SUM(Quantity) FOR P.ProductName IN (' + @PivotList + ') ) AS PVT'

Now all that is left is to execute the stored query using the sp_executesql function. Here is the entire Dynamic SQL Pivot Script:

CREATE TABLE #CustomerProducts (
   CustomerID int
 , ProductName nvarchar(50)
 , Quantity int)
 
INSERT   INTO #CustomerProducts
         (CustomerID, ProductName, Quantity)
VALUES   (1, N'Snicker', 1),
         (1, N'Butterfinger', 1),
         (1, N'Rolo', 2),
         (1, N'Milkyway', 1),
         (1, N'Snicker', 1),
         (1, N'Butterfinger', 1),
         (2, N'Rolo', 2),
         (1, N'Milkyway', 3),
         (2, N'Twix', 2);
 
DECLARE @Output nvarchar(max) = N''
 , @PivotList varchar(max)
 
SELECT   @PivotList = COALESCE(@PivotList + ', ', N'') + N'[' + ProductName + N']'
 
FROM     (
          SELECT DISTINCT
                  ProductName
          FROM    #CustomerProducts
         ) AS CustProds;
 
SET @Output = 'SELECT  [CustomerID]
	, ' + @PivotList + '
FROM    #CustomerProducts P 
PIVOT ( SUM(Quantity) FOR P.ProductName IN (' + @PivotList + ') ) AS PVT'
 
EXEC sp_executesql @Output;
 
IF OBJECT_ID('tempdb..#CustomerProducts') IS NOT NULL	
	DROP TABLE #CustomerProducts

Standard T-SQL PIVOT

Pivoting data is a common practice while analyzing data. The basic idea is to take a single column’s data and pivot the unique values into separate columns. In this post I am going to show a simple example of how we can implement a PIVOT statement in a query. In a future post I will discuss how we can create a dynamic PIVOT based on the functionality below. SQL Server 2005 introduced functionality to help with pivoting data. You can PIVOT data using the following syntax

SELECT  [AnyColumn], [PivotCol1],[PivotCol2],...,[PivotColN]
FROM    (
	   SELECT  [AnyColumn]
                 , [DataColumn]
		 , [AggregateColumn]
	   FROM    SomeTable	   
	  ) P PIVOT ( SUM(AggregateColumn) FOR DataColumn IN ([PivotCol1],[PivotCol2],...,[PivotColN]) ) AS PVT

So, let’s take a look at a bit of data. The temp table #CustomerProducts below stores a snippet of transaction data. This data is describing the number of candy bars ordered per transaction by a given customer, over time.

CREATE TABLE #CustomerProducts (
   CustomerID int
 , ProductName nvarchar(50)
 , Quantity int)
 
INSERT   INTO #CustomerProducts
         (CustomerID, ProductName, Quantity)
VALUES   (1, N'Snickers', 1),
         (1, N'Butterfinger', 1),
         (1, N'Rolo', 2),
         (1, N'Snickers', 1),
         (1, N'Butterfinger', 1),
         (2, N'Rolo', 2);
CustomerID ProductName Quantity
1 Snickers 1
1 Butterfinger 1
1 Rolo 2
1 Snickers 1
1 Butterfinger 1
2 Rolo 2

However, we would like to see how many candy bars each customer has ordered total, as described in the table below:

CustomerID Butterfinger Rolo Snickers
1 2 2 2
2 NULL 2 NULL

So how do we accomplish this task? Simple, we will implement a PIVOT query…

Before we can develop the PIVOT query we need to determine the PIVOT columns. These columns will come from the data located in the product name column. According the the table above we want to use the following columns: Butterfinger, Rolo, and Snickers. Next we will want to determine the column that we are going to aggregate. In our example, we will use the Quantity column and we will summarize the results. With this information we can plug it into the a PIVOT query and get the query listed below.

CREATE TABLE #CustomerProducts (
   CustomerID int
 , ProductName nvarchar(50)
 , Quantity int)
 
INSERT   INTO #CustomerProducts
         (CustomerID, ProductName, Quantity)
VALUES   (1, N'Snickers', 1),
         (1, N'Butterfinger', 1),
         (1, N'Rolo', 2),         
         (1, N'Snickers', 1),
         (1, N'Butterfinger', 1),
         (2, N'Rolo', 2);
 
 SELECT  [CustomerID]
	, [Butterfinger],[Rolo],[Snickers]
FROM    (
	   SELECT  [CustomerID]
			 , [ProductName]
			 , [Quantity]
	   FROM    #CustomerProducts	   
	  ) P PIVOT ( SUM(Quantity) FOR P.ProductName IN ([Butterfinger],[Rolo],[Snickers]) ) AS PVT
 
IF OBJECT_ID('tempdb..#CustomerProducts') IS NOT NULL	
	DROP TABLE #CustomerProducts

So, whats next? Well, what if another candy bar type is sold? What if 10 different types are sold? We can continue to update the PIVOT query above, or we can use a Dynamic PIVOT Query. In another post I will describe how to create a Dynamic PIVOT Query.