Rules of Engagement – NoSQL Column Data Stores

Well Rick Krueger( blog | @DataOgre ) and I are back at it again. We have decided to pair up and do a talk about a rather controversial topic in the SQL Server community, NoSQL. Now we are not jumping ship and saying that NoSQL is the way of the future, because that is simply not true. Relational databases have been around since the 70′s and still outperform NoSQL Solutions at most tasks. The idea that we are going to seed is that these NoSQL solutions may have a place in enterprise applications. We are not going to say you should use SQL Server or a NoSQL Solution exclusively, rather we are going to claim Polyglot Persistence. Polyglot Persistence is the idea of using multiple data stores to solve multiple problems, hence Rules of Engagement: NoSQL is SQL Server’s Ally. In a previous blog post I discussed Key Value Stores, in this blog post I will be discussing another version of the NoSQL solutions, Column Data Stores.

Column Data Stores

Column Data Stores and Relational Database share some concepts like rows and columns. However, column data stores do not require columns to be defined, nor is space allocated for each column. A column is simply a key value pair, where the key is an identifier and the value stores values related to the key. Columns can be added without all of the overhead that is incurred with a Relational Database. A column can exist for every row, some rows, or only one row. A row can have many different columns or just one. Being that the primary focus of column data stores is the column, performing aggregations or returning column data becomes very fast. The image below depicts a sample column store, with a concept of column families, which are groupings of like column data. This concept of groupings help with more complex queries.ColumnDataStoreThe image above shows a simple example of what a column’s key values could look like, however they can store much more. The columns themselves can store rows and columns. Thinking of this as a table with-in a table would be correct. In a relational database we would use a secondary table linking back to the primary with a foreign key. The example below shows the comparison between a relational model and a column store model.ColumnDataStoreAdvancedThe big advantage here is that we are bringing the related data closer to each other, which makes data retrieval fast. However trying to aggregate this type of data becomes difficult and requires a Map Reduce job.

Just like the Key Value stores two big benefits to Column Stores are horizontal scaling and the lack of a schema requirement. However, with Column Stores, having a good understanding of the tables and key structure defined prior to development helps implementing a useful data store.

Column Data Stores Implementations

One of the most recognized Column Data Stores comes from the Hadoop world, it’s called HBase. Apache HBase is the database that sits on top of Hadoop and HDFS (Hadoop File System). HBase was designed to handle big data, millions of records with millions of columns type of big data. This is something you should keep in mind when selecting a data store. If you have big data, then looking at a data store solution like HBase is worth the effort. One of the biggest benefits that I have seen is the ability to store all the data that we would normally have to archive in a RDBMS. Here is an example, think of an Automobile company like GM. They have been selling cars and servicing cars for years. This would be big data! In a RDBMS we would have to eventually archive data to keep costs down, however with an HBase solution, we simply add more nodes to the Hadoop cluster and let HDFS work its magic. Imagine being able to go back and getting every service record for a car built in the 50′s, this is what HBase and Hadoop can give you.

Polyglot Persistence, SQL Server and Column Data Stores

An example would be an enterprise application that stores US census data. Here again we have big data. Let’s say the requirements are that we need to store the past 50 years of census data. However only the past 3 years need to be returned in real time, whereas the rest of the data could be returned over time, maybe in a queued report. We could store all 50 years of data in a RDBMS, however that could get rather costly and most likely will impact performance. Rather, we could store the past 3 years in a RDBMS and push the remaining data into a NoSQL solution and scale horizontally as needed. This hybrid data solution is polyglot persistence and best of all solves the problem at hand.

Conclusion

I am not recommending that you re-write your applications to make room for a NoSQL solution, I am simply suggesting that you as the DBA (keeper of data), should keep an open mind to other possibilities. If there is another solution out there that could assist SQL Server, or is maybe a better fit, exploring the avenue might result in a better data solution. As always, a round peg will fit into a square hole if you have a big enough hammer.

Rules of Engagement – Key Value Stores

Well Rick Krueger( blog | @DataOgre ) and I are back at it again. We have decided to pair up and do a talk about a rather controversial topic in the SQL Server community, NoSQL. Now we are not jumping ship and saying that NoSQL is the way of the future, because that is simply not true. Relational databases have been around since the 70’s and still outperform NoSQL Solutions at most tasks. The idea that we are going to seed is that these NoSQL solutions may have a place in enterprise applications. We are not going to say you should use SQL Server or a NoSQL Solution exclusively, rather we are going to claim Polyglot Persistence. Polyglot Persistence is the idea of using multiple data stores to solve multiple problems, hence Rules of Engagement: NoSQL is SQL Server’s Ally. In this blog post I will be discussing the simplest version of the NoSQL solutions, Key Value Stores.

Key Value Stores

KeyValueStore

There are many Key Value store implementations, however at the root of each solution, they are simply storing values by a specific key. You can think of this as a two column table, a hash table, or an associative array.
They use a unique indexed key, which allows for fast retrieval of data, and a blob type of field for the value. The key value stores are typically written in some type of programming language, commonly Java. This gives the application developer the freedom to store data how they see fit, in a schema-less data store. Yes I used application developer, freedom, data, and schema-less all in the same sentence. This is a good thing. Typically we, as the DBA’s, are not going to be distracted by how the data is stored in the Key Value Store. Rather, we will be involved with serving up the data that will be pushed into the Key Value store, and helping with any transaction support that is needed.

One of the biggest benefit for most NoSQL solutions, including Key Value Stores, would be horizontal scaling. We all know that horizontal scaling and SQL Server, while it’s possible, does not play well. Typically if you need more from SQL Server you scale vertically, which can be costly.

Another benefit for Key Value stores is a lack of schema, this allows for changing the data structure as needed, thus being a bit more flexible. Whereas with SQL Server altering a table could result in stored procedures, functions, views, etc… needing updates, which take time and a DBA resource.

Key Value stores support “Eventual Consistency”, if a feature in your application doesn’t need to fully support ACID, then may not be a significant draw back. However the features of your application that need ACID transaction support should use a RDBMS, like SQL Server.

Another drawback is querying data from a Key Value store. Simple queries for keys in Key Value stores are a minor task because they are indexed. However, supporting joins or complex queries takes coding by an application developer. Code developed queries lose out on years of SQL Server optimization and caching features.

A Key Value Store Implementation

RedisExampleRedis
Redis is an advanced Key Value store, meaning that it can support more complex data types such as Lists, Objects, Hashes, Sets, etc… Redis is a combination of an in-memory and persisted to disk data store. This gives Redis great performance on reads and writes, however it puts a limitation on what can be stored, as memory becomes a constraint. The basic commands for CRUD operations in Redis are rather simple: GET, SET, and DEL. http://try.redis.io/ offers an online utility for learning some of the commands.

Polyglot Persistence, SQL Server and Key Value Stores

Obviously there are pros and cons for Key Value Stores, however creating a hybrid solution, could solve the drawbacks for both SQL Server and Key Value Stores. Every application is different and ultimately you will need to decide what the best fit is, for your application and budget.

With that said an example for a Key Value Store could be session data. Where the session id would be the key and all the session data would be in the value. An in-memory database, such as Redis, could produce quick results to the UI, significantly improving the web applications performance. However when that session ends, all of the updated detailed user information would ultimately need to be persisted, let’s say in a SQL Server Users table.

Or maybe a shopping cart feature, again the shopping cart id would be the key and the contents of the shopping cart would be in the value. What happens when the user checks out? That’s right, we will need to store that transaction, user information, product information, and inventory into a RDBMS such as SQL Server

Conclusion

I am not recommending that you re-write your applications to make room for a NoSQL solution, I am simply suggesting that you as the DBA (keeper of data), should keep an open mind to other possibilities. If there is another solution out there that could assist SQL Server, or is maybe a better fit, exploring the avenue might result in a better data solution. A square object will fit into a round hole if you have a big enough hammer, but could take time and produce a bad product.

T-SQL Tuesday #39 – Can you shell what the PoSH is Cooking?

TSQLTUES

Wayne Sheffield (blog | @DBAWayne) invited me to a party, and anyone that knows me, knows I love a good party. This month’s party is all about how we have integrated Posh into our SQL lives. Thank you Wayne for hosting this month’s T-SQL Tuesdays #39 Can you shell what the PoSH is Cooking?

I have to admit that I am relatively new to PowerShell and all its functionality, however I have quickly learned the power of such a tool.

Before this party started (T-SQL Tuesday #39), I was tasked with creating a process to upload reports to Reporting Services. This process had to be repeatable, automatic, and configurable. My initial suggestion was to just upload each report to the necessary folders, however that still includes the human factor, so repeat-ability is out. After digging around on the inter-webs I came across examples of SSRS report push using a Web Services Proxy. After a few modifications, I was able to use a couple Cmdlets and the Web Services method to upload reports. In this blog post I am going to show how to use Powershell v2.0 Cmdlets and Proxy calls for uploading SSRS reports to a Reporting Services Web Server.

The following example demonstrates how to upload a report (RDL file) and load the report to appropriate folder using the Reporting Services Web Services.

$SSRSService = "http://localhost/ReportServer/ReportService2005.asmx"
$SSRSReport = "C:\Dev\SSRSReport.rdl"
$SSRSNameWithoutExtension = [System.IO.Path]::GetFileNameWithoutExtension($SSRSReport);
$SSRSPath = "/"
 
Write-Output "Reporting Services Server: $ReportingServices"
$SSRSProxyObject = New-WebServiceProxy -Uri $SSRSService -UseDefaultCredential 
 
Write-Output "Converting $SSRSReport to a byte array"
$ByteArray = Get-Content $SSRSReport -Encoding byte     
 
Write-Output "SSRS Report Name: $SSRSNameWithoutExtension "
$SSRSProxyObject.CreateReport($SSRSNameWithoutExtension, $SSRSPath, $true, $ByteArray, $null)

The key PowerShell statements are:

New-WebServiceProxy (Cmdlet)

The New-WebServiceProxy Cmdlet is used for creating a web services proxy object. We will use this to create a proxy object from our Reporting Services Servers Web Services.
New-WebServiceProxy Cmdlet
Example:

$SSRSProxyObject = New-WebServiceProxy -Uri $SSRSService -UseDefaultCredential

Get-Content (Cmdlet)

We will use the Get-Content Cmdlet to get the byte array representation of the RDL file.
Get-Content for FileSystem Cmdlet
Example:

$ByteArray = Get-Content $SSRSReport -Encoding byte

CreateReport (SSRS Web-Service Call)

After we have our Reporting Services Proxy object, we can use it to call a CreateReport() method, which uploads the report to your selected $SSRSPath.
ReportingService2005.CreateReport Method
Example:

$SSRSProxyObject.CreateReport($SSRSNameWithoutExtension, $SSRSPath, $true, $ByteArray, $null)

Dynamic T-SQL Pivot

In a prior blog post (Standard T-SQL PIVOT) I showed how we can use a PIVOT query to take distinct values in a column and make those values new columns. However,  data changes and new columns may be needed. Looking at the data from the prior post, we only had three types of candy bars. What happens when a new candy bar shows up in our transactions? Well, we could update our query to use the new column value. If another new candy bar shows up we could update the query again. This process could go on and on and on. I am going to propose an alternative: using a dynamic PIVOT query. In this blog post I am going to show how we can use dynamic SQL to implement a dynamic PIVOT query.

Let’s start out by looking at the original data:

CustomerID ProductName Quantity
1 Snickers 1
1 Butterfinger 1
1 Rolo 2
1 Snickers 1
1 Butterfinger 1
2 Rolo 2

After using a PIVOT query we will see the data as described in the table below:

CustomerIDButterfingerRoloSnickers

CustomerID Butterfinger Rolo Snickers
1 2 2 2
2 NULL 2 NULL

Now update the original data with a few new products:

CustomerID ProductName Quantity
1 Snickers 1
1 Butterfinger 1
1 Rolo 2
1 Snickers 1
1 Butterfinger 1
2 Rolo 2
1 MilkyWay 3
2 Twix 2

If we apply that same query from the prior blog post, listed below, we see the same results as before.

SELECT   [CustomerID], 
         [Butterfinger], [Milkyway], [Rolo], [Snicker]
FROM    #CustomerProducts P 
PIVOT ( SUM(Quantity) FOR P.ProductName IN (
                          [Butterfinger], [Milkyway], [Rolo], [Snicker])
) AS PVT
CustomerID Butterfinger Rolo Snickers
1 2 2 2
2 NULL 2 NULL

But what we really wanted was

CustomerID Butterfinger Milkyway Rolo Snickers Twix
1 2 3 2 2 NULL
2 NULL NULL 2 NULL 2

As I mentioned before, we could simply update our query and this blog post is over. However, we are going to push on and create our dynamic version. Let’s take a look at creating the dynamic portion of the PIVOT query. The key part of the dynamic query is the distinct column list. Getting the distinct columns is just a simple DISTINCT query. To get the comma separated column list we will employ a trick that involves a COALESCE or ISNULL statement.

SELECT @PivotList = COALESCE(@PivotList + ',', '')

Once we have the distinct comma separated list, we can start to build out the dynamic query. We will create an nvarchar variable that we will use to store our dynamic SQL statement, and then we will execute that statement with the sp_executesql function.

Looking at our PIVOT query again, what we will want to do is replace the distinct columns with our comma separated list of values. Then we will take and store the new query into our variable

SELECT   [CustomerID], 
         [Butterfinger], [Milkyway], [Rolo], [Snicker]
FROM    #CustomerProducts P 
PIVOT ( SUM(Quantity) FOR P.ProductName IN (
                          [Butterfinger], [Milkyway], [Rolo], [Snicker])
) AS PVT

Converts to this

SET @Output = 'SELECT  [CustomerID]
	, ' + @PivotList + '
FROM    #CustomerProducts P 
PIVOT ( SUM(Quantity) FOR P.ProductName IN (' + @PivotList + ') ) AS PVT'

Now all that is left is to execute the stored query using the sp_executesql function. Here is the entire Dynamic SQL Pivot Script:

CREATE TABLE #CustomerProducts (
   CustomerID int
 , ProductName nvarchar(50)
 , Quantity int)
 
INSERT   INTO #CustomerProducts
         (CustomerID, ProductName, Quantity)
VALUES   (1, N'Snicker', 1),
         (1, N'Butterfinger', 1),
         (1, N'Rolo', 2),
         (1, N'Milkyway', 1),
         (1, N'Snicker', 1),
         (1, N'Butterfinger', 1),
         (2, N'Rolo', 2),
         (1, N'Milkyway', 3),
         (2, N'Twix', 2);
 
DECLARE @Output nvarchar(max) = N''
 , @PivotList varchar(max)
 
SELECT   @PivotList = COALESCE(@PivotList + ', ', N'') + N'[' + ProductName + N']'
 
FROM     (
          SELECT DISTINCT
                  ProductName
          FROM    #CustomerProducts
         ) AS CustProds;
 
SET @Output = 'SELECT  [CustomerID]
	, ' + @PivotList + '
FROM    #CustomerProducts P 
PIVOT ( SUM(Quantity) FOR P.ProductName IN (' + @PivotList + ') ) AS PVT'
 
EXEC sp_executesql @Output;
 
IF OBJECT_ID('tempdb..#CustomerProducts') IS NOT NULL	
	DROP TABLE #CustomerProducts

Standard T-SQL PIVOT

Pivoting data is a common practice while analyzing data. The basic idea is to take a single column’s data and pivot the unique values into separate columns. In this post I am going to show a simple example of how we can implement a PIVOT statement in a query. In a future post I will discuss how we can create a dynamic PIVOT based on the functionality below. SQL Server 2005 introduced functionality to help with pivoting data. You can PIVOT data using the following syntax

SELECT  [AnyColumn], [PivotCol1],[PivotCol2],...,[PivotColN]
FROM    (
	   SELECT  [AnyColumn]
                 , [DataColumn]
		 , [AggregateColumn]
	   FROM    SomeTable	   
	  ) P PIVOT ( SUM(AggregateColumn) FOR DataColumn IN ([PivotCol1],[PivotCol2],...,[PivotColN]) ) AS PVT

So, let’s take a look at a bit of data. The temp table #CustomerProducts below stores a snippet of transaction data. This data is describing the number of candy bars ordered per transaction by a given customer, over time.

CREATE TABLE #CustomerProducts (
   CustomerID int
 , ProductName nvarchar(50)
 , Quantity int)
 
INSERT   INTO #CustomerProducts
         (CustomerID, ProductName, Quantity)
VALUES   (1, N'Snickers', 1),
         (1, N'Butterfinger', 1),
         (1, N'Rolo', 2),
         (1, N'Snickers', 1),
         (1, N'Butterfinger', 1),
         (2, N'Rolo', 2);
CustomerID ProductName Quantity
1 Snickers 1
1 Butterfinger 1
1 Rolo 2
1 Snickers 1
1 Butterfinger 1
2 Rolo 2

However, we would like to see how many candy bars each customer has ordered total, as described in the table below:

CustomerID Butterfinger Rolo Snickers
1 2 2 2
2 NULL 2 NULL

So how do we accomplish this task? Simple, we will implement a PIVOT query…

Before we can develop the PIVOT query we need to determine the PIVOT columns. These columns will come from the data located in the product name column. According the the table above we want to use the following columns: Butterfinger, Rolo, and Snickers. Next we will want to determine the column that we are going to aggregate. In our example, we will use the Quantity column and we will summarize the results. With this information we can plug it into the a PIVOT query and get the query listed below.

CREATE TABLE #CustomerProducts (
   CustomerID int
 , ProductName nvarchar(50)
 , Quantity int)
 
INSERT   INTO #CustomerProducts
         (CustomerID, ProductName, Quantity)
VALUES   (1, N'Snickers', 1),
         (1, N'Butterfinger', 1),
         (1, N'Rolo', 2),         
         (1, N'Snickers', 1),
         (1, N'Butterfinger', 1),
         (2, N'Rolo', 2);
 
 SELECT  [CustomerID]
	, [Butterfinger],[Rolo],[Snickers]
FROM    (
	   SELECT  [CustomerID]
			 , [ProductName]
			 , [Quantity]
	   FROM    #CustomerProducts	   
	  ) P PIVOT ( SUM(Quantity) FOR P.ProductName IN ([Butterfinger],[Rolo],[Snickers]) ) AS PVT
 
IF OBJECT_ID('tempdb..#CustomerProducts') IS NOT NULL	
	DROP TABLE #CustomerProducts

So, whats next? Well, what if another candy bar type is sold? What if 10 different types are sold? We can continue to update the PIVOT query above, or we can use a Dynamic PIVOT Query. In another post I will describe how to create a Dynamic PIVOT Query.