Friday, April 2, 2010

Microsoft Association Rules / Data mining in SQL Server 2008

Since the release of SQL Server 2005 Microsoft has dabbled quite a bit with Data mining models. While I have read that Data mining capability has been available with one of the later releases of SQL Server 2000, I haven't been able to come across too much material on the same.

First up I would like say the best place to start if you are serious about learning about Data mining is to go get yourself this really good book on the topic. Its extremely easy to read and the examples used are quite practical.

The book is

Data Mining with SQL Server® 2008

Jamie MacLennan
ZhaoHui Tang
Bogdan Crivat

Back to original reason of the post, being a techno functional individual, I was keen to see how we could effectively cross sell products to stores since we had their complete sales history. Ensuring that our consumers never left the store without the products they wanted goes a long way in ensuring consumer loyalty and company performance.

Having most of the data already available in SQL Server 2008 I was keen to see what we could use to help identify these gaps. In SQL Server 2008 Microsoft has the following Data Mining Models

Microsoft Na─▒ve Bayes
Decision Trees Algorithm
Time Series Algorithm
Microsoft Clustering
Sequence Clustering
Association Rules
Neural Network and Logistic Regression

The best fit for the requirement I had was the Microsoft Association Rules. It allowed me to build the model utilizing the classification of outlets that we already had.

I built the model using three classifications we had for our outlets along with products that the outlet already sold. In order to do this I needed to create a nested table for the products as one outlet could sell multiple products. Also important to note that the data types for Data mining objects are slightly different from the data types that are used by standard SQL queries. The DMX I used is below.

Post creating the structure , I applied the Microsoft Association Rules Algorithm requested the prediction of the product utilizing the standard support and probability the model is generally used with.

The next step was to feed in data into the model to let it build the association rules. This was done using the following syntax. Am important point to note here is that the key data needs to be sorted. Sometimes in spite of sorting it still gives errors saying the data most probably is not sorted. The solution for this is to ensure that the key that is used is specifically a nvarchar source data type. I faced a problem when my key happened to have numeric data as well as string data. Needed to convert everything before I could successfully load the data into the mining structure.

Once the data is loaded into the model, the last step is to retrieve out the recommendations that the model would generate. I have done this for the store ABC by passing the store along with the products that the store purchased into the model and asking it to return 5 recommendations based on the model as well as the classification of the store and the products that the store sells.
This is done through the following DMX statement.

In case you want this data in the standard query windows you could always create a Linked server to OLAP where the Data Mining model exists.

Do let me know if you have any comments !