Thursday, October 29, 2015

Microsoft goes All-In on Analytics

It has been a busy few months in the data analytics space, especially at Microsoft.  To say the least.  Next year will be even busier, when many of these solutions converge further.

Microsoft announced General Availability of PowerBI, and brings many of its cloud features to the desktop.

Jen Underwood reveals the splendor in PowerBI Desktop.  PowerBI Desktop can be published to a Pyramid Analytics server for on-prem solutions.  PowerBI isn't just in the cloud anymore.

Microsoft acquires Datazen, and some multidimensional dashboards come out of a DataZen server in SQL 2016.

SQL 2016 with R & JSON integration.  Hold up, Revolution Analytics is coming to SQL?  JSON in SQL?  No more FOR XML madness?

New JSON support in T-SQL, the SQL Server query language, includes OPENJSON to parse JSON text and return tabular data, JSON_VALUE and JSON_QUERY for querying JSON data, and ISJSON for validating JSON-formatted text.
The syntax for the R external functions seems suspiciously familiar to what Apache Hive provides..


ML Blog Team talks about Jupyter Notebooks for Azure ML Studio and Nitin Mehotra gets us started with Zeppelin Notebooks and Apache Spark on Azure HDInsight.

Michael Erasmus gives us an interesting mashup of Python and SQL running in a Jupyter notebook. 

What's nice about this is the inline SQL statement, effectively removing the need for quoting & appending multi-line SQL together.  What else is useful is the ability to capture the results of the SQL query as a data frame for analysis and visualization.

A similar method might be used with MDX queries and Analysis Services OLAP cubes.

Perhaps if you have some web services connected to Excel Services or PowerPivot, this method could get really useful for exposing your data.  I would imagine writing some DAX queries nicely formatted in Jupyter python and returning data frames to R would be of interest for a couple people... well maybe just me. ;)

Speaking of DAX...

Kasper de Jonge writes about SQL 2016 includes over 50 new DAX functions

Some of the cool ones are Natural Inner Join,

CONCATENATEX(Employees, [FirstName] & “ “ & [LastName], “,”)
Returns "Alan Brewer, Michael Blythe"

SUBSTITUTEWITHINDEX (Left Semi Join in PowerPivot?)

ADDMISSINGITEMS (What do you mean we have no sales for March in Tuvalu? We need to see a March column for Tuvalu on the report!)

Only 46 functions left to check out...

Tuesday, October 13, 2015

Unit Testing and Mocking T-SQL Objects with tSQLt

tSQLt is an open-source framework for unit testing SQL Server code.  You can get it here.  The Red Gate SQL Test application provides an installation, management framework, support, and testing harness for SQL Server Management Studio.  ApexSQL has also announced a competing offering, to be released in 2016.

The most useful functionality, in my opinion, is the ability to mock or fake objects.  See Greg Lucas' post "A mock too far?" for some useful scenarios.  There are many other tSQLt tutorials and examples on the DataCentricity web site for Test Driven Database Development (TD3) using mocks.

What do you mean, mock or fake objects?
When running a test or test suite with a SQL testing framework like tSQLt, you arrange your test, act on the item under test, and assert your results match expectations.  The framework handles the assertions, teardown and rollback of any items changed by the test.  What mocking allows for is the faking of dependent items not central to the purpose of the test.  It provides a shell or template object that is a clone of an existing object without the clutter of reams of data, foreign keys or constraints unrelated to the test.

tSQLt allows you to fake tables.   Behind the scenes, tables are renamed, constraints are removed and all columns are set to nullable.  This allows for you to assert / check only the columns and rows of data required for your test.  Instead of sifting through millions of rows looking for your changes, you can assert that your input row was added to the destination table with a simple count.  You can create an expected table, and use tSQLt's AssertEqualsTable to compare that the columns you are testing contain your expected results.  You can take your expected results from a table, fake the table, then ensure that the results are repeated after execution of the object under test.

It's kind of like Where's Waldo, with only a white page and Waldo sitting in the middle of it.  It becomes very obvious when things are and aren't working, since the noise has been removed from your test.

tSQLt allows you to spy on procedures.  Behind the scenes, procedures are emptied out, and only the shell or "contract" remains.  The parameters of the procedure remain, with an insert into a logging table.  This allows you to inspect what is being sent to procedures to ensure the item under test is passing along the expected parameters.  It also allows you to inject code into dependent procedures, perhaps to test functionality or changes that would exist in the future once that other team gets finished with their deliverable.

These types of actions follow the Test Double pattern.

How can I use this when debugging SQL code?
Fake out and spy on all dependent objects to get a better understanding of how the procedure under test is acting against them.

I have found Faking and Spying are also useful when debugging and understanding poorly documented procedures, since you can interactively view your data during a test, roll back the changes, and visualize what occurred in the destination table.  If you are working with a large, legacy procedure or set of procedures, this tool becomes invaluable for tracing data lineage and transformation logic.   A side effect is you'll probably end up with some automated unit tests on your legacy code.

Martin Fowler describes these mocks or fakes as a kind of Stunt Double, taking the bombardment of your tests while your true actor is sitting on the sidelines.

In the past, I used synonyms for similar purposes, however creating fakes for testing rather than replacing synonyms is much a much easier task with tSQLt.

If you are using SQL Server for more than just a big dumb ORM storage or CRUD layer, it's worth looking at a unit testing framework such as tSQLt and the ecosystem around it.

Sunday, September 13, 2015

Reporting Services 2016

After a couple years hiatus, the SSRS Team (via Riccardo Muti) has published some highlights of SQL Server Reporting Services 2016 CTP 2.3

Most of the highlights are for usability and appearance. No Metro tiles?

Rather than SSRS features, my bigger interests within SQL 2016 Enterprise are Polybase and the SQL-R functionality.  Polybase will allow you to join your Data Lake with the Data Ocean that is your Enterprise Data Warehouse, and all things in between.

"PolyBase only works within PDW for now, but later it might be added to SQL Server (but there are no plans for that).  PolyBase relies on the Data Movement Service (DMS) in PDW, and DMS does not exist in SQL Server"

Apparently things change.  For those companies having file shares filled with structured, semi-structured, and unstructured data, replicating that data to a Hadoop HDFS cluster can allow you to immediately join it to SQL Server without ETL.

Or so I hear... :)

This might be a game-changer for some organizations.

Andrew Peterson spent some time with Polybase on his blog.

A few years ago I built a solution that utilized nothing but Analysis Services views to access data inside an Analysis Services multidimensional cube solution.  Some of those views actually queried Excel spreadsheets, Access Databases and CSV files instead of SQL Server.  At the time I thought it was kind of a mashup hack, though cool nonetheless due to lack of ETL effort and instant data availability.

Glad to see MS is catching up with this idea of schema-on-read, query-anywhere, anything.  You're still going to need to create external, structured views on each file format to a specific Hadoop cluster, which seems rather linked-server'y.  There are some other limitations to be aware of.  You'll also need access to a Hadoop cluster of course.  You do have one of those, right?

The Hybrid BI story is just beginning...

Wednesday, September 09, 2015

SQL Treeo adds folders to your SSMS

Microsoft, why can't you have packages like Oracle or at least a folder structure in Object Explorer?

SQL Treeo brings sanity to monolithic databases and multi-tenant servers, capturing objects into folders.

Saturday, August 08, 2015

SQL Server Collation Settings, Quelquechose de Vraiment Grav

Saw this post about a Windows 10 error message a couple weeks ago on Facebook and laughed my head off.

Microsoft has gone ahead and adopted the Sad Mac philosophy of error handling.
From a consumer's perspective, I guess this yodel is a tad better than Catastrophic Failure and reassures you that something might happen next.

From a developers perspective, you really should have included the bugcheck code, stack trace, or at least some other kind of error identifying the method call you are finally giving up on.

If not, one fix for fellow Canadians installing Windows 10 will be to ensure you meld with our neighbours in the south by selecting English (US) as the regional locale.  You selected French (Canada), didn't you?

Zut alors.  Didn't SQL have this issue a few years ago?  And Analysis Services?

Sorry MS, as a Windows 10 Insider I did not catch this bug.  Actually, I didn't really catch any bugs.  My Windows 10 install over the last year was a pleasant surprise, and I expect my main desktop to be upgraded soon.  Perhaps my only suggestion would be that if English (Canada) is selected, then error messages should be bilingual.  Seems like a waste of the second line of Something happened.

Instead of upgrading him, a family member who I provide tech support for asked me to just hide the "Upgrade to Windows 10" icon for him instead.  From a support perspective I though that was wise for now....  I used the scheduled tasks and registry edit methods.

A few times in the last year I've encountered various collation issues related to Windows Regional Settings and SQL Server installs, and the defaults MS uses for products like Sharepoint and Reporting Services.

To switch over collation of a single database, you can try this at your own risk.

1. ALTER the database to change the collation (after killing all processes that are using the database and dropping all user defined functions with varchar max that bind to the collation)

2. ALTER the columns (this script may need to be modified to include schema id's when dealing with multiple schemas.  This script was tested on SQL 2008 R2.

       + c.NAME + ' ' + t.NAME + '('
       + Cast(c.max_length AS NVARCHAR)
       + ') COLLATE ' + ' Latin1_General_CI_AS ' + CASE WHEN c.is_nullable = 1
       + ';'
--select,,, c.collation_name, c.is_nullable,, c.max_length
FROM   sys.objects o
       INNER JOIN sys.columns c
               ON c.object_id = o.object_id
       INNER JOIN sys.types t
               ON t.user_type_id = c.user_type_id
WHERE  c.collation_name IS NOT NULL
       AND o.type = 'u' 

Wednesday, July 29, 2015

Ninite - mass installer is a pretty awesome installer to build up a new PC.  Fresh install of Windows 10?  Check off Classic Start Menu and whatever optional software you want...

Ninite will

  • start working as soon as you run it
  • not bother you with any choices or options
  • install apps in their default location
  • say no to toolbars or extra junk
  • install 64-bit apps on 64-bit machines
  • install apps in your PC's language or one you choose
  • do all its work in the background
  • install the latest stable version of an app
  • skip up-to-date apps
  • skip any reboot requests from installers
  • use your proxy settings from Internet Explorer
  • download apps from each publisher's official site
  • verify digital signatures or hashes before running anything
  • work best if you turn off any web filters or firewalls
  • save you a lot of time!

Wednesday, February 11, 2015

Kerberos & Web Site

Sometimes all you need is one tool for a tough job.  If you're setting up Kerberos, that tool is the Kerberos Delegation Config & Report.

Monday, November 03, 2014

DOS tricks & tips

Because sometimes simplicity is best.

type *.csv >> data.txt - consolidate multiple files to a single file
copy a.csv + b.csv c.csv - consolidate multiple files to a single file

Date formatting

Friday, July 04, 2014

The Myth of the Full-stack DBA

Andy Shora recently published a great article about "The Myth of the Full-stack Developer."  From his posts and tweets, I peg Andy as a front-end developer specializing in CSS and Javascript frameworks.

Andy's comment about the 30-point spider diagram really resonated with me.

Where do your web development skills lie?

The term Full-Stack applies to knowing different layers of architecture, of which the database (SQL/NoSQL/SomeSQL) can sit as one of those layers.  However, if you were consider yourself a "Full-Stack" SQL Server DBA, you might know the details of some or more of the following components of SQL:
  1. SQL Server
    1. Transact-SQL
    2. Backup/Restore
    3. Replication
    4. Log Shipping
    5. Query Tuning and Indexing
    6. Policies and Resource Governor
    7. High availability
    8. In Memory Querying
  2. SQL Agent
    1. Jobs
    2. Alerts
  3. Reporting Services
    1. Sharepoint Integrated Mode
    2. Native Mode
  4. Integration Services
    1. Importing/Exporting/Transforming Various Data Sources
    2. Connection Types
    3. Toolbox Components
    4. Project /Package Deployment Methods
  5. Analysis Services
    1. MDX/DMX/DAX
    2. Tabular
    3. Multidimensional
    4. Sharepoint
  6. Powershell Scripting
  7. TFS Integration
And of course the various ways to setup, configure, debug, maintain, monitor, license, upgrade and secure above tools along with the associated hardware and software.  Perhaps your job description may even include mentoring developers on best practices on interfacing with the layer that is the "database".

By no means an exhaustive list and depending on your environment your role as a DBA could include Active Directory maintenance and Voicemail support.

If you're a Microsoft consultant or a partner of Microsoft, you're probably intimately familiar with the old Microsoft Hamburger and its associated toppings.  

(RIP standalone PerformancePoint Server)

If you have expertise in one or more of the grills at the bottom of the stack, (SAP, Oracle, etc) your job just got a bit more important.

For some of those components in the stack, hopefully you're using best-of-breed third party tools from companies like ApexIdera and Red-Gate to improve your experience.  The ecosystem of helper apps is vast and wide.  Many feature limitations of products are considered "partner opportunities" by software companies.  If you want to be a Full-Stack DBA, you'll want to know what tools keep your stack upright.

If you were to consider yourself an expert on the internals of SQL, you might know about all of the components of this diagram, and take great pleasure in using tools like Internals Viewer to see how SQL arranges its storage and understanding the details of how GAM, SGAM & PFS Pages work.

Or you may have had a "crash" course on above internals while troubleshooting why a DBCC CheckDB job failed and your customers can't access their data.  It's best to be proactive about these things, start studying the diagram!

The ScaryDBA's recent post about The Curse Of Relational Databases also comments about The Stack of Information Technology in general, and how deep, wide, and unfathomable it is.  The comments provide  insight into the mindset of the community, and one comment mentions the idea of "Technical Debt".  A topic perhaps worth exploring into detail about in a future post, since this one is getting a tad rambling and epic.

Yesterday I attended a Toronto Hadoop User Group Meetup, with a discussion from the co-author of a book on YARN.  The topic was a deep-dive into the internals and features of the resource scheduler and application manager that is YARN.  Fascinating stuff (I'm not kidding!) that shows how much effort goes into the foundation of a scalable distributed data solution.

Speaking of deep, wide, and unfathomable....

The Big Data Developer Toolkit  

The Apache Open-Source Developer Toolkit

Hopefully your DBA stack expertise includes some Linux knowledge.

Monday, June 23, 2014

Choosing a model - Tabular vs. Multidimensional

Understand the limitations of the Tabular model and when Multidimensional model implementation is more appropriate.

It isn't as simple as Distinct Count = Tabular, Parent-Child = Multidimensional...

If you're still using Office 2007 or SQL 2008 R2, it might be a bit easier to make a choice...

Tuesday, June 17, 2014

Turing SQL Server - Is your power model sucking performance?

As stated in this hilarious Knowledge Base article, changing your power options to “High Performance” (an option Microsoft conveniently hides under a “more” menu to prevent you from accidentally finding it) resolves this issue by using more power all the time. In return, you need to accept that global warming is entirely your fault. Amazingly, no performance analyser of SQL server suggests you change this option.

Sometimes the simplest things can drastically change performance.  There should be an interview-style wizard or report for those who are new to setting up a server that would interrogate WMI to find out which settings may be sub-optimal for your usage scenario.

cURL Super Powers

cURL is an extremely powerful tool that you can use, available on many platforms including Linux or Windows (via tools like Cygwin). It's powered by libcurl. Wget is another like-minded utility to download files recursively.

Here's an example to parallel download all 150+ Microsoft eBooks available from this site.
curl | grep -o -E "[0-9A-Za-z]+" | uniq | xargs -P 5 -n 1 wget --content-disposition

So what's it doing?
First, download (or curl) the url specified.
Second,search the second URL for a matching regular expression.
Third, uniq(ue) the resultset.
Fourth, xargs to deal with the large parameter list
Fifth, wget to download the resulting URLs in a parallel fashion.

Why use both tools when they appear to do similar things?
If you're a Windows user, and can't install Cygwin, Powershell can do some of the things that wget can do.
(new-object System.Net.WebClient).DownloadFile('','C:\tmp\file.tx‌​t')
Here's a ton of GNU manuals, including wget.

Wednesday, April 02, 2014

Analysis Services 2012 and 2014 Performance Guide

Microsoft has released an Analysis Services 2012 and 2014 Performance Guide, 111 pages of multidimensional performance tuning goodness.

Is there a future for the multidimensional cube at Microsoft?  Sounds like it may stick around for awhile longer.... watch for improvements in SQL vNext.

Performance is important, and not just for improving business processes. Licensing rules for SQL 2014  will probably cause server administrators to try and keep their server utilization at full capacity.  Do more with less.

A similar whitepaper exists for PowerPivot and the tabular model in SQL 2012.

Tuesday, April 01, 2014

SQL 2014 Released, so what's the big deal?

SQL 2014 is released, in 2014 no less!  You can download the bits from Microsoft or get your free Azure trial (with a cap to ensure you don't get charged anything) to try it out.  Perhaps SQL Spade will help you to install it.  Or just browse the new features on Virtual Labs.

Microsoft has officially killed support for SQL 2005 compatibility mode, so before upgrading from prior versions be sure you've upgraded the compatibility mode of your databases.

The database engine has various improvements, fixes and tweaks to query plan optimization and partitioning.  Hopefully the query engine is tuned for the better.

There are some additional options for impersonation and separation of duties within the database.

Reporting Services, Analysis Services, and Integration Services, Data Quality Services, Master Data Management Services are mostly left in the dust.  Sigh....

Some of the cloud integration stories have improved dramatically.  Backup to URL.  Migrate to VM.  Hybrid blob storage.  AlwaysOn to Azure.  Azure's ExpressRoute functionality to move the public cloud to a private direct connection is now available.  This should improve adoption for those not wanting to transfer data over the public internet.  And provide companies with a secure form of offsite backup, DRP and replication to Azure.

Perhaps the biggest story with SQL 2014 is its In-Memory (formerly Hekaton) engine.  Or what I like to call DBCC PINTABLE+++.  This, along with its ability to use solid state storage for the SQL Buffer Pool, should provide for much faster query times when it's used correctly.  Think columnstore indexes but for an entire table, and updateable, and not a separately stored object.  Columnstore indexes in SQL 2014 can now be created as updateable, clustered objects that contain the data itself, in addition to the indexes.  In Enterprise Edition, anyway...

Note for SQL Server 2005:
This (DBCC Pintable) functionality was introduced for performance in SQL Server version 6.5. DBCC PINTABLE has highly unwanted side-effects. These include the potential to damage the buffer pool. DBCC PINTABLE is not required and has been removed to prevent additional problems. The syntax for this command still works but does not affect the server.

Ahhh, memories of data corruption, and arguments with other technology experts that there is no such feature post-SQL 2000 for pinning tables in memory.  "But I typed DBCC Pintable in and it came back with no errors!"   Whenever I want to do something that bypasses the internal workings of a database, I like to visit Ask Tom.   Even if you're not using Oracle, his site is a very fun an informative read.

Now back to SQL and the fine print.

Some of the licensing challenges to be aware of.  One challenge, where can you get some budget for 128 GB of memory for your SQL Server Standard boxes, and 512GB of flash memory for your buffer pool?  How about 4 terabytes on Datacentre with Enterprise edition, and some of those larger tables pinned in memory?

Fun times ahead with in-memory databases.

Friday, March 28, 2014

OCD Programming

A great StackOverflow answer using the railroad analogy to explain Branch Processing and why processors love predictability.

I think the response that won the votes was because of the picture in the response, and the answer "Consider a railroad junction" which succinctly describes the problem you may have when dealing with unsorted data and path analysis.

I am consistently amazed at how much difference a good path makes in programming, especially with databases.  Ensure your tables all have primary keys.  Performance improves.  Add covering indexes for long running queries.  Performance improves.

These two things can sometimes fix many of the performance issues living within the database layer.

Clean up your code.  Ensure state is kept to a minimum, and conversations with slower systems such as databases and disk are batched up and cached.

And apparently sort anything coming down the pipe before trying to analyze it.

Predictability and patterns in your code, and being obsessive-compulsive when it comes to clean coding seems to mean a lot.  To some compilers, anyway.

Others will ignore all your hard work and try to determine the best way to solve your problem.  Seems to be the norm for the Oracle instance I'm currently using.

If you use a tool such as Windows Sysinternals process explorer, you can see what kind of effect these types of incorrect paths have on system performance.  You'll notice a lot of Access Denied, Key not Found or File not Found messages.  Systems could be so much faster if it just found the correct registry entry, filename, or internet address the first time, without going down multiple paths blindly.   Perhaps predictive analytics will soon solve this issue, so that each time you use your computer it will intelligently speed up your daily tasks.

Or perhaps I'm just being obsessive.

Tuesday, February 25, 2014

Modern IE

Great site from MS to download VMs for IE testing.

As a side effect, you can get VMs for Linux & Mac OS of all major MS operating systems!

Thursday, February 06, 2014

Windows 2012 Command Prompt

Gates Spends Entire First Day Back in Office Trying to Install Windows 8.1 : The New Yorker:

The New Yorker should have done their fact-checking.  From what I have read Bill Gates is known for his trucker mouth when dealing with staff.

I downloaded 4GB VM of Windows Server 2012 trial yesterday to brush up on my Microsoft O/S skills and missed the fact it was the "core" edition.  I was back in 1989 heaven when the DOS prompt came up on startup, after a nice UI for the login prompt.

Took me a few searches to figure out the UI wasn't broken, and some hacker-like Powershell to install the UI components.   Ended up failing since I guess I need to download the install ISO too.

I wonder what Bill Gates would think when his 31-year history of DOS operating systems did a full circle with Win 2012 Server.  Who would have thought in 2014 I would be staring at a command prompt typing "help".

'via Blog this'

Thursday, November 29, 2012

Performance tips for SSAS | James Serra's Blog

Performance tips for SSAS | James Serra's Blog: "Over the past few years I have developed a list of ways to improve performance in SSAS and in two of the tools that use it: ProClarity and PerformancePoint.  Below are those performance tips along with other important items to be aware of:"

The one that really helped us was to move parameters from the where clause to a subquery.  This can improve performance drastically.

'via Blog this'

Ranting about Cumulative Updates

Don't get me wrong.  Fix it fast and fix it often is a good motto to have when addressing software bugs.  However, trying to decode what was changed between versions isn't so easy.  Especially with products like Microsoft Sharepoint.

Sharepoint has had umpteen patches since Service Pack 1.  Each one is outlined in separate KB articles by install file.  This works out to about 100 tabs open in the browser to figure out what is in each cumulative update.

Is there not a database that has this stuff?

On another note, SQL 2012 SP1 is released, which doesn't include the fixes contained in CU3 & CU4, so download SQL 2012 SP1 and SQL 2012 SP1 CU1.

Monday, November 12, 2012

Big Data Vendor Landscape

A very comprehensive post on the current (well, June anyway) picture of Big Data Vendors along with plenty of contact information.

Vertica is really eating everyone else's pie.  They partner with many of the traditional front-end BI app developers, including Cognos, Microstrategy and Tableau.

Monday, October 29, 2012


Some helpful tricks with LINKMEMBER.  Linkmember is used when the cube has multiple hierarchies that could be linked together.  Eg. Order Date & Ship Date could be linked together, even though no relationship exists at the fact level.  Or in my case Security Code and Security Underlier Code.

The key issue I had was that the HideUnrelatedDimension property was set to false in the cube.  This means that you need to explicitly specify the "all" member in the linkmember function, and exclude the all member from the scope of the calculation statement.

Details here.

Thursday, October 04, 2012

Interesting MDX Feature

Apparently there's a keyword test in MDX that returns an empty value, even if the cube doesn't contain a measure called test...

select test on 0
from Adventureworks

Easter egg?

Wednesday, October 03, 2012

Humor in code

So I've been getting this error fairly regularly and thought that Mario's arch enemy had done something to my PC.  Lately it has been shutting off without warning.

The Source was "Bowser"

The browser driver has received too many illegal datagrams from the remote computer ---- to name WORKGROUP on transport NetBT_Tcpip_{-----}. The data is the datagram. No more events will be generated until the reset frequency has expired.

Apparently this means my computer is beyond repair, or at least the network card is sending out line noise. 

Another one that happens every minute since I installed Windows 8:

A corrected hardware error has occurred.
Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: No Error
Processor ID: 1
The details view of this entry contains further information.

If it's corrected, how come it keeps showing up every minute?
If it's not an error, why is it called a corrected hardware error?

I always thought the event viewer should have ads for software that fixes event viewer errors or offers up places to buy hardware. That could be a great idea for extra MS revenue...

Reading Larry's blog, I finally figured out how I fixed that insurance office network 17 years ago.

Flipping the network cards from Auto Detect to Full Duplex solved the issue, since 1 card was blowing up the entire network.


Tuesday, September 25, 2012

Grouping in SQL RS

One issue we recently had was that grouping wasn't filtering properly on the report.  If you are having issues grouping and filtering items in Reporting Services, make sure the detail row is not deleted.  If you have to, set the Row Visibility to hidden, but don't delete the detail.

This solved our filtering issue.

Wednesday, September 12, 2012

Thursday, August 30, 2012

Understanding Analysis Services Deployment Modes - Analysis Services and PowerPivot Team Blog - Site Home - MSDN Blogs

Analysis Services has 3 instance install modes.  Multidimensional is for the "legacy" Analysis Services cubes.  Tabular is for a standalone tabular server hosting persisted tabular models, or temporary models created by Visual Studio.  Sharepoint mode is for integration with PowerPivot functionality in Sharepoint.

There are some key features which may force you to select both the Tabular and PowerPivot modes for different functional requirements.

Understanding Analysis Services Deployment Modes - Analysis Services and PowerPivot Team Blog - Site Home - MSDN Blogs: "Obviously, this way of connecting to a PowerPivot database is very different from connecting to a Tabular database, but the differences between Tabular and SharePoint modes go even deeper than that because the databases themselves are very different in nature. Tabular databases 
are permanent databases. PowerPivot databases, on the other hand, are temporary. They can be deleted from the SSAS server and reloaded from their workbooks at any time."

'via Blog this'

Wednesday, August 08, 2012

Spider Schema

The Spider Schema is an interesting approach to the distinct count performance issue with Analysis Services, and modelling data in general.  It is an approach to fix some of the flaws with star schema design.  It adds an intermediate table between dimension tables and fact tables, which house the unique key combinations for dimensions in the fact.

It sounds a bit like the Data Vault architecture promoted by Dan Linstedt, though targeted at reporting on vs vaulting the data.

More details here.

About Me | Spider Schema:

Tuesday, August 07, 2012

Excel 2013's Data Model Feature

Excel is turning into a scaled-down version of Microsoft Access.
The new Data Model feature lets you build out tables and relationships in your spreadsheet.  No more vlookups!

I have highlighted a new option in the create PivotTable dialog which is to “Add this data to the Data Model”. So what is this Data Model I speak of?

“A Data Model is a new approach for integrating data from multiple tables, effectively building a relational data source inside the Excel workbook. Within Excel, Data Models are used transparently, providing data used in PivotTables, PivotCharts, and Power View reports“. Read more here…
Data models utilize the features of PowerPivot, now embedded into Excel.  They are managed by the PowerPivot add-in, and more structured than Excel tables.

It would be nice to see if there is some way of 'upsizing' or linking these data tables to SQL Server, as we had with Access.  Being able to give users the ability to store large datasets on the server without admin intervention could change the way they work with Excel.

Wednesday, August 01, 2012

Oracle ORAOLEDB Data Type Hokeyness

My current project pits me against an Oracle Data Source within an Analysis Services multidimensional cube.  We've had a few battles so far, and I think I'm getting closer to winning.

The first quirk is just connecting.   Using the ORAOLEDB.1 provider, we specify / as the user name with a blank password, and ensure connection is trusted.  Under Advanced - Extended Properties, we specify a FetchSize=10000;defaultRowPrefech=10000; (perhaps larger).  This is supposed to improve performance of processing the cube.

The second quirk is creating your DSV.  When using the Oracle connection, it lists all tables in all schemas.    The Oracle environment here is defined by separate schemas instead of separate databases.  Microsoft likes the database segmentation approach.  In a development environment, we have access to everything.   It's not _that_ slow with 18,000 tables and 3x the amount of views, but preselecting a schema would be a nice thing to have.

This laundry list of schemas and tables has a larger impact on the audience using Power Pivot, which regulates Oracle to an "Other Data Source" while my client uses it as their primary data source.  Searching through 18,000 tables, and also placing the views at the bottom in unsorted order is not the most friendly interface for building models.  Exposing the data using OData links is probably the preferred approach anyway... but there has to be a better way to deal with using Oracle as a data source for Power Pivot.

One major quirk/show stopper using Analysis Services against Oracle is with data types.  Sometimes the data types don't get reflected properly.  A Numeric() data type in Oracle is a dynamic type, which SQL doesn't like.  Casting this to NUMERIC(9,0) should theoretically provide you with Int, or Int32 in Analysis Services world.

It doesn't always.  In an Analysis Services data source view, sometimes it doesn't pick up your changes.  Casting a number to NUMERIC(9,0) in the Oracle View and refreshing doesn't appear to do anything.  My data type is still Int64 or Decimal or whatever the Oracle decides is in my future.

The workaround is to use named queries everywhere.  Any changes to the underlying data model requires changing something "case or space or anything" in the named query.  Refresh View doesn't seem to work.

Analysis Services appears to then pick up data types properly.
NUMERIC(5,0) becomes Byte (SmallInt)
NUMERIC(6,0 - 9,0) Becomes Int32 (Int)
NUMERIC(10,0 - xxx,0) becomes Int64 (Bigint)

Anything with decimal places becomes a decimal.  Cast your decimals and round them to ensure precision doesn't blow up the cube.

Seems like an acceptable workaround.  The other way is to manually change the DSV xml.  I will be avoiding that one at all costs.

When you build out your multidimensional cube, all surrogate keys and numeric identity keys should be of data type Int32.  You can use smaller sizes for smaller dimension tables, but usually the tradeoff for size/performance vs. consistency is negligible.  The exception to the Int rule is when you're expecting lots of rows in your dimensions or facts... 2 billion rows?  Int64 or BigInt is what you want.

Would like to hear from others building cubes and PowerPivot models against Oracle data sources.   What tricks or quirks do you see when dealing with non-Microsoft sources?

Wednesday, July 25, 2012

The beginning of the end of NoSQL — Too much information

Earlier this year, Gartner killed the term Business Intelligence and its associated Business Intelligence Competency Center (BICC) acronym, and "introduced" the term Business Analytics and the concept of Business Analytic Teams (BATs).

Some NoSQL-categorized vendors now prefer to be called anything but NoSQL, since this connotation lumps together various technologies into a single buzzword acronym.

CouchDB is often categorized as a “NoSQL” database, a term that became increasingly popular in late 2009, and early 2010. While this term is a rather generic characterization of a database, or data store, it does clearly define a break from traditional SQL-based databases. A CouchDB database lacks a schema, or rigid pre-defined data structures such as tables. Data stored in CouchDB is a JSON document(s). The structure of the data, or document(s), can change dynamically to accommodate evolving needs.

So if the term NoSQL is dead, what is the replacement?  NewSQL?  The lack of creativity is amazing.

A database is only a database if it is an organized collection of data.  Are NoSQL databases really organized or are they freeform?  If it's not a database, what is this unstructured set of information called?

Another term that could be headed for the Furby pile is "Big Data" which is apparently a trademark of a corporation.  Massive Data sounds better to me anyway...

The beginning of the end of NoSQL — Too much information

Speaking of Furbys...

Query Languages and technology Mashups

SQL (Structured Query Language, See-Kwell) has been around for awhile.  IBM engineers developed SEQUEL (Structured English Query Language) in the early 70s to support building and querying relational models.  It was adopted as an ANSI standard in the mid-80s.  I've been using it since the days of Informix & Sybase, prior to SQL Server 6.5.  Various flavours and implementations have been created since then, with the key ones being SQL Server's T-SQL, Oracle's PL/SQL, and IBM's SQL PL, and PostGres PL/SQL, and the various offshoots of those languages.

MDX (Multidimensional Expressions, or Mosha's Data Expressions) are used to query Analysis Services multidimensional and tabular cube models.  Instead of a single axis of rows being returned, there is the potential to return 128 axes of data.  Usually it's just rows, columns, and sometimes pages, since thinking in 128 dimensions hurts most peoples brains.

DMX (Data Mining Extensions) is a a SQL-like query language used for creating and pulling data from multidimensional mining models.  The original team lead for Data Mining at MS is now CTO at a company called Predixions, implementing mining models using PowerPivot.

DAX (Data Analysis Expressions) supercedes MDX for the Analysis Services / PowerPivot tabular models.  MDX is still used in the background for Excel when querying PowerPivot models, but DAX is the way you write expressions and calculated functions.

NoSQL (Not Only SQL) isn't really a language, it is a technology categorization.  NoSQL databases like those used for Facebook and Google aren't necessarily relational in nature.  Usually they are key-value stores.  Think thin columns and deep rows.  Microsoft has Windows Azure Blob Storage for it's NoSQL offering, with others on the way.  HQL (Hive Query Language) is one way of querying a NoSQL database running on Hadoop.  Not to be mistaken with Hibernate Query Language, used to query the ORM framework NHibernate.  Most NoSQL databases fail the ACID test, and aren't necessarily good for transactional systems.  They're great when you need to distribute and analyze massive amounts of data.

NewSQL is either a Sourceforge LDBC driver implementation created around 2003 that offers a simplified SQL syntax, or yet another buzzword coined last year to describe something that isn't SQL and isn't NoSQL.  It's easier to use, more scalable, and performs better.  It provides ACID support and all the features of new and old database technologies.  Or so the vendors of this technology space suggest...

Some products in the NewSQL space include Akiban Server and (perhaps) Microsoft SQL Azure

More info here:

Augmenting legacy SQL databases and technologies with their new "Big Data" columnstore or key-pair counterparts seems to be the Next Big Thing.  Eventually one of the Big Dogs will fill this space, probably Microsoft, Oracle, IBM or SAP.  For now the niche players will nibble at each other's dogfood, until somebody decides to buy out the dogfood supplier.

What we really need is a database to track and compare the companies, names, technologies, features and query languages used for each of these products.  Maybe this database could just be a query language against Wikipedia.  As more and more players enter the database market, research companies and consultants who understand the "Big Picture" become more important, as do content aggregators and search tools like Wikipedia, Linked-In and Google.  

Wednesday, July 18, 2012

Using Sequence Identity Column in SQL 2012

So the conclusion is that identity is 10 times slower than sequence with default cache size. You will not gain significant performance by defining very big cache size.

It sounds like sequence is the way to go when performance and concurrency are issues.  Sequence requires rewriting your application inserts to handle the syntax for sequence.  eg. INSERT INTO x (id) VALUES (NEXT VALUE FOR seqname)
An insert trigger could potentially replicate the functionality of the identity column.

SQL Server 2012 Sequence Internal » John Huang's Blog
Further details on the internals of identity vs. sequence are here.
The primary reasons I wouldn't use sequence - in some cases anyway.
1. No DISTINCT keyword allowed when using sequences in query.
2. Can't use in views.
3. Adds complexity vs. identity column.
4. Can't use in defaults for column.
5. Requires trigger to implement identity-style insert logic.
6. Can be updated without setting identity insert option.

Monday, July 16, 2012

Tuning Analysis Services

Tuning Analysis Services for better aggregation - parallel partitioning.

Set AggregationMemoryLimitMax and AggregationMemoryLimitMin for performance with multiple-cpu environments.

Wednesday, July 11, 2012

Data Quality in an Enterprise Warehouse

Current product releases for enterprise customers, especially O/Ses and database systems, are usually at least 1-2 versions behind in my experience.  Enterprise IT doesn't like adopting the latest and greatest technologies until they have been proven, and rightfully so.  Some of the latest releases could stop your business in a second with a critical support issue.  

Not adopting the latest technologies can also stop your business, though less like a train wreck and more like   an old mall department store.  If maintenance isn't kept up, prices don't appear competitive and new products don't keep filling the shelves, the store is probably doomed to stagnate to a slow death.  If software patches and releases aren't kept up, support lifecycles expire and similar platforms become harder to integrate.  Upgrading through 3 versions of an O/S while maintaining compatibility with your in-house proprietary apps becomes not just painful, but nearly impossible.

It's usually the same with data.  Fixing problems immediately without doing root cause analysis can be just a band-aid solution.  Not fixing problems at all could be even worse, and fixing historical data could cost more than it's worth.

Historians say that it is harder to predict the past than it is to predict the future.  The internet is making predicting the past a bit easier, at least the past beyond the mid 90s.

Here's an article from 2005 that's still relevant, regarding data cleanliness in the warehouse.

Despite the great efforts from Inmon, Kimball, and the Others, the world of Data Warehousing is still facing great challenges. Even in 2005, after 14 years of Inmon explaining the concept, more than 50% of today’s data warehouse projects are anticipated to fail [Robert Jaques]. In fact, Ted Friedman, a Principal Analyst in Gartner wrote in 2005, “Many enterprises fail to recognize that they have an issue with data quality. They focus only on identifying, extracting, and loading data to the warehouse, but do not take the time to assess the quality.”

Today’s data warehouses suffer from poor quality of the data. Whether or not the poor quality of data existed a decade ago is a questionable hypothesis. In the 90s, the new breed of software products and the ease of implementing data-moving techniques have opened several avenues that resulted in data duplication. As a result, any data inconsistencies in source systems have been remedied by scrubbing and cleansing them on “local copies” of the data sets rather than taking efforts to correct them at the sources.
If Inmon or Kimball had foreseen the wave of software product proliferation in the 90s that implemented duplicated functionality in an organization, they might have stressed on architecting for better quality.

The premise behind many data warehousing strategies is to bring "all the things" into a central area for reporting, and perhaps analysis.  More so just reporting, or "giving the users the numbers" instead of the Semantic Meaning behind them.  Just like a real warehouse, the historical data starts to collect dust.  Without a proper inventory control system, it becomes hard to find things.  The older data elements appear a bit yellowed and sun-bleached, and perhaps have warped or had some design flaws when they were originally input that were only caught in later versions.  The newest items may look shiny due to improved validation processes during input time, but could have critical defects due to ETL "Chinese Whispers" gone haywire.

The way companies deal with interpreting and fixing bad data is probably the number one reason why most BI projects fail.

 As in a warehouse, it can be hard or just doesn't make ROI sense to go back and fix those old elements.  It is much easier to write them off as being known bad data.  This can cause grief and confusion for those trying to perform historical comparisons and trend analysis on the data.

In a dimensional model, a measure is also called a fact.  What is a fact if it is not quite true?  It doesn't become an opinion, it becomes an error.  Loading errors into a fact table is not a good idea. So we have a conundrum.

The missing component here is being able to publish scores that tell us the data is flawed, and commentary describing why we are not going to fix it, or how data quality is being addressed, or at least how to interpret the data, and ways to ignore data based on a threshold of quality.

As we move toward the "big data" experience, identifying trust-levels within data sources becomes more important.  Facts become opinions instead of errors.  Opinion tables sound much nicer to me than Error tables, and for dealing with data that doesn't have to be down to the penny, like internet sentiment analysis, perhaps opinions may work.

I propose an approach to augmenting current star/snowflake dimension models.   Opinionate your data models.

1. Add an opinion table to your fact table.  Ensure it uses the same grain as the fact table.
2. Include status columns for each column in your fact that you wish to track errors.
3. Include an aggregated status column, and a checksum that can be matched against the source record for auditing purposes.
4. Include a type 2 with history slowly-changing dimension for managing when each fact table load occurred and what the most current load is.
5. Track all errors, questionable data, or data out of the statistical norm using this table.
6. Ensure you have dimensions for ratings, tags and link to a wiki for people to maintain commentary.

This table load could be automated using standard ETL tools, or manually augmented using data input forms.  If you're using Excel 2007/2010 and Analysis Services cubes, you could use the writeback feature to assign scores to existing data and provide 1/0 values for true/false flagging.  Some tools, like Oracle and SQL, automatically create statistics and histograms on tables which you could leverage to find out what data are outliers.  Data Quality Services tools can be trained to understand your data and identify suggestions or even fix your data.

Gamify the process by giving out weekly prizes for those business users that find the most issues with the data, and those technical users that propose the best solutions for fixing the issues.

Sometimes automating data processes can uncover unsettling facts, especially with people & processes that resist automation.  Improving data quality gives more power to the users of that data, and less power to the controllers of the data.  This isn't always a good thing.... depending on your opinion.

Wednesday, June 20, 2012

SSIS SCD Wizard Performance Issue « Data Warehousing and Business Intelligence

In Microsoft BI, when we do a slowly changing dimension type 2, we instinctively do SCD Wizard. The problem with SCD Wizard is the performance. For each source row coming to the SCD box, SSIS does a lookup on the target dimension table to find out if the row already exists or not. That is a crazy idea. And SSIS also update the target dimension table on row by row basis. That is also a crazy idea.

If your dimension is 1000 rows and the source data is 1000 rows, the SCD takes a few minutes. But if your dimension is 500k and the source data is 500k, it’s 5-6 hours performance. The whole idea is fundamentally flawed. On the welcome screen of the SSIS box, there should be a warning: Warning, do not use SSIS SCD Wizard if your dimension is above 1000 rows.

So, if your dimension is 1m rows, what should you do?

SSIS SCD Wizard Performance Issue « Data Warehousing and Business Intelligence

Tuesday, June 19, 2012

Filtering by logged in user

How to get the current user name in MDX.


Filtering by logged in user

Wednesday, June 06, 2012

Add Report Server Content Types to a Library (Reporting Services in SharePoint Integrated Mode)

To configure Reporting Services against an existing Sharepoint 2010 farm, you need to go to the root site and enable the proper features.  Then configure libraries in subsites to expose the proper content types (Report Builder Model, Report, Data Source).

It can get a bit tricky.  Here are some links.

Add Report Server Content Types to a Library (Reporting Services in SharePoint Integrated Mode)

SQL Error 909 when restoring ContosoRetailDW.bak using MS SQL Server Mgt. Studio

If you’re trying to restore the Contoso database samples on a BI or Standard edition of SQL 2012, you’ll get an error about Enterprise features being used.  Here is a script that can be run on a trial Enterprise or Developer edition of SQL 2012 to remove that dependency.


SQL Error 909 when restoring ContosoRetailDW.bak using MS SQL Server Mgt. Studio

Thursday, May 31, 2012

Share cache across SSIS packages

Shared caching is a new feature of SSIS 2012 that could improve performance when using the same large lookup table across multiple packages.  It could be replicated by saving to a raw file in SSIS 2005/8, however this should be faster.

Prior to SSIS 2012 it was not possible to re-use or share the same cache across packages. What that meant is if you created a cache you will be able to consume it only in the very package where the cache was instantiated. Therefore, a developer could take only two routes from here – either make as much heavy ETL processing as possible inside the same package where the cache resides, or populate yet another cache object in every other package where it is needed. The latter was 
especially harmful leading to unnecessary heavy extra database calls and an extended development time. This limitation has been overcome in the SSIS 2012 release.

Monday, May 14, 2012

SQL Live Monitor


a .NET application that provides realtime performance data on the target SQL Server instance.
No installation required, data displayed in realtime, and can also be logged to CSV for offline analysis. Will also capture SQL data for processing uing PAL.

SQL Live Monitor

Thursday, May 10, 2012

Layerscape - Home Page

Layerscape is a fascinating Microsoft Research tool that integrates with Worldwide Telescope to provide deep spatial virtualizations.


Layerscape - Home Page

Thursday, May 03, 2012

Download: Microsoft® SSIS Balanced Data Distributor - Microsoft Download Center - Download Details

Useful for distributing workloads and speeding up SSIS performance.

Microsoft® SSIS Balanced Data Distributor (BDD) is a new SSIS transform. This transform takes a single input and distributes the incoming rows to one or more outputs uniformly via multithreading. The transform takes one pipeline buffer worth of rows at a time and moves it to the next output in a round robin fashion. It’s balanced and synchronous so if one of the downstream transforms or destinations is slower than the others, the rest of the pipeline will stall so this transform works best if all of the outputs have identical transforms and destinations. The intention of BDD is to improve performance via multi-threading. Several characteristics of the scenarios BDD applies to: 1) the destinations would be uniform, or at least be of the same type. 2) the input is faster than the output, for example, reading from flat file to OleDB.
NOTE:  An updated version of Microsoft SSIS Balanced Data Distributor (BDD) is available. This includes a fix for the problem causing BIDS to crash when removing the BDD transform. For more details, and download information, see
KB 2616527.

Download: Microsoft® SSIS Balanced Data Distributor - Microsoft Download Center - Download Details

Monday, April 30, 2012

SQL Server Security - Site Home - MSDN Blogs

Instead of using CHECKSUM, timestamp, or other methods, hashing data may be a good alternative.

A common scenario in data warehousing applications is knowing what source system records to update, what data needs to be loaded and which data rows can be skipped as nothing has changed since they were last loaded. Another possible scenario is the need to facilitate searching data that is encrypted using cell level encryption or storing application passwords inside the database.

Data Hashing can be used to solve this problem in SQL Server.

SQL Server Security - Site Home - MSDN Blogs

Thursday, April 26, 2012

Twitter Kills The Majesty of Pleasant Conversation

Words are powerful things.  As of late, I have been working on text mining social data feeds, and investigating how Hadoop, R, Azure and SQL Server 2012 fit into the big picture.  I wrote a SQL Integration Services package using a script task that pulls Twitter data from a syndication feed, parses out the words, hash tags, screen names and links, and stores the results in a SQL 2012 (or optionally SQL Azure) database.  Performing some text mining of Twitter status keywords against a selection of “best” and “worst” words in the English language brings back some interesting and slightly depressing results.

I started by entering in the database the twitter screen names to follow.  I targeted a few Canadian bike companies (don’t ask why).  Each time the package is run, it adds the last 20 tweets, and parses screen names of users that are mentioned in each tweet.  And so on.  This recursion builds a very quick six-degrees-of-separation trail and some fairly random discussions. 

Running this process for about 4 days, sometimes 2-3 times per day, produced 5599 tweets.  Originally I was looking at using R and Hadoop to analyze the results, which is a bit like bringing a ballistic missile to a knife fight.  To slice this data with SQL takes only a couple of seconds or less.  Perhaps reading the entire Twitter firehose or analyzing historic tweet data might change the architecture in the future.  For now, things are working pretty well. 

Of a selection of 5599 individual tweets, 9 contain the “best words” and 2135 have the “worst words” as rated by Vocabula Review.  That’s 38% of the sample that have an aura of foolishness or odium, and 0.1% that have an aura of fun and majesty.  The sampling is fairly small, with the top word “valley” only coming up 3 times. 

Another dataset with seeded with a more technology-centric list of twitter users like Robert Scoble some Microsoft folks I follow brought back similar results.  Running this process over the course of a month saved 59,583 tweets containing 796,171 words, links, screen names, emoticons and hash tags.

Of the 796k words, 24,171 came up in the “worst words” filter.  That’s about 30%.  A measly 282 came up in the “best words” filter.  That’s less than 0.001%.

The following Top 5 Words came up.

valley 74
azure 19
simplicity 11
bliss 10
recherche 8

  • Valley makes sense, with Silicon Valley, Napa Valley, and those other west coast valleys being discussed.
  • Azure makes sense, since a steady stream of Windows Azure propaganda continually bubbles from Microsoft.
  • Simplicity comes up a few times when people talk about Apple or iPad.
  • Bliss comes up because of Rob Bliss, a Washington Times columnist, and some comments about cranberry bliss bars.
  • Recherche, well, let’s chalk that up to the fact that some of the best words in the English language are French.  Mon dieu.

With only 140 characters to leverage, you would think that people would use words like “animadversion” or “cachinnation” to provide deep and meaningful expression.  Instead, you get the logorrhea that is the Twitter dataset.

Check out to improve your tweets and amaze your followers with fun and majesty.