Friday, July 04, 2014

The Myth of the Full-stack DBA

Andy Shora recently published a great article about "The Myth of the Full-stack Developer."  From his posts and tweets, I peg Andy as a front-end developer specializing in CSS and Javascript frameworks.

Andy's comment about the 30-point spider diagram really resonated with me.

Where do your web development skills lie?

The term Full-Stack applies to knowing different layers of architecture, of which the database (SQL/NoSQL/SomeSQL) can sit as one of those layers.  However, if you were consider yourself a "Full-Stack" SQL Server DBA, you might know the details of some or more of the following components of SQL:
  1. SQL Server
    1. Transact-SQL
    2. Backup/Restore
    3. Replication
    4. Log Shipping
    5. Query Tuning and Indexing
    6. Policies and Resource Governor
    7. High availability
    8. In Memory Querying
  2. SQL Agent
    1. Jobs
    2. Alerts
  3. Reporting Services
    1. Sharepoint Integrated Mode
    2. Native Mode
  4. Integration Services
    1. Importing/Exporting/Transforming Various Data Sources
    2. Connection Types
    3. Toolbox Components
    4. Project /Package Deployment Methods
  5. Analysis Services
    1. MDX/DMX/DAX
    2. Tabular
    3. Multidimensional
    4. Sharepoint
  6. Powershell Scripting
  7. TFS Integration
And of course the various ways to setup, configure, debug, maintain, monitor, license, upgrade and secure above tools along with the associated hardware and software.  Perhaps your job description may even include mentoring developers on best practices on interfacing with the layer that is the "database".

By no means an exhaustive list and depending on your environment your role as a DBA could include Active Directory maintenance and Voicemail support.

If you're a Microsoft consultant or a partner of Microsoft, you're probably intimately familiar with the old Microsoft Hamburger and its associated toppings.  

(RIP standalone PerformancePoint Server)

If you have expertise in one or more of the grills at the bottom of the stack, (SAP, Oracle, etc) your job just got a bit more important.

For some of those components in the stack, hopefully you're using best-of-breed third party tools from companies like ApexIdera and Red-Gate to improve your experience.  The ecosystem of helper apps is vast and wide.  Many feature limitations of products are considered "partner opportunities" by software companies.  If you want to be a Full-Stack DBA, you'll want to know what tools keep your stack upright.

If you were to consider yourself an expert on the internals of SQL, you might know about all of the components of this diagram, and take great pleasure in using tools like Internals Viewer to see how SQL arranges its storage and understanding the details of how GAM, SGAM & PFS Pages work.



Or you may have had a "crash" course on above internals while troubleshooting why a DBCC CheckDB job failed and your customers can't access their data.  It's best to be proactive about these things, start studying the diagram!

The ScaryDBA's recent post about The Curse Of Relational Databases also comments about The Stack of Information Technology in general, and how deep, wide, and unfathomable it is.  The comments provide  insight into the mindset of the community, and one comment mentions the idea of "Technical Debt".  A topic perhaps worth exploring into detail about in a future post, since this one is getting a tad rambling and epic.

Yesterday I attended a Toronto Hadoop User Group Meetup, with a discussion from the co-author of a book on YARN.  The topic was a deep-dive into the internals and features of the resource scheduler and application manager that is YARN.  Fascinating stuff (I'm not kidding!) that shows how much effort goes into the foundation of a scalable distributed data solution.

Speaking of deep, wide, and unfathomable....

The Big Data Developer Toolkit  

The Apache Open-Source Developer Toolkit

Hopefully your DBA stack expertise includes some Linux knowledge.





Monday, June 23, 2014

Choosing a model - Tabular vs. Multidimensional

Understand the limitations of the Tabular model and when Multidimensional model implementation is more appropriate.

It isn't as simple as Distinct Count = Tabular, Parent-Child = Multidimensional...
http://msdn.microsoft.com/en-us/library/hh994774.aspx
http://msdn.microsoft.com/en-us/library/hh212940.aspx

If you're still using Office 2007 or SQL 2008 R2, it might be a bit easier to make a choice...

Tuesday, June 17, 2014

Turing SQL Server - Is your power model sucking performance?

As stated in this hilarious Knowledge Base article, changing your power options to “High Performance” (an option Microsoft conveniently hides under a “more” menu to prevent you from accidentally finding it) resolves this issue by using more power all the time. In return, you need to accept that global warming is entirely your fault. Amazingly, no performance analyser of SQL server suggests you change this option.

Sometimes the simplest things can drastically change performance.  There should be an interview-style wizard or report for those who are new to setting up a server that would interrogate WMI to find out which settings may be sub-optimal for your usage scenario.

http://www.masterofmalt.com/software-development/blog/why-microsoft-windows-server-2012-produces-slow-and-inconsistent-sql-query-speeds/

cURL Super Powers

cURL is an extremely powerful tool that you can use, available on many platforms including Linux or Windows (via tools like Cygwin). It's powered by libcurl. Wget is another like-minded utility to download files recursively.

Here's an example to parallel download all 150+ Microsoft eBooks available from this site.
curl http://blogs.msdn.com/b/mssmallbiz/archive/2013/06/28/almost-150-free-microsoft-ebooks-covering-windows-7-windows-8-office-2010-office-2013-office-365-office-web-apps-windows-server-2012-windows-phone-7-windows-phone-8-sql-server-2008-sql-server-2012-sharepoint-server-2010-s.aspx | grep -o -E "http://ligman.me/[0-9A-Za-z]+" | uniq | xargs -P 5 -n 1 wget --content-disposition

So what's it doing?
First, download (or curl) the url specified.
Second,search the second URL for a matching regular expression.
Third, uniq(ue) the resultset.
Fourth, xargs to deal with the large parameter list
Fifth, wget to download the resulting URLs in a parallel fashion.

Why use both tools when they appear to do similar things?
http://stackoverflow.com/questions/636339/what-is-better-curl-or-wget
If you're a Windows user, and can't install Cygwin, Powershell can do some of the things that wget can do.
http://superuser.com/questions/25538/what-is-the-windows-equivalent-of-wget
(new-object System.Net.WebClient).DownloadFile('http://www.xyz.net/file.txt','C:\tmp\file.tx‌​t')
Here's a ton of GNU manuals, including wget.

Wednesday, April 02, 2014

Analysis Services 2012 and 2014 Performance Guide

Microsoft has released an Analysis Services 2012 and 2014 Performance Guide, 111 pages of multidimensional performance tuning goodness.

Is there a future for the multidimensional cube at Microsoft?  Sounds like it may stick around for awhile longer.... watch for improvements in SQL vNext.

Performance is important, and not just for improving business processes. Licensing rules for SQL 2014  will probably cause server administrators to try and keep their server utilization at full capacity.  Do more with less.

A similar whitepaper exists for PowerPivot and the tabular model in SQL 2012.

Tuesday, April 01, 2014

SQL 2014 Released, so what's the big deal?

SQL 2014 is released, in 2014 no less!  You can download the bits from Microsoft or get your free Azure trial (with a cap to ensure you don't get charged anything) to try it out.  Perhaps SQL Spade will help you to install it.  Or just browse the new features on Virtual Labs.

Microsoft has officially killed support for SQL 2005 compatibility mode, so before upgrading from prior versions be sure you've upgraded the compatibility mode of your databases.

The database engine has various improvements, fixes and tweaks to query plan optimization and partitioning.  Hopefully the query engine is tuned for the better.

There are some additional options for impersonation and separation of duties within the database.

Reporting Services, Analysis Services, and Integration Services, Data Quality Services, Master Data Management Services are mostly left in the dust.  Sigh....

Some of the cloud integration stories have improved dramatically.  Backup to URL.  Migrate to VM.  Hybrid blob storage.  AlwaysOn to Azure.  Azure's ExpressRoute functionality to move the public cloud to a private direct connection is now available.  This should improve adoption for those not wanting to transfer data over the public internet.  And provide companies with a secure form of offsite backup, DRP and replication to Azure.

Perhaps the biggest story with SQL 2014 is its In-Memory (formerly Hekaton) engine.  Or what I like to call DBCC PINTABLE+++.  This, along with its ability to use solid state storage for the SQL Buffer Pool, should provide for much faster query times when it's used correctly.  Think columnstore indexes but for an entire table, and updateable, and not a separately stored object.  Columnstore indexes in SQL 2014 can now be created as updateable, clustered objects that contain the data itself, in addition to the indexes.  In Enterprise Edition, anyway...

Note for SQL Server 2005:
This (DBCC Pintable) functionality was introduced for performance in SQL Server version 6.5. DBCC PINTABLE has highly unwanted side-effects. These include the potential to damage the buffer pool. DBCC PINTABLE is not required and has been removed to prevent additional problems. The syntax for this command still works but does not affect the server.


Ahhh, memories of data corruption, and arguments with other technology experts that there is no such feature post-SQL 2000 for pinning tables in memory.  "But I typed DBCC Pintable in and it came back with no errors!"   Whenever I want to do something that bypasses the internal workings of a database, I like to visit Ask Tom.   Even if you're not using Oracle, his site is a very fun an informative read.

Now back to SQL and the fine print.

Some of the licensing challenges to be aware of.  One challenge, where can you get some budget for 128 GB of memory for your SQL Server Standard boxes, and 512GB of flash memory for your buffer pool?  How about 4 terabytes on Datacentre with Enterprise edition, and some of those larger tables pinned in memory?

Fun times ahead with in-memory databases.



Friday, March 28, 2014

OCD Programming

A great StackOverflow answer using the railroad analogy to explain Branch Processing and why processors love predictability.

http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array

I think the response that won the votes was because of the picture in the response, and the answer "Consider a railroad junction" which succinctly describes the problem you may have when dealing with unsorted data and path analysis.

I am consistently amazed at how much difference a good path makes in programming, especially with databases.  Ensure your tables all have primary keys.  Performance improves.  Add covering indexes for long running queries.  Performance improves.

These two things can sometimes fix many of the performance issues living within the database layer.

Clean up your code.  Ensure state is kept to a minimum, and conversations with slower systems such as databases and disk are batched up and cached.

And apparently sort anything coming down the pipe before trying to analyze it.

Predictability and patterns in your code, and being obsessive-compulsive when it comes to clean coding seems to mean a lot.  To some compilers, anyway.

Others will ignore all your hard work and try to determine the best way to solve your problem.  Seems to be the norm for the Oracle instance I'm currently using.

If you use a tool such as Windows Sysinternals process explorer, you can see what kind of effect these types of incorrect paths have on system performance.  You'll notice a lot of Access Denied, Key not Found or File not Found messages.  Systems could be so much faster if it just found the correct registry entry, filename, or internet address the first time, without going down multiple paths blindly.   Perhaps predictive analytics will soon solve this issue, so that each time you use your computer it will intelligently speed up your daily tasks.

Or perhaps I'm just being obsessive.

Tuesday, February 25, 2014

Modern IE

Great site from MS to download VMs for IE testing.

http://modern.ie/en-us

As a side effect, you can get VMs for Linux & Mac OS of all major MS operating systems!


Thursday, February 06, 2014

Windows 2012 Command Prompt

Gates Spends Entire First Day Back in Office Trying to Install Windows 8.1 : The New Yorker:

The New Yorker should have done their fact-checking.  From what I have read Bill Gates is known for his trucker mouth when dealing with staff.

I downloaded 4GB VM of Windows Server 2012 trial yesterday to brush up on my Microsoft O/S skills and missed the fact it was the "core" edition.  I was back in 1989 heaven when the DOS prompt came up on startup, after a nice UI for the login prompt.

Took me a few searches to figure out the UI wasn't broken, and some hacker-like Powershell to install the UI components.   Ended up failing since I guess I need to download the install ISO too.

I wonder what Bill Gates would think when his 31-year history of DOS operating systems did a full circle with Win 2012 Server.  Who would have thought in 2014 I would be staring at a command prompt typing "help".



'via Blog this'

Thursday, November 29, 2012

Performance tips for SSAS | James Serra's Blog

Performance tips for SSAS | James Serra's Blog: "Over the past few years I have developed a list of ways to improve performance in SSAS and in two of the tools that use it: ProClarity and PerformancePoint.  Below are those performance tips along with other important items to be aware of:"

The one that really helped us was to move parameters from the where clause to a subquery.  This can improve performance drastically.

'via Blog this'

Ranting about Cumulative Updates

Don't get me wrong.  Fix it fast and fix it often is a good motto to have when addressing software bugs.  However, trying to decode what was changed between versions isn't so easy.  Especially with products like Microsoft Sharepoint.

Sharepoint has had umpteen patches since Service Pack 1.  Each one is outlined in separate KB articles by install file.  This works out to about 100 tabs open in the browser to figure out what is in each cumulative update.

Is there not a database that has this stuff?

On another note, SQL 2012 SP1 is released, which doesn't include the fixes contained in CU3 & CU4, so download SQL 2012 SP1 and SQL 2012 SP1 CU1.

Monday, November 12, 2012

Big Data Vendor Landscape

A very comprehensive post on the current (well, June anyway) picture of Big Data Vendors along with plenty of contact information.

http://www.rosebt.com/1/post/2012/06/big-data-vendor-landscape.html

Vertica is really eating everyone else's pie.  They partner with many of the traditional front-end BI app developers, including Cognos, Microstrategy and Tableau.

http://www.vertica.com/partners/business-intelligence/

Monday, October 29, 2012

LINKMEMBER MDX Syntax

Some helpful tricks with LINKMEMBER.  Linkmember is used when the cube has multiple hierarchies that could be linked together.  Eg. Order Date & Ship Date could be linked together, even though no relationship exists at the fact level.  Or in my case Security Code and Security Underlier Code.

The key issue I had was that the HideUnrelatedDimension property was set to false in the cube.  This means that you need to explicitly specify the "all" member in the linkmember function, and exclude the all member from the scope of the calculation statement.

Details here.

http://rklh.blogspot.ca/2012/01/scope-statement-on-calculated-measure.html

Thursday, October 04, 2012

Interesting MDX Feature

Apparently there's a keyword test in MDX that returns an empty value, even if the cube doesn't contain a measure called test...

select test on 0
from Adventureworks

Easter egg?

Wednesday, October 03, 2012

Humor in code

So I've been getting this error fairly regularly and thought that Mario's arch enemy had done something to my PC.  Lately it has been shutting off without warning.

The Source was "Bowser"



The browser driver has received too many illegal datagrams from the remote computer ---- to name WORKGROUP on transport NetBT_Tcpip_{-----}. The data is the datagram. No more events will be generated until the reset frequency has expired.
http://blogs.msdn.com/b/larryosterman/archive/2011/05/02/reason-number-9-999-999-why-you-don-t-ever-use-humorous-elements-in-a-shipping-product.aspx

Apparently this means my computer is beyond repair, or at least the network card is sending out line noise. 

Another one that happens every minute since I installed Windows 8:

A corrected hardware error has occurred.
Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: No Error
Processor ID: 1
The details view of this entry contains further information.

If it's corrected, how come it keeps showing up every minute?
If it's not an error, why is it called a corrected hardware error?

I always thought the event viewer should have ads for software that fixes event viewer errors or offers up places to buy hardware. That could be a great idea for extra MS revenue...

Reading Larry's blog, I finally figured out how I fixed that insurance office network 17 years ago.
http://blogs.msdn.com/b/larryosterman/archive/2005/01/11/350800.aspx

Flipping the network cards from Auto Detect to Full Duplex solved the issue, since 1 card was blowing up the entire network.

 

Tuesday, September 25, 2012

Grouping in SQL RS

One issue we recently had was that grouping wasn't filtering properly on the report.  If you are having issues grouping and filtering items in Reporting Services, make sure the detail row is not deleted.  If you have to, set the Row Visibility to hidden, but don't delete the detail.

This solved our filtering issue.

Wednesday, September 12, 2012

Thursday, August 30, 2012

Understanding Analysis Services Deployment Modes - Analysis Services and PowerPivot Team Blog - Site Home - MSDN Blogs

Analysis Services has 3 instance install modes.  Multidimensional is for the "legacy" Analysis Services cubes.  Tabular is for a standalone tabular server hosting persisted tabular models, or temporary models created by Visual Studio.  Sharepoint mode is for integration with PowerPivot functionality in Sharepoint.

There are some key features which may force you to select both the Tabular and PowerPivot modes for different functional requirements.

Understanding Analysis Services Deployment Modes - Analysis Services and PowerPivot Team Blog - Site Home - MSDN Blogs: "Obviously, this way of connecting to a PowerPivot database is very different from connecting to a Tabular database, but the differences between Tabular and SharePoint modes go even deeper than that because the databases themselves are very different in nature. Tabular databases 
are permanent databases. PowerPivot databases, on the other hand, are temporary. They can be deleted from the SSAS server and reloaded from their workbooks at any time."

'via Blog this'

Wednesday, August 08, 2012

Spider Schema

The Spider Schema is an interesting approach to the distinct count performance issue with Analysis Services, and modelling data in general.  It is an approach to fix some of the flaws with star schema design.  It adds an intermediate table between dimension tables and fact tables, which house the unique key combinations for dimensions in the fact.

It sounds a bit like the Data Vault architecture promoted by Dan Linstedt, though targeted at reporting on vs vaulting the data.

More details here.

About Me | Spider Schema:

Tuesday, August 07, 2012

Excel 2013's Data Model Feature

Excel is turning into a scaled-down version of Microsoft Access.
The new Data Model feature lets you build out tables and relationships in your spreadsheet.  No more vlookups!

http://office.microsoft.com/en-us/excel-help/create-a-data-model-in-excel-HA102923361.aspx

I have highlighted a new option in the create PivotTable dialog which is to “Add this data to the Data Model”. So what is this Data Model I speak of?


“A Data Model is a new approach for integrating data from multiple tables, effectively building a relational data source inside the Excel workbook. Within Excel, Data Models are used transparently, providing data used in PivotTables, PivotCharts, and Power View reports“. Read more here…

http://blogs.office.com/b/microsoft-excel/archive/2012/08/23/introduction-to-the-data-model-and-relationships.aspx
Data models utilize the features of PowerPivot, now embedded into Excel.  They are managed by the PowerPivot add-in, and more structured than Excel tables.

It would be nice to see if there is some way of 'upsizing' or linking these data tables to SQL Server, as we had with Access.  Being able to give users the ability to store large datasets on the server without admin intervention could change the way they work with Excel.




Wednesday, August 01, 2012

Oracle ORAOLEDB Data Type Hokeyness

My current project pits me against an Oracle Data Source within an Analysis Services multidimensional cube.  We've had a few battles so far, and I think I'm getting closer to winning.

The first quirk is just connecting.   Using the ORAOLEDB.1 provider, we specify / as the user name with a blank password, and ensure connection is trusted.  Under Advanced - Extended Properties, we specify a FetchSize=10000;defaultRowPrefech=10000; (perhaps larger).  This is supposed to improve performance of processing the cube.

The second quirk is creating your DSV.  When using the Oracle connection, it lists all tables in all schemas.    The Oracle environment here is defined by separate schemas instead of separate databases.  Microsoft likes the database segmentation approach.  In a development environment, we have access to everything.   It's not _that_ slow with 18,000 tables and 3x the amount of views, but preselecting a schema would be a nice thing to have.

This laundry list of schemas and tables has a larger impact on the audience using Power Pivot, which regulates Oracle to an "Other Data Source" while my client uses it as their primary data source.  Searching through 18,000 tables, and also placing the views at the bottom in unsorted order is not the most friendly interface for building models.  Exposing the data using OData links is probably the preferred approach anyway... but there has to be a better way to deal with using Oracle as a data source for Power Pivot.

One major quirk/show stopper using Analysis Services against Oracle is with data types.  Sometimes the data types don't get reflected properly.  A Numeric() data type in Oracle is a dynamic type, which SQL doesn't like.  Casting this to NUMERIC(9,0) should theoretically provide you with Int, or Int32 in Analysis Services world.

It doesn't always.  In an Analysis Services data source view, sometimes it doesn't pick up your changes.  Casting a number to NUMERIC(9,0) in the Oracle View and refreshing doesn't appear to do anything.  My data type is still Int64 or Decimal or whatever the Oracle decides is in my future.

The workaround is to use named queries everywhere.  Any changes to the underlying data model requires changing something "case or space or anything" in the named query.  Refresh View doesn't seem to work.

Analysis Services appears to then pick up data types properly.
NUMERIC(5,0) becomes Byte (SmallInt)
NUMERIC(6,0 - 9,0) Becomes Int32 (Int)
NUMERIC(10,0 - xxx,0) becomes Int64 (Bigint)

Anything with decimal places becomes a decimal.  Cast your decimals and round them to ensure precision doesn't blow up the cube.

Seems like an acceptable workaround.  The other way is to manually change the DSV xml.  I will be avoiding that one at all costs.

When you build out your multidimensional cube, all surrogate keys and numeric identity keys should be of data type Int32.  You can use smaller sizes for smaller dimension tables, but usually the tradeoff for size/performance vs. consistency is negligible.  The exception to the Int rule is when you're expecting lots of rows in your dimensions or facts... 2 billion rows?  Int64 or BigInt is what you want.

Would like to hear from others building cubes and PowerPivot models against Oracle data sources.   What tricks or quirks do you see when dealing with non-Microsoft sources?


Wednesday, July 25, 2012

The beginning of the end of NoSQL — Too much information

Earlier this year, Gartner killed the term Business Intelligence and its associated Business Intelligence Competency Center (BICC) acronym, and "introduced" the term Business Analytics and the concept of Business Analytic Teams (BATs).

Some NoSQL-categorized vendors now prefer to be called anything but NoSQL, since this connotation lumps together various technologies into a single buzzword acronym.

CouchDB is often categorized as a “NoSQL” database, a term that became increasingly popular in late 2009, and early 2010. While this term is a rather generic characterization of a database, or data store, it does clearly define a break from traditional SQL-based databases. A CouchDB database lacks a schema, or rigid pre-defined data structures such as tables. Data stored in CouchDB is a JSON document(s). The structure of the data, or document(s), can change dynamically to accommodate evolving needs.

So if the term NoSQL is dead, what is the replacement?  NewSQL?  The lack of creativity is amazing.

A database is only a database if it is an organized collection of data.  Are NoSQL databases really organized or are they freeform?  If it's not a database, what is this unstructured set of information called?

Another term that could be headed for the Furby pile is "Big Data" which is apparently a trademark of a corporation.  Massive Data sounds better to me anyway...



The beginning of the end of NoSQL — Too much information

Speaking of Furbys...
http://www.engadget.com/2012/07/06/furby-hands-on-video/

Query Languages and technology Mashups

SQL (Structured Query Language, See-Kwell) has been around for awhile.  IBM engineers developed SEQUEL (Structured English Query Language) in the early 70s to support building and querying relational models.  It was adopted as an ANSI standard in the mid-80s.  I've been using it since the days of Informix & Sybase, prior to SQL Server 6.5.  Various flavours and implementations have been created since then, with the key ones being SQL Server's T-SQL, Oracle's PL/SQL, and IBM's SQL PL, and PostGres PL/SQL, and the various offshoots of those languages.

MDX (Multidimensional Expressions, or Mosha's Data Expressions) are used to query Analysis Services multidimensional and tabular cube models.  Instead of a single axis of rows being returned, there is the potential to return 128 axes of data.  Usually it's just rows, columns, and sometimes pages, since thinking in 128 dimensions hurts most peoples brains.

DMX (Data Mining Extensions) is a a SQL-like query language used for creating and pulling data from multidimensional mining models.  The original team lead for Data Mining at MS is now CTO at a company called Predixions, implementing mining models using PowerPivot.

DAX (Data Analysis Expressions) supercedes MDX for the Analysis Services / PowerPivot tabular models.  MDX is still used in the background for Excel when querying PowerPivot models, but DAX is the way you write expressions and calculated functions.

NoSQL (Not Only SQL) isn't really a language, it is a technology categorization.  NoSQL databases like those used for Facebook and Google aren't necessarily relational in nature.  Usually they are key-value stores.  Think thin columns and deep rows.  Microsoft has Windows Azure Blob Storage for it's NoSQL offering, with others on the way.  HQL (Hive Query Language) is one way of querying a NoSQL database running on Hadoop.  Not to be mistaken with Hibernate Query Language, used to query the ORM framework NHibernate.  Most NoSQL databases fail the ACID test, and aren't necessarily good for transactional systems.  They're great when you need to distribute and analyze massive amounts of data.

NewSQL is either a Sourceforge LDBC driver implementation created around 2003 that offers a simplified SQL syntax, or yet another buzzword coined last year to describe something that isn't SQL and isn't NoSQL.  It's easier to use, more scalable, and performs better.  It provides ACID support and all the features of new and old database technologies.  Or so the vendors of this technology space suggest...

Some products in the NewSQL space include Akiban Server and (perhaps) Microsoft SQL Azure

More info here:
http://www.readwriteweb.com/cloud/2011/04/the-newsql-movement.php

Augmenting legacy SQL databases and technologies with their new "Big Data" columnstore or key-pair counterparts seems to be the Next Big Thing.  Eventually one of the Big Dogs will fill this space, probably Microsoft, Oracle, IBM or SAP.  For now the niche players will nibble at each other's dogfood, until somebody decides to buy out the dogfood supplier.

What we really need is a database to track and compare the companies, names, technologies, features and query languages used for each of these products.  Maybe this database could just be a query language against Wikipedia.  As more and more players enter the database market, research companies and consultants who understand the "Big Picture" become more important, as do content aggregators and search tools like Wikipedia, Linked-In and Google.  

Wednesday, July 18, 2012

Using Sequence Identity Column in SQL 2012

So the conclusion is that identity is 10 times slower than sequence with default cache size. You will not gain significant performance by defining very big cache size.


It sounds like sequence is the way to go when performance and concurrency are issues.  Sequence requires rewriting your application inserts to handle the syntax for sequence.  eg. INSERT INTO x (id) VALUES (NEXT VALUE FOR seqname)
An insert trigger could potentially replicate the functionality of the identity column.

SQL Server 2012 Sequence Internal » John Huang's Blog
 
Further details on the internals of identity vs. sequence are here.
http://dba.stackexchange.com/questions/1635/why-are-denali-sequences-supposed-to-perform-better-than-identity-columns
 
The primary reasons I wouldn't use sequence - in some cases anyway.
 
1. No DISTINCT keyword allowed when using sequences in query.
2. Can't use in views.
3. Adds complexity vs. identity column.
4. Can't use in defaults for column.
5. Requires trigger to implement identity-style insert logic.
6. Can be updated without setting identity insert option.

Monday, July 16, 2012

Tuning Analysis Services

Tuning Analysis Services for better aggregation - parallel partitioning.

Set AggregationMemoryLimitMax and AggregationMemoryLimitMin for performance with multiple-cpu environments.

http://geekswithblogs.net/ManicArchitect/archive/2010/11/02/142558.aspx

Wednesday, July 11, 2012

Data Quality in an Enterprise Warehouse

Current product releases for enterprise customers, especially O/Ses and database systems, are usually at least 1-2 versions behind in my experience.  Enterprise IT doesn't like adopting the latest and greatest technologies until they have been proven, and rightfully so.  Some of the latest releases could stop your business in a second with a critical support issue.  

Not adopting the latest technologies can also stop your business, though less like a train wreck and more like   an old mall department store.  If maintenance isn't kept up, prices don't appear competitive and new products don't keep filling the shelves, the store is probably doomed to stagnate to a slow death.  If software patches and releases aren't kept up, support lifecycles expire and similar platforms become harder to integrate.  Upgrading through 3 versions of an O/S while maintaining compatibility with your in-house proprietary apps becomes not just painful, but nearly impossible.

It's usually the same with data.  Fixing problems immediately without doing root cause analysis can be just a band-aid solution.  Not fixing problems at all could be even worse, and fixing historical data could cost more than it's worth.

Historians say that it is harder to predict the past than it is to predict the future.  The internet is making predicting the past a bit easier, at least the past beyond the mid 90s.

Here's an article from 2005 that's still relevant, regarding data cleanliness in the warehouse.

Despite the great efforts from Inmon, Kimball, and the Others, the world of Data Warehousing is still facing great challenges. Even in 2005, after 14 years of Inmon explaining the concept, more than 50% of today’s data warehouse projects are anticipated to fail [Robert Jaques]. In fact, Ted Friedman, a Principal Analyst in Gartner wrote in 2005, “Many enterprises fail to recognize that they have an issue with data quality. They focus only on identifying, extracting, and loading data to the warehouse, but do not take the time to assess the quality.”


Today’s data warehouses suffer from poor quality of the data. Whether or not the poor quality of data existed a decade ago is a questionable hypothesis. In the 90s, the new breed of software products and the ease of implementing data-moving techniques have opened several avenues that resulted in data duplication. As a result, any data inconsistencies in source systems have been remedied by scrubbing and cleansing them on “local copies” of the data sets rather than taking efforts to correct them at the sources.
If Inmon or Kimball had foreseen the wave of software product proliferation in the 90s that implemented duplicated functionality in an organization, they might have stressed on architecting for better quality.

The premise behind many data warehousing strategies is to bring "all the things" into a central area for reporting, and perhaps analysis.  More so just reporting, or "giving the users the numbers" instead of the Semantic Meaning behind them.  Just like a real warehouse, the historical data starts to collect dust.  Without a proper inventory control system, it becomes hard to find things.  The older data elements appear a bit yellowed and sun-bleached, and perhaps have warped or had some design flaws when they were originally input that were only caught in later versions.  The newest items may look shiny due to improved validation processes during input time, but could have critical defects due to ETL "Chinese Whispers" gone haywire.

The way companies deal with interpreting and fixing bad data is probably the number one reason why most BI projects fail.

 As in a warehouse, it can be hard or just doesn't make ROI sense to go back and fix those old elements.  It is much easier to write them off as being known bad data.  This can cause grief and confusion for those trying to perform historical comparisons and trend analysis on the data.

In a dimensional model, a measure is also called a fact.  What is a fact if it is not quite true?  It doesn't become an opinion, it becomes an error.  Loading errors into a fact table is not a good idea. So we have a conundrum.



The missing component here is being able to publish scores that tell us the data is flawed, and commentary describing why we are not going to fix it, or how data quality is being addressed, or at least how to interpret the data, and ways to ignore data based on a threshold of quality.

As we move toward the "big data" experience, identifying trust-levels within data sources becomes more important.  Facts become opinions instead of errors.  Opinion tables sound much nicer to me than Error tables, and for dealing with data that doesn't have to be down to the penny, like internet sentiment analysis, perhaps opinions may work.

I propose an approach to augmenting current star/snowflake dimension models.   Opinionate your data models.

1. Add an opinion table to your fact table.  Ensure it uses the same grain as the fact table.
2. Include status columns for each column in your fact that you wish to track errors.
3. Include an aggregated status column, and a checksum that can be matched against the source record for auditing purposes.
4. Include a type 2 with history slowly-changing dimension for managing when each fact table load occurred and what the most current load is.
5. Track all errors, questionable data, or data out of the statistical norm using this table.
6. Ensure you have dimensions for ratings, tags and link to a wiki for people to maintain commentary.

This table load could be automated using standard ETL tools, or manually augmented using data input forms.  If you're using Excel 2007/2010 and Analysis Services cubes, you could use the writeback feature to assign scores to existing data and provide 1/0 values for true/false flagging.  Some tools, like Oracle and SQL, automatically create statistics and histograms on tables which you could leverage to find out what data are outliers.  Data Quality Services tools can be trained to understand your data and identify suggestions or even fix your data.



Gamify the process by giving out weekly prizes for those business users that find the most issues with the data, and those technical users that propose the best solutions for fixing the issues.

Sometimes automating data processes can uncover unsettling facts, especially with people & processes that resist automation.  Improving data quality gives more power to the users of that data, and less power to the controllers of the data.  This isn't always a good thing.... depending on your opinion.


Wednesday, June 20, 2012

SSIS SCD Wizard Performance Issue « Data Warehousing and Business Intelligence

In Microsoft BI, when we do a slowly changing dimension type 2, we instinctively do SCD Wizard. The problem with SCD Wizard is the performance. For each source row coming to the SCD box, SSIS does a lookup on the target dimension table to find out if the row already exists or not. That is a crazy idea. And SSIS also update the target dimension table on row by row basis. That is also a crazy idea.



If your dimension is 1000 rows and the source data is 1000 rows, the SCD takes a few minutes. But if your dimension is 500k and the source data is 500k, it’s 5-6 hours performance. The whole idea is fundamentally flawed. On the welcome screen of the SSIS box, there should be a warning: Warning, do not use SSIS SCD Wizard if your dimension is above 1000 rows.


So, if your dimension is 1m rows, what should you do?

SSIS SCD Wizard Performance Issue « Data Warehousing and Business Intelligence

Tuesday, June 19, 2012

Filtering by logged in user

How to get the current user name in MDX.

WITH MEMBER Measures.UserID

as
'UserName()'
SELECT
[Measures].[UserID]
ON COLUMNS
FROM
 
Filtering by logged in user

Wednesday, June 06, 2012

Add Report Server Content Types to a Library (Reporting Services in SharePoint Integrated Mode)

To configure Reporting Services against an existing Sharepoint 2010 farm, you need to go to the root site and enable the proper features.  Then configure libraries in subsites to expose the proper content types (Report Builder Model, Report, Data Source).

It can get a bit tricky.  Here are some links.

http://blog.cloudshare.com/2012/05/22/step-by-step-guide-to-installing-sharepoint-with-sql-2012-powerpivot-powerview-and-reporting-services/

Add Report Server Content Types to a Library (Reporting Services in SharePoint Integrated Mode)

SQL Error 909 when restoring ContosoRetailDW.bak using MS SQL Server Mgt. Studio

If you’re trying to restore the Contoso database samples on a BI or Standard edition of SQL 2012, you’ll get an error about Enterprise features being used.  Here is a script that can be run on a trial Enterprise or Developer edition of SQL 2012 to remove that dependency.

ALTER INDEX ALL ON DimChannel REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON DimEntity REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON DimProduct REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON DimStore REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON FactExchangeRate REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON FactInventory REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON FactITMachine REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON FactOnlineSales REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON FactSales REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON FactSalesQuota REBUILD WITH (DATA_COMPRESSION = None);
ALTER INDEX ALL ON FactStrategyPlan REBUILD WITH (DATA_COMPRESSION = None);

SQL Error 909 when restoring ContosoRetailDW.bak using MS SQL Server Mgt. Studio

Thursday, May 31, 2012

Share cache across SSIS packages

Shared caching is a new feature of SSIS 2012 that could improve performance when using the same large lookup table across multiple packages.  It could be replicated by saving to a raw file in SSIS 2005/8, however this should be faster.


Prior to SSIS 2012 it was not possible to re-use or share the same cache across packages. What that meant is if you created a cache you will be able to consume it only in the very package where the cache was instantiated. Therefore, a developer could take only two routes from here – either make as much heavy ETL processing as possible inside the same package where the cache resides, or populate yet another cache object in every other package where it is needed. The latter was 
especially harmful leading to unnecessary heavy extra database calls and an extended development time. This limitation has been overcome in the SSIS 2012 release.

Monday, May 14, 2012

SQL Live Monitor

 

a .NET application that provides realtime performance data on the target SQL Server instance.
No installation required, data displayed in realtime, and can also be logged to CSV for offline analysis. Will also capture SQL data for processing uing PAL.

SQL Live Monitor

Thursday, May 10, 2012

Layerscape - Home Page

Layerscape is a fascinating Microsoft Research tool that integrates with Worldwide Telescope to provide deep spatial virtualizations.

 

Layerscape - Home Page

Thursday, May 03, 2012

Download: Microsoft® SSIS Balanced Data Distributor - Microsoft Download Center - Download Details

Useful for distributing workloads and speeding up SSIS performance.

Microsoft® SSIS Balanced Data Distributor (BDD) is a new SSIS transform. This transform takes a single input and distributes the incoming rows to one or more outputs uniformly via multithreading. The transform takes one pipeline buffer worth of rows at a time and moves it to the next output in a round robin fashion. It’s balanced and synchronous so if one of the downstream transforms or destinations is slower than the others, the rest of the pipeline will stall so this transform works best if all of the outputs have identical transforms and destinations. The intention of BDD is to improve performance via multi-threading. Several characteristics of the scenarios BDD applies to: 1) the destinations would be uniform, or at least be of the same type. 2) the input is faster than the output, for example, reading from flat file to OleDB.
NOTE:  An updated version of Microsoft SSIS Balanced Data Distributor (BDD) is available. This includes a fix for the problem causing BIDS to crash when removing the BDD transform. For more details, and download information, see
KB 2616527.

Download: Microsoft® SSIS Balanced Data Distributor - Microsoft Download Center - Download Details

Monday, April 30, 2012

SQL Server Security - Site Home - MSDN Blogs

Instead of using CHECKSUM, timestamp, or other methods, hashing data may be a good alternative.

A common scenario in data warehousing applications is knowing what source system records to update, what data needs to be loaded and which data rows can be skipped as nothing has changed since they were last loaded. Another possible scenario is the need to facilitate searching data that is encrypted using cell level encryption or storing application passwords inside the database.

Data Hashing can be used to solve this problem in SQL Server.

SQL Server Security - Site Home - MSDN Blogs

Thursday, April 26, 2012

Twitter Kills The Majesty of Pleasant Conversation


Words are powerful things.  As of late, I have been working on text mining social data feeds, and investigating how Hadoop, R, Azure and SQL Server 2012 fit into the big picture.  I wrote a SQL Integration Services package using a script task that pulls Twitter data from a syndication feed, parses out the words, hash tags, screen names and links, and stores the results in a SQL 2012 (or optionally SQL Azure) database.  Performing some text mining of Twitter status keywords against a selection of “best” and “worst” words in the English language brings back some interesting and slightly depressing results.

I started by entering in the database the twitter screen names to follow.  I targeted a few Canadian bike companies (don’t ask why).  Each time the package is run, it adds the last 20 tweets, and parses screen names of users that are mentioned in each tweet.  And so on.  This recursion builds a very quick six-degrees-of-separation trail and some fairly random discussions. 

Running this process for about 4 days, sometimes 2-3 times per day, produced 5599 tweets.  Originally I was looking at using R and Hadoop to analyze the results, which is a bit like bringing a ballistic missile to a knife fight.  To slice this data with SQL takes only a couple of seconds or less.  Perhaps reading the entire Twitter firehose or analyzing historic tweet data might change the architecture in the future.  For now, things are working pretty well. 

Of a selection of 5599 individual tweets, 9 contain the “best words” and 2135 have the “worst words” as rated by Vocabula Review.  That’s 38% of the sample that have an aura of foolishness or odium, and 0.1% that have an aura of fun and majesty.  The sampling is fairly small, with the top word “valley” only coming up 3 times. 

Another dataset with seeded with a more technology-centric list of twitter users like Robert Scoble some Microsoft folks I follow brought back similar results.  Running this process over the course of a month saved 59,583 tweets containing 796,171 words, links, screen names, emoticons and hash tags.

Of the 796k words, 24,171 came up in the “worst words” filter.  That’s about 30%.  A measly 282 came up in the “best words” filter.  That’s less than 0.001%.

The following Top 5 Words came up.

valley 74
azure 19
simplicity 11
bliss 10
recherche 8

  • Valley makes sense, with Silicon Valley, Napa Valley, and those other west coast valleys being discussed.
  • Azure makes sense, since a steady stream of Windows Azure propaganda continually bubbles from Microsoft.
  • Simplicity comes up a few times when people talk about Apple or iPad.
  • Bliss comes up because of Rob Bliss, a Washington Times columnist, and some comments about cranberry bliss bars.
  • Recherche, well, let’s chalk that up to the fact that some of the best words in the English language are French.  Mon dieu.

With only 140 characters to leverage, you would think that people would use words like “animadversion” or “cachinnation” to provide deep and meaningful expression.  Instead, you get the logorrhea that is the Twitter dataset.

Check out www.vocabula.com to improve your tweets and amaze your followers with fun and majesty.

The BI Software Tool Belt

Excellent list of must-have software for SQL Server and everything else.

My must-have software is:

  • Microsoft OneNote, Live Writer
  • SSMS Tools
  • BIDS Helper
  • WinDirStat

The BI Software Tool Belt

Wednesday, April 25, 2012

SQL Server FineBuild

A wizard for installing SQL?  Yes, it exists…

FineBuild provides 1-click install and best-practice configuration of SQL Server 2012, SQL Server 2008 R2, SQL Server 2008, and SQL Server 2005.

SQL Server FineBuild

Tuesday, April 24, 2012

CQRS & Separation of Church and State

CQRS is a design pattern acronym which stands for Command Query and Responsibility Segregation.  I would call this the R/RW pattern, or the separation of Church and State.  In a reporting application design, it would silo the read queries from the write commands. 

In its simplicity, it is designing two services for each application interface, one that reads and one that writes. 

That is it. That is the entirety of the CQRS pattern.

CQRS, Task Based UIs, Event Sourcing agh! | Greg Young

Since the bulk of applications have highly concurrent reads and minimal or bulk writes, thinking about this pattern during design is very important.  What if you were able to architect a system where you could spin up unlimited services that provide querying, and a few services that provide writes/updates, queuing them up for later?  Would this improve performance and scalability? What if you could split the data for these systems into unlimited read-only models and a few read/write models?

What if the data in the read-only models was effortlessly in-synch with the read/write models, with minimal latency and lazy consistency?

This is one of the tenets behind Azure and cloud computing in general.

You can do this in a SQL Server environment, albeit not so effortlessly and with some gotchas, using Log Shipping, Undo and database snapshots for a read-only picture of the database.  There is some latency, and keeping the databases in synch adds lots of overhead. 

SQL 2012 appears to solve some of the limitations of SQL 2008 R2 log shipping with it’s AlwaysOn capabilities for rolling out read-only, distributed clusters. It still doesn’t seem as effortless as it should be.

Replication also offers a solution, though it can be brittle and has its limitations and maintenance overhead. 

SANs also offer a high-cost solution for fast bulk loads, with some downtime.

You cannot use SAN snapshots of an active SQL Server database file(s) unless you detach them first. This is very intrusive.

I actually do this for a situation where I need 800GB of data from one server to another every night.

  • Detach the database on ServerA
  • SAN snapshot serverA drives F:,G:,H:,I:
  • reattach database on ServerA
  • detach database on ServerB
  • detach drive L,M,N,O on ServerB
  • mount new SAN snapshots L,M.N,O on ServerB
  • attach database files from L,M,N,O on ServerB

It requires an outage on both, but is very fast to virtually move the data to the other server.

http://social.msdn.microsoft.com/Forums/eu/sqldatabaseengine/thread/8f01f55a-66ab-4f96-a7bf-dca10bea64b8

Change data capture is another solution for maintaining multiple concurrent copies of a data stream.  SQL 2012 now has an Oracle CDC feature which improves the integration and reporting story.

The above solutions are very hammer-based approaches to the solution and don’t necessarily relate to CQRS.  Message queuing, eventing, change data capture and pub/sub systems all seem to fall within the CQRS design pattern, though they are really just an extension to the thought process. 

Chad Meyers has an excellent posting on Laws, Rules, Principles, Patterns, and Practices.

Another set of excellent posts on the concept of CQRS.

http://thinkbeforecoding.com/

There doesn’t seem to be a silver bullet to the solution just yet, though I’m sure someone will get it right sooner or later. The CQRS pattern is merely that, a pattern. The devil is in the details.

Monday, April 23, 2012

Ctrl+E not working for executing statements | Microsoft Connect

Upgrading from Denali CTP1?  Your keyboard shortcuts (CTRL-E, CTRL-R) are probably not working.  Use Options – Keyboard – Reset to fix.

Ctrl+E not working for executing statements | Microsoft Connect

Reloading Agent Settings appears in SQL Agent Logs every 90 seconds | Microsoft Connect

One of the things you may want to do as soon as you install SQL 2012 is disable the auto refresh feature of SQL Agent, which will fill up your agent logs with superfluous messages.

We investigated this issue and it's indeed painful behavior from the user perspectives, we will address this issue in nearest Service Pack.
For now I can suggest you the the following workaround:
Go to registry on your SQL Server machine and navigate to corresponding SQL Instance registry key: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\ MSSQL11.YOUR_INSTANCE_NAME\SQLServerAgent
Change registry settings called “AutoRegistryRefresh” from 1 to 0.
The side impact of this: if you change any default settings behavior for SQL Agent, you may need to restart SQL Agent to get it.

Reloading Agent Settings appears in SQL Agent Logs every 90 seconds | Microsoft Connect

Exporting via HTML instead of MHTML - CSS SQL Server Engineers - Site Home - MSDN Blogs

How to configure additional output formats (HTML) for Reporting Services.

There was a question on Twitter about how to display a report in HTML instead of MHTML due to some browser issues.  Based on the fact that it was MHTML, I’m assuming we are talking about exporting a report as the default report view is HTML.  First off, if we look at our export options for a report, we see the following:

Exporting via HTML instead of MHTML - CSS SQL Server Engineers - Site Home - MSDN Blogs

Thursday, April 19, 2012

Jonathan Kehayias | Installing and Configuring SQL Server 2012 Distributed Replay

The distributed replay feature of SQL 2012 allows for load testing a sampled workload against a different SQL server, and spinning up multiple workloads to simulate real activity.

Here’s the install instructions.

Jonathan Kehayias | Installing and Configuring SQL Server 2012 Distributed Replay

Monday, April 16, 2012

SQL Server 2012 Command Line Utilities — DatabaseJournal.com

The MAK talks about command line utilities in SQL 2012.

Microsoft provided many Command line utilities for SQL Server 2012.  In this article I am going to explain the functionality and use of these command line utilities based on the feature it is dealing with.

SQL Server 2012 Command Line Utilities — DatabaseJournal.com

Also, the details on PowerShell cmdlets for SQL