Wednesday, July 25, 2012

The beginning of the end of NoSQL — Too much information

Earlier this year, Gartner killed the term Business Intelligence and its associated Business Intelligence Competency Center (BICC) acronym, and "introduced" the term Business Analytics and the concept of Business Analytic Teams (BATs).

Some NoSQL-categorized vendors now prefer to be called anything but NoSQL, since this connotation lumps together various technologies into a single buzzword acronym.

CouchDB is often categorized as a “NoSQL” database, a term that became increasingly popular in late 2009, and early 2010. While this term is a rather generic characterization of a database, or data store, it does clearly define a break from traditional SQL-based databases. A CouchDB database lacks a schema, or rigid pre-defined data structures such as tables. Data stored in CouchDB is a JSON document(s). The structure of the data, or document(s), can change dynamically to accommodate evolving needs.

So if the term NoSQL is dead, what is the replacement?  NewSQL?  The lack of creativity is amazing.

A database is only a database if it is an organized collection of data.  Are NoSQL databases really organized or are they freeform?  If it's not a database, what is this unstructured set of information called?

Another term that could be headed for the Furby pile is "Big Data" which is apparently a trademark of a corporation.  Massive Data sounds better to me anyway...



The beginning of the end of NoSQL — Too much information

Speaking of Furbys...
http://www.engadget.com/2012/07/06/furby-hands-on-video/

Query Languages and technology Mashups

SQL (Structured Query Language, See-Kwell) has been around for awhile.  IBM engineers developed SEQUEL (Structured English Query Language) in the early 70s to support building and querying relational models.  It was adopted as an ANSI standard in the mid-80s.  I've been using it since the days of Informix & Sybase, prior to SQL Server 6.5.  Various flavours and implementations have been created since then, with the key ones being SQL Server's T-SQL, Oracle's PL/SQL, and IBM's SQL PL, and PostGres PL/SQL, and the various offshoots of those languages.

MDX (Multidimensional Expressions, or Mosha's Data Expressions) are used to query Analysis Services multidimensional and tabular cube models.  Instead of a single axis of rows being returned, there is the potential to return 128 axes of data.  Usually it's just rows, columns, and sometimes pages, since thinking in 128 dimensions hurts most peoples brains.

DMX (Data Mining Extensions) is a a SQL-like query language used for creating and pulling data from multidimensional mining models.  The original team lead for Data Mining at MS is now CTO at a company called Predixions, implementing mining models using PowerPivot.

DAX (Data Analysis Expressions) supercedes MDX for the Analysis Services / PowerPivot tabular models.  MDX is still used in the background for Excel when querying PowerPivot models, but DAX is the way you write expressions and calculated functions.

NoSQL (Not Only SQL) isn't really a language, it is a technology categorization.  NoSQL databases like those used for Facebook and Google aren't necessarily relational in nature.  Usually they are key-value stores.  Think thin columns and deep rows.  Microsoft has Windows Azure Blob Storage for it's NoSQL offering, with others on the way.  HQL (Hive Query Language) is one way of querying a NoSQL database running on Hadoop.  Not to be mistaken with Hibernate Query Language, used to query the ORM framework NHibernate.  Most NoSQL databases fail the ACID test, and aren't necessarily good for transactional systems.  They're great when you need to distribute and analyze massive amounts of data.

NewSQL is either a Sourceforge LDBC driver implementation created around 2003 that offers a simplified SQL syntax, or yet another buzzword coined last year to describe something that isn't SQL and isn't NoSQL.  It's easier to use, more scalable, and performs better.  It provides ACID support and all the features of new and old database technologies.  Or so the vendors of this technology space suggest...

Some products in the NewSQL space include Akiban Server and (perhaps) Microsoft SQL Azure

More info here:
http://www.readwriteweb.com/cloud/2011/04/the-newsql-movement.php

Augmenting legacy SQL databases and technologies with their new "Big Data" columnstore or key-pair counterparts seems to be the Next Big Thing.  Eventually one of the Big Dogs will fill this space, probably Microsoft, Oracle, IBM or SAP.  For now the niche players will nibble at each other's dogfood, until somebody decides to buy out the dogfood supplier.

What we really need is a database to track and compare the companies, names, technologies, features and query languages used for each of these products.  Maybe this database could just be a query language against Wikipedia.  As more and more players enter the database market, research companies and consultants who understand the "Big Picture" become more important, as do content aggregators and search tools like Wikipedia, Linked-In and Google.  

Wednesday, July 18, 2012

Using Sequence Identity Column in SQL 2012

So the conclusion is that identity is 10 times slower than sequence with default cache size. You will not gain significant performance by defining very big cache size.


It sounds like sequence is the way to go when performance and concurrency are issues.  Sequence requires rewriting your application inserts to handle the syntax for sequence.  eg. INSERT INTO x (id) VALUES (NEXT VALUE FOR seqname)
An insert trigger could potentially replicate the functionality of the identity column.

SQL Server 2012 Sequence Internal » John Huang's Blog
 
Further details on the internals of identity vs. sequence are here.
http://dba.stackexchange.com/questions/1635/why-are-denali-sequences-supposed-to-perform-better-than-identity-columns
 
The primary reasons I wouldn't use sequence - in some cases anyway.
 
1. No DISTINCT keyword allowed when using sequences in query.
2. Can't use in views.
3. Adds complexity vs. identity column.
4. Can't use in defaults for column.
5. Requires trigger to implement identity-style insert logic.
6. Can be updated without setting identity insert option.

Monday, July 16, 2012

Tuning Analysis Services

Tuning Analysis Services for better aggregation - parallel partitioning.

Set AggregationMemoryLimitMax and AggregationMemoryLimitMin for performance with multiple-cpu environments.

http://geekswithblogs.net/ManicArchitect/archive/2010/11/02/142558.aspx

Wednesday, July 11, 2012

Data Quality in an Enterprise Warehouse

Current product releases for enterprise customers, especially O/Ses and database systems, are usually at least 1-2 versions behind in my experience.  Enterprise IT doesn't like adopting the latest and greatest technologies until they have been proven, and rightfully so.  Some of the latest releases could stop your business in a second with a critical support issue.  

Not adopting the latest technologies can also stop your business, though less like a train wreck and more like   an old mall department store.  If maintenance isn't kept up, prices don't appear competitive and new products don't keep filling the shelves, the store is probably doomed to stagnate to a slow death.  If software patches and releases aren't kept up, support lifecycles expire and similar platforms become harder to integrate.  Upgrading through 3 versions of an O/S while maintaining compatibility with your in-house proprietary apps becomes not just painful, but nearly impossible.

It's usually the same with data.  Fixing problems immediately without doing root cause analysis can be just a band-aid solution.  Not fixing problems at all could be even worse, and fixing historical data could cost more than it's worth.

Historians say that it is harder to predict the past than it is to predict the future.  The internet is making predicting the past a bit easier, at least the past beyond the mid 90s.

Here's an article from 2005 that's still relevant, regarding data cleanliness in the warehouse.

Despite the great efforts from Inmon, Kimball, and the Others, the world of Data Warehousing is still facing great challenges. Even in 2005, after 14 years of Inmon explaining the concept, more than 50% of today’s data warehouse projects are anticipated to fail [Robert Jaques]. In fact, Ted Friedman, a Principal Analyst in Gartner wrote in 2005, “Many enterprises fail to recognize that they have an issue with data quality. They focus only on identifying, extracting, and loading data to the warehouse, but do not take the time to assess the quality.”


Today’s data warehouses suffer from poor quality of the data. Whether or not the poor quality of data existed a decade ago is a questionable hypothesis. In the 90s, the new breed of software products and the ease of implementing data-moving techniques have opened several avenues that resulted in data duplication. As a result, any data inconsistencies in source systems have been remedied by scrubbing and cleansing them on “local copies” of the data sets rather than taking efforts to correct them at the sources.
If Inmon or Kimball had foreseen the wave of software product proliferation in the 90s that implemented duplicated functionality in an organization, they might have stressed on architecting for better quality.

The premise behind many data warehousing strategies is to bring "all the things" into a central area for reporting, and perhaps analysis.  More so just reporting, or "giving the users the numbers" instead of the Semantic Meaning behind them.  Just like a real warehouse, the historical data starts to collect dust.  Without a proper inventory control system, it becomes hard to find things.  The older data elements appear a bit yellowed and sun-bleached, and perhaps have warped or had some design flaws when they were originally input that were only caught in later versions.  The newest items may look shiny due to improved validation processes during input time, but could have critical defects due to ETL "Chinese Whispers" gone haywire.

The way companies deal with interpreting and fixing bad data is probably the number one reason why most BI projects fail.

 As in a warehouse, it can be hard or just doesn't make ROI sense to go back and fix those old elements.  It is much easier to write them off as being known bad data.  This can cause grief and confusion for those trying to perform historical comparisons and trend analysis on the data.

In a dimensional model, a measure is also called a fact.  What is a fact if it is not quite true?  It doesn't become an opinion, it becomes an error.  Loading errors into a fact table is not a good idea. So we have a conundrum.



The missing component here is being able to publish scores that tell us the data is flawed, and commentary describing why we are not going to fix it, or how data quality is being addressed, or at least how to interpret the data, and ways to ignore data based on a threshold of quality.

As we move toward the "big data" experience, identifying trust-levels within data sources becomes more important.  Facts become opinions instead of errors.  Opinion tables sound much nicer to me than Error tables, and for dealing with data that doesn't have to be down to the penny, like internet sentiment analysis, perhaps opinions may work.

I propose an approach to augmenting current star/snowflake dimension models.   Opinionate your data models.

1. Add an opinion table to your fact table.  Ensure it uses the same grain as the fact table.
2. Include status columns for each column in your fact that you wish to track errors.
3. Include an aggregated status column, and a checksum that can be matched against the source record for auditing purposes.
4. Include a type 2 with history slowly-changing dimension for managing when each fact table load occurred and what the most current load is.
5. Track all errors, questionable data, or data out of the statistical norm using this table.
6. Ensure you have dimensions for ratings, tags and link to a wiki for people to maintain commentary.

This table load could be automated using standard ETL tools, or manually augmented using data input forms.  If you're using Excel 2007/2010 and Analysis Services cubes, you could use the writeback feature to assign scores to existing data and provide 1/0 values for true/false flagging.  Some tools, like Oracle and SQL, automatically create statistics and histograms on tables which you could leverage to find out what data are outliers.  Data Quality Services tools can be trained to understand your data and identify suggestions or even fix your data.



Gamify the process by giving out weekly prizes for those business users that find the most issues with the data, and those technical users that propose the best solutions for fixing the issues.

Sometimes automating data processes can uncover unsettling facts, especially with people & processes that resist automation.  Improving data quality gives more power to the users of that data, and less power to the controllers of the data.  This isn't always a good thing.... depending on your opinion.