Wednesday, July 11, 2012

Data Quality in an Enterprise Warehouse

Current product releases for enterprise customers, especially O/Ses and database systems, are usually at least 1-2 versions behind in my experience.  Enterprise IT doesn't like adopting the latest and greatest technologies until they have been proven, and rightfully so.  Some of the latest releases could stop your business in a second with a critical support issue.  

Not adopting the latest technologies can also stop your business, though less like a train wreck and more like   an old mall department store.  If maintenance isn't kept up, prices don't appear competitive and new products don't keep filling the shelves, the store is probably doomed to stagnate to a slow death.  If software patches and releases aren't kept up, support lifecycles expire and similar platforms become harder to integrate.  Upgrading through 3 versions of an O/S while maintaining compatibility with your in-house proprietary apps becomes not just painful, but nearly impossible.

It's usually the same with data.  Fixing problems immediately without doing root cause analysis can be just a band-aid solution.  Not fixing problems at all could be even worse, and fixing historical data could cost more than it's worth.

Historians say that it is harder to predict the past than it is to predict the future.  The internet is making predicting the past a bit easier, at least the past beyond the mid 90s.

Here's an article from 2005 that's still relevant, regarding data cleanliness in the warehouse.

Despite the great efforts from Inmon, Kimball, and the Others, the world of Data Warehousing is still facing great challenges. Even in 2005, after 14 years of Inmon explaining the concept, more than 50% of today’s data warehouse projects are anticipated to fail [Robert Jaques]. In fact, Ted Friedman, a Principal Analyst in Gartner wrote in 2005, “Many enterprises fail to recognize that they have an issue with data quality. They focus only on identifying, extracting, and loading data to the warehouse, but do not take the time to assess the quality.”

Today’s data warehouses suffer from poor quality of the data. Whether or not the poor quality of data existed a decade ago is a questionable hypothesis. In the 90s, the new breed of software products and the ease of implementing data-moving techniques have opened several avenues that resulted in data duplication. As a result, any data inconsistencies in source systems have been remedied by scrubbing and cleansing them on “local copies” of the data sets rather than taking efforts to correct them at the sources.
If Inmon or Kimball had foreseen the wave of software product proliferation in the 90s that implemented duplicated functionality in an organization, they might have stressed on architecting for better quality.

The premise behind many data warehousing strategies is to bring "all the things" into a central area for reporting, and perhaps analysis.  More so just reporting, or "giving the users the numbers" instead of the Semantic Meaning behind them.  Just like a real warehouse, the historical data starts to collect dust.  Without a proper inventory control system, it becomes hard to find things.  The older data elements appear a bit yellowed and sun-bleached, and perhaps have warped or had some design flaws when they were originally input that were only caught in later versions.  The newest items may look shiny due to improved validation processes during input time, but could have critical defects due to ETL "Chinese Whispers" gone haywire.

The way companies deal with interpreting and fixing bad data is probably the number one reason why most BI projects fail.

 As in a warehouse, it can be hard or just doesn't make ROI sense to go back and fix those old elements.  It is much easier to write them off as being known bad data.  This can cause grief and confusion for those trying to perform historical comparisons and trend analysis on the data.

In a dimensional model, a measure is also called a fact.  What is a fact if it is not quite true?  It doesn't become an opinion, it becomes an error.  Loading errors into a fact table is not a good idea. So we have a conundrum.

The missing component here is being able to publish scores that tell us the data is flawed, and commentary describing why we are not going to fix it, or how data quality is being addressed, or at least how to interpret the data, and ways to ignore data based on a threshold of quality.

As we move toward the "big data" experience, identifying trust-levels within data sources becomes more important.  Facts become opinions instead of errors.  Opinion tables sound much nicer to me than Error tables, and for dealing with data that doesn't have to be down to the penny, like internet sentiment analysis, perhaps opinions may work.

I propose an approach to augmenting current star/snowflake dimension models.   Opinionate your data models.

1. Add an opinion table to your fact table.  Ensure it uses the same grain as the fact table.
2. Include status columns for each column in your fact that you wish to track errors.
3. Include an aggregated status column, and a checksum that can be matched against the source record for auditing purposes.
4. Include a type 2 with history slowly-changing dimension for managing when each fact table load occurred and what the most current load is.
5. Track all errors, questionable data, or data out of the statistical norm using this table.
6. Ensure you have dimensions for ratings, tags and link to a wiki for people to maintain commentary.

This table load could be automated using standard ETL tools, or manually augmented using data input forms.  If you're using Excel 2007/2010 and Analysis Services cubes, you could use the writeback feature to assign scores to existing data and provide 1/0 values for true/false flagging.  Some tools, like Oracle and SQL, automatically create statistics and histograms on tables which you could leverage to find out what data are outliers.  Data Quality Services tools can be trained to understand your data and identify suggestions or even fix your data.

Gamify the process by giving out weekly prizes for those business users that find the most issues with the data, and those technical users that propose the best solutions for fixing the issues.

Sometimes automating data processes can uncover unsettling facts, especially with people & processes that resist automation.  Improving data quality gives more power to the users of that data, and less power to the controllers of the data.  This isn't always a good thing.... depending on your opinion.

No comments: