Tuesday, September 27, 2005

SSIS 2005 Naming Conventions


ETL/SSIS/Naming Conventions
We are developing SSIS solutions with project teams, and figured it would be a good idea to define a set of ETL, SSIS, and naming conventions in order to make it easier for team members to collaborate. We would like to share our thoughts on these conventions, and wonder what strategies other SSIS developers out there are currently using.
ETL StrategyAs a general strategy for our ETL process, we use three SQL Server databases: SA_IN, SA_OUT, and DWH (where SA means Staging Area). Data flows through these databases in the following steps (see this picture for a graphical representation of the entire process):
First, all relevant data from the source systems (flat files, production databases, etc.) will be copied 1:1 to the SA_IN database.
In the SA_IN database, we build a view for each dimension and fact table we want to create. Each view's SELECT statement selects and joins the SA_IN columns and tables that are needed to populate the corresponding dimension or fact table. We also use the view to give columns correct names (aliases) and to perform some simple transformations (e.g., replace NULL values with the ISNULL function).
Then we use SSIS to copy the result data from each SA_IN view into the SA_OUT database. Note that the tables in the SA_OUT database can be created on the fly: when configuring the SQL Server Destination components in our SSIS package, we use the "Create..." button to create a new table in the SA_OUT database (the table design will then be based on the output columns of the data source, in this case our SA_IN view).
On the SA_OUT database we perform complex transformations with SSIS components when necessary.
Next, we use SSIS to copy all the data 1:1 from the SA_OUT database to the DWH database. On the DWH database we have defined foreign key constraints (relationships) to maintain referential integrity. Rows that do not satisfy these constraints will be written to an error table in the SA_OUT database and will be reported back to the administrator.
Finally, we use SSIS to process the SSAS cubes that are built on top of the DWH data warehouse database.
SSIS Package StructureOne of the decisions we had to make, was whether we should create many small SSIS packages or a few large packages. We decided on a middle course. Each SSIS solution contains one _Overall package that executes an _Extract, _Transform, and _Load package. These three packages in turn execute a package for each source table / data warehouse table that needs to be extracted, transformed, or loaded into the data warehouse (this picture shows how these packages fit in the ETL process as a whole). We are using separate packages for each table so different persons can work on the ETL for different tables at the same time.
Naming ConventionsWe also decided to define naming conventions for both database objects and SSIS packages. SSAS seems to detect fact and dimension tables based on their names: when a table name contains the string "dim", SSAS will treat it by default as a dimension table. The same holds for the string "fact" for fact tables. So we use a Dim and Fact prefix for dimension and fact tables respectively. Another naming convention we have often seen in Microsoft products is to use the prefixes PK and FK for primary keys and foreign keys. We've chosen to follow this notation, and also added a Key and AppKey suffix to differentiate surrogate keys from application (OLTP) keys. We have also posted a summary of the naming conventions we are currently considering. Note that some package names start with an underscore. The purpose of this prefix is twofold: (a) it differentiates the top-level packages from the lower-level packages, and (b) it makes these packages easy to find in the solution explorer (the solution explorer orders packages alphabetically by package name, so all the packages that start with an underscore will be grouped together at the top of the list).
Maarten Strunk [Macaw]

No comments: