Thursday, September 17, 2009

Handling history failures in data warehousing

I work with Business Intelligence and use slowly changing history (wikipedia) extensively. We have a multi-layered topology and use multiple modeling techniques, including "data-vault", "star-schema" and relational model.

The way we implement history…
- It requires one or more identified unique key fields in every source system table
- One row in the original source system initially produces two rows in the destination (for each layer in the topology layer)
- Change in any field in the source produces a new row in the destination
- Deletes in the source creates new rows in the destination

This way a row in the source system can easily be ten or more in the data warehouse. With modern hardware this is not really a problem, even if we get into the terabytes.

Problem

The problem comes when we get errors in the history-mechanisms. A typical problem can be that one static row in a source system could result in a new row in the destination for every load even if there are no real changes.

If introduced in an early layer, this will spread like the plague in the warehouse (and not like bird flu, either).

These errors take time to solve and requires expensive talent.

Solution

We now have a set of functionality that automatically fixes things up and centralize any misbehaving data in our meta data repository or elsewhere.

Result

We can use the talent to solve more problems interesting to the customer and be less proactive (more execution-focused and reactive) on this technical problem because problems will not grow severe.

; )

I find data warehousing pretty nice, since there is untouched ground here and there with great promise. Of course, the technical side of these projects is not necessary the hardest to deliver on, but it would be sad if it was the show stopper.

I also like some metaphors used in the sport, especially the time line (I'll use it next time).