Monday, October 17, 2011

Handling historical data with SQL - Common errors in datawarehousing


It has been a while since my last post. Here is an old post I put here because it comes up in mature enterprise data warehouses. Two easy points that are too hard to put in Power Point, but easy enough to take care of in the warehousing solution. Problems are intervened.

1) GETDATE() used directly
In MS SQL Server GETDATE() is nondeterministic which is suboptimal when it comes to performance and schema binding.

2) BETWEEN is used
BETWEEN is an inclusive operator, which means that one can't really step with it the way I see it. Say you step one day at the time, you would have to use for date d: "x BETWEEN d AND d+.9999999" or risk having

a) the same data at two consecutive dates
b) two historic versions for the same BK

to give an example, for
- 1000002; "1901-01-01 00:00"; "2009-10-24 00:00"; "1234"; "Helly hansen";...
- 1000003; "2009-10-24 00:00"; "2049-12-31 23:59"; "1234"; "Helly Hansen";...

the query "d BETWEEN Invoice.ValidFrom AND Invoice.ValidTo" would for d="2009-10-24 00:00" give both rows, which should crash the ETL job or pollute the result.

3) One usual practice
For this reason some use between statements like

"d BETWEEN '09:00:00.0000000' AND '16:59:59.9999999';"
"d BETWEEN '09:00:00.0000000' AND '16:59:59';"

which might be working as long as you are absolutely sure that this will not be cast, or it might be rounded up, which is even worse (looks like it works but don't).

SELECT CAST(N'2009-12-12 16:59:59.199' AS DATETIME) -> "2009-12-12 16:59:59.200"

4) The simple solution
"d >= '09:00' AND d < '17:00';"
will work no matter what data type is used and therefore is more portable, I think.

So, now that’s documented
Have a nice one,
Gorm Braarvig