Monday, October 17, 2011

Handling historical data with SQL - Common errors in datawarehousing


It has been a while since my last post. Here is an old post I put here because it comes up in mature enterprise data warehouses. Two easy points that are too hard to put in Power Point, but easy enough to take care of in the warehousing solution. Problems are intervened.

1) GETDATE() used directly
In MS SQL Server GETDATE() is nondeterministic which is suboptimal when it comes to performance and schema binding.

2) BETWEEN is used
BETWEEN is an inclusive operator, which means that one can't really step with it the way I see it. Say you step one day at the time, you would have to use for date d: "x BETWEEN d AND d+.9999999" or risk having

a) the same data at two consecutive dates
b) two historic versions for the same BK

to give an example, for
- 1000002; "1901-01-01 00:00"; "2009-10-24 00:00"; "1234"; "Helly hansen";...
- 1000003; "2009-10-24 00:00"; "2049-12-31 23:59"; "1234"; "Helly Hansen";...

the query "d BETWEEN Invoice.ValidFrom AND Invoice.ValidTo" would for d="2009-10-24 00:00" give both rows, which should crash the ETL job or pollute the result.

3) One usual practice
For this reason some use between statements like

"d BETWEEN '09:00:00.0000000' AND '16:59:59.9999999';"
"d BETWEEN '09:00:00.0000000' AND '16:59:59';"

which might be working as long as you are absolutely sure that this will not be cast, or it might be rounded up, which is even worse (looks like it works but don't).

SELECT CAST(N'2009-12-12 16:59:59.199' AS DATETIME) -> "2009-12-12 16:59:59.200"

4) The simple solution
"d >= '09:00' AND d < '17:00';"
will work no matter what data type is used and therefore is more portable, I think.

So, now that’s documented
Have a nice one,
Gorm Braarvig

Sunday, April 17, 2011

Alike twitter-app

This is the initial page for the "Alike" twitter-app. That's all.

Friday, March 18, 2011

"Domain Driven Design" and "Information Islands"

Domain driven design, as it has been explained to me, dictates that the data shall have the truth in the domain model layer at the same time as you must have a bounded context.

This way you create an information Island bound on the application layer!

The integration work between these systems, as they grow in fast iterations (say Scrum-sprints) will surely be huge, and the "single version of the truth" across the value chains will be a distant dream, evens when it comes to central dimensions like "Customer" and "Product"!

Will be exciting to see the solutions to this, so far I have heard
- Shared Kernel
Well... you need it in the database to solve all the integration points, you can't expect ETL and EAI to use an application layer that changes all the time, you want access to the databases in most cases. Many reasons for that across scalability, security, ease of implementation and testing etc

- Anticorruption Layer
Well, in a siloed system this would in my mind include context tables on every single shared entity, which I haven't really seen samples of yet, I expect this is not thought through. Might be promissing.

- Explicit model delimination
I didn't understand the consept, which tells me that it might be too complex for a simple common problem. The solution shouldn't be so much more complex than the problem

I believe the DDD is great for Cabin-rental systems and Access card systems, but the SOA-solution? Well prove that you can handle a strategy of the common version of your data. I personally like to be threated as the same identity across the functions of the companys I interact with...

...this is a reason for working on MDM (Master data management).

just my 2 cent

have a good one