2013-12-15

Some Data Warehouse

I am writing a serie of posts about parallel processing of computer background workflows  in general, and more specific how I  parallel process workflows with my Integration Tag Language ITL . ITL is not a real computer language, it doesn’t generate executing code, it only process  a parse tree, nonetheless it passes for a language in a duck test. I can use ITL  to describe and execute processes and I’m happy with that. Some computer systems I created have been labeled ‘not real’ by others, I don’t mind. Once I created the fastest search engine there was in the IBM Z-server environment, it was labeled (by a competitor) not a real search engine. It was a column based database with bitmap indexes and massively parallel search, it kicked ass with contenders, and I was happy with that. I have created a Data Warehouse, it has also been labeled not a real Data Warehouse, it beats the crap out of competitors though, and I’m happy with that. With my XML based Integration Tag Language I can describe and execute complex parallel workflows easier than anything else I have seen, and I’m happy with that too. Interestingly the host language I use PHP, is often labeled not a real language. PHP a perfect match for ITL, it’s so not real.

Not a real Data Warehouse

My not real Data warehouse has a rather long history, it started around 1995 when I was working as a Business Intelligence Analyst. I got hold of a 4Gb harddisk for my PC, I realised I could fit a BI storage on this huge  disk. At that time I had read a lot about the spintronic revolution that was about to come with gigantic disks, RAM  and super processors to ridiculously low prices. I started to play with the idea of building a future BI system, where you do not have squeeze data onto the disks, where large portions of data could reside in RAM and be parallel processed by many processors.

Simple tables, no stupefying multi dimensional extended bogus schema

The first thing I did was to get rid of the traditional data models , the normalized OLTP, and the denormalized OLAP models, I was thinking denormalize on a grand scale, each report should have it’s own table from which users could slice, dice and pivot as much they liked in their own spreadsheet (Excel) applications. I called these tables Business Query Sets, since in future we would have oceans of disk space , we could afford to build databases as the normal user perceived them, as tables not multi dimensional extended snowflakes or star or whatever the traditional BI storages are called. Have you ever heard a user ask for reports in extended cube format?

Simple extraction, table based full loads

In future data access will be so fast you can extract entire tables, no more delta loads, I thought. No special extractor code in the source system, just simple SQL on tables. Delta loads are hell, the only thing you can be sure of delta loads break, no matter what you been told delta loads fails. In rare situations you need delta loads, you should have good procedures in place to mend broken delta loads, otherwise you will have a corrupt system. As for having special extractor code in each source system, I would probably not be allowed to put in any extractor code in the source system and you lose control spreading out code all over. Special extractor code was never a clever idea anyway.

Tired of waiting for reports?

It’s fast as hell to select * from table . ‘In future I could trade disk space for speed’ I reasoned. And lots of indexes, it’s just disk space. All frequently used data will reside in huge RAM caches. I also envisioned a Data Warehouse in every LAN, since hardware will be cheap you can have BI servers in every LAN . A LAN database  server is faster than a WAN application  server.  

A user Business Intelligence App, not a Business Intelligence System

Not only did I skip the traditional database models, I also scrapped the System , I wanted to build my Data Warehouse around the users, not build a Data Warehouse System. I wanted to invite users to explore the data with tools of their own choice. I didn’t want to build a System BI the user had to log in to and have funny ‘data explorer’ tools straightjacketed onto them.

2001 a start

Year 2001 I could start build my futuristic Data Warehouse. It wasn’t a start I had wished for. Actually no one in the company believed it was possible to build a Data Warehouse my way. I could not get a sponsor, it was only the fact I was the CIO, with my own (small) budget I could start build my Data Warehouse together with one member in my group. I had to start on a shoestring budget. We used scrapped desktops for servers, with a  new large 100Gb disk, 1Gb RAM and an extra network card. I only used free software. Payback time for the first version was one month. I soon began to design and build my own hardware  and since I used low cost components I could afford large RAM pools and keep much of the frequent data in memory. From a humble start with one user the, Data Warehouse  now produces six to twelve million queries (including ETL) a day, it has about 500 users and feed other applications with data, this includes the BI tool Qlikview . From start 2001 until May 2013 the system only has been down at three power outages. When I moved to corporate IT, the managers of the the product company migrated the Data Warehouse from my two of everything hardware design to single hardware , so now we have to take down the Data Warehouse once a year for service. Twelve years of continuous operation, not many system started 2001 have a track record like that.
For not being a real Data Warehouse, it’s quite some Data Warehouse.
Today I’m very happy for the lack of funds at the start. It forced me to think in directions I probably would not have done otherwise. Another piece of happiness; My colleague I started with was completely ignorant (and unimpressed) of Business Intelligence database theories. The few times I complexed the database design he said ‘I will not do that, I create simple tables it is faster and the users want  tables’. Then I measured the different approaches and his simpler models were always better.

Been there, done that

These days Business Intelligence vendors talk a lot about Big Data , hardware accelerators, in-memory databases, nearly online reporting  etc, I can say with a real pride ‘been there, done that’. I do not say other BI apps are not real, they are real, some really good.
I actually have heard a BI sales representative refer to my Data Warehouse as not real. My Data Warehouse has been called a simple Excel app. I have put in a lot of hard work into my Data Warehouse. Countless nights and weekends of coding, testing and measuring. I’m not happy when the appreciation of all that hard work is ‘a simple Excel app’. On the other hand the feeling of knowing ‘not many people know what I know of Business Intelligence applications’ makes me happy. And now seeing the big guys catching up on ideas I conceived some twenty years ago, makes me real happy.

No comments:

Post a Comment