Saturday, February 25, 2012

Data, data, who's got the data? (basic theory question)

I'm trying to understand aggregations, storage theory (MOLAP,ROLAP,HOLAP), and proactive caching and loading data into my cube.

We are currently building both an OLTP db and OLAP db 100% in SQL Server.

The OLTP users will add several hundred thousand rows a day.

I'm going to create an SSIS package to initially take data from the OLTP db and load it into the OLAP db and then build my cube.

Now my understanding becomes fuzzy. I've built partitions and created aggregations and enabled proactive caching, but I'm not sure why I am doing so. When talking about real time MOLAP and proactive caching and MOLAP to ROLAP to MOLAP, the various literature refers to "changes in the relational data" triggering various actions based on my settings. Are the authors referring to changes in the original OLTP db or changes in the OLAP db. If the latter, then changes will only result when I run another SSIS package to update my data.

Or does this discussion assume I am using a UDM db which seems to say I can build my star schema on top of the OLTP db. But then who has that data?

Where the heck is my data?

You want to update your cubes after data has been added to your relational OLAP database. Your relational OLAP database is the source of the data for your cubes so there really isn't any reason to process your cubes until they have been updated.

Though data will be added to your OLTP system throughout the day, how frequently will you be adding data to your OLAP database? If it's on a fixed schedule, you may want to do without the proactive caching and just start the processing yourself as part of your SSIS package.

Regarding the UDM, I would not recommend building your DSV using references to tables in the OLTP system. Though this can be done, it just seems like this is bound to cause you problems.

Good luck,

Bryan

|||

My problem is my users want "near real time" data so I'm looking for solutions. Proactive caching seems like a piced of the puzzle but I wasn't sure what the data source was for the proactive caching.

If all the hubub is about detecting data that changes in the OLAP, I don't get it. Data will only change in the OLAP when I tell it (run an SSIS program) to change so I would think I'd know about it and go ahead and process the cube.

|||

What's nice about the Proactive Caching is that you could have SSAS handle the processing on your behalf. You can configure it to detect a change but then wait for some period of time for things to settle down before processing. During that time, you might have 2 or 3 ETL cycles, some with changes and some without. Really just depends on your needs. If you can process quickly so that it doesn't interfere with your ETL cycles, then your suggested approach is probably the best for you.

Bryan

|||Does it "detect a change" in the OLAP db or the original OLTP db? I suspect OLAP, which is only going to change when I run an SSIS update so I don't see much value to detecting a change sincs I am causing the change and am fully aware of when it happens. As part of my SSIS package I suppose I could trigger the reprossing of my cube.|||

Just so anyone else reading this is clear on what we are talking about, let me define a few things. OLTP and OLAP refer to how data is used. OLTP systems typically support entity-relationship models while OLAP systems typiclly support dimensional models (though others are available).

Both entity-relationship and dimensional models are be supported through relational database technologies. The SQL Server Database Engine is an example of a relational database technology. Dimensional models are also supported by multidimensional database technologies, such as SQL Server Analysis Services, which store data in a different kind of structure and allow access through languages like MDX.

So, when you say "OLAP db", I interpret that to mean the relational database containing your dimensional model supporting OLAP activities, and when you say "OLTP db", I interpret that to mean the relational database containing your entity-relationship model supporting OLTP activities.

Whew!

With that out of the way, the answer to your question is normally you are watching your OLAP db for changes. Watching isn't the most accurate way to describe this. You can certainly configure SSAS to evaluate the underlying tables for changes. However, you can also configure some relational database technologies to notify SSAS when a change occurs. And there are a ton of other options that make ProActive caching really useful in some scenarios.

Still, my general philosophy is to keep your system as simplistic as is possible while achieving your critical goals (one of which is supportablity!). If you can avoid using proactive caching by taking a more straightforward approach while still meeting your goals, I would recommend doing so.

Good luck,

Bryan

|||

I'm sorry, I did leave out quite a bit, but yes, that is what I am talking about. In my case the only reason to have a "middle" OLAP db is to create some calculated fields and pull together some columns from different tables. I could probably do most everything during cube processing and build it right on my OLTP which is a single SQL Server db. But the main reason I'm using the OLAP "staging" db is so I am not placing an additional load on my OLTP db which is expected to receive up to a million new rows of data a day. I foresee huge latency and functionality issues and am trying to get ahead of the curve.

But, from your answer, it looks like I can use either db to trigger a cache update (if that's the proper SSAS term).

Thanks.

|||

Just wanted to mention one thing.... You refer to your OLAP db as a "staging" database. We often implement a full-blown dimensional model in a relational database. Lots of reasons for this. (The Kimball books go into this in detail.)

This database + the OLAP cube serve as the data warehouse layer of our BI infrastructure. You could build a cube off a DSV that assembles data right out of your OLTP database, but we don't often recommend this.

Good Luck,
Bryan

|||

Yes. The main reason I'm building an OLAP db is to offload data retrieval requests from the main db. So I tend to think of it as a "staging" db, which really refers to a real db one might require were one gathering data from diverse data sources.

We are currently building both the OLTP db and BI solution. In theory, I could build my cube right on top of the live db (OLTP) or a snapshot of it or some other replicated image. My "truth" lies in my OLTP db, changes in which I may be able to use to update caching. I was reading the Sam's book and it referred to changes in the relational database triggering updates to your cache and I just wondered WHICH relational db.

Given the constant updating that is going to be done on the OLTP source, I probably am better off to update caching and/or cube processing on a timed schedule. I realize there are a couple of configurable "windows" but my data updates probably will be constant throughout the day. Of course clients want terabyte size data cubes to generate instantaneous reports on data with nano-second latency.

I'm trying to get my arms around the basic theory of this new (to me) data model (multi-dimensional). I can recite Codd's 12 rules in my sleep (or is it 13...I think I need a nap), which is perhaps part of my problem: I am trying to force everything through a relational prism.

But you've been quite helpful.

Thanks.

John

No comments:

Post a Comment