Saturday, February 25, 2012
Data Warehousing:Best way to write SP?
I'm dealing with 8 fact tables, 8 dimension tables and 9 keys. I'm currently doing:
SELECT COUNT(DISTINCT key)
FROM t_fact
WHERE key NOT IN
(
SELECT DISTINCT key FROM t_dimension
)
If I do one of those for each key-fact table combo, there are about 50 queries in total. Not every key exists in every fact table.
I'm a Stored Procedure novice. What is the best way to check all of the fact tables, aside from running 50 counts with subqueries? If I run the queries one fact table at a time, it will take about 30 minutes. I've tried to run one query per fact table, by counting all keys, and doing a subselect to each dimension table, but got misleading results.
Any tips will be greatly appreciated. Abandoning data warehousing isn't a current option!
MikeIn Oracle I would find missing keys like this:
SELECT key FROM t_fact
MINUS
SELECT key FROM t_dimension
BTW, why not use a foreign key constraint to ensure all fact keys are based on dimension keys?
data warehouse truncate question
database, creates tables, then I have a package that loads the data into the
data warehouse. Now, I got the request in to only load the data and not drop
and create the database every time the package runs due to the package is
going to run every 5-10 minutes. Now I never created an SSIS package until
now, so how can I accomplish in only loading the tables every time its ran
instead of dropping and creating the database every time the package is ran?
so in a nutshell.
I need a package to load the data only if the data does not exsist and not
drop and create the database every time.
any suggestions on how this can be done?
Two methods come to mind.
1) Simple insert where not exists statements for each table. Can be VERY
inefficient, but easy to construct.
2) Build system that tracks the last-inserted key value for each table.
Reference the appropriate key in the where clause of each insert statement.
If you need to also track modified data, then it gets MUCH more complicated
because you need triggers on every base table to insert the modified row's
PKs into a refresh holding table which would also be reference for the
warehouse update statements.
TheSQLGuru
President
Indicium Resources, Inc.
"Mike" <igotyourdotnet@.gmail.com> wrote in message
news:%23%23HSmywaHHA.1300@.TK2MSFTNGP02.phx.gbl...
> have an SSIS project and I have a package that drops and creates the
> database, creates tables, then I have a package that loads the data into
> the
> data warehouse. Now, I got the request in to only load the data and not
> drop
> and create the database every time the package runs due to the package is
> going to run every 5-10 minutes. Now I never created an SSIS package until
> now, so how can I accomplish in only loading the tables every time its ran
> instead of dropping and creating the database every time the package is
> ran?
> so in a nutshell.
> I need a package to load the data only if the data does not exsist and not
> drop and create the database every time.
> any suggestions on how this can be done?
>
>
data warehouse truncate question
database, creates tables, then I have a package that loads the data into the
data warehouse. Now, I got the request in to only load the data and not drop
and create the database every time the package runs due to the package is
going to run every 5-10 minutes. Now I never created an SSIS package until
now, so how can I accomplish in only loading the tables every time its ran
instead of dropping and creating the database every time the package is ran?
so in a nutshell.
I need a package to load the data only if the data does not exsist and not
drop and create the database every time.
any suggestions on how this can be done?Two methods come to mind.
1) Simple insert where not exists statements for each table. Can be VERY
inefficient, but easy to construct.
2) Build system that tracks the last-inserted key value for each table.
Reference the appropriate key in the where clause of each insert statement.
If you need to also track modified data, then it gets MUCH more complicated
because you need triggers on every base table to insert the modified row's
PKs into a refresh holding table which would also be reference for the
warehouse update statements.
TheSQLGuru
President
Indicium Resources, Inc.
"Mike" <igotyourdotnet@.gmail.com> wrote in message
news:%23%23HSmywaHHA.1300@.TK2MSFTNGP02.phx.gbl...
> have an SSIS project and I have a package that drops and creates the
> database, creates tables, then I have a package that loads the data into
> the
> data warehouse. Now, I got the request in to only load the data and not
> drop
> and create the database every time the package runs due to the package is
> going to run every 5-10 minutes. Now I never created an SSIS package until
> now, so how can I accomplish in only loading the tables every time its ran
> instead of dropping and creating the database every time the package is
> ran?
> so in a nutshell.
> I need a package to load the data only if the data does not exsist and not
> drop and create the database every time.
> any suggestions on how this can be done?
>
>
Data Warehouse Structure and Indexing when using MOLAP
Hi there,
I am building a cube in AS2005 which is made up of one fact table (20 million rows) and 7 dimension tables. I am planning to use MOLAP storage and my question is this. Is it important to add indexes, primary keys, foreign keys to my data source to improve performance? I can appreciate that it might speed up processing time, but since all the data and aggregations would be stored in the cube, I can't see how it would have any imact on performance (querying the cube).
If there is a benefit to adding indexes, keys etc. then could somebody please tell the best strategy, i.e where to put them, clustered or nonclustered etc.
Many thanks,
David
I have the same opinion, with MOLAP indexes will only help when you process a cube. MOLAP is separated from the data source when the cube is finished/processed.
If you use the same datamart for Reporting Services you will benefit when running queries.
The strategy for index is clustered indexes for dimension primary keys and non-clustered on their foreign keys in the fact table. A non clustered composite index on all the foreigh keys in the fact table is also recommended. Clustered index in the fact table is not recommended.
You will pay for using indexes when you do inserts, deletes and updates so it is often recommended to drop indexes when you update your data mart or data warehouse, and recreate the when the update/load is finished.
Primary and foreign keys are good help if you like to receive errors when you are loading the data mart/warehouse. With them you know that all fact records have related dimension members Else these errors will appear when you process the cube.
Here is a web site I can recommend for performance issues:
http://www.sql-server-performance.com/default.asp
Regards
Thomas Ivarsson
Data Warehouse Structure and Indexing when using MOLAP
Hi there,
I am building a cube in AS2005 which is made up of one fact table (20 million rows) and 7 dimension tables. I am planning to use MOLAP storage and my question is this. Is it important to add indexes, primary keys, foreign keys to my data source to improve performance? I can appreciate that it might speed up processing time, but since all the data and aggregations would be stored in the cube, I can't see how it would have any imact on performance (querying the cube).
If there is a benefit to adding indexes, keys etc. then could somebody please tell the best strategy, i.e where to put them, clustered or nonclustered etc.
Many thanks,
David
I have the same opinion, with MOLAP indexes will only help when you process a cube. MOLAP is separated from the data source when the cube is finished/processed.
If you use the same datamart for Reporting Services you will benefit when running queries.
The strategy for index is clustered indexes for dimension primary keys and non-clustered on their foreign keys in the fact table. A non clustered composite index on all the foreigh keys in the fact table is also recommended. Clustered index in the fact table is not recommended.
You will pay for using indexes when you do inserts, deletes and updates so it is often recommended to drop indexes when you update your data mart or data warehouse, and recreate the when the update/load is finished.
Primary and foreign keys are good help if you like to receive errors when you are loading the data mart/warehouse. With them you know that all fact records have related dimension members Else these errors will appear when you process the cube.
Here is a web site I can recommend for performance issues:
http://www.sql-server-performance.com/default.asp
Regards
Thomas Ivarsson
Data warehouse question
I am building the data warehouse in SQL 2005 and have the following
questions want to clarify.
1. Any method and suggestion for incremental loading for data warehouse? If
I create a field in each record in my source table and select the one havent
been loaded in ETL, is it a method or any other better suggestion?
2. If there are schema changes in the source table and the data warehouse,
is it necessary (or best practise) to reload all data?
Thanks a lot.
Ryan
Ryan:
I am stumbling through my ETL systems initial design and have had to work
this one out as well. Take this following bit of free advice with caution:
I created a tracking table in a separate db with TableName and LastLoadDate
columns. After runnin a package that is my "Initial Load" package (ie:
truncating the whole Fact table), it runs a SQL task that populates the
LastLoadDate in the corresponding record.
For Incremental Loads, I run two data flows, one for 'changed' rows and one
for 'new' rows:
"Changed": SELECT * FROM x WHERE [created] < LastLoadDate and [Modified] >
LastLoadDate. That goes into a staging table. At the end of the package,
another SQL task runs a simple UPDATE query that updates the FACT table from
the statging table.
"NEW ROWS" SELECT * FROM x WHERE [created] > LastLoadDate. That data flow
gets appended directly to the FACT table.
Check out books by Ralph Kimball on Microsoft Data Warehouse Design.
HTH
Todd Chittenden
"Ryan" wrote:
> Hi,
> I am building the data warehouse in SQL 2005 and have the following
> questions want to clarify.
> 1. Any method and suggestion for incremental loading for data warehouse? If
> I create a field in each record in my source table and select the one havent
> been loaded in ETL, is it a method or any other better suggestion?
> 2. If there are schema changes in the source table and the data warehouse,
> is it necessary (or best practise) to reload all data?
> Thanks a lot.
> Ryan
Data warehouse question
I am building the data warehouse in SQL 2005 and have the following
questions want to clarify.
1. Any method and suggestion for incremental loading for data warehouse? If
I create a field in each record in my source table and select the one havent
been loaded in ETL, is it a method or any other better suggestion?
2. If there are schema changes in the source table and the data warehouse,
is it necessary (or best practise) to reload all data?
Thanks a lot.
RyanRyan:
I am stumbling through my ETL systems initial design and have had to work
this one out as well. Take this following bit of free advice with caution:
I created a tracking table in a separate db with TableName and LastLoadDate
columns. After runnin a package that is my "Initial Load" package (ie:
truncating the whole Fact table), it runs a SQL task that populates the
LastLoadDate in the corresponding record.
For Incremental Loads, I run two data flows, one for 'changed' rows and one
for 'new' rows:
"Changed": SELECT * FROM x WHERE [created] < LastLoadDate and [Modif
ied] >
LastLoadDate. That goes into a staging table. At the end of the package,
another SQL task runs a simple UPDATE query that updates the FACT table from
the statging table.
"NEW ROWS" SELECT * FROM x WHERE [created] > LastLoadDate. That data flo
w
gets appended directly to the FACT table.
Check out books by Ralph Kimball on Microsoft Data Warehouse Design.
HTH
--
Todd Chittenden
"Ryan" wrote:
> Hi,
> I am building the data warehouse in SQL 2005 and have the following
> questions want to clarify.
> 1. Any method and suggestion for incremental loading for data warehouse? I
f
> I create a field in each record in my source table and select the one have
nt
> been loaded in ETL, is it a method or any other better suggestion?
> 2. If there are schema changes in the source table and the data warehouse,
> is it necessary (or best practise) to reload all data?
> Thanks a lot.
> Ryan
Data Warehouse Presentation
Hello
I have been tasked to give a presentation (introductory) on Data warehouse to Health professionals and Health information analysts. I would be grateful if you could share with me any slides and or presentation on the topic that you might have. Anything that graphically explains the concepts of cubes and its applications would be much appreciated.
Help. Thanks.
If you would like to give most generic presentation about data warehousing in heath industry, you should try and run a search about most generic information about data warehousing and see how some if it applies to the health industry.
This forum is focucing mostly on questions about Microsoft SQL Server Analysis Services product.
You might find interesting a case study of Data Warehouse implemenation using Analysis Serivces by one of the most prominent Israeli Heath providers: http://members.microsoft.com/CustomerEvidence/Search/EvidenceDetails.aspx?EvidenceID=13653&LanguageID=1
Hope that helps.
Edward.
--
This posting is provided "AS IS" with no warranties, and confers no rights.
Data warehouse or data mart
I'm working on a project at the moment where have an existing warehouse, and
need to bring in data from new source systems, which are from a similar (but
not quite the same) part of the business. These two parts of the business
merged recently, and they want to have a consolidated view of the data, as
well as replace some existing (Excel and Access based) reporting.
We are coming to the end of the analysis and design phase, and I want to get
some opinions that the approach I'm taking is valid.
I have tried to bring everything into the one database. I've added some new
fact tables for new source systems, as the original and new source systems
don't quite work the same. I've also tried to conform the dimensions as far
as possible, so that the data will aggregate consistently for the management
reporting.
This seems to fit pretty well, and I haven't had to alter the current data
model very much.
My question is basically: Is this the right way to go, or am I better off
breaking the warehouse into separate data marts?
A lot of the reporting will only apply to each part of the business, but
there will be some management reporting that wants a consolidated view of the
new merged business.
Any comments will be appreciated.
Thanks.
Hi,
Based on my scope, there is no necessary to break them into seperate data
marts for this situation because they have similar model and need to be
merged in the same analysis report. You may consider to create different
cube and you could at least share some common dimensions such as
time/location etc.
Regards,
Peter Yang
MCSE2000/2003, MCSA, MCDBA
Microsoft Online Partner Support
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
================================================== ===
This posting is provided "AS IS" with no warranties, and confers no rights.
--
>Thread-Topic: Data warehouse or data mart
>thread-index: AcXSOg1GLZ3yTwgwTGqbRewN1uy9bQ==
>X-WBNR-Posting-Host: 203.54.233.209
>From: "=?Utf-8?B?V3JlY2s=?=" <Wreck@.community.nospam>
>Subject: Data warehouse or data mart
>Date: Sun, 16 Oct 2005 03:12:02 -0700
>Lines: 31
>Message-ID: <EA2C0FDE-2A52-44DE-88F9-5BA83B6278BF@.microsoft.com>
>MIME-Version: 1.0
>Content-Type: text/plain;
>charset="Utf-8"
>Content-Transfer-Encoding: 7bit
>X-Newsreader: Microsoft CDO for Windows 2000
>Content-Class: urn:content-classes:message
>Importance: normal
>Priority: normal
>X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.0
>Newsgroups: microsoft.public.sqlserver.datawarehouse
>NNTP-Posting-Host: TK2MSFTNGXA03.phx.gbl 10.40.2.250
>Path: TK2MSFTNGXA01.phx.gbl!TK2MSFTNGXA03.phx.gbl
>Xref: TK2MSFTNGXA01.phx.gbl microsoft.public.sqlserver.datawarehouse:2317
>X-Tomcat-NG: microsoft.public.sqlserver.datawarehouse
>Hi,
>I'm working on a project at the moment where have an existing warehouse,
and
>need to bring in data from new source systems, which are from a similar
(but
>not quite the same) part of the business. These two parts of the business
>merged recently, and they want to have a consolidated view of the data, as
>well as replace some existing (Excel and Access based) reporting.
>We are coming to the end of the analysis and design phase, and I want to
get
>some opinions that the approach I'm taking is valid.
>I have tried to bring everything into the one database. I've added some
new
>fact tables for new source systems, as the original and new source systems
>don't quite work the same. I've also tried to conform the dimensions as
far
>as possible, so that the data will aggregate consistently for the
management
>reporting.
>This seems to fit pretty well, and I haven't had to alter the current data
>model very much.
>My question is basically: Is this the right way to go, or am I better off
>breaking the warehouse into separate data marts?
>A lot of the reporting will only apply to each part of the business, but
>there will be some management reporting that wants a consolidated view of
the
>new merged business.
>Any comments will be appreciated.
>Thanks.
>
|||Hi Peter,
That's pretty much the approach I'm taking. The new source systems will have
their own fact tables, with conformed measures, joining to conformed
dimensions.
I'll build separate cubes for the fact tables, and put the old and new cubes
in a single virtual cube, joined on the common dimensions.
Thanks,
Wreck.
"Peter Yang [MSFT]" wrote:
> Hi,
> Based on my scope, there is no necessary to break them into seperate data
> marts for this situation because they have similar model and need to be
> merged in the same analysis report. You may consider to create different
> cube and you could at least share some common dimensions such as
> time/location etc.
> Regards,
> Peter Yang
> MCSE2000/2003, MCSA, MCDBA
> Microsoft Online Partner Support
> When responding to posts, please "Reply to Group" via your newsreader so
> that others may learn and benefit from your issue.
> ================================================== ===
>
> This posting is provided "AS IS" with no warranties, and confers no rights.
>
> --
> and
> (but
> get
> new
> far
> management
> the
>
Data warehouse or data mart
I'm working on a project at the moment where have an existing warehouse, and
need to bring in data from new source systems, which are from a similar (but
not quite the same) part of the business. These two parts of the business
merged recently, and they want to have a consolidated view of the data, as
well as replace some existing (Excel and Access based) reporting.
We are coming to the end of the analysis and design phase, and I want to get
some opinions that the approach I'm taking is valid.
I have tried to bring everything into the one database. I've added some new
fact tables for new source systems, as the original and new source systems
don't quite work the same. I've also tried to conform the dimensions as far
as possible, so that the data will aggregate consistently for the management
reporting.
This seems to fit pretty well, and I haven't had to alter the current data
model very much.
My question is basically: Is this the right way to go, or am I better off
breaking the warehouse into separate data marts?
A lot of the reporting will only apply to each part of the business, but
there will be some management reporting that wants a consolidated view of th
e
new merged business.
Any comments will be appreciated.
Thanks.Hi,
Based on my scope, there is no necessary to break them into seperate data
marts for this situation because they have similar model and need to be
merged in the same analysis report. You may consider to create different
cube and you could at least share some common dimensions such as
time/location etc.
Regards,
Peter Yang
MCSE2000/2003, MCSA, MCDBA
Microsoft Online Partner Support
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
========================================
=============
This posting is provided "AS IS" with no warranties, and confers no rights.
--
>Thread-Topic: Data warehouse or data mart
>thread-index: AcXSOg1GLZ3yTwgwTGqbRewN1uy9bQ==
>X-WBNR-Posting-Host: 203.54.233.209
>From: "examnotes" <Wreck@.community.nospam>
>Subject: Data warehouse or data mart
>Date: Sun, 16 Oct 2005 03:12:02 -0700
>Lines: 31
>Message-ID: <EA2C0FDE-2A52-44DE-88F9-5BA83B6278BF@.microsoft.com>
>MIME-Version: 1.0
>Content-Type: text/plain;
> charset="Utf-8"
>Content-Transfer-Encoding: 7bit
>X-Newsreader: Microsoft CDO for Windows 2000
>Content-Class: urn:content-classes:message
>Importance: normal
>Priority: normal
>X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.0
>Newsgroups: microsoft.public.sqlserver.datawarehouse
>NNTP-Posting-Host: TK2MSFTNGXA03.phx.gbl 10.40.2.250
>Path: TK2MSFTNGXA01.phx.gbl!TK2MSFTNGXA03.phx.gbl
>Xref: TK2MSFTNGXA01.phx.gbl microsoft.public.sqlserver.datawarehouse:2317
>X-Tomcat-NG: microsoft.public.sqlserver.datawarehouse
>Hi,
>I'm working on a project at the moment where have an existing warehouse,
and
>need to bring in data from new source systems, which are from a similar
(but
>not quite the same) part of the business. These two parts of the business
>merged recently, and they want to have a consolidated view of the data, as
>well as replace some existing (Excel and Access based) reporting.
>We are coming to the end of the analysis and design phase, and I want to
get
>some opinions that the approach I'm taking is valid.
>I have tried to bring everything into the one database. I've added some
new
>fact tables for new source systems, as the original and new source systems
>don't quite work the same. I've also tried to conform the dimensions as
far
>as possible, so that the data will aggregate consistently for the
management
>reporting.
>This seems to fit pretty well, and I haven't had to alter the current data
>model very much.
>My question is basically: Is this the right way to go, or am I better off
>breaking the warehouse into separate data marts?
>A lot of the reporting will only apply to each part of the business, but
>there will be some management reporting that wants a consolidated view of
the
>new merged business.
>Any comments will be appreciated.
>Thanks.
>|||Hi Peter,
That's pretty much the approach I'm taking. The new source systems will have
their own fact tables, with conformed measures, joining to conformed
dimensions.
I'll build separate cubes for the fact tables, and put the old and new cubes
in a single virtual cube, joined on the common dimensions.
Thanks,
Wreck.
"Peter Yang [MSFT]" wrote:
> Hi,
> Based on my scope, there is no necessary to break them into seperate data
> marts for this situation because they have similar model and need to be
> merged in the same analysis report. You may consider to create different
> cube and you could at least share some common dimensions such as
> time/location etc.
> Regards,
> Peter Yang
> MCSE2000/2003, MCSA, MCDBA
> Microsoft Online Partner Support
> When responding to posts, please "Reply to Group" via your newsreader so
> that others may learn and benefit from your issue.
> ========================================
=============
>
> This posting is provided "AS IS" with no warranties, and confers no rights
.
>
> --
> and
> (but
> get
> new
> far
> management
> the
>
Data Warehouse Nulls
I was wondering if anyone out there could tell me how they deal with
NULL values in a data warehouse? I am looking to implement a warehouse
in SQL 2005 and have some fields which will have NULL values and I
would like some further ideas on how to deal with them. At my last job
in dealing with Oracle we were just going to leave the fields NULL, but
in SQL how would you best recommend cleaning the data? I greatly
appreicate your help and look forward to your reponses.
Thank youVery interesting question. The answer is "it depends."
As a general rule, I'll leave money as null usually. My logic is that
if a dollar amount is unknown, that is different then the dollar amount
zero, and i probably need to deal with the unknowns wherever and
whenever the amounts are shown to teh end user. This leads to
interesting discussions with the users, as you get to explain to them
the issues, and ask them what they want the defaults to be, or whether
they want to skip the data, and how they want the reports to be
documented.
For text fields, I will usually convert to ' '. Olap likes that
better. Sometimes I will convert nulls to 'BLANK'. Just kind of
depends.
Dates need to stay null. A null date is the easiest thing to deal
with.
Does this answer your questions? Are you doing OLAP? You might try
creating a cube with your denormalized data. OLAP is pretty neat for
datawarehouses where users want to extract data.
Regards,
Doug|||Thanks so much Doug! Your answer is of great help!
Data Warehouse Load Hints?
and no other processes will occur, should we provide hints for the table
inserts and updates for a performance boost? If so, which hints? Can any
help avoid deadlocks? And will this be different for 2005?
Does setting the db to singleuser help speed-up updates?
Thanks
Greg CA deadlock is when two processes have locked resources (typically in an
uncommitted transaction) that the other needs to continue. How would this
happen if the ETL is the only process running?
http://support.microsoft.com/defaul...kb;en-us;169960
http://msdn.microsoft.com/library/d... />
a_3hdf.asp
As for maximizing the performance of your ETL, I would reccomend the "bulk
load" method, becuase it supports features such as minimal transaction
logging.
[url]http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/incbulkload.mspx[/u
rl]
Setting the database to DBO Use Only would also be a good idea; it may
improve performance, but mostly to insure that noone or nothing else logs
in.
"gc" <nospam@.hotmail.com> wrote in message
news:Xns977C861DB2E90nospam@.24.93.43.121...
> (SQL 2000) So if we are loading large tables while no one else is logged
> in
> and no other processes will occur, should we provide hints for the table
> inserts and updates for a performance boost? If so, which hints? Can any
> help avoid deadlocks? And will this be different for 2005?
> Does setting the db to singleuser help speed-up updates?
> Thanks
> Greg C
>
>|||Just a couple of tips that I've learned:
1. If you have a clustered index, make sure that it's located on a
monotonically increasing value. This will minimize page splitting.
2. The clustered index does not have to be your primary key; in fact,
if you use a natural key, then it should not be. Our data is very date
and time sensitive, so I index the date and time of load; others use a
sequential numbering system (like an identity column), but I needed the
date of load for other reasons, and I prefer not to add columns that
have no meaning.
3 Depending on the size of your data, it may be more appropriate to
drop all of your non-clustered indexes , load the data, and rebuild the
indexes.
HTH,
Stu|||"JT" <someone@.microsoft.com> wrote in
news:uTboj28PGHA.720@.TK2MSFTNGP14.phx.gbl:
Well...you are asking someone who really needs to ask the question. Did
that make sense? :-) I am not a dba, and have learned through trial and
error what seems to work, at least most of the time.
I thought 'bulk load' recovery was just beneficial when importing text
files or using 'select into', no?
In our ETL, we have a number of procedures (which can call other
procedures) to test and process the data after the raw data is imported
w/ DTS. The DTS part is always fast, some of the subsequent processing
will lock-up periodically...I assume this would be a deadlock or lock
escalation? (When I look at the locks I will see many pages on one
object). It just happened to occurr on one machine...I stopped and
restarted w/o mods to the code and the next time it processed w/o
halting.
I drop indexes for the DTS, for the data transformation I optimize sql
for performance, individual statements are broken into transactions,
most of the largest updates/inserts are broken into smaller transactions,
temp tables are used when needed, I aggressively manage the multiple data
and log files on multiple raids and hard drives. But I have not been
able to understand the hints but more importantly why a process will run
fine most times then decide to freeze once (btw, I also check for open
transactions to make certain that there was no minor error leaving an
open tran).
thanks for the references
> A deadlock is when two processes have locked resources (typically in
> an uncommitted transaction) that the other needs to continue. How
> would this happen if the ETL is the only process running?
> http://support.microsoft.com/defaul...kb;en-us;169960
> http://msdn.microsoft.com/library/default.asp
?url=/library/en-us/acdata/ac_8_con_7a_3hdf.asp As for maximizing the
performance of your ETL, I
> would reccomend the "bulk load" method, becuase it supports features
> such as minimal transaction logging.
http://www.microsoft.com/technet/pr...tain/incbulkloa
d.mspx Setting the database to DBO Use Only would also be a good
> idea; it may improve performance, but mostly to insure that noone or
> nothing else logs in.
> "gc" <nospam@.hotmail.com> wrote in message
> news:Xns977C861DB2E90nospam@.24.93.43.121...
>
>|||"Stu" <stuart.ainsworth@.gmail.com> wrote in news:1141535576.352832.290970
@.e56g2000cwe.googlegroups.com:
> 1. If you have a clustered index, make sure that it's located on a
> monotonically increasing value. This will minimize page splitting.
Not sure that I understand this one.
If I load some data into the field of the clustered key, are you saying
that it makes a difference what order that I load the records?
If so, is that different whether the clustered index is the primary key or
not?
> 2. The clustered index does not have to be your primary key; in fact,
> if you use a natural key, then it should not be. Our data is very date
> and time sensitive, so I index the date and time of load; others use a
> sequential numbering system (like an identity column), but I needed the
> date of load for other reasons, and I prefer not to add columns that
> have no meaning.
I use natural keys for something like a client id, but convert almost all
of the dimensions to surrogate (identity) keys.
> 3 Depending on the size of your data, it may be more appropriate to
> drop all of your non-clustered indexes , load the data, and rebuild the
> indexes.
>
I do this everytime. 1.5 out of 3... :-)|||You can know if a process if being blocked by another process by executing
sp_who2 and seeing if the [blkby] column for a spid (process id) contains
the spid of another blocking process.
When importing large amounts of data, use the bulk copy program (BCP.EXE) or
BULK INSERT command.
http://msdn.microsoft.com/library/d...>
bcp_9esz.asp
Try reducing the batch size of your bulk copy process:
http://support.microsoft.com/defaul...=kb;en-us;81339
Consider what impact transaction logging is having on your process. There
are a few conditions (such as dropping indexes and specifying the TABLOCK
hint) that are required for minimal logging to take effect.
http://support.microsoft.com/defaul...=kb;en-us;59462
http://support.microsoft.com/defaul...kb;en-us;110139
"gc" <nospam@.hotmail.com> wrote in message
news:Xns977CEEB88B493nospam@.24.93.43.121...
> "JT" <someone@.microsoft.com> wrote in
> news:uTboj28PGHA.720@.TK2MSFTNGP14.phx.gbl:
> Well...you are asking someone who really needs to ask the question. Did
> that make sense? :-) I am not a dba, and have learned through trial and
> error what seems to work, at least most of the time.
> I thought 'bulk load' recovery was just beneficial when importing text
> files or using 'select into', no?
> In our ETL, we have a number of procedures (which can call other
> procedures) to test and process the data after the raw data is imported
> w/ DTS. The DTS part is always fast, some of the subsequent processing
> will lock-up periodically...I assume this would be a deadlock or lock
> escalation? (When I look at the locks I will see many pages on one
> object). It just happened to occurr on one machine...I stopped and
> restarted w/o mods to the code and the next time it processed w/o
> halting.
> I drop indexes for the DTS, for the data transformation I optimize sql
> for performance, individual statements are broken into transactions,
> most of the largest updates/inserts are broken into smaller transactions,
> temp tables are used when needed, I aggressively manage the multiple data
> and log files on multiple raids and hard drives. But I have not been
> able to understand the hints but more importantly why a process will run
> fine most times then decide to freeze once (btw, I also check for open
> transactions to make certain that there was no minor error leaving an
> open tran).
> thanks for the references
>
> ?url=/library/en-us/acdata/ac_8_con_7a_3hdf.asp As for maximizing the
> performance of your ETL, I
> http://www.microsoft.com/technet/pr...tain/incbulkloa
> d.mspx Setting the database to DBO Use Only would also be a good
>|||Ugh. Page splitting is difficult to explain; think of your table as a
blank notebook. The clustered index is kind of like a page number for
your other indexes to go and retrieve information; it's a way for the
optimizer to go figure out where row X is located in your table.
As you write data to your notebook, it begins to fill up each page. As
you continue to write, it skips to the next page; if your clustered
index does not increase in a sequential fashion, when the optimizer
encounters a a value that is out of order, it has to split the page at
the insertion point, and move records below that point on a page to a
new page; that's why you should cluster on a sequential value that is
independent of the order of the data outside of the warehouse. That's
why I recommend a datetime representation so it increases without
splitting the pages.
If your primary key is a natural key (like a combination of clientID
and rowID), then the likelihood is that your rows will be out of order
as they load into the target database, and hence some reordering will
be necessary if you cluster on that.
Clear as mud?
Stu|||Also consider that if a good candidate for a clustered index cannot be
found, then best not to implement one.
"Stu" <stuart.ainsworth@.gmail.com> wrote in message
news:1141539867.013564.240070@.e56g2000cwe.googlegroups.com...
> Ugh. Page splitting is difficult to explain; think of your table as a
> blank notebook. The clustered index is kind of like a page number for
> your other indexes to go and retrieve information; it's a way for the
> optimizer to go figure out where row X is located in your table.
> As you write data to your notebook, it begins to fill up each page. As
> you continue to write, it skips to the next page; if your clustered
> index does not increase in a sequential fashion, when the optimizer
> encounters a a value that is out of order, it has to split the page at
> the insertion point, and move records below that point on a page to a
> new page; that's why you should cluster on a sequential value that is
> independent of the order of the data outside of the warehouse. That's
> why I recommend a datetime representation so it increases without
> splitting the pages.
> If your primary key is a natural key (like a combination of clientID
> and rowID), then the likelihood is that your rows will be out of order
> as they load into the target database, and hence some reordering will
> be necessary if you cluster on that.
> Clear as mud?
> Stu
>|||gc (nospam@.hotmail.com) writes:
> In our ETL, we have a number of procedures (which can call other
> procedures) to test and process the data after the raw data is imported
> w/ DTS. The DTS part is always fast, some of the subsequent processing
> will lock-up periodically...I assume this would be a deadlock or lock
> escalation? (When I look at the locks I will see many pages on one
> object). It just happened to occurr on one machine...I stopped and
> restarted w/o mods to the code and the next time it processed w/o
> halting.
A deadlock is when two (or more) processes are waiting for each other
to release resources. SQL Server detects a deadlock, and will select one
of the processes as a deadlock victim and cancel execution for that
process, so that at least one of the processes can continue working.
If you processing "locks up" this can be due to blocking, but also due
to long-running queries. With the information you have provided, it is
difficult to tell. As JT said, use sp_who2 to check for blocking.
Another issue to watch out for is auto-grow of the database. By default,
SQL Server auto-grows a database with 10% when you run out of space. If
your database is huge - and data warehouses often are - then 10% can
take quite some to time grow. This can be dealt with expanding the database
to a reasonable size in advance.
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx
Data warehouse forum
Hi, all,
Thanks for your kind attention.
Just have some enquries about some issues of data warehouse design and would like to hear from any of you for any prestigious forum on data warehouse?
Thanks a lot in advance and I am looking forward to hearing from you shortly.
With best regards,
Yours sincerely,
Check out http://www.kimballgroup.com/. They have a nice email newsletter with design tips.
I like the books by Ralph Kimball, particularly "The data warehouse toolkit."
Eric
|||Hi, Eric,
Thanks a lot for your help.
With kindest regards,
Yours sincerely,
Data warehouse forum
Hi, all,
Thanks for your kind attention.
Just have some enquries about some issues of data warehouse design and would like to hear from any of you for any prestigious forum on data warehouse?
Thanks a lot in advance and I am looking forward to hearing from you shortly.
With best regards,
Yours sincerely,
Check out http://www.kimballgroup.com/. They have a nice email newsletter with design tips.
I like the books by Ralph Kimball, particularly "The data warehouse toolkit."
Eric
|||Hi, Eric,
Thanks a lot for your help.
With kindest regards,
Yours sincerely,
Data Warehouse Estimation
Looking out for data size estimation, and server size estimation templates.
To start with data size estimation templates will also do.
Any help will be appreciated. The requirement is kind of urgent.
Regards,
Bharat
Hello Bharat,
I wouldn't have expected a warehouse estimation process to be any different
to a normal database.
I always find that each project requires its own version, a generic one doesn't
cut it.
Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons
> Hi
> Looking out for data size estimation, and server size estimation
> templates. To start with data size estimation templates will also do.
> Any help will be appreciated. The requirement is kind of urgent.
> Regards,
> Bharat
Data Warehouse Estimation
Looking out for data size estimation, and server size estimation templates.
To start with data size estimation templates will also do.
Any help will be appreciated. The requirement is kind of urgent.
Regards,
BharatHello Bharat,
I wouldn't have expected a warehouse estimation process to be any different
to a normal database.
I always find that each project requires its own version, a generic one does
n't
cut it.
Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons
> Hi
> Looking out for data size estimation, and server size estimation
> templates. To start with data size estimation templates will also do.
> Any help will be appreciated. The requirement is kind of urgent.
> Regards,
> Bharat
Data warehouse data refresh/update
Im a member of a team planning a data warehousing project. We have multiple data sources which are aggregated in a staging area. This is then denormalised and imported into the datawarehouse database.
I am looking at ideas for incremental data refresh, rather than a drop and re-import of all data. This would allow us to have historic data.
Does anyone have any tips that might be helpful for detecting changes in the source data for import? We have had some bad experiences with triggers on our source database in the past, so would rather not use these. I have considered replication and log sh
ipping, but these just give a replica of the source data and does not flag the updated/new data.
Any help would be greatly appreciated.
Thanks.
Ben.
Look in SQL Books Online for CHECKSUM_AGG to identify changes in a table. The difficulty then is in trying to identify what has changed. Another route is to try and identify fields within the production data that will identify when it was last changed. Ty
pically these tend to be datetime, timestamp or rowversion data types.
If you keep production keys in your fact table as additional attributes then this will give you another option for identifying new data.
"Ben" wrote:
> Hello,
> Im a member of a team planning a data warehousing project. We have multiple data sources which are aggregated in a staging area. This is then denormalised and imported into the datawarehouse database.
> I am looking at ideas for incremental data refresh, rather than a drop and re-import of all data. This would allow us to have historic data.
> Does anyone have any tips that might be helpful for detecting changes in the source data for import? We have had some bad experiences with triggers on our source database in the past, so would rather not use these. I have considered replication and log
shipping, but these just give a replica of the source data and does not flag the updated/new data.
> Any help would be greatly appreciated.
> Thanks.
> Ben.
|||Timestamp your source data when it gets changed. That is definately the best way.
Rgards
Jamie
"Ben" wrote:
> Hello,
> Im a member of a team planning a data warehousing project. We have multiple data sources which are aggregated in a staging area. This is then denormalised and imported into the datawarehouse database.
> I am looking at ideas for incremental data refresh, rather than a drop and re-import of all data. This would allow us to have historic data.
> Does anyone have any tips that might be helpful for detecting changes in the source data for import? We have had some bad experiences with triggers on our source database in the past, so would rather not use these. I have considered replication and log
shipping, but these just give a replica of the source data and does not flag the updated/new data.
> Any help would be greatly appreciated.
> Thanks.
> Ben.
Data warehouse data refresh/update
Im a member of a team planning a data warehousing project. We have multiple
data sources which are aggregated in a staging area. This is then denormalis
ed and imported into the datawarehouse database.
I am looking at ideas for incremental data refresh, rather than a drop and r
e-import of all data. This would allow us to have historic data.
Does anyone have any tips that might be helpful for detecting changes in the
source data for import? We have had some bad experiences with triggers on o
ur source database in the past, so would rather not use these. I have consid
ered replication and log sh
ipping, but these just give a replica of the source data and does not flag t
he updated/new data.
Any help would be greatly appreciated.
Thanks.
Ben.Look in SQL Books Online for CHECKSUM_AGG to identify changes in a table. Th
e difficulty then is in trying to identify what has changed. Another route i
s to try and identify fields within the production data that will identify w
hen it was last changed. Ty
pically these tend to be datetime, timestamp or rowversion data types.
If you keep production keys in your fact table as additional attributes then
this will give you another option for identifying new data.
"Ben" wrote:
> Hello,
> Im a member of a team planning a data warehousing project. We have multipl
e data sources which are aggregated in a staging area. This is then denormal
ised and imported into the datawarehouse database.
> I am looking at ideas for incremental data refresh, rather than a drop and
re-import of all data. This would allow us to have historic data.
> Does anyone have any tips that might be helpful for detecting changes in the sourc
e data for import? We have had some bad experiences with triggers on our source data
base in the past, so would rather not use these. I have considered replication and l
og
shipping, but these just give a replica of the source data and does not flag the updated/new
data.
> Any help would be greatly appreciated.
> Thanks.
> Ben.|||Timestamp your source data when it gets changed. That is definately the best
way.
Rgards
Jamie
"Ben" wrote:
> Hello,
> Im a member of a team planning a data warehousing project. We have multipl
e data sources which are aggregated in a staging area. This is then denormal
ised and imported into the datawarehouse database.
> I am looking at ideas for incremental data refresh, rather than a drop and
re-import of all data. This would allow us to have historic data.
> Does anyone have any tips that might be helpful for detecting changes in the sourc
e data for import? We have had some bad experiences with triggers on our source data
base in the past, so would rather not use these. I have considered replication and l
og
shipping, but these just give a replica of the source data and does not flag the updated/new
data.
> Any help would be greatly appreciated.
> Thanks.
> Ben.
Data Warehouse backup/recover
Does any body have the experience to execute data warehouse backup/recover? What I want to know is how to backup/recover database in data warehouse and cubes.
Thanks.
ZYTdo you refer to facts/cubes that may get out of sync?
you might want to do the backup when there's no loading of facts/dimensions or have the loading processes do the cubes prior the loading of facts. Also, in case some batches are not processed 100% have some automated processes that check this and are able to repair (delete/reload batches).|||Thanks for reply. The difficult I have to face is I cannot connect database of OLAP,such as FoodMark2000, and cubes from backup box in EM. Any suggestions?
ZYT|||Do you get an error message?|||No. I mean I cannot find cube and database of DW in SQL Server Backup box.|||ok, so either no backup is being done at all or a third party tool is installed that does the backup. I hope this tool is able to connect to the sql-server though, otherwise the backups may not be so usefull afterall.
Data warehouse and a data base? difference?
I need someone to give me a small briefieng on how a enterprise data warehouse (EDW)
differs from a regular database?
Currently we have an appplication that accesses a database with about 18 tables. We also have
a Data Warehouse. For some reason I was thinking that it would be possibly to migrate the
database into the data warehouse. The reason is that in looking at the schematic design for
the data warehouse there are some data tables that could also be used by our application
that uses the DB.
I guess I am confused because I am not sure if a data warehouse is used in the same way
as a database?
Datawarehouse and Databases setup and uses differ. A database is traditionally used to store normalized data for a transactional based system. A datawarehouse is a way to abstract the database tables in such a way as to allow denormalization. Think of it this way: I would use a warehouse to store data in such a way as to query the data for reporting purposes. I would use a database for this reason because I would want some user or users running a 20 minute report on my core database. Hope this helps.
|||Ok, thank you. I also found this info which is pretty good.
http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96520/concept.htm