Saturday, February 25, 2012

Data Warehouse Load Hints?

(SQL 2000) So if we are loading large tables while no one else is logged in
and no other processes will occur, should we provide hints for the table
inserts and updates for a performance boost? If so, which hints? Can any
help avoid deadlocks? And will this be different for 2005?
Does setting the db to singleuser help speed-up updates?
Thanks
Greg CA deadlock is when two processes have locked resources (typically in an
uncommitted transaction) that the other needs to continue. How would this
happen if the ETL is the only process running?
http://support.microsoft.com/defaul...kb;en-us;169960
http://msdn.microsoft.com/library/d... />
a_3hdf.asp
As for maximizing the performance of your ETL, I would reccomend the "bulk
load" method, becuase it supports features such as minimal transaction
logging.
[url]http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/incbulkload.mspx[/u
rl]
Setting the database to DBO Use Only would also be a good idea; it may
improve performance, but mostly to insure that noone or nothing else logs
in.
"gc" <nospam@.hotmail.com> wrote in message
news:Xns977C861DB2E90nospam@.24.93.43.121...
> (SQL 2000) So if we are loading large tables while no one else is logged
> in
> and no other processes will occur, should we provide hints for the table
> inserts and updates for a performance boost? If so, which hints? Can any
> help avoid deadlocks? And will this be different for 2005?
> Does setting the db to singleuser help speed-up updates?
> Thanks
> Greg C
>
>|||Just a couple of tips that I've learned:
1. If you have a clustered index, make sure that it's located on a
monotonically increasing value. This will minimize page splitting.
2. The clustered index does not have to be your primary key; in fact,
if you use a natural key, then it should not be. Our data is very date
and time sensitive, so I index the date and time of load; others use a
sequential numbering system (like an identity column), but I needed the
date of load for other reasons, and I prefer not to add columns that
have no meaning.
3 Depending on the size of your data, it may be more appropriate to
drop all of your non-clustered indexes , load the data, and rebuild the
indexes.
HTH,
Stu|||"JT" <someone@.microsoft.com> wrote in
news:uTboj28PGHA.720@.TK2MSFTNGP14.phx.gbl:
Well...you are asking someone who really needs to ask the question. Did
that make sense? :-) I am not a dba, and have learned through trial and
error what seems to work, at least most of the time.
I thought 'bulk load' recovery was just beneficial when importing text
files or using 'select into', no?
In our ETL, we have a number of procedures (which can call other
procedures) to test and process the data after the raw data is imported
w/ DTS. The DTS part is always fast, some of the subsequent processing
will lock-up periodically...I assume this would be a deadlock or lock
escalation? (When I look at the locks I will see many pages on one
object). It just happened to occurr on one machine...I stopped and
restarted w/o mods to the code and the next time it processed w/o
halting.
I drop indexes for the DTS, for the data transformation I optimize sql
for performance, individual statements are broken into transactions,
most of the largest updates/inserts are broken into smaller transactions,
temp tables are used when needed, I aggressively manage the multiple data
and log files on multiple raids and hard drives. But I have not been
able to understand the hints but more importantly why a process will run
fine most times then decide to freeze once (btw, I also check for open
transactions to make certain that there was no minor error leaving an
open tran).
thanks for the references

> A deadlock is when two processes have locked resources (typically in
> an uncommitted transaction) that the other needs to continue. How
> would this happen if the ETL is the only process running?
> http://support.microsoft.com/defaul...kb;en-us;169960
> http://msdn.microsoft.com/library/default.asp
?url=/library/en-us/acdata/ac_8_con_7a_3hdf.asp As for maximizing the
performance of your ETL, I
> would reccomend the "bulk load" method, becuase it supports features
> such as minimal transaction logging.
http://www.microsoft.com/technet/pr...tain/incbulkloa
d.mspx Setting the database to DBO Use Only would also be a good
> idea; it may improve performance, but mostly to insure that noone or
> nothing else logs in.
> "gc" <nospam@.hotmail.com> wrote in message
> news:Xns977C861DB2E90nospam@.24.93.43.121...
>
>|||"Stu" <stuart.ainsworth@.gmail.com> wrote in news:1141535576.352832.290970
@.e56g2000cwe.googlegroups.com:

> 1. If you have a clustered index, make sure that it's located on a
> monotonically increasing value. This will minimize page splitting.
Not sure that I understand this one.
If I load some data into the field of the clustered key, are you saying
that it makes a difference what order that I load the records?
If so, is that different whether the clustered index is the primary key or
not?

> 2. The clustered index does not have to be your primary key; in fact,
> if you use a natural key, then it should not be. Our data is very date
> and time sensitive, so I index the date and time of load; others use a
> sequential numbering system (like an identity column), but I needed the
> date of load for other reasons, and I prefer not to add columns that
> have no meaning.
I use natural keys for something like a client id, but convert almost all
of the dimensions to surrogate (identity) keys.

> 3 Depending on the size of your data, it may be more appropriate to
> drop all of your non-clustered indexes , load the data, and rebuild the
> indexes.
>
I do this everytime. 1.5 out of 3... :-)|||You can know if a process if being blocked by another process by executing
sp_who2 and seeing if the [blkby] column for a spid (process id) contains
the spid of another blocking process.
When importing large amounts of data, use the bulk copy program (BCP.EXE) or
BULK INSERT command.
http://msdn.microsoft.com/library/d...>
bcp_9esz.asp
Try reducing the batch size of your bulk copy process:
http://support.microsoft.com/defaul...=kb;en-us;81339
Consider what impact transaction logging is having on your process. There
are a few conditions (such as dropping indexes and specifying the TABLOCK
hint) that are required for minimal logging to take effect.
http://support.microsoft.com/defaul...=kb;en-us;59462
http://support.microsoft.com/defaul...kb;en-us;110139
"gc" <nospam@.hotmail.com> wrote in message
news:Xns977CEEB88B493nospam@.24.93.43.121...
> "JT" <someone@.microsoft.com> wrote in
> news:uTboj28PGHA.720@.TK2MSFTNGP14.phx.gbl:
> Well...you are asking someone who really needs to ask the question. Did
> that make sense? :-) I am not a dba, and have learned through trial and
> error what seems to work, at least most of the time.
> I thought 'bulk load' recovery was just beneficial when importing text
> files or using 'select into', no?
> In our ETL, we have a number of procedures (which can call other
> procedures) to test and process the data after the raw data is imported
> w/ DTS. The DTS part is always fast, some of the subsequent processing
> will lock-up periodically...I assume this would be a deadlock or lock
> escalation? (When I look at the locks I will see many pages on one
> object). It just happened to occurr on one machine...I stopped and
> restarted w/o mods to the code and the next time it processed w/o
> halting.
> I drop indexes for the DTS, for the data transformation I optimize sql
> for performance, individual statements are broken into transactions,
> most of the largest updates/inserts are broken into smaller transactions,
> temp tables are used when needed, I aggressively manage the multiple data
> and log files on multiple raids and hard drives. But I have not been
> able to understand the hints but more importantly why a process will run
> fine most times then decide to freeze once (btw, I also check for open
> transactions to make certain that there was no minor error leaving an
> open tran).
> thanks for the references
>
> ?url=/library/en-us/acdata/ac_8_con_7a_3hdf.asp As for maximizing the
> performance of your ETL, I
> http://www.microsoft.com/technet/pr...tain/incbulkloa
> d.mspx Setting the database to DBO Use Only would also be a good
>|||Ugh. Page splitting is difficult to explain; think of your table as a
blank notebook. The clustered index is kind of like a page number for
your other indexes to go and retrieve information; it's a way for the
optimizer to go figure out where row X is located in your table.
As you write data to your notebook, it begins to fill up each page. As
you continue to write, it skips to the next page; if your clustered
index does not increase in a sequential fashion, when the optimizer
encounters a a value that is out of order, it has to split the page at
the insertion point, and move records below that point on a page to a
new page; that's why you should cluster on a sequential value that is
independent of the order of the data outside of the warehouse. That's
why I recommend a datetime representation so it increases without
splitting the pages.
If your primary key is a natural key (like a combination of clientID
and rowID), then the likelihood is that your rows will be out of order
as they load into the target database, and hence some reordering will
be necessary if you cluster on that.
Clear as mud?
Stu|||Also consider that if a good candidate for a clustered index cannot be
found, then best not to implement one.
"Stu" <stuart.ainsworth@.gmail.com> wrote in message
news:1141539867.013564.240070@.e56g2000cwe.googlegroups.com...
> Ugh. Page splitting is difficult to explain; think of your table as a
> blank notebook. The clustered index is kind of like a page number for
> your other indexes to go and retrieve information; it's a way for the
> optimizer to go figure out where row X is located in your table.
> As you write data to your notebook, it begins to fill up each page. As
> you continue to write, it skips to the next page; if your clustered
> index does not increase in a sequential fashion, when the optimizer
> encounters a a value that is out of order, it has to split the page at
> the insertion point, and move records below that point on a page to a
> new page; that's why you should cluster on a sequential value that is
> independent of the order of the data outside of the warehouse. That's
> why I recommend a datetime representation so it increases without
> splitting the pages.
> If your primary key is a natural key (like a combination of clientID
> and rowID), then the likelihood is that your rows will be out of order
> as they load into the target database, and hence some reordering will
> be necessary if you cluster on that.
> Clear as mud?
> Stu
>|||gc (nospam@.hotmail.com) writes:
> In our ETL, we have a number of procedures (which can call other
> procedures) to test and process the data after the raw data is imported
> w/ DTS. The DTS part is always fast, some of the subsequent processing
> will lock-up periodically...I assume this would be a deadlock or lock
> escalation? (When I look at the locks I will see many pages on one
> object). It just happened to occurr on one machine...I stopped and
> restarted w/o mods to the code and the next time it processed w/o
> halting.
A deadlock is when two (or more) processes are waiting for each other
to release resources. SQL Server detects a deadlock, and will select one
of the processes as a deadlock victim and cancel execution for that
process, so that at least one of the processes can continue working.
If you processing "locks up" this can be due to blocking, but also due
to long-running queries. With the information you have provided, it is
difficult to tell. As JT said, use sp_who2 to check for blocking.
Another issue to watch out for is auto-grow of the database. By default,
SQL Server auto-grows a database with 10% when you run out of space. If
your database is huge - and data warehouses often are - then 10% can
take quite some to time grow. This can be dealt with expanding the database
to a reasonable size in advance.
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx

No comments:

Post a Comment