Ssis Is An In Memory Pipeline Computer Science Essay

Since SSIS is an in-memory pipeline, one has to ensure that transactions occur in the memory for performance benefits. To check if your package is staying within memory limits, one should review the SSIS performance counter Buffers spooled. This has an initial value of 0. Any value above 0 is an indication that the engine has started disk-swapping activities.

Capacity planning to understand resource utilization

In order to understand resource utilization it is very important to monitor CPU, Memory, I/O and Network utilization of the SSIS package.

CPU

It is important to understand how much CPU is being utilized by SSIS and how much of CPU is being utilized by overall SQL Server while Integration Services is running. This latter point is very important, especially if you have SSIS and SQL Server on the same box, because if there is resource contention, SQL Server will surely win that will result into disk spilling from Integration Services resulting in slower transformation speed.

The performance counter that should be monitored is Process / % Processor Time (Total). One should measure this counter for both sqlservr.exe and dtexec.exe. If SSIS is not close to 100% CPU load, then this indicates:

Application contention – For e.g. SQL Server takes more processor resources, makes it unavailable for SSIS

Hardware contention – Probably a suboptimal disk I/O or not enough memory to handled the amount of data to be processed

Design limitation – The SSIS design is not making use of parallelism, and/or the package has too many single-threaded tasks

Network

SSIS moves data as fast as your network is able to handle it. Hence, it is important to understand your network topology and ensure that the path between the source and destination have both low latency and high throughput. Following performance counters can help you tune the topology:

Network Interface / Current Bandwidth – Provides estimate of current bandwidth

Network Interface / Bytes Total/Sec – The rates at which bytes are sent and received on each network adapter

Network Interface / Transfers/Sec – How many network transfers per second are occurring. If the number is close to 40,000 IOPs, then get another NIC card and use teaming between the NIC cards

Input / Output (I/O)

A good SSIS package should hit the disk only when it reads from the sources and writes back to the target. But if the I/O is slow, reading and especially writing can create a bottleneck. So it is very important to understand that the I/O system is not only specified in size (like 1 TB, 2 TB) but also its sustainable speed (like 20,000 IOPs).

Memory

The key counters to monitor memory for SSIS and SQL Server are as follows:

Process / Private Bytes (DTEXEC.EXE) – amount of memory currently used by Integration Services that cannot be shared with other processes

Process / Working Set (DTEXEC.EXE) – amount of allocated memory by Integration Services

SQL Server: Memory Manager / Total Server Memory – amount of allocated memory for SQL Server. This counter is the best indicator of total memory used by SQL, because SQL Server has another way to allocate memory using the AWE API

Memory / Page Reads/sec – total memory pressure on the system. If this consistently goes above 500, it is an indication that the system is under memory pressure

Baseline Source System Extract Speed

It is important to understand the source system and the speed at which data can be extracted from it. Measure the speed of the source system by creating a simple package that reads data from some source with the destination that says “Row Count”

Execute this package from the command line and measure the time it took for it to complete the task. Using Integration Services log output, you can measure the time taken. Formula to be used:

Rows/Sec = RowCount / Time

Based on the above value, you can judge the maximum number of rows per second that can be read from the source. To increase the Rows/Sec calculation, you can perform one of the following operations:

Improve drivers and driver configurations: Ensure you are using the up-to-date driver configurations for the network, data source and disk I/O.

Start multiple connections: To overcome limitations of drivers, you can start multiple connections to your data source. If the source is able to handle many concurrent connections, the throughput will increase if you start several extracts at once. If concurrency causes locking or blocking issues, consider partitioning the source having your packages read from different partitions to more evenly distribute the load

Use multiple NIC cards: If network is the bottleneck and you have ensured you are using gigabit network cards and routers, then a potential solution is to use multiple NIC cards per server.

Optimize SQL data source, Lookup transformations and Destination

Here are some optimization tips that you can implement in your SSIS packages:

Use NOLOCK or TABLOCK hints to remove locking overhead

Refrain from using “SELECT *” in SQL queries. Mention each column name in the SELECT clause for which data needs to be retrieved

If possible, perform datetime conversions at source or target databases

In SQL Server 2008 Integration Services, there is a new feature of shared lookup cache. During the use of parallel pipelines, it provides high-speed, shared cache

If Integration Services and SQL Server run on the same box, use SQL Server destination instead of OLE DB

Commit size 0 is fastest on heap bulk targets. If you cannot use 0, use the highest possible value of commit size to reduce overhead of multiple-batch writing. Commit size = 0 is bad while inserting into BTree – because all incoming rows must be sorted at once into the target BTree, and if the memory is limited, there is a likelihood of spill. Batchsize=0 is ideal for inserting into a heap. Please note that a commit size value of 0 might cause the running package to stop responding if the OLE DB destination and another data flow component are updating the same source table. To ensure that the package does not stop, set the maximum insert commit size option to 2147483647

Use a commit size of < 5000 to avoid lock escalation when inserting

Heap inserts are typically faster than using a clustered index. This means it is recommended to drop and rebuild all the indexes if there is a large part of the destination table getting changed.

Use partitions and partition SWITCH command. In other words load a work table that contains single partition and SWITCH it into the main table after the indexes are build and then put the constraints on

Network tuning

Packet size is the main property of the network that needs to be monitored / looked at in order to take decisions for Network tuning. By default this value is set to 4,096 bytes. As noted in SqlConnection.PacketSize property in .Net Framework Class Library, when the packet size is increased, it will improve performance because fewer network read and write operations are required to transfer a large data set. If your system is transactional in nature, lowering the value will improve the performance.

Another network tuning technique is to use network affinity at the operating system level to increase the performance at high throughputs.

Use Data Type wisely

Following are some best practices related to usage of data types:

Define data types as narrow as possible

Do not perform excessing casting of data types. Match your data types to the source or destination and explicitly specify data type casting

Take care of precision when using money, float and decimal data types. Money data type is always faster than decimal and has fewer precision considerations than float.

Change the design

Following are some best practices related to SSIS design:

Do not SORT within Integration Services unless absolutely necessary. In order to sort the data Integration Services allocates memory space for the entire data set that needs to be transformed. Preferably, presort the data before hand. Another way to sort the data is by using ORDER BY clause to sort large data in the database.

There are times where using Transact-SQL will be faster than processing the data in SSIS. Generally all set-based operations will perform faster in Transact-SQL because the problem can be transformed into a relational algebra formulation that SQL Server is optimized to resolve.

Set-based UPDATE statements – these are more efficient than row-by-row OLE DB calls

Aggregation statements like GROUP BY and SUM are also calculated faster using T-SQL instead of in-memory calculations by a pipeline

Delta detection is a technique where you change existing rows in the target table instead of reloading the table. To perform delta detection, one can change detection mechanism such as the new SQL Server 2008 Change Data Capture (CDC) functionality. As a rule of thumb, if the target table has changed > 10 %, it is often faster to simply reload than to perform the delta detection

Partition the problem

For ETL design, partition source data into smaller chunks of equal size. Here are some more partitioning tips:

Use partitioning on your target table. Multiple versions of the same package can be executed in parallel to insert data into different partitions of the same table. The SWITCH statement should be used during partitioning. It not only increases parallel load speed, but also allows efficient transfer of data.

As implied above, the package should have a parameter defined that specifies which partition should it work on.

Minimize logged operations

If possible, used minimal logged operations while inserting data into your target SQL Server database. When data is inserted into a database in fully logged mode, the size of the log grows quickly, because each row that is written in the database is also written to the log. Therefore, consider the following while designing SSIS packages:

Try to perform data flows in bulk mode instead of row by row. This will help minimize the number of entries to the log file. This eventually results into less disk I/O hence improving the performance

Read also  Integrated Development Environment (IDE) Components

If for any reason you need to delete data, organize the data in such a way that you can use TRUNCATE instead of DELETE. The later places an entry of each row that is deleted into the log file. The former will delete all the data and just put one entry into the log file

If for any reason partition need to be move around, use the SWITCH statement. This is a minimally logged operation

If you use DML statements along with your INSERT statements, minimum logging is suppressed.

Schedule and distribute it correctly

Good way to handle execution is to create a priority queue for your package and then execute multiple instances of the same package (with different partition parameter values). This queue can be a simple SQL Server table. A simple loop in the control flow should be a part of each package to:

Pick a relevant chunk from the queue

“Relevant” means that is not already been processed and that all chunks it depends on have already executed

Exit the package if no item is returned from the queue

Perform work required on the chunk

Mark the chunk as “done” in the queue

Return to the start of the loop

Picking an item from the queue and marking it as “done” can be implemented as a stored procedure. Once you have the queue in place, you can simple start multiple copies of DTEXEC to increase parallelism.

Keep it simple

Unnecessary use of components should be avoided. Here is one of the way to avoid it:

Step 1: Declare the variable varServerDate

Step 2: Use ExecuteSQLTask in the control flow to execute a SQL query to get the server datatime and store it in the variable

Step 3: Use the dataflow task and insert/update database with the server datatime from the variable varServerDate

This sequence is advisable only in cases where the time difference from Step 2 to Step 3 really matters. If that does not matter, then just use the getdate() command at Step 3 as shown below:

Create table #Table1(t_ID int, t_date datetime)

Insert into #Table1(t_ID, t_date) values(1, getdate())

Executing a child package multiple times from a parent with different parameter values

While executing a child package from a master package, parameters that are passed from the master package should be configured in the child package. Use the ‘Parent Package Configuration’ option in the child package to implement this feature. But for using this option, you need to specify the name of the ‘Parent Package Variable’ that is passed to the child package. If there is a need to call the same child package multiple times (each time with a different parameter value), declare the parent package variables (with the same name as given in the child package) with a scope limited to ‘Execute Package Tasks’. SSIS allows declaring variables with the same name but the scope limited to different tasks – all inside the same package.

SQL Job with many atomic steps

For the SQL job that calls the SSIS packages, create multiple steps, each performing small tasks rather than one step that performs all the tasks. Creating one big step, the transaction log grows too big and if a rollback takes place, it make take the full processing space of the server.

Avoid unnecessary typecasts

Avoid unnecessary typecasts. For e.g., flat file connection manager, be default, uses the string [DT-STR] data type for all columns. You will have to manually change it, if there is a need to use the actual data type. It is always a good option to change it at the source-level itself to avoid unnecessary type casting.

Transactions

Usually, ETL processes handle large volume of data. In such scenarios, do not attempt a transaction on the whole package logic. SSIS does support transactions, and it is advisable to use transactions.

Distributed transaction that span across multiple tasks

The control flow of an SSIS package threads together various control tasks. In SSIS it is possible to set a transaction that can span into multiple tasks using the same connection. To enable this, set value of the “retainsameconnection” property of the Connection Manager to “true”

Limit the package name to maximum of 100 characters

When a SSIS package with a package name exceeding 100 characters is deployed in SQL Server, it trims the package name to 100 characters, which may cause an execution failure.

SELECT * FROM

Do not pass any unnecessary columns from the source to the destination. With the OLEDB connection manager source, using the “Table or View” data access mode is equivalent to “SELECT * FROM tablename”, which will fetch all the columns. Use ‘SQL Command’ to fetch only required columns and pass that to the destination.

Excel source and 64-bit runtime

Excel Source or Excel Connection manager works only with the 32-bit runtime. Whenever a package that uses Excel Source is enabled for 64-bit runtime (by default, this is enabled), it will fail on the production server using the 64-bit runtime. Go to solution property pages debugging and set Run64BitRuntime to FALSE.

On failure of a component, stop / continue the execution with the next component

When a component fails, the property failParentonFailure can be effectively used either to stop the package execution or continue with the next component execution in the sequence container. The constraint value connecting the components in the sequence should be set to “Completion”. Also the failParentonFailure property should be set to FALSE.

Protection

To avoid most of the package deployment error from one system to other, set the package protection level to ‘DontSaveSensitive’

Copy pasting script component

Once you copy-paste a script component and execute the package, it may fail. As a work-around, open the script editor of the pasted script component, save the script and then execute the package.

Configuration filter – Use as a filter

As a best practice use the package name as the configuration filter for all the configuration items that are specific to a package. This is typically useful when there are so many packages with package specific configuration items. Use a generic name for configuration items that are general to many packages.

Optimal use of configuration records

Avoid using the same configuration item recorded under different filter / object name. For e.g. there should be only one configuration record created if two packages are using the same connection string. This can be achieved by using the same name for the connection manager in both the packages. This is quite useful at the time of porting from one environment to other (like UAT to Prod).

Pulling High Volume data

Process of pulling high volume is represented in the following flowchart:

The recommendation is to consider dropping all indexes from the target tables if possible before inserting data especially when the volume inserts are high.

Effect of OLEDB Destination Settings

Certain settings with OLEDB destination will impact the performance of the data transfer. Let’s look at some of them:

Data Access Mode – This setting provides ‘fast load’ option, which internally uses BULK INSERT statement for uploading data into the destination table.

Keep Identity – By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. On checking this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.

Keep NULLs – By default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during INSERT into the destination table if NULL value is coming from the source for that particular column. On checking this option, the default constraint on the destination table’s column will be ignored and preserved NULL of the source column will be inserted into the destination column.

Table Lock – By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at the same time.

Check Constraints – By default this setting is checked and recommendation is to have it unchecked if you are sure the incoming data is not going to violate constraints of the destination table. This setting indicates that the dataflow pipeline engine will validate the incoming data against the constraints of target table. Performance of data load can be improved by unchecking this option.

Effects of Rows per Batch and Maximum Insert Commit Size settings

Rows per batch – The default value for this setting is -1 which means all incoming rows will be treated as a single batch. If required you can change this to a positive integer value to break all incoming rows into multiple batches. The positive integer value will represent the total number of rows in a batch

Maximum insert commit size – Default value for this setting is ‘2147483647’ which means all incoming rows will be committed once on successful completion. If required, you can change this positive integer to any other positive integer number that would represent that the commit will be done for those specified number of records. This might put an overhead on the dataflow engine to commit several times, but on the other side it will release the pressure on the transaction log and save tempdb from growing tremendously especially during high volume data transfers.

The above two settings are mainly focused on improving the performance of tempdb and transaction log.

Avoid Synchronous/Asynchronous transformations

While executing the package, SSIS runtime engine executes every task other than data flow task in defined sequence. On encountering a data flow task the execution of the data flow task is taken over by the data flow pipeline engine. The dataflow pipeline engine then breaks the execution of the data flow task into one ore more execution tree(s). It may also execute these trees in parallel to achieve high performance.

Read also  Analysis of Data Mining Tools

To make things a bit clearly, here is what an Execution Tree means. An Execution tree starts at a source or an asynchronous transformation and ends at a destination or first asynchronous transformation in the hierarchy. Each tree has a set of allocated buffer and scope of these buffers is associated to this tree. Also in addition to this every tree is allocated an OS thread (worker-thread) and unlike buffers other execution tree may share this thread.

Synchronous transformation gets a record, processes it and passes it to the other transformation or destination in the sequence. The processing of a record does not dependent on the other incoming rows. Since synchronous transformations output the same number of rows as the input, it does not require new buffers to be created and hence is faster in processing. For e.g., in the Derived column transformation, a new column gets added in each incoming row, without adding any additional records to the output.

In case of asynchronous transformation, different number of rows can be created than the input requiring new buffers to be created. Since an output is dependent on one or more records it is called blocking transformation. It might be partial or full blocking. For e.g., the Sort Transformation is a fully blocking transformation as it requires all the incoming rows to arrive before processing.

Since the asynchronous transformation requires additional buffers it performs slower than synchronous transformations. Hence asynchronous transformations must be avoided wherever possible. For e.g. instead of using Sort Transformation to get sorted results, use ORDER BY clause in the source itself.

Implement Parallel Execution in SSIS

Parallel execution in allowed by SQL Server Integration Services (SSIS) in two different ways by controlling two properties mentioned below:

MaxConcurrentExecutables – this property defines how many tasks (executable) can run simultaneously. This property defaults to -1, which is translated to the number of processors plus 2. In case, hyper-threading is turned on in your box, it is the logical processor rather than the physically present processor that is counted. For e.g. we have a package with 3 Data Flow tasks where every task has 10 flows in the form of “OLE DB Source -> SQL Server Destination”. To execute all 3 Data Flow Tasks simultaneously, set the value of MaxConcurrentExecutables to 3.

The second property named EngineThreads controls whether all 10 flows in each individual Data Flow Task get started concurrently.

 

EngineThreads – this property defines how many work threads the schedule will create and run in parallel. The default value for this property is 5.

In the above example, if we set the EngineThreads to 10 on all 3 Data Flow Tasks, then all the 30 flows will start at the same time.

 

One thing we want to be clear about EngineThreads is that it governs both source threads (for source components) and work threads (for transformation and destination components). Source and work threads are both engine threads created by the Data Flow’s scheduler. Looking back at the above example, setting a value of 10 for Engine Threads means up to 10 source and 10 work threads each.

 

In SSIS, we don’t affinitize the threads that we create to any of the processors. If the number of threads surpasses the number of available processors, it might hurt the throughput due to an excessive amount of context switches.

Package restart without losing pipeline data

SSIS has a cool feature called Checkpoint. This feature allows your package to start from the last point of failure on next execution. You can save a lot of time by enabling this feature to start the package execution from the task that failed in the last execution. To enable this feature for your package set values for three properties CheckpointFileName, CheckpointUsage and SaveCheckpoints. Apart from this you should also set FailPackageOnFailure property to TRUE for all tasks that you want to be considered in restarting.

By doing this, on failure of that task, the package fails and the information is captured in the checkpoint file and on subsequent execution, the execution starts from that tasks.

It is very important to note that you can enable a task to participate in checkpoint including data flow task but it does not apply inside the data flow task. Let’s consider a scenario, where you have a data flow task for which you have set FailPackageOnFailure property to TRUE to participate in checkpoint. Lets assume that inside the data flow task there are five transformations in sequence and the execution fails at 5th transformation (assumption is that earlier 4 transformations complete successfully). On the following execution instance, the execution will start from the data flow task and the first 4 transformations will run again before coming to 5th one.

It is worth noting below points.

For loop and for each loop do not honor Checkpoint.

Checkpoint is enabled at only control flow level and not at data level, so regardless of checkpoint the package will execute the control flow/data flow from the start in a case of restart.

If package fails, checkpoint file, all server configurations and variables values are stored and also point of failure. So if package restarted, it takes all configuration values from checkpoint file. During failure you cannot change the configuration values.

Best practices for logging

Integration Services includes logging features that write log entries when run-time events occur and can also write custom messages. Logging, to help you in auditing and troubleshooting a package every time it is run, can capture run-time information about a package. For e.g., name of the operator who ran the package and the time the package began and finished can be captured in the log.

Logging (or tracing the execution) is a great way of diagnosing the problem occurring during runtime. This is especially very useful when your code does not work as expected. Not only that, SSIS allows you to choose different events of a package and components of the packages to log as well as the location where the log information is to be written (text files, SQL Server, SQL Server Profiler, Windows Events, or XML files).

The logging saves you from several hours of frustration that you might get while finding out the causes of problem if you are not using logging, but the story doesn’t end here. It’s true, it helps you in identifying the problem and its root cause, but at the same time it’s an overhead for SSIS that ultimately affects the performance as well, especially if you are excessively using logging. So the recommendation here is to use logging in a case of error (OnError event of package and containers) . Enable logging on other containers only if required, you can dynamically set the value of the LoggingMode property (of a package and its executables) to enable or disable logging without modifying the package.

You can create your own custom logging which can be used for troubleshooting, package monitoring, ETL operations performance dashboard creation etc.

However the best approach is to use the built-in SSIS logging where appropriate and augment it with your own custom logging. A normal custom logging can provide all the information you need as per requirement.

Security audit and data audit is out of scope of this document.

To help you understand which bulk load operations will be minimally logged and which will not, the following table lists the possible combinations.

Table Indexes

Rows in table

Hints

Without TF 610

With TF 610

Concurrent possible

Heap

Any

TABLOCK

Minimal

Minimal

Yes

Heap

Any

None

Full

Full

Yes

Heap + Index

Any

TABLOCK

Full

Depends (3)

No

Cluster

Empty

TABLOCK, ORDER (1)

Minimal

Minimal

No

Cluster

Empty

None

Full

Minimal

Yes (2)

Cluster

Any

None

Full

Minimal

Yes (2)

Cluster

Any

TABLOCK

Full

Minimal

No

Cluster + Index

Any

None

Full

Depends (3)

Yes (2)

Cluster + Index

Any

TABLOCK

Full

Depends (3)

No

(1) It is not necessary to specify the ORDER hint, if you are using the INSERT … SELECT method, but the rows need to be in the same order as the clustered index. While using BULK INSERT it is necessary to use the ORDER hint.

(2) Concurrent loads are only possible under certain conditions. Only rows those are written to newly allocated pages are minimally logged.

(3) Based on the plan chosen by the optimizer, the non-clustered index on the table may either be fully- or minimally logged.

Best practices for error handling

There are two methods of extending the logging capability,

Build a custom log provider

Use event handlers

We can extent SSIS’s event handler for error logging. We can capture error on OnError event of package and let package handle it gracefully. We can capture actual error using script task and log it in text file or in a SQL server tables. You can capture error details using system variables System::ErrorCode, System::ErrorDescription, System::SourceDescription etc.

If you are using custom logging, log the error in same table.

In some cases you may wish to ignore it or handle the error at container level or in some cases at task level.

Event handlers can be attached to any container in the package and that event handler will catch all events raised by that container and any child containers of that container. Hence, by attaching an event handler to the package (which is parent container) we can catch all events raised of that event type by every container in the package. This is powerful because it saves us from building event handlers for each task in the package.

A container has an option to “opt out” of having its events captured by an event handler. Let’s say, you had a sequence container for which you didn’t find it important to capture events, you can then simply switch them off using the sequence container’s DisableEventHandlers property.

Read also  Computer Is Derived From The Latin Term Computer Science Essay

If are looking to capture only certain events of that sequence task by an event handler, you could control this using the System::Propogate variable.

We recommend you to use sequence container to group tasks based on activities.

Options on Lookup optimization

In the Data Warehousing world, it’s a frequent requirement to have records from a source by matching them with a lookup table.  SSIS has a built-in Lookup transformation feature for the same.

Lookup transformation has been designed to perform optimally; for example by default it uses Full Caching mode (all reference dataset records are brought into memory at the start, pre-execute phase of the package, and kept for reference). This helps the lookup operation to perform faster and at the same time reducing the load on the reference data table, since it does not have to fetch each individual record one by one as required.

Though it sounds great there are some things to keep in mind. First you need to have enough physical memory for storage of the complete reference dataset. This is to ensure that if it runs out of memory it does not swap the data to the file system and resulting into a data flow task failure. This mode is recommended when you have enough memory to hold the reference dataset (assuming it does not change frequently). In other words, changes made to the reference table will not be reflected once data is fetched into memory.

Use Partial caching mode or No Caching mode when you have enough memory or the data does change frequently.

In Partial Caching mode, whenever a record is required it is pulled from the reference table and kept in memory. You can also specify the maximum amount of memory to be used for caching. On crossing the limits it removes the least used records from memory to make room for new records. In case of memory constraints or if the reference data does not change frequently, this mode is highly recommended.

No Caching mode performs slower as every time it needs a record it pulls from the reference table and no caching is done except the last row. While working with large reference data set with not enough memory to hold it or if your reference data changes quite frequently and you are looking for the latest data, this mode is highly recommended.

To summarize the above recommendations:

Analyze your environment and after thorough testing, choose the appropriate caching mode.

Ensure you have and index on the reference table, if you plan to use the Partial Caching or No Caching mode. This will help boost the performance.

Use a SELECT statement with a list of required columns instead of specifying a reference table in he lookup configuration.

Use WHERE clause for filtering unwanted rows for the lookup.

In SSIS 2008, there is feature to save your cache to be shared by different lookup transformations or data flow tasks and packages and utilize this feature wherever applicable.

Optimize buffer size

As mentioned earlier in the document, Execution Tree creates buffers for storing incoming rows and performing transactions.

The number of buffer created depends on how many rows fit into a buffer and how many rows fit into a buffer is dependent on few other factors. The first factor being the estimated row size that is the sum of the maximum sizes of all the columns from the incoming records. The second factor is the DefaultBufferMaxSie property of the Data Flow Task. This property indicates the default maximum size of the buffer. The default value bing 10 MB and its upper and lower boundaries are constrained by two internal properties of SSIS which are MaxBufferSize (100MB) and MinBufferSize (64 KB). The third factor is, DefaultBufferMaxRows, which is a Data Flow Task property that specifies the default number of rows in a buffer. The default value is 10000.

There are two things that we can do for better buffer performance. First being removal of unwanted columns from the source and setting data type in each column appropriately, especially in the case when your source is flat file.  This help in accommodating as many rows as possible in the buffer. Second, if the system has sufficient memory available, tune these properties to have a small number of large buffers, which could help in improving the performance. There is an adverse impact on the performance if you change the values of these properties to a point where page spooling begins. So before you set a value for these properties, please thoroughly test the same in your environment and then set the values appropriately.

You can enable logging of the BufferSizeTuning event to learn how many rows a buffer contains and you can monitor “Buffers spooled” performance counter to see if the SSIS has began page spooling. 

General guidelines

Here are some generic guidelines that can be implemented while creating SSIS packages:

Use of Variables

Package variable names should describe their contents and use. Variables should be defined using property expressions. For a given variable, set EvaluateAsExpression to true and enter the expression.

 

Creating template packages for reuse.

This can have standard logging mechanism. Standard comment and frequently used task.

 

Additional to Package naming convention.

Avoid using dot (.) naming convention for your package names. Dot (.) naming convention sometime confuses with the SQL Server object naming convention and hence should be avoided. We recommend using underscore (_) instead of using dot.

   

Annotations/Comments

Package annotations perform the task of descriptive labels helping to illustrate how a package works and can be placed anywhere within the background of a Control Flow or Dataflow.

It is good practice to place large annotations inside a collapsible Sequence Container. This minimizes the impact on space used by the annotation.

Configuration tips

XML & SQL Configurations are widely used. Avoid registry entry & environment variable configuration mode. It is recommended to use SQL Server configuration mode.

 

If you are using XML configuration, don’t pull all configurations into a single XML configuration file. Instead, create a separate XML configuration file for each configuration. This is a good approach that helps in reusing the configuration files by different packages.

 

When storing information about an OLE DB Connection Manager in a configuration, do not store individual properties like Initial Catalog, Username, Password etc., instead just store the ConnectionString property.

 

Configuration Mode

Security

Maintainability

Portability

XML

No

No

Yes

SQL Server

Yes

Yes

No

Environment variable

No

No

N/A

Registry Entry

No

No

N/A

Choosing between Bulk Load Methods

Here is an overview of the different bulk methods available in SQL Server and Integration Services.

Functionality

Integration Services

BULK INSERT

BCP

INSERT … SELECT

SQL Dest.

OLE DB Dest

Protocol

Shared Memory

TCP/IP

Named Pipes

In Memory

TCP/IP

Shared Memory

Named Pipes

In Memory

Speed

Faster / Fastest (4)

Fast / Fastest (1)

Fastest

Fast

Slow / Fastest (2)

Data Source

Any

Any

Data File Only

Data File Only

Any OLE DB

Bulk API Support

Not Native

Not ORDER

Not Native

All

All

No Hints Allowed

Lock taken with TABLOCK hint on heap

BU

BU

BU

BU

X

Can transform in transit

Yes

Yes

No

No

Yes

I/O Read block Size

Depends (3)

Depends (3)

64 kilobytes (KB)

64 KB

Up to 512 KB

SQL Server Version

2005 and 2008

2005 and 2008

7.0, 2000, 2005 and 2008

6.0, 7.0, 2000, 2005 and 2008

2008

Invoked from

DTEXEC / BIDS

DTEXEC / BIDS

Transact-SQL

Command Line

Transact-SQL

(1) If you run DTEXEC on a different server than SQL Server, Integration Services delivers very high speed by offloading data conversions from the Database Engine.


(2) Note that INSERT … SELECT does not allow concurrent inserts into a single table. In a situation when you need to populate a single table, Integration Services would be a faster option because you can now run multiple streams in parallel


(3) The read block size depends on source. 128 KB block sizes are used in case of text files.

(4) SQL Server Destination will use more CPU cycles than BULK INSERT, limiting max speeds. Since it offloads the data conversion, the amount of a single stream insert is faster than BULK INSERT.

Naming Conventions  

Acronyms should be used at the beginning of the names of tasks to identify what type of task it is. e.g. (Execute Package Task – EPT, Conditional Split – CSPL)Following are some guidelines on naming conventions:

Task

Prefix

For Loop Container

FLC

For each Loop Container

FELC

Sequence Container

SEQC

ActiveX Script

AXS

Analysis Services Execute DDL

ASE

Analysis Services Processing

ASP

Bulk Insert

BLK

Data Flow

DFT

Data Mining Query

DMQ

Execute DTS 2000 Package

EDPT

Execute Package

EPT

Execute Process

EPR

Execute SQL

SQL

File System

FSYS

FTP

FTP

Message Queue

MSMQ

Script

SCR

Send Mail

SMT

Transfer Database

TDB

Transfer Error Messages

TEM

Transfer Jobs

TJT

Transfer Logins

TLT

Transfer Master Stored Procedures

TSP

Transfer SQL Server Objects

TSO

Web Service

WST

WMI Data Reader

WMID

WMI Event Watcher

WMIE

XML

XML

Data Reader Source

DR_SRC

Excel Source

EX_SRC

Flat File Source

FF_SRC

OLE DB Source

OLE_SRC

Raw File Source

RF_SRC

XML Source

XML_SRC

Aggregate

AGG

Audit

AUD

Character Map

CHM

Conditional Split

CSPL

Copy Column

CPYC

Data Conversion

DCNV

Data Mining Query

DMQ

Derived Column

DER

Export Column

EXPC

Fuzzy Grouping

FZG

Fuzzy Lookup

FZL

Import Column

IMPC

Lookup

LKP

Merge

MRG

Merge Join

MRGJ

Multicast

MLT

OLE DB Command

CMD

Percentage Sampling

PSMP

Pivot

PVT

Row Count

CNT

Row Sampling

RSMP

Script Component

SCR

Slowly Changing Dimension

SCD

Sort

SRT

Term Extraction

TEX

Term Lookup

TEL

Union All

ALL

Unpivot

UPVT

Data Mining Model Training

DMMT_DST

Data Reader Destination

DR_DST

Dimension Processing

DP_DST

Excel Destination

EX_DST

Flat File Destination

FF_DST

OLE DB Destination

OLE_DST

Partition Processing

PP_DST

Raw File Destination

RF_DST

Recordset Destination

RS_DST

SQL Server Destination

SS_DST

SQL Server Mobile Destination

SSM_DST

File Watcher Task

FW

Order Now

Order Now

Type of Paper
Subject
Deadline
Number of Pages
(275 words)