SQLSaturday Vienna

Don’t forget to join us at sql Saturday in Vienna!

sqlSat
SQLSaturday Vienna#sqlsatVienna
Kostenlose Trainingsveranstaltung
Anmeldung erforderlich!

Donnerstag, 6. März 2014

A Happy New Year

A Happy New Year to you all!

I know it is late in coming, but you know how things go sometime. It’s also been a while since I posted new articles. The last year was quite a turbulent one: I became a father of a beautiful daughter, I moved from Ireland back to Austria, I changed jobs, I became self-employed. That’s quite a year, I reckon.

2014 should be a lot quieter and give me the chance to update this blog more frequently (= at all).

So, a Happy New Year to you all. Let’s hope it’s a good one.

Replication without Snapshot Initialisation

Recently I had a requirement for a customer that was not really “by the book”.
The requirement was to replicate a table, but only replicate anything new coming in, no historical data. The table has close to 7 Billion rows and is several Terabytes big.
Replicating all the historical data would kill the target server which just hasn’t the capacity. The target server will only keep a rolling 30 days of data and discard everything older.
That said, I had to find a way replicating all new transactions without the need of a Snapshot or Backup initialisation. This is not as straight forward as it sounds, but then again also not really too complicated, once you know how.
After some trial-and-error on a test lab installation I got it going. Here is an account of how-to:

Create the schema on the target server

To use replication without snapshot initialisation you will have to need all the data structures set up before you start. Create the target table on the target server. Don’t forget to plan how to handle identity columns. (If you need some data in there, also transfer and – if need be – modify the data to your needs.)

Create publication on source server

What I like to do is to use the wizard to create my Publications but not actually create the publication, just create the scripts. Then check and modify them before actually creating the publication. (And the same for subscriptions)

exec sp_addpublication 
         @publication = N'ReplWoSnapshot'
        ,@description = N'Transactional publication of database ''Source'' from Publisher ''SourceServer''.'
        ,@sync_method = N'concurrent'
        ,@retention = 0
        ,@allow_push = N'true'
        ,@allow_pull = N'true'
        ,@allow_anonymous = N'false'
        ,@enabled_for_internet = N'false'
        ,@snapshot_in_defaultfolder = N'true'
        ,@compress_snapshot = N'false'
        ,@ftp_port = 21
        ,@allow_subscription_copy = N'false'
        ,@add_to_active_directory = N'false'
        ,@repl_freq = N'continuous'
        ,@status = N'active'
        ,@independent_agent = N'true'
        ,@immediate_sync = N'false'
        ,@allow_sync_tran = N'false'
        ,@allow_queued_tran = N'false'
        ,@allow_dts = N'false'
        ,@replicate_ddl = 1
        ,@allow_initialize_from_backup = N'false'
        ,@enabled_for_p2p = N'false'
        ,@enabled_for_het_sub = N'false'
        ,@allow_partition_switch = N'true'
        ,@replicate_partition_switch = N'false'

Important here is that Immediate Synchronosation is off. Also switch of Initialisation from Backup.
Note: The last two parameters are needed in this case since it is a partitioned table I am replicating and I want to still be able to switch partitions on the source database without generating errors. But I don’t need the partition switch to be replicated. The target database will have its own split-switch-merge logic.

exec sp_addpublication_snapshot 
         @publication = N'ReplWoSnapshot'
        ,@frequency_type = 1
        ,@frequency_interval = 1
        ,@frequency_relative_interval = 1
        ,@frequency_recurrence_factor = 0
        ,@frequency_subday = 8
        ,@frequency_subday_interval = 1
        ,@active_start_time_of_day = 0
        ,@active_end_time_of_day = 235959
        ,@active_start_date = 0
        ,@active_end_date = 0
        ,@job_login = N'login'
        ,@job_password = 'pwd'
        ,@publisher_security_mode = 1

The snapshot agent is still created as usual even though it won’t be used.

exec sp_addarticle 
          @publication = N'ReplWoSnapshot'
        , @article = N'Source Table'
        , @source_owner = N'dbo'
        , @source_object = N'Source Table'
        , @type = N'logbased'
        , @description = null
        , @creation_script = null
        , @pre_creation_cmd = N'drop'
        , @schema_option = 0x00000000081B509F
        , @identityrangemanagementoption = N'manual'
        , @destination_table = N'Source Table'
        , @destination_owner = N'dbo'
        , @vertical_partition = N'false'
        , @ins_cmd = N'CALL sp_MSins_dboSourceTable'
        , @del_cmd = N'CALL sp_MSdel_dboSourceTable'
        , @upd_cmd = N'SCALL sp_MSupd_dboSourceTable'

Also all the articles are added as usual.

Create subscription

Again, I used the wizard to create the subscription, but only let it generate the scripts. In my case I created a Pull subscription, but it should also work with a Push subscription.

First add it on the distributor (= soruce server in this case)

exec sp_addsubscription 
         @publication = N'ReplWoSnapshot'
        ,@subscriber = N'Target Server'
        ,@destination_db = N'Source Dtaabase'
        ,@sync_type = N'replication support only'
        ,@subscription_type = N'pull'
        ,@update_mode = N'read only'

Here lies the most important part of the process: the Synchronisation Type. You need to use Replication Support Only. This ensures that no initialisation is attempted, just anything coming in will be replicated without questions.

Now add the subscription on the target server.

exec sp_addpullsubscription 
		 @publisher = N'Source Server'
		,@publication = N'ReplWoSnapshot'
		,@publisher_db = N'Source DB'
		,@independent_agent = N'True'
		,@subscription_type = N'pull'
		,@description = N''
		,@update_mode = N'read only'
		,@immediate_sync = 0

Again, Immediate Synchronisation must be switched off.

exec sp_addpullsubscription_agent 
		 @publisher = N'source server'
		,@publisher_db = N'Source DB'
		,@publication = N'ReplWoSnapshot'
		,@distributor = N'Source Server'
		,@distributor_security_mode = 1
		,@distributor_login = N''
		,@distributor_password = null
		,@enabled_for_syncmgr = N'False'
		,@frequency_type = 64
		,@frequency_interval = 0
		,@frequency_relative_interval = 0
		,@frequency_recurrence_factor = 0
		,@frequency_subday = 0
		,@frequency_subday_interval = 0
		,@active_start_time_of_day = 0
		,@active_end_time_of_day = 235959
		,@active_start_date = 20130111
		,@active_end_date = 99991231
		,@alt_snapshot_folder = N''
		,@working_directory = N''
		,@use_ftp = N'False'
		,@job_login = N'login'
		,@job_password = 'pwd'
		,@publication_type = 0

The subscription agent is then added as usual.

With these few steps the replication is fully configured and running. New transactions from the creation of the Publication on are replicated to the subscriber without any need of initialisation, be it from Snapshot or from Backup.

But there is one more thing you should do!
You need to be aware that a lot of data is actually missing in the target database. So every Update and Delete on a record in the source database that is not available in the target database will cause an error in the replication and acutally stop the replication.
Even when you are sure only Inserts happen on the source server, you should take precautions for transactions being executed for data that is not available on the target system.

Skip Consistency Errors in Replication

In order to avoid these errors (or actually tell the replication to ignore them) you need to use Agent Profiles.
Please find a step-by-step guide here on
How to: Work with Replication Agent Profiles (SQL Server Management Studio)
For this particular distribution we need to use an Agent Profile called Continue On Data Consistency Errors.
To use this profile:

  • Open the Replication Monitor
  • Right-Click on the distribution and choose Agent Profiles
  • Tick the box next to Continue On Data Consistency Errors
  • Click the button Change Existing Agents (if you just click OK, only newly created Agents will use this profile).
  • Caveat 1:

    Please be aware that Replication without Initialisation might not be the best solution for your requirements (or even this requirement described here). In fact Service Broker might provide a more suitable solution or some ETL process or something hand-made.
    Please always look into all kinds of solution and don’t just take the first that offers itself (we do that often enough anyway). It might not be the best one.

    Caveat 2:

    The description given here is no full step-by-step guide. It pre-requisits some knowledge of SQL Server replication. Some obvious steps are omitted, others vaguely described.
    Please always refer to the SQL Server documentation for full description of all parameters and steps.

    Merry Christmas!

    A very merry Christmas to you all!
    And a happy new year!

    Capacity Specifications for SQL Server

    Here is a useful link for all those who ever wondered about the maximum capacity of SQL Server. Here you’ll find a table specifying things like the number of partitions a table can have, the number of bytes of code a single stored procdure can have, but also fairly basic (and most important) stuff like the maximum row size, maximum index key size, maximum columns per table and so on.

    So, add the link to your boookmarks. You’ll need it over and over again:
    Maximum Capacity Specifications for SQL Server

    Get number of rows per partition

    How can you get the number of rows in each partition of your partitioned tables?
    Here is a simple query to retrieve that data, including the boundary value for each partition:

    SELECT	    
         p.partition_id                             as PartitionID
        ,t.name                                     as [Table]
        ,ps.name                                    as PartitionScheme
        ,pf.name                                    as PartitionFunction
        ,p.partition_number                         as [Partition]
        ,p.rows                                     as [Rows]
        ,prv.value                                  as Boundary
        --,pf.type_desc                               as BoundaryType
        ,case when pf.boundary_value_on_right = 1 
            then 'Right'
            else 'Left'
            end                                     as BoundarySide
    
    FROM	
        sys.tables t
        inner join sys.partitions p 
            on t.object_id = p.object_id 
            and p.index_id = 1
        inner join sys.indexes i on p.object_id = i.object_id and p.index_id = i.index_id
        inner join sys.data_spaces ds on i.data_space_id = ds.data_space_id
        inner join sys.partition_schemes ps on ds.data_space_id = ps.data_space_id
        inner join sys.partition_functions pf on ps.function_id = pf.function_id
        left outer join sys.partition_range_values prv 
            on pf.function_id = prv.function_id
            and p.partition_number = prv.boundary_id
    WHERE	
        is_ms_shipped = 0   
        and exists (select 1 from sys.partitions p2
                    where t.object_id = p2.object_id
                    and partition_number > 1)
    order by 
         t.name 
        ,p.partition_number desc

    This will show you

  • Tables that have at least two partitions. If you also want to see the number of rows for all the other tables, remove the … and exits (…) from the where-clause.
  • Only one entry per partition. If you want to see an entry per partitioned index, remove and p.index_id = 1 from the join-condition.
  • Error in replication: the subscription(s) have been marked inactive and must be reinitialized

    There are several reasons for this error to appear, mostly they are network connection problems, but also downtime of the subscriber beyond the subscription retention period or simple schema changes.
    So the first thing to do is to investigate the reason for the issue. As always, Google is your friend and you’ll find some great sites helping you in this investigation.

    Relevant data for this investigation can be found in

    select * from distribution..MSdistribution_agents
    select * from distribution..MSdistribution_history order by time desc
    select top(50) * from distribution..MSrepl_errors order by time desc
    	
    exec sp_browsereplcommands
    	
    select * from distribution..MSrepl_commands

    Reinitializing the replication can be a pain, especially with large data volumes. In some cases reinitializing can be avoided. You can re-activate the subscriptions, but you’ll have to do so directly in the distribution database.

    First find the subsriptions that need to be activated.

    select * from distribution..Mssubscriptions

    Take a look at the status column:

  • 0 = Inactive
    1 = subscribed
    2 = Active
  • Then you can reactivate them by resetting the status to Active:

    update distribution..Mssubscriptions set status = 2 where status = 0 and publisher_db = '<YourDB>'

    After reactivation check the Replication Monitor and the above tables how your replication is doing. There might be several errors you need to solve, but generally this should help get the replication going again.

    Bit columns

    I am not a big friend of the bit data type. It’s one of these things that seem useful and helpful, but in the end they cause trouble, or at least annoyance. I’m not saying that they shouldn’t be used, but I think one should be aware what that entails.
    Like integer, bit is a very simple data type, so there is no problem with the processing time. It only takes up one byte, which is the least any data type can use, so I’m also okay with that (kind of. Why does something that is actually just a bit, even is called a bit, need a byte?). But that’s about it. Everything else about them (or their usage) annoys me.

    Nullable bit columns

    What’s the sense of that? A bit column makes sense as a flag, defining if a state is true or false. If a third state is needed, like ‘undefined’ or ‘who cares’ use another data type (see further down).

    Bit columns without a default

    I admit, sometimes the state of something isn’t known or doesn’t matter at insert time. To my understanding this is a third state (see above), but let’s for now assume it is not and we don’t want to include the column in every insert where it is not necessary and we actually don’t know, if the state is true. Then please define a default of 0 = false and make the column NOT NULL.
    Just because you’re too lazy to write the column into every insert statement, doesn’t mean it has to be NULL.

    Not 0 is 1

    Everything you insert into a bit column, that is not 0 (zero) is automatically 1 (true). So even if you insert -42, it will be 1. That’s fine, and part of the data type, but it’s contrary to some other programming languages.
    Let’s try this:

    declare @a bit = -100
    select @a as bitVariable
    select case when @a = -100 then 'True' else 'False' end bitVarCheck


    So it sets -100 as TRUE, but an equality check on -100 as FALSE. That’s inconsistent. I wouldn’t expect the second query to return TRUE, but I would also expect everything except TRUE being FALSE when inserting data, and not the other way round. Then again it is called bit and not boolean, so the behaviour is actually correct (0 = no value, 1 = (any) value). But still.

    And on a related topic:

    select * from tablename where bitColumn <> 0;

    There are only two possible values. So if it is not 0 it is 1. It’s the same as:

    select * from tablename where bitColumn = 1;

    Equality operators are always more exact and need less processor time than inequality operators. And they help the query optimizer to decide on its plan faster. I’ll admit that the difference might be minuscule in such a case, but still: do it right. For your karma.

    Indexes on bit columns

    Again, bit columns only hold two possible values, so an index on a bit column gives you a diversification of data of 2. Most likely the break down will be something like 60-40, 70-30. This index will never be chosen because it returns too many rows. Unless you have a break down of 95-5 and always query on the 5% value, an index on a bit column makes no sense.
    A bit column in an index can make sense as

  • A part of a multiple column index with columns with larger diversification before it
  • A where –clause in a filtered index as an additional criteria to keep the index small.
  • The query optimizer sometimes even suggests an index on a bit column. But that’s just because it doesn’t know the actual statistics on the column. So don’t create indexes on bit columns, unless you only query the 5%.

    Data conversion

    This is my biggest issue with bit columns: Almost every query results in a data conversion.

    select * from tablename where bitColumn = 1;
    
    insert into	
    	dbo.tablename
    	(intColumn
    	,nvarcharColumn
    	,bitColumn)
    values
    	(1
    	,N'1'
    	,1)
    
    update
    	dbo.tablename
    set
    	bitColumn = 1
    where
    	intColumn > 10
    
    select 
    	case when bitColumn = 1 then 'True' else 'False' end
    from 
    	dbo.tablename

    This is how we would write those queries, right? In stored procedures or in views, wherever. Each of these queries results in forcing SQL Server to do data conversions. As in most (all?) other programming languages the default for undefined whole numbers is integer. So all the 1s are interpreted as integers. Only when the actual plan is compiled SQL Server realises they are actually meant to be bit and has to convert them.
    In the above cases this is one conversion per statement and converting integer to bit is not the most costly conversion. But still.

    select * from tablename where bitColumn = cast(1 as bit);
    
    select 
    	case when bitColumn = cast(1 as bit) then 'True' else 'False' end
    from 
    	dbo.tablename

    Database Bit in other programming languages

    Every programming language interprets a database bit differently. Some map it to true and false, some return it as 1 and 0. Some return it as byte, some as boolean.
    I know, once you mapped it, you know and can use it accordingly. But still.

    Alternative to bit

    What I like to use instead of bit columns is tinyint with a check constraint that only allows values 0 and 1.
    The advantages are obvious:

  • Integers are just as simple as bits
  • They also need a byte for storage but can actually store eight bits in it, not just one
  • If for some reason at some point in my development I need a third state or even more, I don’t need to change the data type of the column or any of my queries. I simply change the check constraint.
  • The queries I write won’t cause data conversion, because intColumn = 1 is really meant to be an integer.
  • The check constraint will give an error if I try to insert 100 or -42. Which is more consistent.
  • varchar(max), varchar(1), varchar(n), char(100) ,nchar, nvarchar

    I recently had a discussion with developers, DBAs and Data Designers on the usage of the character data types in SQL Server. A topic, I thought, that has been discussed so many times that everything about it has to be common knowledge. I was wrong. Many things seem to be left to assumptions still. So let’s take a closer look at those types.

    All of them store character data. All of them can be given a size (1 to 8000 or max). But what does that actually mean.

    Char

    Char stores one byte for each character resp. one byte for its size predicate. Char(50) fi stores 50 bytes.

    if object_id('dbo.t1') is not null
      drop table dbo.t1;
    
    set ansi_padding on
    
    create table dbo.t1
      (col1 char(50) null
      ,col2 char(50) not null)
    
    insert dbo.t1 values ('1234567890', '1234567890')
    insert dbo.t1 values (null, '1234567890')
    insert dbo.t1 values ('12345678901', '12345678901')
    
    select
       col1
      ,datalength(col1) as datalength
      ,len(col1)      as stringlength
      ,col2
      ,datalength(col2) as datalength
      ,len(col2)      as stringlength
    from
      dbo.t1

    Here is what we get








    No surprises there. Char columns have a fixed data length independent of their content. When the content is less than the fixed length, the rest will be padded with zeros.
    Let’s do the same again but let’s set ansi_padding off:

    if object_id('dbo.t1') is not null
      drop table dbo.t1;
    
    set ansi_padding off
    
    create table dbo.t1
      (col1 char(50) null
      ,col2 char(50) not null)
    
    insert dbo.t1 values ('1234567890', '1234567890')
    insert dbo.t1 values (null, '1234567890')
    insert dbo.t1 values ('12345678901', '12345678901')
    
    select
       col1
      ,datalength(col1) as datalength
      ,len(col1)      as stringlength
      ,col2
      ,datalength(col2) as datalength
      ,len(col2)      as stringlength
    from
      dbo.t1

    And what we get is this:








    On a nullable column the datalength is the same as the content. No padding occurs. On the mandatory column col2 the content is stilled filled with zeros up to 50.
    SET ANSI_PADDING ON/OFF must be set at table creation time to have an effect.

    Varchar

    Now let’s look at varchar columns. Varchar doesn’t preemtively store whatever size it has in it’s predicate, but only stores as much as it needs, up to its size. To do that, varchar needs to additionally store the size of its content, which takes up additional two bytes.
    So varchar(50) always stores 2 bytes + (n Characters * 1 byte) up to 52 bytes.

    if object_id('dbo.t1') is not null
      drop table dbo.t1;
    
    set ansi_padding on
    
    create table dbo.t1
      (col1 varchar(50) null
      ,col2 varchar(50) not null)
    
    insert dbo.t1 values ('1234567890', '1234567890')
    insert dbo.t1 values (null, '1234567890')
    insert dbo.t1 values ('12345678901', '12345678901')
    
    select
       col1
      ,datalength(col1) as datalength
      ,len(col1)      as stringlength
      ,col2
      ,datalength(col2) as datalength
      ,len(col2)      as stringlength
    from
      dbo.t1

    And what we get is this:









    Independent of the ansi_padding setting and the null/not null constraint, the datalength is always the same as the content length.
    Looks like char can be just as variable as varchar with the appropriate settings, doesn’t it? No, it can’t. Don’t mistake datalength for storage length. Char still stores a byte for each character. Char(50) reserves 50 bytes for every row that is created. Varchar only stores the data that is actually there. Plus 2 bytes.
    The ansi_padding setting may have some benefits in application development, but it does not change the storage needs of character columns.

    Varchar(1) vs. Char(100)

    So when to use which data type? Well, the answer seems pretty obvious:

  • 1 to 3 characters => use char
  • 4 to 7 characters => depends. If most of your data fills the 7 characters, use char. Storage wise the maintenance of varchar has a slight overhead, so char could be the wiser decision.
  • More than 7 characters => use varchar. Unless, of course, your data has a fixed length that never varies.
  • You often hear to make the cut at 15, 20 or even 50 characters. Fact is, that even at as little as 7 or 8 characters varchar is the more efficient choice.

    Varchar(255)

    A common practice I have seen many times is to use varchar(255) or even varchar(max) just for everything. Hey, it only stores what is there, so who cares?
    First, varchar(255). What is this artificial 255 threshold? I know, I am also old enough to remember database systems that only allowed 255 characters. And of course, 11111111 is 255, bla bla. But come on, 64bit systems, 32 GB of RAM or more, TBs of data storage, so please get rid of the 255. Or do your users really enter exactly 255 characters?

    Varchar(max)

    Then varchar(max). Varchar(max) is not really a character data type. It is treated as large data field, just like varbinary, text, ntext and image.
    Well, it is a hybrid. If the data fits on the data page it is treated as varchar, if it does not fit, it is treated as a large object. This means, that instead of the data itself a 16 byte pointer is stored that points to a pointer tree containing pointers to the actual data. This means at least one additional read operation up to several, how many it takes to get all the pointers to all the data together.
    Note, this does not dependent on how much data is actually stored in the field, but solely on if the data still fits in the data row on the page.

    We can force SQL Server to treat varchar(max) as In-Row data by switching off the large value types out of row option (default is on, which makes more sense).

    sp_tableoption 'tablename', 'large value types out of row', 'OFF'

    With this set to off, SQL Server will store the data in-row, which basically means that less rows per data page can be stored. Use this option only when the large data is the main read and write target of the table. (For more on this topic see here)
    Oh, and even with this option off the maximum value that is stored in-row is 8000 bytes, so why not use varchar(8000) instead of varchar(max)?

    External applications like SSIS treat varchar(max) as LOB, which means for SSIS that no matter what data is in there, data transfer will always be done by swapping the data to disk and then reading again from disk. Which makes data transfer a hell lot slower.

    And of course you cannot create indexes on varchar(max) columns. But you can still use it in the include-clause of an index. If you like really big indexes.
    If after all of this you still feel the need to create varchar(max) columns, consider limiting the length with a constraint. That doesn’t really help the issues described above, but it could limit the size of your database. Do you really want users to load 2GB of data in every data row? Why should you allow them to if all you wanna store is some small binary files?

    create table dbo.t1
      (col1 varchar(50) null
      ,col2 varchar(50) not null
      ,col3 varchar(max) null)
    
    alter table dbo.t1 add constraint CHK_col3_MaxLength
      CHECK (len(col3) <= 32000)

    This leads to the foremost reason why you just shouldn’t set every varchar() column to 255 or 8000 or max besides all the technical details:

    The key to a well designed, well performing database is to maintain data integrity.
    Don’t store data you don’t need.
    Define every data you store as well as possible.

    This means, limit the size of every column (not only character columns) to the size and data type you actually need. Not, what is easy to implement or you might need in 100 years time. Varchar() is the most flexible data type of them all and devours almost every data you throw into it. The least we can do is limit the amount.

    Nchar, nvarchar

    Nchar and nvarchar are the unicode-equivalents to char and varchar. They need two bytes per character for storage, but can store any unicode character. Apart from that they behave exactly like char and varchar.
    Well, actually it is the other way around. Char and varchar are actually subsets of nchar and nvarchar. So every varchar column can implicitly be converted to nvarchar, but not the other way around!
    Beware of things like this

    where colVarchar = N'Unicode Text'

    This forces SQL Server to explicitly convert colVarchar to nvarchar, and convert, even if it is executed in the background by SQL Server, is a function. This means that the rows of the query have to be evaluated row-by-row (hidden r-bar) and no indexes on colVarchar can be used.

    where colNVarchar = 'Non-Unicode Text'

    This can be converted implicitly and indexes can be used, since varchar is a subset of nvarchar.

    Summary

  • Choose varchar over char also for small values.
  • Use varchar(max) only when you really, actually, without doubt need it.
  • Better still, use nvarchar.
  • Table Variables vs. Temporary Tables

    Many articles have been written already about this topic, many online discussions carried out. I won’t recap all of the arguments here, but merely concentrate on the maybe lesser known facts (at least they were new to me not so long ago) about temp tables and table variables.

    Creation

    Temporary tables are created like every other table, only the name is prefixed by a hash sign.

    create table #tabX
      (ID int not null identity(1,1) primary key clustered
      ,Col1 bit not null
      ,Col2 datetime2 not null
      ,Col3 varchar(20) not null
      ,Col4 tinyint not null);

    Temp tables are not only created like other tables, you also can do (almost) everything with them as you would with a normal table, like creating constraints, indexes asf.

    Table variables are not created but declared.

    declare @tabX as table (ID int not null identity(1,1) primary key clustered
          ,Col1 bit not null default 0
          ,Col2 datetime2 not null default sysdatetime()
          ,Col3 varchar(20) not null unique
          ,Col4 tinyint not null check (Col4 >=1 and Col4 <=5));

    As you can see, you can declare any constraint on a table variable, but you cannot create any indexes.

    Both tables can have constraints, only temporary tables can have indexes.

    Resources

    A common myth is that temporary tables are created in tempdb and table variables are created in memory. That’s nonsense. Both are created in tempdb.

    Scope

    Table variables only exist in the same scope as variables, i.e. within the same block of code, defined by begin… end or ended with go.
    Temporary tables exist in the scope of the session. So if you create a temporary table it will exist as long as you drop it or disconnect the session, but only for this session.
    So you can create a temporary table, call a stored procedure and use that same table within the stored procedure. This can be handy sometimes, but it also can lead to very nasty behaviour, if you are forgetting to drop tables or where exactly now you are getting your data from.
    You can even create global temporary tables by prefixing the name with two hash signs:

    create table ##tabX
      (ID int ...

    This table is not exclusive to the session, but is available to all sessions until it is dropped or SQL Sevrer stopped.

    Table variables only exist in the code block scope, temporary tables in the session scope and global temporary tables in the server scope.

    Naming

    Another common misbelieve is, that you need to generate unique names for temporary tables, since they are created in the tempdb and no two objects may have the same name.
    Yes, objects must have different names, but in case of temp tables, SQL Server takes care of the uniqueness.
    To proof this, we open two sessions and in each session we create a temp table #tabX. Then we run

    select name, object_id from tempdb.sys.objects where name like '#tabX%'

    and have this result:

    SQL Server adds a unique number for each session to the name of the temporary table, thus taking care of the uniqueness of names. Our session can still use it as #tabX, not having to care about the suffix. Isn’t that awesome?
    Oh, and if we look at global temp tables

    No suffix here, so for global temp tables we really have to take care of uniqueness. Makes sense.

    Temporary tables don’t need to generate a unique name. SQL Server does that internally.

    Transaction scope

    A huge difference is the transaction scope of temp tables and table variables. Only for temporary tables transaction logs are recorded, which is essential to rollback transactions.

    begin transaction
      insert #tabX (Col3, Col4) values ('New Value', 1);
      insert @tabX (Col3, Col4) values ('New Value', 1);
    rollback

    So if we run something like this, #tabX will have rolled back and not have the new line, @tabX will still have the new line.

    Only temp table write transaction logs and writes can be rolled back.

    Compilation scope

    A stored procedure that creates and/or drops a temporary table can never have a precompiled execution plan. Since a temporary table behaves like a normal table, only with a different scope, it is also treated as a normal object.
    If you were to create a real table and/or drop it, after using it for storage or data aggregation in a stored procedure, the execution plan will also have to be recompiled every time.
    If you use a table variable, the execution plan can be precompiled, since not an object is created, but the mere image of an object.

    Only table variables can let a stored procedure have precompiled, reusable execution plans.

    Performance

    There are many, many articles with benchmarks and performance test scripts. As far as I gather, the conclusion is there is no real difference. A common opinion is that temporary tables perform better with large data sets. I don’t really know if that is true, but for now I believe it.
    So for everything over 50,000 rows I rather tend towards a temp table, below that to a table variable.
    But the main decision for or against one or the other shouldn’t be based on performance but on other facts like scope, transaction handling, execution plans.

    No real performance difference. On larger volumes of data temp tables believed to be better.

    Unique advantages

    Temporary tables

    Temporary tables can be used in select… into… statements. Just as real tables can be created with select… into… , temporary tables can also be created with it.

    select top(100)
       ID
      ,Col1
      ,Col2
      ,Col3
    into
      #tabX
    from
      dbo.MyTable
    where
      ID > 100
    ;

    This can be very handy at times, but it has the same problems as with a real table (see my post on select… into…).

    Table variables

    Since SQL Server 2008 you can pass table variables as parameters to stored procedures. This is a great functionality. It has some restrictions (http://www.sqlteam.com/article/sql-server-2008-table-valued-parameters) but it’s a really cool and powerful feature for complex application development.

    Conclusion

    As always with development, there is no real answer which one has the most advantages. It all depends on what you actually need. Base your decision on the advantages and disadvantages of both options, the result you want to accomplish, the scope and lifetime of the objects, the transactional scope.
    I tend to rather use table variables, because I am a developer who does a lot with stored procedures, so I like that the execution plan can still be precompiled with variables and that I don’t need to care to clean up, since at the end of the execution the variable is gone anyway.

    %d bloggers like this: