Monthly Archives: March 2012

varchar(max), varchar(1), varchar(n), char(100) ,nchar, nvarchar

I recently had a discussion with developers, DBAs and Data Designers on the usage of the character data types in SQL Server. A topic, I thought, that has been discussed so many times that everything about it has to be common knowledge. I was wrong. Many things seem to be left to assumptions still. So let’s take a closer look at those types.

All of them store character data. All of them can be given a size (1 to 8000 or max). But what does that actually mean.

Char

Char stores one byte for each character resp. one byte for its size predicate. Char(50) fi stores 50 bytes.

if object_id('dbo.t1') is not null
  drop table dbo.t1;

set ansi_padding on

create table dbo.t1
  (col1 char(50) null
  ,col2 char(50) not null)

insert dbo.t1 values ('1234567890', '1234567890')
insert dbo.t1 values (null, '1234567890')
insert dbo.t1 values ('12345678901', '12345678901')

select
   col1
  ,datalength(col1) as datalength
  ,len(col1)      as stringlength
  ,col2
  ,datalength(col2) as datalength
  ,len(col2)      as stringlength
from
  dbo.t1

Here is what we get








No surprises there. Char columns have a fixed data length independent of their content. When the content is less than the fixed length, the rest will be padded with zeros.
Let’s do the same again but let’s set ansi_padding off:

if object_id('dbo.t1') is not null
  drop table dbo.t1;

set ansi_padding off

create table dbo.t1
  (col1 char(50) null
  ,col2 char(50) not null)

insert dbo.t1 values ('1234567890', '1234567890')
insert dbo.t1 values (null, '1234567890')
insert dbo.t1 values ('12345678901', '12345678901')

select
   col1
  ,datalength(col1) as datalength
  ,len(col1)      as stringlength
  ,col2
  ,datalength(col2) as datalength
  ,len(col2)      as stringlength
from
  dbo.t1

And what we get is this:








On a nullable column the datalength is the same as the content. No padding occurs. On the mandatory column col2 the content is stilled filled with zeros up to 50.
SET ANSI_PADDING ON/OFF must be set at table creation time to have an effect.

Varchar

Now let’s look at varchar columns. Varchar doesn’t preemtively store whatever size it has in it’s predicate, but only stores as much as it needs, up to its size. To do that, varchar needs to additionally store the size of its content, which takes up additional two bytes.
So varchar(50) always stores 2 bytes + (n Characters * 1 byte) up to 52 bytes.

if object_id('dbo.t1') is not null
  drop table dbo.t1;

set ansi_padding on

create table dbo.t1
  (col1 varchar(50) null
  ,col2 varchar(50) not null)

insert dbo.t1 values ('1234567890', '1234567890')
insert dbo.t1 values (null, '1234567890')
insert dbo.t1 values ('12345678901', '12345678901')

select
   col1
  ,datalength(col1) as datalength
  ,len(col1)      as stringlength
  ,col2
  ,datalength(col2) as datalength
  ,len(col2)      as stringlength
from
  dbo.t1

And what we get is this:









Independent of the ansi_padding setting and the null/not null constraint, the datalength is always the same as the content length.
Looks like char can be just as variable as varchar with the appropriate settings, doesn’t it? No, it can’t. Don’t mistake datalength for storage length. Char still stores a byte for each character. Char(50) reserves 50 bytes for every row that is created. Varchar only stores the data that is actually there. Plus 2 bytes.
The ansi_padding setting may have some benefits in application development, but it does not change the storage needs of character columns.

Varchar(1) vs. Char(100)

So when to use which data type? Well, the answer seems pretty obvious:

  • 1 to 3 characters => use char
  • 4 to 7 characters => depends. If most of your data fills the 7 characters, use char. Storage wise the maintenance of varchar has a slight overhead, so char could be the wiser decision.
  • More than 7 characters => use varchar. Unless, of course, your data has a fixed length that never varies.
  • You often hear to make the cut at 15, 20 or even 50 characters. Fact is, that even at as little as 7 or 8 characters varchar is the more efficient choice.

    Varchar(255)

    A common practice I have seen many times is to use varchar(255) or even varchar(max) just for everything. Hey, it only stores what is there, so who cares?
    First, varchar(255). What is this artificial 255 threshold? I know, I am also old enough to remember database systems that only allowed 255 characters. And of course, 11111111 is 255, bla bla. But come on, 64bit systems, 32 GB of RAM or more, TBs of data storage, so please get rid of the 255. Or do your users really enter exactly 255 characters?

    Varchar(max)

    Then varchar(max). Varchar(max) is not really a character data type. It is treated as large data field, just like varbinary, text, ntext and image.
    Well, it is a hybrid. If the data fits on the data page it is treated as varchar, if it does not fit, it is treated as a large object. This means, that instead of the data itself a 16 byte pointer is stored that points to a pointer tree containing pointers to the actual data. This means at least one additional read operation up to several, how many it takes to get all the pointers to all the data together.
    Note, this does not dependent on how much data is actually stored in the field, but solely on if the data still fits in the data row on the page.

    We can force SQL Server to treat varchar(max) as In-Row data by switching off the large value types out of row option (default is on, which makes more sense).

    sp_tableoption 'tablename', 'large value types out of row', 'OFF'

    With this set to off, SQL Server will store the data in-row, which basically means that less rows per data page can be stored. Use this option only when the large data is the main read and write target of the table. (For more on this topic see here)
    Oh, and even with this option off the maximum value that is stored in-row is 8000 bytes, so why not use varchar(8000) instead of varchar(max)?

    External applications like SSIS treat varchar(max) as LOB, which means for SSIS that no matter what data is in there, data transfer will always be done by swapping the data to disk and then reading again from disk. Which makes data transfer a hell lot slower.

    And of course you cannot create indexes on varchar(max) columns. But you can still use it in the include-clause of an index. If you like really big indexes.
    If after all of this you still feel the need to create varchar(max) columns, consider limiting the length with a constraint. That doesn’t really help the issues described above, but it could limit the size of your database. Do you really want users to load 2GB of data in every data row? Why should you allow them to if all you wanna store is some small binary files?

    create table dbo.t1
      (col1 varchar(50) null
      ,col2 varchar(50) not null
      ,col3 varchar(max) null)
    
    alter table dbo.t1 add constraint CHK_col3_MaxLength
      CHECK (len(col3) <= 32000)

    This leads to the foremost reason why you just shouldn’t set every varchar() column to 255 or 8000 or max besides all the technical details:

    The key to a well designed, well performing database is to maintain data integrity.
    Don’t store data you don’t need.
    Define every data you store as well as possible.

    This means, limit the size of every column (not only character columns) to the size and data type you actually need. Not, what is easy to implement or you might need in 100 years time. Varchar() is the most flexible data type of them all and devours almost every data you throw into it. The least we can do is limit the amount.

    Nchar, nvarchar

    Nchar and nvarchar are the unicode-equivalents to char and varchar. They need two bytes per character for storage, but can store any unicode character. Apart from that they behave exactly like char and varchar.
    Well, actually it is the other way around. Char and varchar are actually subsets of nchar and nvarchar. So every varchar column can implicitly be converted to nvarchar, but not the other way around!
    Beware of things like this

    where colVarchar = N'Unicode Text'

    This forces SQL Server to explicitly convert colVarchar to nvarchar, and convert, even if it is executed in the background by SQL Server, is a function. This means that the rows of the query have to be evaluated row-by-row (hidden r-bar) and no indexes on colVarchar can be used.

    where colNVarchar = 'Non-Unicode Text'

    This can be converted implicitly and indexes can be used, since varchar is a subset of nvarchar.

    Summary

  • Choose varchar over char also for small values.
  • Use varchar(max) only when you really, actually, without doubt need it.
  • Better still, use nvarchar.
  • Table Variables vs. Temporary Tables

    Many articles have been written already about this topic, many online discussions carried out. I won’t recap all of the arguments here, but merely concentrate on the maybe lesser known facts (at least they were new to me not so long ago) about temp tables and table variables.

    Creation

    Temporary tables are created like every other table, only the name is prefixed by a hash sign.

    create table #tabX
      (ID int not null identity(1,1) primary key clustered
      ,Col1 bit not null
      ,Col2 datetime2 not null
      ,Col3 varchar(20) not null
      ,Col4 tinyint not null);

    Temp tables are not only created like other tables, you also can do (almost) everything with them as you would with a normal table, like creating constraints, indexes asf.

    Table variables are not created but declared.

    declare @tabX as table (ID int not null identity(1,1) primary key clustered
          ,Col1 bit not null default 0
          ,Col2 datetime2 not null default sysdatetime()
          ,Col3 varchar(20) not null unique
          ,Col4 tinyint not null check (Col4 >=1 and Col4 <=5));

    As you can see, you can declare any constraint on a table variable, but you cannot create any indexes.

    Both tables can have constraints, only temporary tables can have indexes.

    Resources

    A common myth is that temporary tables are created in tempdb and table variables are created in memory. That’s nonsense. Both are created in tempdb.

    Scope

    Table variables only exist in the same scope as variables, i.e. within the same block of code, defined by begin… end or ended with go.
    Temporary tables exist in the scope of the session. So if you create a temporary table it will exist as long as you drop it or disconnect the session, but only for this session.
    So you can create a temporary table, call a stored procedure and use that same table within the stored procedure. This can be handy sometimes, but it also can lead to very nasty behaviour, if you are forgetting to drop tables or where exactly now you are getting your data from.
    You can even create global temporary tables by prefixing the name with two hash signs:

    create table ##tabX
      (ID int ...

    This table is not exclusive to the session, but is available to all sessions until it is dropped or SQL Sevrer stopped.

    Table variables only exist in the code block scope, temporary tables in the session scope and global temporary tables in the server scope.

    Naming

    Another common misbelieve is, that you need to generate unique names for temporary tables, since they are created in the tempdb and no two objects may have the same name.
    Yes, objects must have different names, but in case of temp tables, SQL Server takes care of the uniqueness.
    To proof this, we open two sessions and in each session we create a temp table #tabX. Then we run

    select name, object_id from tempdb.sys.objects where name like '#tabX%'

    and have this result:

    SQL Server adds a unique number for each session to the name of the temporary table, thus taking care of the uniqueness of names. Our session can still use it as #tabX, not having to care about the suffix. Isn’t that awesome?
    Oh, and if we look at global temp tables

    No suffix here, so for global temp tables we really have to take care of uniqueness. Makes sense.

    Temporary tables don’t need to generate a unique name. SQL Server does that internally.

    Transaction scope

    A huge difference is the transaction scope of temp tables and table variables. Only for temporary tables transaction logs are recorded, which is essential to rollback transactions.

    begin transaction
      insert #tabX (Col3, Col4) values ('New Value', 1);
      insert @tabX (Col3, Col4) values ('New Value', 1);
    rollback

    So if we run something like this, #tabX will have rolled back and not have the new line, @tabX will still have the new line.

    Only temp table write transaction logs and writes can be rolled back.

    Compilation scope

    A stored procedure that creates and/or drops a temporary table can never have a precompiled execution plan. Since a temporary table behaves like a normal table, only with a different scope, it is also treated as a normal object.
    If you were to create a real table and/or drop it, after using it for storage or data aggregation in a stored procedure, the execution plan will also have to be recompiled every time.
    If you use a table variable, the execution plan can be precompiled, since not an object is created, but the mere image of an object.

    Only table variables can let a stored procedure have precompiled, reusable execution plans.

    Performance

    There are many, many articles with benchmarks and performance test scripts. As far as I gather, the conclusion is there is no real difference. A common opinion is that temporary tables perform better with large data sets. I don’t really know if that is true, but for now I believe it.
    So for everything over 50,000 rows I rather tend towards a temp table, below that to a table variable.
    But the main decision for or against one or the other shouldn’t be based on performance but on other facts like scope, transaction handling, execution plans.

    No real performance difference. On larger volumes of data temp tables believed to be better.

    Unique advantages

    Temporary tables

    Temporary tables can be used in select… into… statements. Just as real tables can be created with select… into… , temporary tables can also be created with it.

    select top(100)
       ID
      ,Col1
      ,Col2
      ,Col3
    into
      #tabX
    from
      dbo.MyTable
    where
      ID > 100
    ;

    This can be very handy at times, but it has the same problems as with a real table (see my post on select… into…).

    Table variables

    Since SQL Server 2008 you can pass table variables as parameters to stored procedures. This is a great functionality. It has some restrictions (http://www.sqlteam.com/article/sql-server-2008-table-valued-parameters) but it’s a really cool and powerful feature for complex application development.

    Conclusion

    As always with development, there is no real answer which one has the most advantages. It all depends on what you actually need. Base your decision on the advantages and disadvantages of both options, the result you want to accomplish, the scope and lifetime of the objects, the transactional scope.
    I tend to rather use table variables, because I am a developer who does a lot with stored procedures, so I like that the execution plan can still be precompiled with variables and that I don’t need to care to clean up, since at the end of the execution the variable is gone anyway.

    24 Hours of PASS

    March 21st brings a new edition of 24 Hours of PASS. 24 back-to-back one hour sessions of free SQL Server training by the leading experts in the world.
    Don’t miss out on that and register now!

    select… into…

    A pattern I see frequently used, especially in conjunction with temporary tables is this

    select
       ID
      ,Col1
      ,Col2
      ,Col3
    into
      dbo.tab2
    from
      dbo.tab1
    where
      ID > 100

    This syntax automatically creates a table for the result set and inserts the records of the result set into it. It is fast and efficient. And it can be really handy, especially in complex stored procedures or scripts, storing data from complex queries for other complex queries.
    What makes it so popular is the possibility to create not only manifest tables, but also temporary tables:

    select
       ID
      ,Col1
      ,Col2
      ,Col3
    into
      #tmpTab
    from
      dbo.tab1
    where
      ID > 100

    You need not define or create the table before in fact it must not be defined before.
    This is all very neat, and sometimes I use it myself, but only sometimes and never in stored procedures. More likely in some ad-hoc scripts.

    And here is why:

    Schema Locks

    select… into… always creates a Schema Lock on the table it is selecting from. Even with (nolock) it needs to create a schema lock to get the schema for the table it creates. A schema lock may not sound too bad and is actually not that big a deal.
    But the problem is not so much the lock itself or what it does, but what it takes to get it. Imagine an OLTP system with a high data load, a table with approx. 1000 records inserted per second. And you have a select.. into… from this table, possibly joint with two or three other tables, and your statement, even with the nolock hint, needs to be able to place a schema lock on all four tables at once.
    This means the query has to wait for a window, when it can actually place these locks. Sometimes nanoseconds, sometimes a few milliseconds, in worst case, although unlikely, even seconds.
    While this may not seem much it can amount to quite a lot in the long run. And what you get in any case are highly different, unreliable execution times.

    Crappy Schema

    The created table gets the data from the selected data. So f.i. if for Column X the longest selected string value has 4 characters, the field will be varchar(4). Might be ok, but when I later need space for longer values, I am in trouble.
    Example:

    select
       1      as ID
      ,'XXX'    as Col1
      ,getdate()  as Col2
      ,200.23   as Col3
      ,N'djjdh' as Col4
    into
      tabInto

    This gives you this schema

    CREATE TABLE dbo.tabInto
      (ID int NOT NULL
      ,Col1 varchar(3) NOT NULL
      ,Col2 datetime NOT NULL
      ,Col3 numeric(5, 2) NOT NULL
      ,Col4 nvarchar(5) NOT NULL)

    A heap with no primary key, all the columns are exactly the maximum length the data had, and all columns are not null, just because there were not any nulls in the data.
    Of course you can create your keys and indexes afterwards, but why let the engine work twice to create a decent table. And we all know what increasing the size of not null-columns does to a table (basically tear it apart).
    If you just leave it as the heap it is, the Query Optimizer will have a hard time working with it.

    And speaking of execution plans:

    Non-Reuseable Execution Plans

    If you have a create table-command in your stored procedure, which you implicitly have with select… into…, the procedure has to be recompiled for every execution. It will never get a reusable execution plan. This increases the execution time of the procedure immensely. And it doesn’t matter if it is a real table or a temporary table, they are treated the same way.

    Whereas if you use a well-defined table variable, insert .. select… into it, you will get a reusable execution plan for your procedure, and, if you defined all the data types correctly and used constraints where possible, you will also get the best possible execution plan for your data.

    SQL Saturday in Dublin, March 24th

    Just a quick reminder for all you Developers in and around Dublin: on Saturday, March 24th, SQLSaturday #105 will be hosted in Dublin. Make sure to visit, there will be some great sessions. And it’s free.
    sqlsaturday#105

    Pitfalls of SQL Server Replication

    If you are planning Replication for one of your SQL Server databases there are some pitfalls you will encounter and need to solve. There are some little things you need to take care of, which can be hard to find since some error messages SQL Server gives maybe describe the error but don’t help at all in finding the source or how to resolve it.

    1. Inform yourself about replication.
    Before you even start planning Replication read up on the principal concepts: Snapshot replication, Transactional Replication, Merge Replication. How they work. What resources they need.

    2. Plan your strategy for identity columns.
    You need to plan your Replication especially careful if you plan to let users update the replicated data. Then you need to take special care how both databases – master data and replicated data – handle identity columns.

    3. Only replicate the data you really need replicated.
    Don’t just replicate your whole database. This might seem the easiest solution, especially if your database is small. But databases tend to grow, and what might be a small database in one customer instance, is a huge database at another customer’s instance. Always plan your databases (and therefore also replication) scalable, so they can become bigger and bigger without performance loss.
    This said, replicate only the tables, views, functions and stored procedures you need. Replication adds additional stress to the system, so minimize that stress to the absolutely necessary.

    4. Check also what the application needs.
    Depending on how you plan to use the replicated data, what you actually need to replicate will differ a lot. It is not only necessary to check all the dependencies within the database and maybe even with other databases, but also to check the applications which you are planning to use on the replicated data.
    For instance the database and the views, procedures and functions don’t show any indication of any dependency on a user table or a lookup table, but the application you are trying to run on this data absolutely needs this data to work properly (or at all).
    So always check with application developers what data apart from the obvious database dependencies you also will need to replicate (or substitute in some other way).

    5. Check your objects for their “replication-fitness”.
    There are some (simple) prerequisites to take care of before replicating a table:

    a. It needs a Primary Key (for transactional replication).
    You might shrug at this point, but there are databases out there, especially legacy databases, where not every table has a primary key. If you need to replicate such a table, you need to create a primary key. In many cases you will find unique columns on such tables, either defined by a unique constraint or index or undefined but clearly detectable. Talk to the application developers, what might be unique, if adding a column might work, what consequences a primary key might have.
    If there is no way of adding a Primary Key, you cannot replicate that table and you might need to find another way to copy the data – some automated ETL process maybe.
    b. Take care of identity columns
    When you add a table with an identity column to the replicated articles, SQL Server automatically alters the schema of that column to . This is sufficient if you don’t let the users update the replicated data (read-only database). If you do, you need to take special care of the identity columns (see above).
    c. Check for triggers.
    Everyone always forgets about triggers. An enabled trigger on the target database can fail the whole replication process. Check the triggers, see what they do, and if you need them on the target database. You even might need to alter them because you need only parts of it.
    Tip: If you have a trigger that on insert into Table1 generates data for Table2, there are two ways of handling this:
    i. Replicate Table1 and Table2, disable the trigger on the target database.
    ii. Replicate only Table1 and leave the trigger enabled on the target database. So every replicated data in Table1 fires the trigger and fills Table2.
    Method (i) let’s all the resources be used on the master database, first the trigger fires, then both tables need to be distributed. Method (ii) balances the load by letting the target database do some of the work. Chose for yourself, which to use. It just goes to show, there is no straight answer on what to do with triggers. Each one has to be checked on its own.

    6. You need sufficient rights.
    When setting up replication you need to have Windows users available that have sufficient rights. We are talking Windows/Domain users here, not SQL Server users:
    a. Your user needs rights to set the executing user context of a service, create a Network share, maybe install components on the servers.
    b. A user for each database that can reach the database snapshot share.
    c. A user for each database under which the SQL Agent job can run.
    d. A user with SQL Server sysadmin rights.

    7. SQL Server Replication needs to be installed on both servers.
    Don’t laugh! This happens more often than you think. Everything is prepared, all the objects and user rights checked and you want to start setting up the replication and the first message you get is

    So get your Setup files ready and install SQL Server Replication. It is a straight forward process and shouldn’t present any problems, as long as you have sufficient rights to do so (and the Setup files handy).
    Important: even though setup doesn’t tell you to, after the setup is finished you must restart SQL Agent service. Otherwise replication won’t work. And SQL Server will never tell you why, but throw cryptic error messages that some settings are missing or incorrect.

    8. Run SQL Agent with a valid user context.
    Database snapshots are distributed via a network share that both servers can access. Depending on your security configuration (and best practice implementations) you might need to set SQL Agent service to run under a specific user context.
    Important: Don’t change the service’s user context in the computer management screen. Even though this screen automatically gives the user account the Logon as Service right, SQL Server and SQL Agent need some additional rights to work correctly.
    Always use SQL Configuration Manager to change the service’s user context. This will set all the rights that SQL Server needs.

    9. Replication needs TCP/IP name resolution.
    The replication set up screen lets you connect to the distribution server via its IP address. No problem. You can set up everything, set the user rights and so on. Only when replication tries to start its SQL Agent jobs it will tell you, it can’t reach the distribution server because it has no name. So always use the server name instead of the IP address.
    Tip: Depending on your network configuration and name resolution configuration it might not be possible to reach the server by name. In such a case you can edit the hosts file in %WINDOWSDIR%\system32\drivers\etc . Here enter the IP and the name and fool SQL Server into having a name to use.

    10. The owner of the database has to be valid.
    This issue only happens if you have an environment with, let’s say, test and production servers with different SQL users and databases get copied across different environments. It is a quite obvious and rare issue, but it happens, so I mention it.
    Even though everything else works and the users seem to have sufficient rights, if the owner’s SID is not recognized, replication won’t work. This can be solved simply by resetting the ownership of the database:

    alter authorization on database::test to mySAUser;

    Even though the user name might be the same, the SID would be different and therefore we need to make this change.

    %d bloggers like this: