varchar(max), varchar(1), varchar(n), char(100) ,nchar, nvarchar

I recently had a discussion with developers, DBAs and Data Designers on the usage of the character data types in SQL Server. A topic, I thought, that has been discussed so many times that everything about it has to be common knowledge. I was wrong. Many things seem to be left to assumptions still. So let’s take a closer look at those types.

All of them store character data. All of them can be given a size (1 to 8000 or max). But what does that actually mean.

Char

Char stores one byte for each character resp. one byte for its size predicate. Char(50) fi stores 50 bytes.

if object_id('dbo.t1') is not null
  drop table dbo.t1;

set ansi_padding on

create table dbo.t1
  (col1 char(50) null
  ,col2 char(50) not null)

insert dbo.t1 values ('1234567890', '1234567890')
insert dbo.t1 values (null, '1234567890')
insert dbo.t1 values ('12345678901', '12345678901')

select
   col1
  ,datalength(col1) as datalength
  ,len(col1)      as stringlength
  ,col2
  ,datalength(col2) as datalength
  ,len(col2)      as stringlength
from
  dbo.t1

Here is what we get








No surprises there. Char columns have a fixed data length independent of their content. When the content is less than the fixed length, the rest will be padded with zeros.
Let’s do the same again but let’s set ansi_padding off:

if object_id('dbo.t1') is not null
  drop table dbo.t1;

set ansi_padding off

create table dbo.t1
  (col1 char(50) null
  ,col2 char(50) not null)

insert dbo.t1 values ('1234567890', '1234567890')
insert dbo.t1 values (null, '1234567890')
insert dbo.t1 values ('12345678901', '12345678901')

select
   col1
  ,datalength(col1) as datalength
  ,len(col1)      as stringlength
  ,col2
  ,datalength(col2) as datalength
  ,len(col2)      as stringlength
from
  dbo.t1

And what we get is this:








On a nullable column the datalength is the same as the content. No padding occurs. On the mandatory column col2 the content is stilled filled with zeros up to 50.
SET ANSI_PADDING ON/OFF must be set at table creation time to have an effect.

Varchar

Now let’s look at varchar columns. Varchar doesn’t preemtively store whatever size it has in it’s predicate, but only stores as much as it needs, up to its size. To do that, varchar needs to additionally store the size of its content, which takes up additional two bytes.
So varchar(50) always stores 2 bytes + (n Characters * 1 byte) up to 52 bytes.

if object_id('dbo.t1') is not null
  drop table dbo.t1;

set ansi_padding on

create table dbo.t1
  (col1 varchar(50) null
  ,col2 varchar(50) not null)

insert dbo.t1 values ('1234567890', '1234567890')
insert dbo.t1 values (null, '1234567890')
insert dbo.t1 values ('12345678901', '12345678901')

select
   col1
  ,datalength(col1) as datalength
  ,len(col1)      as stringlength
  ,col2
  ,datalength(col2) as datalength
  ,len(col2)      as stringlength
from
  dbo.t1

And what we get is this:









Independent of the ansi_padding setting and the null/not null constraint, the datalength is always the same as the content length.
Looks like char can be just as variable as varchar with the appropriate settings, doesn’t it? No, it can’t. Don’t mistake datalength for storage length. Char still stores a byte for each character. Char(50) reserves 50 bytes for every row that is created. Varchar only stores the data that is actually there. Plus 2 bytes.
The ansi_padding setting may have some benefits in application development, but it does not change the storage needs of character columns.

Varchar(1) vs. Char(100)

So when to use which data type? Well, the answer seems pretty obvious:

  • 1 to 3 characters => use char
  • 4 to 7 characters => depends. If most of your data fills the 7 characters, use char. Storage wise the maintenance of varchar has a slight overhead, so char could be the wiser decision.
  • More than 7 characters => use varchar. Unless, of course, your data has a fixed length that never varies.
  • You often hear to make the cut at 15, 20 or even 50 characters. Fact is, that even at as little as 7 or 8 characters varchar is the more efficient choice.

    Varchar(255)

    A common practice I have seen many times is to use varchar(255) or even varchar(max) just for everything. Hey, it only stores what is there, so who cares?
    First, varchar(255). What is this artificial 255 threshold? I know, I am also old enough to remember database systems that only allowed 255 characters. And of course, 11111111 is 255, bla bla. But come on, 64bit systems, 32 GB of RAM or more, TBs of data storage, so please get rid of the 255. Or do your users really enter exactly 255 characters?

    Varchar(max)

    Then varchar(max). Varchar(max) is not really a character data type. It is treated as large data field, just like varbinary, text, ntext and image.
    Well, it is a hybrid. If the data fits on the data page it is treated as varchar, if it does not fit, it is treated as a large object. This means, that instead of the data itself a 16 byte pointer is stored that points to a pointer tree containing pointers to the actual data. This means at least one additional read operation up to several, how many it takes to get all the pointers to all the data together.
    Note, this does not dependent on how much data is actually stored in the field, but solely on if the data still fits in the data row on the page.

    We can force SQL Server to treat varchar(max) as In-Row data by switching off the large value types out of row option (default is on, which makes more sense).

    sp_tableoption 'tablename', 'large value types out of row', 'OFF'

    With this set to off, SQL Server will store the data in-row, which basically means that less rows per data page can be stored. Use this option only when the large data is the main read and write target of the table. (For more on this topic see here)
    Oh, and even with this option off the maximum value that is stored in-row is 8000 bytes, so why not use varchar(8000) instead of varchar(max)?

    External applications like SSIS treat varchar(max) as LOB, which means for SSIS that no matter what data is in there, data transfer will always be done by swapping the data to disk and then reading again from disk. Which makes data transfer a hell lot slower.

    And of course you cannot create indexes on varchar(max) columns. But you can still use it in the include-clause of an index. If you like really big indexes.
    If after all of this you still feel the need to create varchar(max) columns, consider limiting the length with a constraint. That doesn’t really help the issues described above, but it could limit the size of your database. Do you really want users to load 2GB of data in every data row? Why should you allow them to if all you wanna store is some small binary files?

    create table dbo.t1
      (col1 varchar(50) null
      ,col2 varchar(50) not null
      ,col3 varchar(max) null)
    
    alter table dbo.t1 add constraint CHK_col3_MaxLength
      CHECK (len(col3) <= 32000)

    This leads to the foremost reason why you just shouldn’t set every varchar() column to 255 or 8000 or max besides all the technical details:

    The key to a well designed, well performing database is to maintain data integrity.
    Don’t store data you don’t need.
    Define every data you store as well as possible.

    This means, limit the size of every column (not only character columns) to the size and data type you actually need. Not, what is easy to implement or you might need in 100 years time. Varchar() is the most flexible data type of them all and devours almost every data you throw into it. The least we can do is limit the amount.

    Nchar, nvarchar

    Nchar and nvarchar are the unicode-equivalents to char and varchar. They need two bytes per character for storage, but can store any unicode character. Apart from that they behave exactly like char and varchar.
    Well, actually it is the other way around. Char and varchar are actually subsets of nchar and nvarchar. So every varchar column can implicitly be converted to nvarchar, but not the other way around!
    Beware of things like this

    where colVarchar = N'Unicode Text'

    This forces SQL Server to explicitly convert colVarchar to nvarchar, and convert, even if it is executed in the background by SQL Server, is a function. This means that the rows of the query have to be evaluated row-by-row (hidden r-bar) and no indexes on colVarchar can be used.

    where colNVarchar = 'Non-Unicode Text'

    This can be converted implicitly and indexes can be used, since varchar is a subset of nvarchar.

    Summary

  • Choose varchar over char also for small values.
  • Use varchar(max) only when you really, actually, without doubt need it.
  • Better still, use nvarchar.
  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    %d bloggers like this: