Data Modelling

Bit columns

I am not a big friend of the bit data type. It’s one of these things that seem useful and helpful, but in the end they cause trouble, or at least annoyance. I’m not saying that they shouldn’t be used, but I think one should be aware what that entails.
Like integer, bit is a very simple data type, so there is no problem with the processing time. It only takes up one byte, which is the least any data type can use, so I’m also okay with that (kind of. Why does something that is actually just a bit, even is called a bit, need a byte?). But that’s about it. Everything else about them (or their usage) annoys me.

Nullable bit columns

What’s the sense of that? A bit column makes sense as a flag, defining if a state is true or false. If a third state is needed, like ‘undefined’ or ‘who cares’ use another data type (see further down).

Bit columns without a default

I admit, sometimes the state of something isn’t known or doesn’t matter at insert time. To my understanding this is a third state (see above), but let’s for now assume it is not and we don’t want to include the column in every insert where it is not necessary and we actually don’t know, if the state is true. Then please define a default of 0 = false and make the column NOT NULL.
Just because you’re too lazy to write the column into every insert statement, doesn’t mean it has to be NULL.

Not 0 is 1

Everything you insert into a bit column, that is not 0 (zero) is automatically 1 (true). So even if you insert -42, it will be 1. That’s fine, and part of the data type, but it’s contrary to some other programming languages.
Let’s try this:

declare @a bit = -100
select @a as bitVariable
select case when @a = -100 then 'True' else 'False' end bitVarCheck


So it sets -100 as TRUE, but an equality check on -100 as FALSE. That’s inconsistent. I wouldn’t expect the second query to return TRUE, but I would also expect everything except TRUE being FALSE when inserting data, and not the other way round. Then again it is called bit and not boolean, so the behaviour is actually correct (0 = no value, 1 = (any) value). But still.

And on a related topic:

select * from tablename where bitColumn <> 0;

There are only two possible values. So if it is not 0 it is 1. It’s the same as:

select * from tablename where bitColumn = 1;

Equality operators are always more exact and need less processor time than inequality operators. And they help the query optimizer to decide on its plan faster. I’ll admit that the difference might be minuscule in such a case, but still: do it right. For your karma.

Indexes on bit columns

Again, bit columns only hold two possible values, so an index on a bit column gives you a diversification of data of 2. Most likely the break down will be something like 60-40, 70-30. This index will never be chosen because it returns too many rows. Unless you have a break down of 95-5 and always query on the 5% value, an index on a bit column makes no sense.
A bit column in an index can make sense as

  • A part of a multiple column index with columns with larger diversification before it
  • A where –clause in a filtered index as an additional criteria to keep the index small.
  • The query optimizer sometimes even suggests an index on a bit column. But that’s just because it doesn’t know the actual statistics on the column. So don’t create indexes on bit columns, unless you only query the 5%.

    Data conversion

    This is my biggest issue with bit columns: Almost every query results in a data conversion.

    select * from tablename where bitColumn = 1;
    
    insert into	
    	dbo.tablename
    	(intColumn
    	,nvarcharColumn
    	,bitColumn)
    values
    	(1
    	,N'1'
    	,1)
    
    update
    	dbo.tablename
    set
    	bitColumn = 1
    where
    	intColumn > 10
    
    select 
    	case when bitColumn = 1 then 'True' else 'False' end
    from 
    	dbo.tablename

    This is how we would write those queries, right? In stored procedures or in views, wherever. Each of these queries results in forcing SQL Server to do data conversions. As in most (all?) other programming languages the default for undefined whole numbers is integer. So all the 1s are interpreted as integers. Only when the actual plan is compiled SQL Server realises they are actually meant to be bit and has to convert them.
    In the above cases this is one conversion per statement and converting integer to bit is not the most costly conversion. But still.

    select * from tablename where bitColumn = cast(1 as bit);
    
    select 
    	case when bitColumn = cast(1 as bit) then 'True' else 'False' end
    from 
    	dbo.tablename

    Database Bit in other programming languages

    Every programming language interprets a database bit differently. Some map it to true and false, some return it as 1 and 0. Some return it as byte, some as boolean.
    I know, once you mapped it, you know and can use it accordingly. But still.

    Alternative to bit

    What I like to use instead of bit columns is tinyint with a check constraint that only allows values 0 and 1.
    The advantages are obvious:

  • Integers are just as simple as bits
  • They also need a byte for storage but can actually store eight bits in it, not just one
  • If for some reason at some point in my development I need a third state or even more, I don’t need to change the data type of the column or any of my queries. I simply change the check constraint.
  • The queries I write won’t cause data conversion, because intColumn = 1 is really meant to be an integer.
  • The check constraint will give an error if I try to insert 100 or -42. Which is more consistent.
  • varchar(max), varchar(1), varchar(n), char(100) ,nchar, nvarchar

    I recently had a discussion with developers, DBAs and Data Designers on the usage of the character data types in SQL Server. A topic, I thought, that has been discussed so many times that everything about it has to be common knowledge. I was wrong. Many things seem to be left to assumptions still. So let’s take a closer look at those types.

    All of them store character data. All of them can be given a size (1 to 8000 or max). But what does that actually mean.

    Char

    Char stores one byte for each character resp. one byte for its size predicate. Char(50) fi stores 50 bytes.

    if object_id('dbo.t1') is not null
      drop table dbo.t1;
    
    set ansi_padding on
    
    create table dbo.t1
      (col1 char(50) null
      ,col2 char(50) not null)
    
    insert dbo.t1 values ('1234567890', '1234567890')
    insert dbo.t1 values (null, '1234567890')
    insert dbo.t1 values ('12345678901', '12345678901')
    
    select
       col1
      ,datalength(col1) as datalength
      ,len(col1)      as stringlength
      ,col2
      ,datalength(col2) as datalength
      ,len(col2)      as stringlength
    from
      dbo.t1

    Here is what we get








    No surprises there. Char columns have a fixed data length independent of their content. When the content is less than the fixed length, the rest will be padded with zeros.
    Let’s do the same again but let’s set ansi_padding off:

    if object_id('dbo.t1') is not null
      drop table dbo.t1;
    
    set ansi_padding off
    
    create table dbo.t1
      (col1 char(50) null
      ,col2 char(50) not null)
    
    insert dbo.t1 values ('1234567890', '1234567890')
    insert dbo.t1 values (null, '1234567890')
    insert dbo.t1 values ('12345678901', '12345678901')
    
    select
       col1
      ,datalength(col1) as datalength
      ,len(col1)      as stringlength
      ,col2
      ,datalength(col2) as datalength
      ,len(col2)      as stringlength
    from
      dbo.t1

    And what we get is this:








    On a nullable column the datalength is the same as the content. No padding occurs. On the mandatory column col2 the content is stilled filled with zeros up to 50.
    SET ANSI_PADDING ON/OFF must be set at table creation time to have an effect.

    Varchar

    Now let’s look at varchar columns. Varchar doesn’t preemtively store whatever size it has in it’s predicate, but only stores as much as it needs, up to its size. To do that, varchar needs to additionally store the size of its content, which takes up additional two bytes.
    So varchar(50) always stores 2 bytes + (n Characters * 1 byte) up to 52 bytes.

    if object_id('dbo.t1') is not null
      drop table dbo.t1;
    
    set ansi_padding on
    
    create table dbo.t1
      (col1 varchar(50) null
      ,col2 varchar(50) not null)
    
    insert dbo.t1 values ('1234567890', '1234567890')
    insert dbo.t1 values (null, '1234567890')
    insert dbo.t1 values ('12345678901', '12345678901')
    
    select
       col1
      ,datalength(col1) as datalength
      ,len(col1)      as stringlength
      ,col2
      ,datalength(col2) as datalength
      ,len(col2)      as stringlength
    from
      dbo.t1

    And what we get is this:









    Independent of the ansi_padding setting and the null/not null constraint, the datalength is always the same as the content length.
    Looks like char can be just as variable as varchar with the appropriate settings, doesn’t it? No, it can’t. Don’t mistake datalength for storage length. Char still stores a byte for each character. Char(50) reserves 50 bytes for every row that is created. Varchar only stores the data that is actually there. Plus 2 bytes.
    The ansi_padding setting may have some benefits in application development, but it does not change the storage needs of character columns.

    Varchar(1) vs. Char(100)

    So when to use which data type? Well, the answer seems pretty obvious:

  • 1 to 3 characters => use char
  • 4 to 7 characters => depends. If most of your data fills the 7 characters, use char. Storage wise the maintenance of varchar has a slight overhead, so char could be the wiser decision.
  • More than 7 characters => use varchar. Unless, of course, your data has a fixed length that never varies.
  • You often hear to make the cut at 15, 20 or even 50 characters. Fact is, that even at as little as 7 or 8 characters varchar is the more efficient choice.

    Varchar(255)

    A common practice I have seen many times is to use varchar(255) or even varchar(max) just for everything. Hey, it only stores what is there, so who cares?
    First, varchar(255). What is this artificial 255 threshold? I know, I am also old enough to remember database systems that only allowed 255 characters. And of course, 11111111 is 255, bla bla. But come on, 64bit systems, 32 GB of RAM or more, TBs of data storage, so please get rid of the 255. Or do your users really enter exactly 255 characters?

    Varchar(max)

    Then varchar(max). Varchar(max) is not really a character data type. It is treated as large data field, just like varbinary, text, ntext and image.
    Well, it is a hybrid. If the data fits on the data page it is treated as varchar, if it does not fit, it is treated as a large object. This means, that instead of the data itself a 16 byte pointer is stored that points to a pointer tree containing pointers to the actual data. This means at least one additional read operation up to several, how many it takes to get all the pointers to all the data together.
    Note, this does not dependent on how much data is actually stored in the field, but solely on if the data still fits in the data row on the page.

    We can force SQL Server to treat varchar(max) as In-Row data by switching off the large value types out of row option (default is on, which makes more sense).

    sp_tableoption 'tablename', 'large value types out of row', 'OFF'

    With this set to off, SQL Server will store the data in-row, which basically means that less rows per data page can be stored. Use this option only when the large data is the main read and write target of the table. (For more on this topic see here)
    Oh, and even with this option off the maximum value that is stored in-row is 8000 bytes, so why not use varchar(8000) instead of varchar(max)?

    External applications like SSIS treat varchar(max) as LOB, which means for SSIS that no matter what data is in there, data transfer will always be done by swapping the data to disk and then reading again from disk. Which makes data transfer a hell lot slower.

    And of course you cannot create indexes on varchar(max) columns. But you can still use it in the include-clause of an index. If you like really big indexes.
    If after all of this you still feel the need to create varchar(max) columns, consider limiting the length with a constraint. That doesn’t really help the issues described above, but it could limit the size of your database. Do you really want users to load 2GB of data in every data row? Why should you allow them to if all you wanna store is some small binary files?

    create table dbo.t1
      (col1 varchar(50) null
      ,col2 varchar(50) not null
      ,col3 varchar(max) null)
    
    alter table dbo.t1 add constraint CHK_col3_MaxLength
      CHECK (len(col3) <= 32000)

    This leads to the foremost reason why you just shouldn’t set every varchar() column to 255 or 8000 or max besides all the technical details:

    The key to a well designed, well performing database is to maintain data integrity.
    Don’t store data you don’t need.
    Define every data you store as well as possible.

    This means, limit the size of every column (not only character columns) to the size and data type you actually need. Not, what is easy to implement or you might need in 100 years time. Varchar() is the most flexible data type of them all and devours almost every data you throw into it. The least we can do is limit the amount.

    Nchar, nvarchar

    Nchar and nvarchar are the unicode-equivalents to char and varchar. They need two bytes per character for storage, but can store any unicode character. Apart from that they behave exactly like char and varchar.
    Well, actually it is the other way around. Char and varchar are actually subsets of nchar and nvarchar. So every varchar column can implicitly be converted to nvarchar, but not the other way around!
    Beware of things like this

    where colVarchar = N'Unicode Text'

    This forces SQL Server to explicitly convert colVarchar to nvarchar, and convert, even if it is executed in the background by SQL Server, is a function. This means that the rows of the query have to be evaluated row-by-row (hidden r-bar) and no indexes on colVarchar can be used.

    where colNVarchar = 'Non-Unicode Text'

    This can be converted implicitly and indexes can be used, since varchar is a subset of nvarchar.

    Summary

  • Choose varchar over char also for small values.
  • Use varchar(max) only when you really, actually, without doubt need it.
  • Better still, use nvarchar.
  • Using constraints in Data Models

    When talking to application developers about data bases, I have encountered many long-lived myths that seem not to want to die out. One of the most persistent myths is: Constraints, especially foreign key constraints, are bad for performance. So use as less constraints as possible.
    The contrary is true: Use as many constraints as possible. Primary key, Foreign Key constraints, Check constraints, Default constraints, Unique constraints, Not NULL constraints.

    For two main reasons:

    • Data integrity
    • (and, yes) Performance

     Data Integrity

    The more constraints you use, the more you can be sure, that the data will exactly look like it was intended to look like. The more check and key constraints you have, the less data inconsistencies you’ll have. The additional effort this is in the design and implementation phase, is marginal compared to the effort that comes with data inconsistency issues.
    I have never heard of expensive consultations and migration experts having to be hired, because the data was too consistent. Implementing as much data integrity in your data model will save you a lot of headaches and maintenance effort.

    Performance

    Yes, there is a (small) performance overhead in maintaining all these constraints, but this performance overhead of having many constraints is mostly negligible. If you write code where the performance overhead of a foreign key constraint or a check constraint matters, you already have a database that is fast as hell.

    On the contrary, the more constraints you have the easier it is for the Query Optimizer to find the best execution plan for a query. Every constraint helps defining the data and the connections between the data, without having to look at the data itself. Thus the optimizer is more likely to find a good execution plan than without constraints. And so constraints are more likely to make the database faster than slower.

    Here is an example with a simple check constraint:

    select
    t1.col1
        ,t2.col1
    from
        t1
    left outer join t2 on t1.t2Id = t2.id and t2.col2 = 15

    Col2 in table T2 only has values from 1 to 3. The optimizer doesn’t know that, so it creates a plan with a table scan on T2.

    Let’s create a check constraints that tells the database that Col2 in T2 can only contain values 1, 2 and 3.

    ALTER TABLE dbo.t2 ADD CONSTRAINT CHK_t2_Col2
    CHECK (Col2 &gt;= 1 AND Col2

    Now let’s run the same statement again. And, surprise, the execution plan looks completely different.

    The optimizer now knows that T2.Col2 = 15 will never ever bring any result back, so it doesn’t even bother to include the table T2 in the execution plan. The execution plan is now far more efficient.

    This is just a simple example, but it goes to show:

    The more the Query Optimizer knows about the data, the better the execution plan will be.

    And the way to tell SQL Server about the data is creating contraints. Thus:

    Create as many constraints as possible!

    %d bloggers like this: