How do you choose which columns to include in your SQL Server clustered index?

Database Engineering: Database Modelling: Do I need to have an incrementing identity int/bigint as my clustered index in a SQL Server database?

When you want to produce a professional table design that will scale in the future and stop you being called at 4 AM to fix a performance issue, will you regret that decision to not add an incrementing “id” column to that core table?

When you look in forums, do you see people fiercely guarding their opinion that you should/should not include an id column?

In this post, we are going to discuss the different points which should help you decide, on a table by table basis, what column or columns we should use as the clustered index for your table.

Types of databases

There are many different ways people use SQL Server databases, from transactional row by row, fast data stores to large analytical data warehouses. The way you use the database defines completely how you design the database. This post talks about OLTP or transactional systems, things like a list of customers and orders, specific individual transactions where common requests ask for one or a few rows. The opposite of this is OLAP databases where typical requests retrieve millions of rows in a single query.

Types of indexes

This post is about clustered indexes, and this isn’t about non-clustered, columnstore or clustered-columnstore indexes.

Why this topic?

I have seen this debated in forums spread over the internet for decades, and the advice that we gave ten years ago isn’t as valid today as it was then. Ten years ago, memory was considerably less, and disks were spinning rust. The advent of SSD’s and the ability to get servers with more memory than data, even on large systems have changed how we should think about designing and maintaining databases.

What are the options for a clustered index?

Option 1 – create an “id” column that is either an int or bigint and is also an “identity”

This is often recommended as the default for many people and is seen as the way to go. The int/bigint is because it is a small data type that is efficient to search and compare. The “identity” is so that the column is auto-incrementing so new rows go at the end of the data.

Options 2 – Use a unique natural key made up of one or more columns

If you consider a users table, a natural key might be something like email address or a combination of first name, last name, date of birth. The natural key might be a customer id which happened to be an int or bigint that was generated by another system. If the key is an auto-incrementing number, then you may well have the perfect choice for your clustered index, but it is rare.

Option 3 – Use a non-unique natural key

If you use a key that is non-unique whether it is made up or one or more columns, SQL Server will, behind the scenes, create a new column and bloat your data with that column that you can’t even see in the results. The hidden column is called the “uniqueifier”.

What concerns do people have about choosing the clustered index?

Key width

When you use a wide column or columns to create your clustered index, it is those columns that are the key to selecting a row. They are like the directions to the data and the longer they are, the harder they are to navigate to. When you use a clustered index, whatever the key is also copied onto every non-clustered index. The non clustered index finds the row or rows that it needs and then if it needs to go back to the clustered index, uses the clustered index key to find the data. If that key is long or wide, then it is harder to find the data than something narrow like an int or bigint.

Non-unique keys / uniqueifier’s

There is a great video from Kimberly Tripp who discusses how having a non-unique clustered index can lead to massive amounts of extra data in your database, wasting your RAM, disk I/O’s and making your maintenance and backups more expensive. I would recommend you review the video here: www.youtube.com/embed/QjCEkI8Qm5c

Joining datasets on strings is more expensive than joining on numbers

To compare two integers, a developer can put the two numbers into registers on a CPU and call an instruction to see if they are the same. With a string, you, possibly, need to compare every character. When you add in things like case and accent insensitivity, a string comparison starts to get even more complicated and use even more CPU cycles.

Fragmentation and page splits

If the key you use to cluster your table grows in a known way, using the current UTC date or an ever-increasing number, then new rows are only ever inserted at the end of the table. If you have a randomly ordered key, then data can be inserted anywhere in the table. If you insert data anywhere in the table, then you will get page splits and fragmentation which you then need to have the expense of actually doing and then add to the cost of any index maintenance that you need to do to fix the issue. If you have randomly sorted clustered keys, you also need to think about what the fill factor is on the pages that store the table.

Don’t blindly do X

There is a feeling, rightly so, that blindly following a rule is the wrong thing to do and can cause issues down the line. Instead of saying “you must have an id identity column that increases” you should be thinking about the logical and then the physical design of the table.

Last page contention

¬¬¬¬ When you write data to SQL Server it puts the data in a page, if you optimize the clustered index, so that writes always happen at the end of the table it means that because of the way SQL takes locks to write the data you can only ever write one row at a time so even though you may have plenty of cores, writes will be serialized. Now the locks are seriously fast so you will still get high volume, but for cases where you generate a high number of concurrent insert, you may find it to be an issue. To read all about this see: https://www.sqlpassion.at/archive/2014/04/15/an-ever-increasing-clustered-key-value-doesnt-scale/

Mitigations

Key width / duplicate data

This used to be more of an issue that it is today, TB’s of SSD are within reach of most developers and even SSD de-duplication is a thing. When we pull the data into memory, it is possible that you will have more RAM than data anyway.

Joining datasets causes high CPU usage

If your database server is completely CPU bound then any CPU cycles you save are welcome. IF you had an otherwise perfectly functioning system but had a lack of CPU power, I would probably either add more cores or even better change to a more modern CPU architecture to give you that extra boost.

It comes back to knowing what queries your data will be exposed to if every query is joining hundreds of large strings in a cas¬¬e and accent insensitive way then you are going to find issues. If you have the odd one here or there where it makes sense and performance isn’t critical, does it matter?

Page splits

We have SSD’s, and if we are in Azure, we might even have SSD’s with write caching for the data so sub-millisecond response times are to be expected. If you have a highly critical database and you are getting issues because of page splits you can do things like changing the fill factor, there are options.

Fragmentation

When you have a primary key that isn’t an incrementing number you can do things (some may say dirty things) like adding a date at the beginning so that rows always go to the end. Also again with SSD’s, is fragmentation the problem it once was? IF we have fragmentation in a database that isn’t performance critical, does it matter?

Opinion

I used to be in the camp of pretty much always adding an id column that was an identity and used that as the clustered index. This served me well, and I don’t remember a time where I had wished I hadn’t clustered on that. That being said, if there were another good choice for a clustered index, I would certainly consider it.

For each table, you create you need to understand whether or not it is going to have a significant number of rows, how the data will be accessed and whether you will get a high enough volume of queries where the types on the columns you use to join could be an issue. If these lead you to needing a small (int/bigint) incrementing number then go for it. For smaller tables where you have a good natural key and that is how the data is accessed then I would test to see if that gives you the performance you need but don’t be afraid to err on the side of adding an id incrementing column.

Personally, I will still default to adding an id int identity(1,1) column to most of my tables but I won’t be so quick to ignore other options.