How to delete a lot of data from a SQL Server database
How to delete data efficiently
When we delete data from a table in SQL Server, we can often find that the delete is slower than even the original insert, especially if the insert was made using the bulk insert API. Often when people find that they have a delete that is taking longer than it should and blocking other users, they kill the delete, but the rollback takes just as long if not longer than the delete up until that point. They can sometimes panic and restart SQL Server, but when the server starts back up again, it puts the database into recovery mode and continues rolling back that data.
If we put the proper planning into how we maintain and delete data, then we can efficiently delete data when we need to.
What do we need to think about?
First I will discuss the significant things that we need to think about when deleting large amounts of data and then we will discuss different approaches to deleting data.
Foreign Keys
When we have a relationship defined on a table, and we want to delete data from the parent table, the child table also needs to be checked to see if the delete would be allowed. If we have cascade deletes enabled then the child table will need to be checked to see if it is a parent table as well and potentially one or more other child tables will need to be checked.
If the child tables do not have indexes on the columns to be checked, then that can lead to SQL to generate poorly performing query plans and affect performance negatively.
Foreign Keys are essential for enforcing referential integrity with databases, but they aren’t a deletions friend.
Triggers
It is impossible to show the exact effect of triggers because you can write any valid T-SQL code in a trigger, including calling out to the operating system using xp_cmdshell. The thing to note is that inside a trigger an implicit transaction is created for you which means that any other table you touch in the trigger will have locks placed on it until the delete completes. This means that when you do have a delete trigger, you should keep the code as specific as possible and bear in mind the extra transaction that you get whether you want it or not.
Locking
Locking and blocking are often treated as synonyms with SQL Server, but they are not the same thing, it is possible to have lots of locking without any blocking or a small piece of locking with horrendous blocking. It does depend on what you lock and how.
To understand the type of locking you will need to know:
- The table type, does it have a clustered index or not? Is it a clustered columnstore index or not?
- The transaction isolation level you are running the delete in
- The size of the table to have data deleted from it
The table type matters because if you are removing data from a heap, that is a table without a clustered index then you will likely take page locks on any page read and then any page deleted. If entire pages are removed, then you will also get extent locks.
If the table has a clustered index, then you will see key locks, and if it is a clustered columnstore index, then you will see things like rowgroup locks.
The transaction isolation and table size are essential to understand because if you have an isolation level that takes a lot of locks such as serializable and you have enough data, then any page or row locks may be escalated to table locks. Table locks affect any other query accessing that table and can even cause queries to deadlock themselves in some situations so lock escalations can be devastating for an online system.
Fragmentation
Fragmentation occurs when we delete from pages, and there is still data surrounding our data. If we have 100 rows and delete every odd row, we would have perfect fragmentation in that we have doubled the size of the data that we need. If we delete rows 1-49, even though we remove the same number of rows we don’t have any fragmentation as the data is in a continuous block. Knowing how the data is stored on disk and how the data will be deleted, is it the first x records or every x record is vital so that we know whether, after the delete, we should also reorganise the indexes to remove the deleted records.
Log File
The transaction log keeps a copy of all the data it needs so that if an operation doesn’t complete, it can be rolled back. Each row is stored, row by row rather than in batches or pages. This means that if when loading the data initially you used the bulk load API and the insert was minimally logged, in that it didn’t have everything written to the transaction log then it might be that it was faster to load the data than to delete the data.
This difference in speed can be worse when we use clustered columnstore indexes, and we use batch mode inserts to parallelise the inserts. If you used this and inserted rows, it will be much slower to delete the same amount of rows.
If you do delete large amounts of data where you either trickle inserts or use a minimally logged inserts, then it could mean that the size of the log file grows dramatically as well. Even if you use the simple recovery model, you will still have every row logged. If you can use simple then batching the deletes into smaller transactions will keep the log size smaller.
Partitions
Parititons can be used to delete data if the partition key is relevant to the data that you want to delete. If you have a partition and you want to delete the entire partition, with later versions of SQL, you can Truncate that specific partition, which is a non-logged operation so is fast. If you are on an older version of SQL, you can swap the partition with an empty partition which is also a fast operation.
For partitions to be useful, the table needs to be partitioned by the key you are using to delete and, vitally, you need to be able to delete entire partitions. If you use the date as the partition key and it mixes different clients, and you want to delete one specific client then it doesn’t help. If you partition by year and want to delete entire years worth of data, then it works great.
Approaches to deleting data
We will now look at different ways of managing large deletes.
Don’t delete any data
The first approach of not deleting any data might, in fact, be the simplest way to deal with deleting data. If you can find a way to ignore the data and you don’t need to remove it then perhaps that is the best answer.
It might be that you can not delete the data during the week and then batch up deletes when the system is quieter or has some maintenance windows.
The downside to never deleting data is that you end up with more data than you need which can in itself cause performance problems.
Batched Deletes
Batching the deletes into smaller transactions can help to reduce the amount of locking and potential blocking. If you also schedule the batched deletes at quieter times and spread over a few days, then you are less likely to cause problems for other users.
Truncate
Truncate is a minimally logged operation so any time you truncate a table, it is fast, but the downside is that you lose all the data. If you want to clear out an entire table then truncate is the way to go. If you are using partitions and you can truncate one or more partitions, then that is also the way to go.
One other approach is to take the data you want to keep and insert it into a new table, then drop the existing table and rename the new temporary table (note: not a # table you can’t rename a # or ## table back to a standard table) back to the original name. When using this, you should think about how much data you want to delete versus how much data you want to keep. You should also think about how often new data is added to the table, if it is getting updates and inserts every second then keeping the two tables in sync will be hard.
Any other options?
If there is anything else you consider when you delete large amounts of data let me know in the comments!