Spark Delta Lake, Updates, Deletes and Time Travel

When you use delta lake there are a couple of interesting things to note based around the fact that the data is stored in parquet files which are read-only and delta lake includes the ability to delete and update data and view the state of a table at a specific point in time. Obviously read-only and updates and deletes don’t exactly sound like they work together, so how does it all work and what do we need to be aware of?

Let’s start by creating a new data frame and save it in delta format:

create data frame(

If we look at the file system we can see we have a single parquet file:

One parquet file(

If we load that one parquet file we can see that it has one hunded rows:

One hundred rows(

Now lets update one row:

Update one row(

and check that just one row was updated:

Just one row has changed(

If we look in the file system we can see two files, notice that they are about the same size - this is because the second file contains the updated row as well as the original 99 rows

Two parquet files, same size(

If we re-read the first file and check it and then read the second we can see that they both have one hundred rows:

One hundred rows in each file(

Now, what about deletes? If we delete every record apart from one:

We deleted everything aside from a single row(

We then get a new parquet file that is smaller:

New smaller parquet file(

If we load the new parquet file, we can see that it only has one row:

Single row(

So, what to remember?

The way that delta lake works is that if you update a row then that new updated row and all the existing rows are written, not just the updated rows. If a row is deleted, every row apart from the deleted rows are written out to the new parquet file. There is some additional work that goes on in the log file (json file) to say that the existing parquet files are to be ignored and to use the new one instead.