Revisit Database Isolation

seanhu93
12 min readFeb 8, 2021

--

​Understand concurrency problems

When two or more work units (it can be process, thread, program or any other executables) execute at the same time, especially if they access or modify the same piece of data, we have concurrency problems.

Take a real world example, two cars may crash with each other if they drive to the same crossroad at the same time without any stop sign nor traffic light — they used the same lane without knowing the other did the same.

Similarly, if two programs update the same row at the same time in a database, one may overwrite another one’s write (consider there’s no mutual exclusive mechanism in the database such as locks)

Even if two programs guarantee the correctness of their implementation when they are executed in a single thread environment, such concurrency problems impact the correctness when they are executed concurrently with each other.

Database Transactions and how it helps

Naturally, human programmers are good at written single-threaded programs.

The concept of database transaction is a brilliant idea, it provides the human programmers a single-threaded environment illusion. It abstracts the complexity of handling concurrency problems. Human programmers need to only focus on the correctness of their programs in a single-threaded environment.

But how do database transactions provide such an illusion? If you learnt database, you probably still remember the puzzle word — ACID. It stands for the four properties that a database transaction should have — Atomicity, Consistency, Isolation and Durability. We will review others later, but in this post, we will focus on the 3rd property — Isolation.

The isolation guarantee states that transactions are executed as of there’s no concurrently executed transactions (even though in fact there’s many of these). In another word, concurrent transactions are executed in a way that it seems like they are executed sequentially, one at a time.

However, everything comes with a cost. What we mentioned above is a very strict isolation guarantee and it has performance cost such as latency (how long it takes for a transaction to complete) and throughput (how many transactions the database system can handle per second). In fact, most existing database implementations do not provide such strictness or do not enable such strictness by default for performance considerations.

Like many other systems in Computer Science, it is a trade off. To gain a better performance, people often relax the isolation guarantees. This is where the concept of isolation level comes into the picture. Different isolation levels define different “strictness” of the isolation that the database systems provide.

To review and understand different isolation levels, let’s step out and learn what are anomalies in concurrent systems first.

Anomalies in concurrent systems

Anomalies (or concurrent bugs) are problems that we may face in concurrent systems. Note these anomalies are only possible if transactions are executed concurrently (overlapped in time) in a database system that isolation is not fully guaranteed. However, if the transactions are executed sequentially one after another, such anomalies are not possible.

In my experience, people will refer to them as race conditions. However, there are many different kinds of “race conditions”.

Note that in this post I am using terminologies in the database systems, for example, transactions instead of processes, database rows instead of memory addresses. But the idea should be applicable to other systems as well.

Lost Update

As known as write-write conflict — when two transactions concurrently update the same row, one update is lost.

For example, if the database row X is set as 0 initially and we have two concurrent transactions T1 and T2. T1 reads X (=0) and increases the value by 1 (X=1). At the same time, T2 reads X (=0) and increases the value by 2 (X=2).

As we can see, it is problematic as X = 2 once both transactions are completed, the update from T1 is lost. If we execute the two transactions sequentially, the expectation is that X = 3.

Dirty Writes

Dirty writes happen when you write some data based on an uncommitted write from another transaction. In more detail, when two transactions concurrently update the same row. The 1st transaction updates the row with some values, but then rolls back. On the other hand, the 2nd transaction reads the uncommitted write from the 2nd transaction, and updates the row based on that.

For example, if the database row X = 0 initially. Then (1) T1 reads X (=0), (2) T1 increases the value by 1 (X=1), (3) T1 aborts and rollback. If T2 started between (2) and (3). T2 reads X (=1) and increases the value by 1 (X=2) and then commits.

As we can see, it is problematic since T2 writes X based on an uncommitted data from T1. If we execute the two transactions sequentially, the expectation is X = 1.

Dirty Reads

As known as write–read conflict or read uncommitted data. It is similar to dirty write except that the 2nd transaction doesn’t necessarily do a write after it reads uncommitted data from the 1st transaction.

For example, if X = 0 initially. Then (1) T1 reads X (=0), (2) T1 increases the value by 1 (X=1), (3) T1 aborts and rollback. If T2 started between (2) and (3). T2 reads X (=1) then returns the data back to the client. Then the client sees X = 1.

As we can see, it is problematic since the user sees X = 1, which is an uncommitted data and finally is rolled back. If we execute the two transactions sequentially, the user should see X = 0.

Non-repeatable Reads

As known as read-write conflict. It happens when a transaction reads the same row twice during its execution, but two reads return different values if there’s another transaction commits a write between the two reads.

For example, T1 reads X = 0. T2 update X with 1 and then commit. T1 reads X again, and finds X = 1 now.

It is problematic since T1 observed changes made by T2 during its execution. If we execute the two transactions sequentially, the expectation is that T1 should get X as 0 or 1 (depends on whether T1 happens before T2 or not) consistently from the two reads

Phantom Reads

It happens when the same transaction executes the same query on the same table twice but getting different results, if there’s another transaction that makes some updates that could impact the query result.

For example, initially a table A has 2 records. T1 queries table A for the number of rows it has and finds 2. T2 inserts a new row into table A and commits. Then T1 does the same query against table A but found the result is 3 now.

It is problematic since T1 observed changes made by other transactions during its executions. If we execute the two transactions sequentially, the expectation is that T1 should see the same result, either 2 or 3 (depends on whether T1 happens before T2 or not), consistently from two queries.

Write Skew

It happens when two transactions concurrently read 2 rows X and Y, concurrently make disjoint updates (T1 updates X, and T2 updates Y), and concurrently commit. Neither have seen the update performed by the other. [Wikipedia]

It is problematic if there’s a constraint on X and Y. Say X + Y must always > 0. Note the constraint check should be performed every time when there’s an update to X or Y, however, in case of concurrent updates, such constraint check may pass.

If we execute the two transactions sequentially, the expectation is that T1 or T2 will abort due to constraint check violation (depends on whether T1 happens before T2 or not). And the constraint of (X + Y > 0) will still hold after the execution.

Isolation Levels

How to define isolation levels that are less strict than the perfect isolation?

If a database system guarantees a perfect isolation, it should guarantee that none of the concurrent anomalies are possible. With this idea, we can then define reduced isolation levels by relaxing some of the constraints.

The ISO SQL standard defines a list of reduced isolation levels, where some anomalies are possible for some levels and some are not possible for other levels.

For example, the most relaxed isolation level — Read Uncommitted — allows three anomalies — Dirty Reads, Non-repeatable Reads and Phantom Reads. However, the most strict isolation level — Serializable — doesn’t allow all three anomalies.

You may wonder that we mentioned six different anomalies in the previous sections, but the ISO SQL standard mentioned only three, how about the rest of them. Good question! I had the exact same question when I was learning this as well. Keep reading!

Many problems have already been pointed by researches on how ISO SQL standard defines isolation levels. What you just asked is one of them. Unfortunately, after many revisions, the ISO SQL standard hasn’t fixed them yet, which is bad : (

Let’s review the list of problems with the existing definition of Isolation Levels by ISO SQL standard.

Problems with the existing definition of Isolation Levels

Firstly, the current ISO SQL standard failed to define the isolation levels with comprehensive types of anomalies. It only mentioned 3 types of anomalies, other types of anomalies such as Lost Update, Dirty Writes and Write Skew are not mentioned. As a result (which is bad), different database system implementations may define different isolation levels with different sets of anomalies, which leads to inconsistency. It makes things even worse if such differences are not well documented by the database system vendors. It leads to many confusions for application developers and ultimately concurrency bugs in the applications.

Secondly, the current ISO SQL standard failed to provide a definition for snapshot isolation level, which is very popular in practice. As a result (again bad), different database system implementations may allow different sets of vulnerabilities at the snapshot isolation level.

Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database. [wikipedia] Imagine that the database is taking a snapshot consistently, where each snapshot only contains committed data. (In practice, the snapshot isolation is typically implemented with MVCC — multi-version concurrency control).

When a transaction gets started, it takes a reference to the last snapshot of the database. Any reads will come from the same snapshot, a repeat read for the same data will return the same value. Therefore, the Non-repeatable Reads anomaly is not possible. On the other hand, any writes will lead to a new snapshot of the database, if two transactions write to the same data concurrently, one will abort. Therefore, the Lost Update anomaly is not possible. However, if two concurrent transactions write to disjoint data, the write conflict will not be detected. Therefore, the snapshot isolation is vulnerable to write skew anomalies. Some database system implementations may also be vulnerable to the phantom read anomaly.

Lastly, the existing definition of Isolation Level provides an ambiguous definition for serializable. On the one hand, the SQL standard defined the serializable correctly — if its outcome (e.g., the resulting database state) is equal to the outcome of its transactions executed serially, i.e. without overlapping in time [Wikipedia] On the other hand: the aforementioned isolation level table seems to imply that as long as an isolation level is not vulnerable to the 3 mentioned types of anomalies — dirty reads, non-repeatable reads and phantom reads, it can be called serializable.

As an application developer, what can I learn from this?

In this article, we review what are concurrency problems and how database transactions and its ACID properties help.

We deep dived into one of the database transaction properties — Isolation. We learnt that in practice, the perfect isolation has performance implications. Based on different use cases, the application developers should make wise judgement to decide which isolation level to use, which provides a good balance of isolation guarantee and performance.

We learnt that there are six different concurrent bugs (anomalies) in concurrent systems. Unfortunately, the current ISO SQL standard did a bad job to define isolation levels using only 3 out of 6 anomalies, which leads to many confusions.

Today, different database system vendors may provide vague definitions for different isolation levels they have. People may need to look into database system implementation details to understand the subtle difference.

Such ambiguity makes application developer life harder since they not only need to make wise judgement on the trade-off of performance and isolation level, but also need to understand vendor-specific implementation details when picking isolation levels for their business needs.

As an application developer, what can I do?

I found the original blog from Prof. Daniel Abadi when I was googling some data consistency problems in distributed systems. It was astonishing to find such a great blog as it has so much content which is very valuable and important but unfortunately missed from my database education from school.

However, after reading the blog again and again (in order to write this article), I feel a sense of sadness that as an application developer I don’t find I am equipped with a better weapon to solve the problem in my hand — how to write better distributed microservices which are race-condition free.

In Java, we have interfaces and concrete classes, the interface provides the API contract and the concrete class provides different implementations of the APIs. The consumers of the interfaces are expecting consistent and well-defined API contracts, so that the consumers are able to leverage the functionalities that provided by the interfaces without understanding implementation details of any concrete classes. This is called abstraction. Abstraction is everywhere in Computer Science, which in my view, is the foundation of the success of many modern computer systems.

The same abstraction is provided by database systems for their consumers — applications that have a need for data persistence. The ISO SQL standard, in my view, serves as the “interface” between applications and the database systems. However, its isolation levels are so poorly defined that consumers like us need to look into database system implementations before making wise judgement. It is not acceptable.

As a layman of database realm (I am very interested in database systems and plan to learn more, however, today I still consider that I knows little about it), what I learnt and what I can recommend is that:

Firstly, look around and research different database system implementations. Look into their documents. If the document doesn’t provide a clear explanation of different isolation levels and configuration instructions, it may not be a good candidate for you.

Secondly, choose the most strict isolation level as much as possible. As we mentioned, different implementations have different implications for relaxed isolation level. You should be really sure that the different vulnerabilities and its impact on your application before choosing a relaxed isolation level.

While performance is definitely an important consideration, there are many ways to workaround it. For example: (1) look for other database system implementations, which may provide better performance with the same isolation guarantee. (2) optimize your application code and your database design based on some best practices.

What is next

The discussion in this article is focused only on traditional database systems. Today, we see more and more needs for distributed database systems where data is replicated to multiple database instances for high availability and scalability. We will look into isolation problems for distributed database systems in the next post. Stay Tuned.

References

I highly recommend you to read the original blog by Prof. Daniel Abadi. I was trying to explain some ideas with pictures and selected some basic content from the blog. But the original blog has more than that, you should definitely read the original blog if you want to learn more.

--

--