Exploring Errors in Database Systems in Light of Cap Theorem

As we can see, there is an increased interest in terms of CAP theorem in database management applications which spans across various processing sites. CAP theorem discusses three important properties desired in every DBMS application:

C for Consistency

The primary goal of consistency is to ensure the multisite transactions follow the execution as a whole or nothing type of semantics, which is supported by the commercial database management systems. Also, if all replicas get assisted, then one may want these to have a consistent state.

A for Availability

Availability is essential for a database to be always up and running. In other terms, when there is a failure, the database should still function and keep on running by transferring the function over to a replica as needed. This is a feature that was first introduced by Tandem Computers a couple of decades back and is still popular.

P for Partition Tolerance

In case of any system or network failure, which may split the processing nodes into distinct groups that may not connect over a network to each other, partition tolerance should run in and allow independent processing to be run at both the subgroups equally.

In fact, the CAP theorem actually says that all these three goals of C, A, and P simultaneously in case of errors. So, in the case of commercial databases, we may have to drop one of these objectives.

In the NoSQL database commune, the CAP theorem is effectively used by the database developers and admins as justification for compromising the C – consistency. The majority of the NoSQL database systems may disallow transactions that tend to cross node boundaries, but consistency may be applicable only to those replicas. So, the CAP theorem may be put forth as a justification for giving the eventual consistency of replicas, which is done by replacing the goal of eventual consistency. As this notion is being raised, the guarantee is only for all the replicas to converge eventually to a similar state. When the network connectivity is re-established, and there is enough time for replica clean up, this justification compromises consistency to preserve availability and partition tolerance.

In this article, we are trying to explore this analysis and trying to discuss more possible dimensions to check out in terms of data recovery from errors. We also assume that a standard hardware mechanism with a set of local storage and data processing nodes set up in a cluster using the LAN net. These types of clusters are further connected together using a WAN. You can get a better insight into this from providers like RemoteDBA.com. While discussing this topic, we have to ideally start with the reason for database errors. Here is a list of errors, not comprehensive, though.

Database errors

Application errors

In these types of errors, applications may perform one or multiple incorrect updates. Basically, these may get unnoticed for several minutes or many hours. Here, you need to back-up the database to the point before an offending transaction, and then subsequent activities need to be reversed.

Repeatable database management systems errors

In this case, the data management system may crash on any given node. Running the same transactions at other processing nodes having the replica may cause back-ups to crash. This error is also called Bohr bugs.

Unrepeatable database management systems errors

There is a slight twist in this case compared to repeatable DBMS errors. Also, the data gets crashed; however, the replica remains okay. This is caused most of the time by some weird cases which deal with some asynchronous database operations. These errors are also known as Heisenbugs.

OS Errors

In this case, the operating system crashes on a given node, and it further shows simply the blue screen failure.

Hardware failure at local clusters

Hardware failure may be anything ranging from memory failures to complete disk failures. Basically, these may further cause panic stops at the operating system level or the database management system level. However, at some points, these failures may also be shown up as Heisenbugs.

Network partitioning at local clusters

There could be a network partition in the local cluster, where the local area network is failed, and nodes cannot go further and communicate with other nodes.

A disaster

In this case, a local cluster may have been fully wiped out by a natural disaster like an earthquake, hurricane, or flood, etc. The cluster does not exist at all.

Networking failure in the WAN connecting the clusters together

In this case, the WAN fails as a whole, and the clusters are not able to communicate with each other.

The first two types of errors may cause an issue with the highly available scheme. In such cases, you will be at a deadlock to move ahead. Availability is fully unachievable in such cases. Replica consistency also becomes meaningless and the current state of DBMS is fully wrong. In case of a disaster, then the data is only recoverable if any local transaction can only be committed after an assurance for which a connected cluster will receive the transaction.

Some application builders may accept such a latency. However, the eventual consistency may not be guaranteed as the transaction may have been lost completely if the disaster happens at the local cluster prior to any transaction forwarded elsewhere.  So, the first two and the natural disaster errors can be considered as examples of situations where the CAP theorem may not be applicable. Any real-time database should be well prepared to effectively handle recovery in such cases.

In the case of LAN partition errors, which is very rare, if we replicate the LAN, the majority of instances may cause failure to a single node, which may degenerate the case of network partition survived by various algorithms. So, it is ideal for giving up the P in such cases rather than compromising on C. Considering the network failure error, there is a partition in the WAN network. In such an instance, most of the database admins can use the straightforward algorithms, as only a very small portion is only blocked. It also seems to be unwise to sacrifice consistency to achieve availability.

So, one may not give up the C so quickly as there are many real-time errors where the CAP is not applicable as such. Applying this theorem may be a bad tradeoff in many failure situations.