Kinds of Failure
Last updated
Last updated
Computers in general are known to fail in spectacular and interesting ways. Some bugs are just a simple mistake while writing code, or . Others are much more complex.
Let's take a moment to review various kinds of failure.
Many distributed systems ultimately rely on some form of physical storage (SSDs, NVMe) to keep their data safe. Like all hardware, these devices can fail due to age, wear, or misuse.
A file that was expected was not present. (open
fails)
A file that was not expected was present. (create
fails)
A file was removed after being opened. (read
/ write
fails)
A file contains data that is invalid to the reader. (encoding mismatch, missing/extra data)
Our nodes would exist in an isolated state if not for networks. Communicating with an external API could pass through hundreds of cables, routers, and switches, all of which (including the API) can fail in various ways.
Without a reliable network connection a node can be in the dark about the world. Perhaps it can only reach some of the nodes it expects, perhaps it can reach none.
One node becomes isolated from the rest.
A partition isolates two (or more) distinct node groups.
Two particular nodes can no longer communicate.
Malformed (or outright hostile) requests
Increased probability of packet corruption (forcing re-transmits)
The network becomes intolerably slow at some or all links.
Events such as a network request, a file read, or even a thead context switch go through the scheduler. Depending on the state of the greater system(s) these events be scheduled in any number of different orders.
Even in a single-machine (but multi-threaded) system changing this schedule in subtle ways can expose interesting errors.
The system expects to have events ABC in order and gets ACB instead.
Occasionally power grids (or supplies) hiccup or fail. This is different than just a network isolation, as the node may be in an unexpected state when it returns (what if the filesystem has errors now? What if the system had to do a rollback?)
A machine loses power, and is replaced some time later (minutes, hours) after being repaired. (Try to find the worst cases possible for this, eg in the middle of a 2 phase commit.)
A machine reboots, disappearing and reappearing a minute later.
A machine with persistent data reboots, and returns with corrupted data.
Byzantine Fault Tolerance is described as the following:
Trying to get a sufficiently large number of nodes to agree on something while a small number bad actors subvert the system.
A node starts sending messages it shouldn't be, under the influence of a bad actor.
A node is under the wrong impression about the state of the greater system due to some bug, and sends messages it shouldn't. (Eg "I'm the master replica now, listen to me!")
In many distributed systems cross-data center deployments are used to help protect against regional failures. This is mostly just the "Big sister" of the above network and disk failures.
Data center(s) loses connectivity (Eg an undersea fiber is cut)
Data center is lost entirely and a new one needs to be bootstrapped.
Learn more at .
Byzantine fault tolerance is not relevant to most distributed systems, and introduces a significant amount of system complexity. Notable exceptions being things like Blockchains (eg ).