Infrastructure Familiarization

Setting a Baseline

Before we start introducing chaos let's take a bit of time to ensure we are meeting a baseline of:

  • Service failures can be detected.

  • Failed services should restart automatically.

  • Logs are collected. (preferably we can collect all debug logs as well)

  • Metrics are collected.

  • There is some way to view system wide logs and metrics. (Eg Kibana/Grafana)

If you're already doing these things feel free to skip the pages related to them.

Understanding Your System

Try to collect some information about each component in the system:

  • What is the minimum number of nodes fulfilling this role that need to be in a healthy state in order to function? (For example, a raft group requires a majority online)

  • Is the component stateful? If so, where is state saved? Which data is required by the system to not be lost/corrupted?

  • Is the component mission critical? If it entirely fails should the system still work? (albeit in a degraded state)

  • Are there any metrics you should be monitoring on this component during your testing? (Eg QPS on a SQL database)

Additionally, from the perspective of the entire system:

  • What are the main metrics of the system?

  • What is a realistic workload for the system? How can you reproduce this?

  • At what point do you expect the system to fail? (Eg if you switch off the SQL databases entirely the system should break)

  • What is the average cluster topology and size of the system? What is the breaking point of this system? (Eg the majority of nodes of each component is alive)

Last updated