Surviving Chaos
  • Surviving Chaos
  • A Brief Intoduction to Chaos
    • Principles of Chaos
    • Kinds of Failure
    • Goals and Non-goals
  • Infrastructure Familiarization
    • Service Resilience
    • Monitoring and Logging
    • Generating Work & Data
  • Assembling Your Kit
    • Using a Cloud Node
    • Using a Private Node
  • A Menagerie of Tools
    • 1000 Ways to Die (`kill`)
    • Failing the Network (`ip`)
    • Controlling Traffic (`tc`)
    • Isolating and Parititioning (`iptables`)
    • A Fuzzy Schedule (`nmz`)
    • A Disfunctional Docker (`pumba`)
  • Failure as a Feature
  • Continous Chaos (CI/CD)
    • Example: Schrödinger
  • Resources / References
Powered by GitBook
On this page
  • Setting a Baseline
  • Understanding Your System

Infrastructure Familiarization

Setting a Baseline

Before we start introducing chaos let's take a bit of time to ensure we are meeting a baseline of:

  • Service failures can be detected.

  • Failed services should restart automatically.

  • Logs are collected. (preferably we can collect all debug logs as well)

  • Metrics are collected.

  • There is some way to view system wide logs and metrics. (Eg Kibana/Grafana)

If you're already doing these things feel free to skip the pages related to them.

Understanding Your System

Try to collect some information about each component in the system:

  • What is the minimum number of nodes fulfilling this role that need to be in a healthy state in order to function? (For example, a raft group requires a majority online)

  • Is the component stateful? If so, where is state saved? Which data is required by the system to not be lost/corrupted?

  • Is the component mission critical? If it entirely fails should the system still work? (albeit in a degraded state)

  • Are there any metrics you should be monitoring on this component during your testing? (Eg QPS on a SQL database)

Additionally, from the perspective of the entire system:

  • What are the main metrics of the system?

  • What is a realistic workload for the system? How can you reproduce this?

  • At what point do you expect the system to fail? (Eg if you switch off the SQL databases entirely the system should break)

  • What is the average cluster topology and size of the system? What is the breaking point of this system? (Eg the majority of nodes of each component is alive)

PreviousGoals and Non-goalsNextService Resilience

Last updated 6 years ago