Surviving Chaos
  • Surviving Chaos
  • A Brief Intoduction to Chaos
    • Principles of Chaos
    • Kinds of Failure
    • Goals and Non-goals
  • Infrastructure Familiarization
    • Service Resilience
    • Monitoring and Logging
    • Generating Work & Data
  • Assembling Your Kit
    • Using a Cloud Node
    • Using a Private Node
  • A Menagerie of Tools
    • 1000 Ways to Die (`kill`)
    • Failing the Network (`ip`)
    • Controlling Traffic (`tc`)
    • Isolating and Parititioning (`iptables`)
    • A Fuzzy Schedule (`nmz`)
    • A Disfunctional Docker (`pumba`)
  • Failure as a Feature
  • Continous Chaos (CI/CD)
    • Example: Schrödinger
  • Resources / References
Powered by GitBook
On this page
  1. Infrastructure Familiarization

Monitoring and Logging

PreviousService ResilienceNextGenerating Work & Data

Last updated 6 years ago

Chaos engineering relies heavily on being able to understand and analyze the state of a whole system. Critically, we need to be able to analyze the whole state of the system.

In general, you should be able to answer the following questions about your system at any given point in time:

  • How fast is it running?

  • How many errors are occurring?

  • What was logged on a particular node for a particular time span, even if that machine is now destroyed?

  • How many, and which, nodes are presently failing?

If you do not have such infrastructure in place, or are not familiar with the infrastructure you have, this is the perfect opportunity.

One common way to do this is to deploy with and . It will take some time and work to get this infrastructure in place, and even more to correctly configure your dashboards.

It will be worth it.

Prometheus
Grafana
Kibana