Surviving Chaos
  • Surviving Chaos
  • A Brief Intoduction to Chaos
    • Principles of Chaos
    • Kinds of Failure
    • Goals and Non-goals
  • Infrastructure Familiarization
    • Service Resilience
    • Monitoring and Logging
    • Generating Work & Data
  • Assembling Your Kit
    • Using a Cloud Node
    • Using a Private Node
  • A Menagerie of Tools
    • 1000 Ways to Die (`kill`)
    • Failing the Network (`ip`)
    • Controlling Traffic (`tc`)
    • Isolating and Parititioning (`iptables`)
    • A Fuzzy Schedule (`nmz`)
    • A Disfunctional Docker (`pumba`)
  • Failure as a Feature
  • Continous Chaos (CI/CD)
    • Example: Schrödinger
  • Resources / References
Powered by GitBook
On this page
  1. A Menagerie of Tools

Controlling Traffic (`tc`)

While the ip tool gave us a way to fiddle with network links, it didn't really give us any good abilities to fiddle with the network.

tc fixes that by allowing you to tinker to your hearts content with the quality and characteristics of a link. You can do things like subtly corrupt, delay, reorder, or outright drop packets.

Be very careful using tc commands on a link you're SSH'd over. You could lock yourself out!

Causing a Delay on a Link:

tc qdisc add dev lo root netem delay 200ms
# qdisc: Queuing discipline
# dev: Device
# root: Modify egress
# netem: Network emulation

Show Rules on a Link:

tc qdisc show dev lo

Try it on a link with no settings, you can see the default Kernel settings!

Delete Rule on a Link

tc qdisc del dev lo root

Introduce Loss on a Link:

tc qdisc add dev lo root netem loss 10%

Try to avoid going over 15% packet loss, TCP starts to seriously degrade at that point.

Twins! Duplicate on a Link:

tc qdisc change dev lo root netem duplicate 1%

Packets can also be corrupted:

tc qdisc change dev lo root netem corrupt 5%

Note that TCP has a checksum built in, and corruptions commonly cause a retransmit. Most user level applications do not see this.

Exercises

  • See how much packet loss you can introduce before a connection between a database and a REPL (eg psql) starts failing.

  • If you introduce, say 2%, corruption to a web server, does it still work? Are the pages reliably what they should be? Why? (Hint: Checksums)

PreviousFailing the Network (`ip`)NextIsolating and Parititioning (`iptables`)

Last updated 6 years ago