Documents

Cassandra Failure Scenarios Plan

Goal of this plan is to create a list of failure scenarios which should be tested and verified in order to be production ready.

Test failure scenarios:

  • Spin up DC and let it stream data (check impact of streaming on latency)
    • Inter DC streaming
    • Time of completion
    • Cross-region bandwidth
  • Stop node. Bring up with 1 hour hint window
  • Stop node. Bring up with 3 hour hint window
  • Stop node. Bring up with 6 hour hint window
  • Stop node. Bring up outside of hint window (1, 3, 6 hour). Full repair
  • Kill node. and bring new replace node with same IP within hinted handoff period
  • Kill node and bring new replace node with same IP with same IP outside of hinted handoff period.
  • Kill node. Copy last snapshots. Bring new node with same IP. Full repair.
  • Kill node and bring new node with new IP instead of old node, let it pick up data
  • Stop AZ. See performance impact
  • Kill AZ and recover. see performance impact.
  • Kill region, see impact
  • Kill link between two VPCs and see impact i.e split brain
  • Add node to cluster.
  • Decommission node.
  • Split brain - WAN breaks, affect - time, hints and repairs?