Cassandra Failure Scenarios Plan
Goal of this plan is to create a list of failure scenarios which should be tested and verified in order to be production ready.
Test failure scenarios:
- Spin up DC and let it stream data (check impact of streaming on latency)
- Inter DC streaming
- Time of completion
- Cross-region bandwidth
- Stop node. Bring up with 1 hour hint window
- Stop node. Bring up with 3 hour hint window
- Stop node. Bring up with 6 hour hint window
- Stop node. Bring up outside of hint window (1, 3, 6 hour). Full repair
- Kill node. and bring new replace node with same IP within hinted handoff period
- Kill node and bring new replace node with same IP with same IP outside of hinted handoff period.
- Kill node. Copy last snapshots. Bring new node with same IP. Full repair.
- Kill node and bring new node with new IP instead of old node, let it pick up data
- Stop AZ. See performance impact
- Kill AZ and recover. see performance impact.
- Kill region, see impact
- Kill link between two VPCs and see impact i.e split brain
- Add node to cluster.
- Decommission node.
- Split brain - WAN breaks, affect - time, hints and repairs?