HIGH PERFORMANCE TUNING FOR AD SERVING INDUSTRYBack Case Studies
Client: Globally recognized monetization technology platform in ad serving space
Project Duration: 9 months
Goal: Help a client define low latency SLA and tune application stack which will handle in excess of 20K writes and 10K reads per second
Tech: Java, Apache Cassandra, ELK, Grafana, Riemann, InfluxDB, Python, AWS, Ansible
The client is storing analytics data for ads they serve. Data is coming in huge volumes, with high velocity. The main challenge of this engagement is to provide guarantees on high nines (99.999% of requests must be under certain threshold). On five nines, even the smallest glitch in network can produce bad results. The first challenge was to prove that requirements can be met by providing proof of concept in a controlled environment (we needed to build a load generator which will simulate production load and meet SLA in that environment). The second challenge was to bring this proof of concept to production by making it robust (we added automation, monitoring, alerting).
When we came to the project, the proof of concept was already in place but it did not produce satisfactory results. We figured we first needed a reliable monitoring stack, so we could see the results of our changes. We added ELK stack and we plotted interesting graphs on Grafana. We created the missing stuff (like OS metrics and slow query monitoring). We started off with Cassandra, and tuned things bottom-up, moving forward when we were satisfied.
We came up with slow query monitoring both on Cassandra cluster and application level. This is measuring all slow queries above threshold. With easy math, you can check how many queries above the SLA threshold you can have to fall to five nines SLA. After three months of tuning and moving through the stack, we got to the target we set at the beginning. We also built all the monitoring and alerting, and wrote a lot of best practices that the client can leverage going into production.
The result of this project is low latency SLA on five nines which the client can use when speaking with customers. Another result is also a good and reliable monitoring stack which the client will benefit from in production to monitor system performance and be alerted when problems happen. This also resulted in a lot of automation, where all parts of the infrastructure (application nodes, Cassandra cluster, metrics collector,..) can be provisioned using Ansible scripts.