SENTINEL - SMART ALERTINGBack Case Studies
Client: Internal Project
Project Duration: 3 months
Goal: Using Neural Networks and Machine Learning to predict the chance of anomalies before they actually happen and create a SMART alert (alert with cause of error, all available system metrics at that moment and hint from algorithm that constantly learns what caused this problem).
Tech: Kafka, Spark (Streaming and MLib), Cassandra, ElasticSearch, InfluxDB, Riemann, Telegraf, Machine Learning, Neural Networks
We are hired frequently to solve problems with distributed systems. The first step is always getting to know the system, and in order to do that, we need a powerful monitoring solution. The next step is to have a good monitoring solution when the system is in production, which will enable system admins to step away from monitors. We wanted to take this a step further and have a machine predict when something bad will happen and give system admins hints about the source (or potential source) of error.
We needed to create an infrastructure for testing. We needed a load generator which will connect to any kind of distributed system (in this case Cassandra) and we needed the monitoring of that system. This monitoring solution was saving the parameters of interest (memory, CPU, network, disk statistics, application metrics of interest). After creating the infrastructure, we had everything we needed to store certain measurements. We stored a couple of days of data and we trained our model in a normal environment. After that, we switched to anomaly testing, we produced artificial anomalies on the CPU, disk, network adapter and we let our algorithm detect those anomalies.
The solution consists of load testing environment, a monitoring machine which stores metrics and a machine learning algorithm which works on a stream of measurements which detects when anomalies happen. This algorithm sends SMART alerts to system admin, with the error, a system snapshot carrying all the parameters of interest and a hint of what caused this error.
Our result is a better alerting solution which provides a lot of insights to system admin. This solution helps us in providing consultancy services to clients. We can help our clients have better monitoring and less problems when running their systems in production with alerts that pinpoint the exact source of problem.