MONITORING STACK FOR REAL TIME PROCESSING INFRASTRUCTUREBack Case Studies
Client: Advertising company with clickstream data
Project Duration: 3 months
Goal: Build an infrastructure for centralized log and metrics collection and visualization for various tools and application in the stack
Tech: ELK, Grafana, Telegraf, InfluxDB, AWS, Ansible
We were approached by a client who had built a data lake which accepts clickstream data (events). They wanted to be in control and to relax the life of sysadmins when it comes to the system status analysis. They had little or no monitoring, and, should a problem occurred, they needed to pinpoint problematic instances/servers, to connect to them and manually check logs and metrics.
We decided to split the problem in three parts and to execute the implementation of it in a phased manner. First, we brought all the relevant logs to a centralized place, since the whole infrastructure was very complex, spread on many machines that, in turn, host many services/applications. Second, we selected the relevant metrics data and collected it to a centralized place. And finally, we added respective alerts so sysadmins were free to walk away from monitors. The requirement was to use an open-source stack so we went with what we thought was the best selection of tools out there.
We installed Filebeat as the log collector on each machine that hosted applications of interest. Filebeat was configured to ship logs to Logstash that filters/augments them and pass downstream. At the end, the logs were indexed in ElasticSearch and presented via Kibana.
The metrics part was a bit more complex. We needed to change the application code and add Dropwizard metrics where the application specific metrics such as request latency/throughput were needed. We decided to go with InfluxDB and store the metrics there since it was a well suited database for time series. Telegraf was selected for metrics collection (e.g. via JMX) and shipment towards InfluxDB. The last part was visualization and for that we placed Grafana on top with both InfluxDB and ElasticSearch as data sources. In this way, we had the metrics on the dashboard and we could combine them with logs.
Alerting was the last step. Grafana from version 4 has support for alerts so we configured a couple of important ones and that made our monitoring stack even better.
The result is a functional monitoring system that eases everyday system operations and reduces the time spent on problem identification. The key metrics and logs now can be visualized and stored for a configurable length of time. That allows for correlating current problems with events/situations that happened in the past. Also, alerts made the lives of everyone in the team a lot easier and the whole team is more productive since they can now concentrate on business features and react to alerts when necessary.