Hawkeye Review Report Example
This document contains a proposed architecture based on the provided functional and nonfunctional system requirements. It provides a high-level overview of the architecture and elaborates the technology stack choice and the required infrastructure.
Functional System Requirements
The functional requirements that the new system architecture needs to satisfy are based on the functional elements of the current system with a set of new elements that add new functions or replace existing ones. The proposed architecture needs to allow for implementation of MVP and also to support future development.
The architecture needs to support:
- Data Acquisition
- Location data ingestion
- Location data archiving
- Location data mapping and storing
- Data Analysis
- Observations classification
One additional requirement is that the architecture needs to allow for near-real-time data processing (data location analytics).
The following picture lays out these requirements:
Nonfunctional System Requirements
The proposed architecture needs to be a scalable solution in order to support the planned workload increase.
The main points during the architecture design were to improve scalability and system resiliency, as well as to reduce the number of different storage types that exist in the current setup. Most important aspect is that this is an analytics system so technology choices are influenced by ease of data science algorithms integration. The following diagram provides a proposed architecture overview with main data ingestion and processing pipelines.
Akka and Apache Kafka are used to implement the data ingestion part.
Using Akka, an HTTP API for accepting CMX Location Data JSON messages, de-duping and mapping into a generic format and passed to a Kafka topic. Akka actor model is selected as it allows for implementation of a robust, highly efficient, fast and scalable message receiving layer.
Transformed location data is then published/stored into a Kafka topic. Using a Kafka topic, together with Akka in front of it, allows for implementation of a very scalable data ingestion pattern with a high-throughput and low-latency. It virtually eliminates backpressure problem by acting as a buffer that can accept input flow peaks. This pattern also can prevent data lost in cases the rest of the system is down (malfunction on maintenance) by buffering incoming location data and replaying/streaming it later when the is back online. Kafka also allows for adding new, parallel data pipelines that might implement new functions without disturbing the existing system parts.
In addition to being used for data ingestion, Kafka can be used for persisting all input messages for a configurable period (message retention) without having a negative impact on real-time performances. This can be used to implement an efficient archive mechanism where old data can be easily accessed in a stream-fashion (log replay) in parallel with current, real-time data.
Once stored in a Kafka topic, location data enter Apache Spark. Spark jobs are used for both classification and location analytics. Spark with its Spark streaming module and libraries allows for implementation of both near real-time and batch processing. As a storage, Spark uses Cassandra database. Processed location data is stored into Cassandra tables while at the same time, previously stored data is read in order to calculate real-time response. Apache Spark and Apache Cassandra are a really good fit. Namely, Spark cluster can be deployed over Cassandra cluster in a way that on every Cassandra node there is a Spark node. It allows both clusters to be scaled together and Spark nodes can rely on data locality to boost processing performance by using only data present in the local Cassandra node.
The calculated real time response from the respective Spark job (e.g. update of location analytics - the number of visitors on a certain location) is passed again to an Akka layer. Akka is used here to implement necessary integration with the dashboard. It is also used to implement an HTTP API for data querying.
Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. It provides a high-level API for Scala and Java, built on top of the actor model of concurrency and fault tolerance. Besides the core functionality, Akka provides numerous additional modules and extensions.
The purpose of Akka would be to provide an entrypoint for ingesting the CMX location data, acquired from access points, using its HTTP module, and to transform the ingested data and publish the resulting values into the Kafka topic(s) asynchronously.
Kafka is a distributed messaging platform, designed to be scalable, durable, while ensuring high-throughput, low-latency handling of real-time data feeds. Internally, it stores data into categories called topic, with each record in each topic having a key, a value and a timestamp.
As previously said, the data ingested from the HTTP endpoint will be stored in Kafka with some configurable retention period. This data could be used to replay the stream of events at any point of time, e.g. when deploying new Spark jobs for the purpose of deriving new analytics.
Apache Spark is described as fast, unified execution engine for large-scale data processing. Spark achieves its performance by utilizing memory as primary operational storage, achieving up to 100 times improvement compared to Hadoop. It provides several language bindings (Scala, Java, Python, R), and multiple modules, applicable to various stages of data processing pipeline. One of the available modules is Spark Streaming, which exposes a high-level API built on top of the discretized streams, which provides resiliency, load balancing and unification of streaming, batch and interactive processing out of the box.
Use case for Spark is twofold - the primary job will ingest the data from Kafka topics and store it in its “raw” form in the database, while different job(s) can be deployed to transform the stored data and calculate aggregations as described in the requirements. In the future, Spark’s machine learning algorithms can be utilised as foundation for jobs performing the advanced data analysis.
Apache Cassandra is highly available, distributed, horizontally scalable and decentralized database. Cassandra data models are built around application access patterns, i.e. data queried together is stored in the same partitions. New data is written sequentially to disc, and it is sorted according to the key definition. Due to these facts Cassandra is able to provide both fast read and writes to its clients, making it an ideal time-series data storage.
Since the dashboard application uses MySQL as data source, it would be necessary to adapt its data access layer. One possible solution to this problem is to use Akka as a data access API provider on top of Cassandra.
Even though there are a lot of components in the system, the MVP solution would not require more than one instance to deploy the aforementioned layers. However, achieving high performance might be troublesome with such setup.
To be able to determine the optimal cluster size and estimate its price, it is recommended to perform an adequate load test. Such test can be performed using any of the available load testing tools, combined with SmartCat data generator. Once those numbers are determined, each of the system layers can be scaled-out, i.e. additional instances can be created without a significant operational cost. Moreover, by using tools like Terraform and Ansible, the whole process of creating and provisioning instances can be automated and kept in code.
|Task||Description||Estimate (in days)|
|Infrastructure setup||Akka, Spark, Kafka and Zookeeper clusters need to be properly configured and automated.||5|
|Implement HTTP server||Setup akka-http seed project, expose an API endpoint for accepting observations from Meraki cloud. Once the service is up and running, transform the incoming JSON documents and publish them into Kafka topic(s).||4|
|Implement event classifier||Classify events into visitors and passerby. An algorithm should be implemented on top of Spark streaming.||3|
|Aggregate data||Ingested data should be aggregated according to the provided rules using the Spark SQL.||10|
|Testing||Testing phase should cover e2e and overall performance tests.||5|
Table 1. First phase task breakdown