Client: EU-based Printing Company
Project Duration: 1 month
Goal: Transform the ingestion mechanism where they pull various data in various formats and transform them to internal format. Speed up transformation process and make it robust. Make ingestion flexible so the machine learning algorithm can be hooked at later stage.
Tech: Apache Spark, Apache Cassandra, Scala, Ansible, Terraform
The client has a system in place - a web shop with articles from different sources. This data is imported and processed manually in slow and error prone fashion. The challenge was to convince the client that, with initial investment in technology such as Spark, they can get a robust scalable solution, with pluggable data sources, and with a possibility to integrate machine learning algorithms on top of the ingestion stream in later phases.
First, we wanted to get familiar with the business of the customer and the problem we are solving. We suggested to fly over and do a review of the current architecture and process and to understand the problem we are solving. After two days spent at the customer’s premises, we created a review document with a proposal to use Apache Spark as ingestion ETL. The next phase was proof of concept, we implemented everything for one provider to show that the job can be run automatically in scheduled time slots. After that, we proceeded to cover all providers.
The solution consists of Apache Spark as the primary ingestion mechanism, since it is connected to different data sources on one side, and it produces data in internal format on the other side. The solution also covers UI monitoring of jobs, and everything is wrapped up in Terraform + Ansible scripts which help with automation.
Our result is a better ingestion mechanism, which can scale horizontally if needed when multiple providers are added to the system. It is also a solution which is flexible when it comes to adding new providers to the system. Last but not the least, this solution provides space for machine learning algorithms which can be done on input stream and can predict certain anomalies on incoming data.