Business Case and Architecture Review Example
XXX is building a real estate platform providing information of all real estates in XXX. While user interface looks simple they want to be able to provide the best possible user experience while searching for a real estate with personalized recommendation for the user. Next to the website, the business model is selling the data to other companies, and by enriching data with recommendations based on the request, its value rises.
Current setup looks like this:
- Data is collected from a public real estate database of XXX and from 3rd party scrapers scraping their competition
- During the import process data is being converted into a domain data model and saved into Elasticsearch
- MongoDB used to be the single source of truth but at one point they switched to writing directly to Elasticsearch
- Government statistics bureau provides information for criminal offence, price history,...
- Endpoint API serves user requests and queries Elasticsearch for that providing a separation layer between website and Elasticsearch
Most of the instances are hosted in XXXX private hosting and the current infrastructure consists out of:
|Service||No of instances||Hosting|
|Web server (go, nginx)||3||Exonet|
|MongoDB (Exonet managed?)||3||Exonet|
|PostgreSQL (BAG imported data)||1||local|
Application instances are behind a load balancer managed by XXX. 3rd party scrapers are exporting their results to dropbox from where the data is collected based on dropbox notifications.
Codebase is in Golang and the idea is to have microservice implementations. There are some internal microservices (like OAuth2) which are used by the system.
Currently there is no logic behind recommended properties which are displayed on the website and they are static. There is a rework of UI in progress and it should impact the search filter related data model.
Data model is described in bag_search.go file.
The proposed architecture consists of a set of fast and scalable technologies with having execution of data science algorithms in mind. The backbone of the infrastructure is Apache Kafka where all the data is being pushed for real-time processing using Kafka streams and storing both raw and processed data into the database. Our database of choice is Apache Cassandra which has proven to be one of the fastest, extremely scalable and most resilient databases today. Cassandra is a partitioned row storage and it is perfect for time-series data such as clickstreams and user tracking but also for all other entities in this case. Any complex query or random searches are not possible from Cassandra itself so that’s why keeping Elasticsearch makes sense. The team is already comfortable with ES so there will be no adoption period. One of the data processing component is Apache Spark which can be used to process streaming data from Kafka, execute batch processing on both incoming and stored (historical) data and its flexibility enables any ETL task to be implemented easily and to leverage the existing infrastructure. Apache Spark’s component MLlib is a scalable machine learning library which contains most of the popular algorithm implementations (Classification, Regression, Recommendation, Clustering…) and can interoperate with R libraries.
The existing architecture provides business value and all steps being made forward need to be almost non-intrusive. The business must continue to operate and the integration process needs to be defined in small and well defined phases. This phased approach will enable the team to get a hold of the technologies being introduced as well as provide insights to the data in a more accessible way.
- Deploy Apache Kafka and write publishers for data import jobs and all http queries (clickstream and user tracking data). This phase will provide the team with possibility to run simple algorithms on the Kafka data streams and gain more insights while creating an internal data structure and defining a data model
- Deploy Apache Cassandra and start writing raw and processed data into it. Start indexing data into Elasticsearch and building an index from updated data source with updated data model
- Deploy Apache Spark and deploy Spark Jobs for data processing for both Kafka and Cassandra data
- Start building user profiles and recommendation model
- Execute AB testing with both infrastructures in parallel (certain percent of the load is routed to the new setup)
- Remove the old infrastructure (all requests are served from the new infrastructure)
After all infrastructure integration phases are complete the end result will be a flexible and scalable real-time processing and recommendation platform. Web application changes should be minimal to none because we will retain the original api endpoint and elasticsearch for serving client queries.
Additional use cases
One of the requirements for this architecture was to be able to reuse it for a similar use case (a cars portal). With everything in mind it’s going to be possible to even reuse the existing infrastructure for that second use case but it will require proper capacity planning since current monthly visits for Jumba are around 100k and it is expected that this number will drastically increase. Implementing the second use case will require defining data structure and data processing jobs but running it on the same infrastructure will reduce time to production as well as running costs.