DISTRIBUTED DATA PROCESSINGBack Case Studies
Client: EU country government
Project Duration: 8 months
Goal: Efficiently process data from multiple data sources with an extremely complex data model in order to provide detail historical information
Tech: Apache Cassandra, Apache Spark, SQL, Scala, EMR, Solr
Our client is building a new real-estate evaluation system where tax will be calculated based on the property value. Property value is calculated by taking into consideration many factors other than just the property sale value: surrounding property sales prices, school distance, park or lake distance, etc. The new system is designed so that it can provide historical data for all the properties in the system and the operator can always check what the property looked like at a given point in time and provide information on how tax was calculated for legal requirements.
Data collected from multiple sources is stored into an SQL database. All entities in the SQL database are written as immutable and all the changes to existing entities create new entries in audit fashion. In order to enable the rest of the system components to easily query the property data, we needed to do all the processing ahead. Processing complexity lies in the business logic and how the entity states are kept in audit fashion. Since the data model is extremely complex (11 depth and 9 width) we needed to use a distributed processing framework and Spark was our first choice. The initial implementation was complex and total processing duration was around 96 hours.
After a couple of iterations and reworking the processing code, we decided to leverage Spark for its distributed infrastructure but moved all the processing functionality into a custom code being deployed to and ran by workers utilizing the hardware more efficiently.
This optimized approach dropped the code complexity significantly, reduced Spark lineage completely and provided us with a huge improvement in processing speed. The initial solution was correct and necessary in order to properly validate the output data and approach but running for 96 hours was not an option. The optimized solution dropped the processing time to 3 hours and made it easy to extend the codebase or make updates to the processing logic.