Before we delve into the interesting part, let me set the context first. The problem we had in hand was to do some data crunching on the log data for one of our client applications, to analyze and report on the various client-defined metrics from the application logs. The application under consideration had a user base of more than 100K users, which meant millions of rows of data to process on a daily basis. Clearly, we were dealing with “big data.” Considering the volume of data involved, we decided to go with Spark running on an Azure HDInsight cluster to benefit from the increased performance offered by Spark’s in-memory RDDs (Resilient Distributed Datasets). (more…)
Cloud, Site Reliability Engineering