ETL Data Pipeline

An Apache Spark ETL data pipeline processing large datasets of Google Local reviews and businesses and loading them into a Data Lake on S3 as well as a Redshift Data Warehouse
data modelling
data engineering
Apache Spark
Amazon EMR
AWS Redshift
Feature engineering
SQL
PySpark

Background & Objectives

This project was the capstone for the Udacity Data Engineering Nanodegree

The (imaginary) goal of this project was to create an ETL data pipeline to clean, process and enrich data from Google Local reviews of US-based businesses on Google Local and make it available for analysis by two different teams:

Data & Code

We have three datasets as input:

The code to execute this pipeline and launch the Redshift cluster is available on my Github page here

Data Pipeline Architecture

The first step in the project was to define a high level view of the data pipeline and decide on the most appropriate tools & technologies. As the main dataset contains over 10 million rows, a distributed processing solution was required and Apache Spark was used to speed up the initial assessment of the data performed via a Jupyter notebook, as well as the ETL pipeline itself. To make full use of Spark's parallel processing capability, the pipeline is executed on an AWS EMR cluster running 3 instances.

To store the large dataset for the team of Data Scientists, I opted for a Data Lake on S3 using parquet format files, as this storage type is much cheaper than a data warehouse but still fast to query and compatible with Python tools to perform exploratory analyses and ML model development. This low-cost storage option meant we could keep the full text of the reviews for a potential NLP analysis as well.

As business analysts are deemed to have a different set of technical skills from the Data Scientists and primarily use SQL to query their data directly, I opted to create a smaller Data Warehouse using a Redshift database, containing only the data pertinent to their case study in order to minimise the cost (i.e. excluding the full review text, and including only categories relating restaurants, though the category is parameterised, should another project arise in the future that requires data for a different business category). A temporary data lake was created first on S3 as a staging dataset before loading the data into a Redshift database.

Below is a diagram of the data pipeline's high level architecture

high level architecture diagram

Data Processing & Transformation

An initial analysis of the data was performed to assess data quality and identify opportunities for feature engineering to enrich the dataset. As a result of the analysis, the following steps were included in the data pipeline:

Some additional features were engineered as they could be a factor in the rating given by a user. e.g. the average rating given by each unique user, the day of the week and whether the review was written during the day or at night (which could influence the mood of the user at the time of writing).

The datasets were then combined into a large Data Lake containing all the reviews for US-based businesses, partitioned by business category. This column was chosen for partitioning as it's a likely candidate for aggregation in an ML model i.e. it is likely that a different model would be developed for each category, or group of categories.

In parallel to the Data Lake above, a temporary staging data lake was created without column partitioning. This was a necessary step as the COPY command to load the data into a Redshift Data Warehouse cannot otherwise include the partitioning column when reading from parquet files. Although this sadly meant an additional step in the pipeline, it was very fast and did not adversely impact the total processing time.

Amazon Redshift Data Warehouse data model

The data model for the Data warehouse is based on the STAR model. This format was selected as it is very clear and easily understandable and usable by the business analysts. The DW contains one fact table with the reviews data and three dimension tables containing details about places (businesses), users and population.

high level architecture diagram

Each row in the database is a review left by a user for a business in a category. If a business fits several categories, there is a record per category i.e. the same review is duplicated for each category representing the business, to allow aggregation at the category level.

NB: If there was a unique category per business, the category related to each business in the review would normally be more logically included on the places dimension table. However, as places can be represented by a number of categories (e.g. both Restaurant, Italian Restaurant and International Restaurant), I wanted to allow the business analyst to query by category to enable a more granular analysis. Having several records for the same place on the places table would have broken the referential integrity of the gPlusPlaceId foreign key between the reviews table and the places table.

In order to optimise query efficiency on the Redshift DW, I indicated Distribution Keys as well as Sort Keys in the schema.

Adapting the pipeline to a different scenario

What if the pipeline needed to be run on a daily basis?

This current project is a once-off pipeline using historical data. However, if we were to collect the same data daily from a Google API, the same pipeline could be used with slight modifications (to append to the data lake & data warehouse instead of replacing the data with each job) and scheduled & monitored using Apache Airflow.