Cats v Dogs

A Data Streaming application that analyses Tweets in real-time to determine whether cats or dogs are more popular on Twitter at any given time.
data streaming
AWS Kinesis Data Firehose
AWS Elasticsearch Service
Kibana
AWS Lambda
real-time dashboard
Streaming analytics

Objectives

The goal of this project was to learn the fundamental components of a data streaming pipeline by finding a simple, highly abstracted cloud-based method to analyse & visualise a data stream in real-time.

The application aims to analyse the popularity of cats and dogs on Twitter at any particular point in time and visualise its evolution in real-time on a dashboard. Popularity is judged simply by the number of tweets coming through with cat or dog related hashtags. This first implementation was kept as simple as possible as a proof-of-concept of the end-to-end pipeline and further complexity will be added to the application gradually.

The corresponding code is available on my Github page here

Architecture

cats v dogs aws architecture

Twitter API and Python Script

The Python script serves two main purposes:

The script contains a loop to automatically reconnect if the stream is disconnected or encounters any exception such as unexpected record formats.

AWS Firehose Delivery Stream + Lambda function

The Kinesis Data Firehose service ingests, transforms and delivers the streaming data records to Elasticsearch. It also delivers a back-up of each record to an S3 bucket.

The Lambda function processes each record as it goes through the stream. It cleans up each Tweet, keeping only the relevant fields for further analysis (the granularity of the Twitter query parameters allowed means that some fields are included by the Twitter API by default and need to be pruned to save on space and compute time). It also adds a timestamp and converts the record to JSON format (required for Elasticsearch) before pushing the record back into the Firehose stream for delivery to its destination.

Elasticsearch + Kibana

Elasticsearch is a distributed, JSON-based search and analytics engine. The AWS ES service is a fully managed service to deploy and run an Elasticsearch cluster. The integration with Firehose in AWS is pretty seamless and the schema of the records is automatically recognised by ES and reflected into an ES index.

The real-time visualisation dashboard was created using Kibana, which comes built-in with Elasticsearch. It displays a number of metrics and updates automatically every 60 seconds.

cats v dogs aws architecture

Plans & ideas for enhancements