Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

July 19, 2018

Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

Reason for building data lake

Characteristics of a data lake

1. Collect Anything: Ability to collect anything
2. Dive in Anywhere: Ability to dive in anywhere, that means any granularity
3. Flexible Access: Flexible access mechanisms such as BI tool, Spark, ML algorithm and R (Raw access / API access)
4. Future Proof: Improvements in future

Amazon S3 as the data lake

# Identified as the best place to store data as 'Starting point' of the data lake

Data can be in any form in S3, so we need to do an ETL process
1. Clean
2. Transform
3. Concatenate
4. Convert to better formats
5. Schedule transformations
6. vent-driven transformations
7. Transformations expressed as code

Metadata: AWS Glue Data Catalog (Central Metadata Catalog for the data lake)
Allows you to share metadata between Amazon Athena, Amazon Redshift Spectrum, EMR & JDBC sources

Has following extensions:
Search over metadata for data discovery
Connection info JDBC URLs, credentials
Classification for identifying and parsing files
Versioning of table metadata as schemas evolve and other metadata are updated

Data catalogue crawlers: AWS Glue Data Catalog - Crawlers (Helping Catalog your data)
Crawlers automatically build your data catalogue and keep it in sync

Automatically discover new data, extracts schema definitions
# Detect schema changes and vision tables
# Detect Hive style partitions on Amazon S3

Build -in classifiers for popular types; custom classifiers using Grok expression

Run ad hoc or on a schedule; serverless - only pay when crawler runs

Partitioning helps the downstream engine to reduce the amount of data while querying

Exmple :

Sensor / IoT device -> Record level data
(record temprature)

Business Question
1. what is going on with a specific sensor
2. Daily Aggregations (device, inefficiencies, average temperature)
3. A real-time view of how many sensors are showing inefficiencies

Operations
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what i need

Querying it in Amazon Athena
# Either create a crawler to auto-generate schema
or
# Write a DDL on the Athena console/API/JDBC/ODBC driver
# Start Querying Data

Daily aggregation
Take the raw dataset and add Glue based event pipeline that allows to ETL the data

AWS Glue Job
# Serverless, event-driven execution
# Data is written out to S3
# Output table is automatically created in Amazon Athena

For real-time stream 'Kinesis Analytics' and 'Kinesis Firehose' give daily inefficiency count and push it into S3

This architecture has following characteristics but "no server to manage"
# Scale to hundreds of thousands of data sources
# Virtually infinite storage scalability
# Real-time and batch processing layers
# Interactive queries
# Highly available and durable
# Pay only for what you use

Search This Blog

Thashi's Blog