Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena


Reason for building data lake


Characteristics of a data lake

1. Collect Anything: Ability to collect anything
2. Dive in Anywhere: Ability to dive in anywhere, that means any granularity
3. Flexible Access: Flexible access mechanisms such as BI tool, Spark, ML algorithm and R (Raw access / API access)
4. Future Proof: Improvements in future

Amazon S3 as the data lake

# Identified as the best place to store data as 'Starting point' of the data lake


Data can be in any form in S3, so we need to do an ETL process
1. Clean
2. Transform
3. Concatenate
4. Convert to better formats
5. Schedule transformations
6. vent-driven transformations
7. Transformations expressed as code


Metadata: AWS Glue Data Catalog (Central Metadata Catalog for the data lake)
Allows you to share metadata between Amazon Athena, Amazon Redshift Spectrum, EMR & JDBC sources

Has following extensions:
Search over metadata for data discovery
Connection info JDBC URLs, credentials
Classification for identifying and parsing files
Versioning of table metadata as schemas evolve and other metadata are updated

Data catalogue crawlers: AWS Glue Data Catalog - Crawlers (Helping Catalog your data)
Crawlers automatically build your data catalogue and keep it in sync

Automatically discover new data, extracts schema definitions
# Detect schema changes and vision tables
# Detect Hive style partitions on Amazon S3

Build -in classifiers for popular types; custom classifiers using Grok expression

Run ad hoc or on a schedule; serverless - only pay when crawler runs

Partitioning helps the downstream engine to reduce the amount of data while querying

Exmple :

Sensor / IoT device -> Record level data
(record temprature)

Business Question
1. what is going on with a specific sensor
2. Daily Aggregations (device, inefficiencies, average temperature)
3. A real-time view of how many sensors are showing inefficiencies

Operations
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what i need


Querying it in Amazon Athena
# Either create a crawler to auto-generate schema
or
# Write a DDL on the Athena console/API/JDBC/ODBC driver
# Start Querying Data

Daily aggregation
Take the raw dataset and add Glue based event pipeline that allows to ETL the data

AWS Glue Job
# Serverless, event-driven execution
# Data is written out to S3
# Output table is automatically created in Amazon Athena


For real-time stream 'Kinesis Analytics' and 'Kinesis Firehose' give daily inefficiency count and push it into S3

This architecture has following characteristics but "no server to manage"
# Scale to hundreds of thousands of data sources
# Virtually infinite storage scalability
# Real-time and batch processing layers
# Interactive queries
# Highly available and durable
# Pay only for what you use

Comments

Popular Posts