Amazon Streaming Data in Real-time with Amazon Kinesis Analytics

July 18, 2018

Amazon Streaming Data in Real-time with Amazon Kinesis Analytics

Amazon Streaming Data in Real-time with Amazon Kinesis Analytics (Quick note - 2016)

Processing real-time, streaming data
Ingest -> Transform -> Analyze -> React -> Persist

Key Requirement :

Durable
Correct
Continuous
Reactive
Fast
Reliable

AWS Kinesis : Comprises of three different products

Amazon Kinesis Stream :

Send clickstream data to Kinesis Stream -> Kinesis Streams store and expose clickstream data for processing -> Custom application build on Kinesis Client Library (KCL) or Spark Streaming makes real-time content recommendations -> Readers see personalized content suggestions

1.For technical developers
2.Collect and stream data for ordered, replayable, real-time processing

Amazon Kinesis Firehose :

Capture & submit streaming data to Firehose -> Firehose loads streaming data continuously into S3, Redshift, and Amazon Elasticsearch domains -> Analyze streaming data using your favourite BI tool

1. For all developers, data scientists
2.Easility load massive volumes of streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service

Amazon Kinesis Analytics :

Capture streaming data with Kinesis Streams or Kinesis Firehose -> Run standard SQL queries against data streams -> Kinesis Analytics can send processed data to analytics tools so you can create alerts and respond in real-time

1.For all developers
2.Easily analyze data streams using standard SQL queries

# Easy to use
# Automatic elasticity
# Real-time processing
# Pay for only what you use
# Standard SQL for analytics

Use SQL to build real-time application

1. Connect to streaming source
    # Streaming data source include Amazon Kinesis Firehose or Amazon Kinesis Streams
    # Input formats include JSON, .csv, variable column, or unstructured text
    # Each input has a schema; schema is inferred, but you can edit
    # Reference data source (S3) for data enrichment
2. Easily write SQL code to process streaming data
    # Build streaming applications with one-to-many SQL statements
    # Robust SQL support and advanced analytic functions
    # Extensions to the SQL standard to work seamlessly with streaming data
    # Support for at-least-once processing semantics
3. Continuously deliver SQL results
    # Send processed data to multiple destinations
        @ S3, Amazon Redshift, Amazon ES (through Firehose)
        @ Streams (with AWS Lambda integration for custom destination)
    # End-toend processing speed as low as sub-second
    # Separation of processing and data delivery

What are common uses for Amazon Kinesis Analytics?

1. Generate time series analytics
# Compute key performance indicators over time periods
# Combine with static or historical data in S3 or Amazon Redshift

2. Feed real-time dashboard
# Validate and transform raw data, and then process to calculate meaningful statistics
# Send processed data downstream for visualization in BI and visualization services.

3. Create real-time alarms and notifications
# Build sequence of events from the stream, like users sessions in a clickstream or app behaviour through logs
# Identify events (or a series of events) of interest and react to the data through alarms and notifications

How do we aggregate streaming data?

# A common requirement is streaming analytics is to perform a set-based operation(s) (count, average, max, min..) over events that arrive within a specified period of time
# Cannot simply aggregate over an entire table like typical static database
# How do we define a subnet in a potentially infinite stream?
# Windowing functions!

Window concept
# Windows can be tumbling or sliding
# Windows are fixed length

Output record will have the timestamp of the end of the window

Comparing types of windows
# Output created at the end of the window
# The output of the window will be a single event based on the aggregate function used

Amazon Kinesis Analytics Best Practices

1. Managing Applications

1. Set up Cloudwatch Alarms
# MillisBehindLatest metric tracks how far behind the application is from the source
# Alarm on MillisBehindLatest metric.
Consider triggering when 1-hour behind, on a 1-minute average. Adjust accordingly for applications with lower end-to-end processing needs.

2. Increase input parallelism to improve performance
# By default, a single source in-application strem is created
# If application is not keeping up with input stream, consider increasing input parallelism to create multiple source in-application streams

3. Limit number of applications reading from same source
# Avoid ReadProvisionedThroughputExceeded exception
# For an Amazon Kinesis Streams source, limit to 2 total applications
# For an Amazon Kinesis Firehose source, limit to 1 application

2. Defining Input Schema

# Review and adequately test inferred input schema
# Manually update schema to handle nested JSON with greater than 2 levels of depth
# Use SQL functions in your application for unstructured data

3. Authoring Application Code

# Avoid time-based window greater than one hour
# Keep window sizes small during development
# Use smaller SQL queries, with multiple in-application streams, rather than a single, large query

Limits

# Maximum row size in an in-application stream is 50 KB
# Maximum input parallelism is 10 in-application streams.
# Each application supports one streaming source, and one reference data source. The reference data source can be no longer than 1 GB in size.

Price

# Pay only for what you use.
# Charged an hourly rate, based on the average number of Amazon Kinesis Processing Units (KPU) used to run your application.
# A single KPU provides one vCPU and 4 GB of memory

Search This Blog

Thashi's Blog