AWS Certified Developer Exam Notes – Kinesis

Kinesis

  • Collect, Process, and Analyze streaming data in real-time
  • Logs, metrics, clickstreams, IoT telemetry
  • Kinesis has shards
    • Each shard allows for 1 MB/s of incoming data and 2 MB/s of outgoing data
    • Issues with instances capacity -> Increase instance size
    • Issues with insufficient shards -> Increase the number of shards
    • The number of instances does not exceed the number of shards
    • The order is guaranteed within a shard but not across shards.
  • Producers(apps, client, SDK, kinesis agent) send records (data)
    • Record is made by partition key and Data Blob
    • 1 MB/sec or 1000 records /sec per shard
  • The partition key is used to distribute data across shards.
  • Consumers(apps, lambda, firehose, data analytics)
    • Classic consumer
      • Cheaper
      • 2 MB/sec per shard across all consumers
      • Use it for the low number of consuming applications
      • Max 5 GetRecords API calls/sec
      • Latency~200ms
    • Enhanced consumer 
      • 2 MB/sec per consumer per shard
      • For many applications
      • Latency~70ms
      • Expensive
  • Billing per shard
    • You can have unlimited shards
  • Retentions 1-365 days
  • Ability to reprocess (replay) data. This is the main difference from the SQS.
  • Once data is inserted in Kinesis it cannot be deleted
  • Kinesis Client Library KCL
    • Java library helps read records from a Kinesis Data Stream
    • Each shard is read by only one KCL instance, but a KCL instance can read multiple shards
    • Progress checkpointed into DynamoDb
    • It can run on EC2
  • Shard Splitting
    • Split shard into two, used to divide a hot shard
    • Method to scale Kinesis
    • The old shard is closed and will be deleted after the data expiration
    • It is not automatic. The scaling is manually
  • Shard Merging
    • Opposite operation of splitting
    • Reduce capacity and cost
  • Kinesis Data Analytics
    • SQL Application to analyze the data
    • It can receive and send data from/to Kinesis Data Streams / Firehose
    • Real-time analytics. Fully managed

 

Kinesis Data Firehose

  • Fully managed service, serverless
  • Pay for data going through firehose
  • Near Real-Time (60 seconds latency)

Firehose 3 kinds of destinations

  • AWS (S3, Redshift, Elasticsearch)
  • 3rd party 
  • Custom Destinations (Http Endpoint)

Firehose vs Data Streams

  • Data Streams requires writing custom code for producer and consumer while Firehose is fully managed
  • Data Streams is real-time(~200ms) while Firehose is near real-time ( 60 seconds)
  • Data streams store data for 1 to 365 days while Firehose does not store data
  • Data streams support replay capability while Firehose does not support it
  • Data streams scale manually while Firehose auto-scales