Kinesis
- Collect, Process, and Analyze streaming data in real-time
- Logs, metrics, clickstreams, IoT telemetry
- Kinesis has shards
- Each shard allows for 1 MB/s of incoming data and 2 MB/s of outgoing data
- Issues with instances capacity -> Increase instance size
- Issues with insufficient shards -> Increase the number of shards
- The number of instances does not exceed the number of shards
- The order is guaranteed within a shard but not across shards.
- Producers(apps, client, SDK, kinesis agent) send records (data)
- Record is made by partition key and Data Blob
- 1 MB/sec or 1000 records /sec per shard
- The partition key is used to distribute data across shards.
- Consumers(apps, lambda, firehose, data analytics)
- Classic consumer
- Cheaper
- 2 MB/sec per shard across all consumers
- Use it for the low number of consuming applications
- Max 5 GetRecords API calls/sec
- Latency~200ms
- Enhanced consumer
- 2 MB/sec per consumer per shard
- For many applications
- Latency~70ms
- Expensive
- Classic consumer
- Billing per shard
- You can have unlimited shards
- Retentions 1-365 days
- Ability to reprocess (replay) data. This is the main difference from the SQS.
- Once data is inserted in Kinesis it cannot be deleted
- Kinesis Client Library KCL
- Java library helps read records from a Kinesis Data Stream
- Each shard is read by only one KCL instance, but a KCL instance can read multiple shards
- Progress checkpointed into DynamoDb
- It can run on EC2
- Shard Splitting
- Split shard into two, used to divide a hot shard
- Method to scale Kinesis
- The old shard is closed and will be deleted after the data expiration
- It is not automatic. The scaling is manually
- Shard Merging
- Opposite operation of splitting
- Reduce capacity and cost
- Kinesis Data Analytics
- SQL Application to analyze the data
- It can receive and send data from/to Kinesis Data Streams / Firehose
- Real-time analytics. Fully managed
Kinesis Data Firehose
- Fully managed service, serverless
- Pay for data going through firehose
- Near Real-Time (60 seconds latency)
Firehose 3 kinds of destinations
- AWS (S3, Redshift, Elasticsearch)
- 3rd party
- Custom Destinations (Http Endpoint)
Firehose vs Data Streams
- Data Streams requires writing custom code for producer and consumer while Firehose is fully managed
- Data Streams is real-time(~200ms) while Firehose is near real-time ( 60 seconds)
- Data streams store data for 1 to 365 days while Firehose does not store data
- Data streams support replay capability while Firehose does not support it
- Data streams scale manually while Firehose auto-scales