Near real-time processing application via AWS Lamda and S3

Serkan SAKINMAZ
Towards AWS
Published in
5 min readMay 27, 2021

--

Near real-time processing is one of the popular approaches that require processing data within minutes. Once the data arrives at the storage layer, you need to process it within a couple of minutes.

There is no exact specification to define near-real in terms of the processing time period. I used the following graph as a reference from Oreilly that gives an idea to explain different processing types. For the near-real-time, you can see the processing time is between roughly 5 min to 60 min.

https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/

For me, the most important thing is to focus on how much value that you create with the respective method. In my previous blog, I explained Data Processing Approaches such as batch, micro-batch, and streaming. Apart from latency, you need to provide accurate data with a stable processing approach. To give an example, real-time processing methods are a more challengeable approach in terms of observability and need more investment.

In this blog, I will explain how to create a near-real-time processing application with a stable as well as a cheap approach. We will use Amazon S3 and Lamba services.

AWS S3 is an object storage mechanism that allows you to store whatever you need without managing any infrastructure. It is cheap, scalable, and secure as well.

Lambda is a computing service that allows running Java, Python, Node.js, and Go without provisioning and managing any server. There is a good feature provided within S3 to create event notifications when you make a file-based action. In a summary, when a specific action happens on S3 such as put a file, you can create an event to the following destinations

  • Amazon Simple Notification Service (Amazon SNS)
  • Amazon Simple Queue Service (Amazon SQS) queue
  • AWS Lambda

Please see the following data flow, when the file is put to S3, S3 is able to trigger the lambda function in order to process the file.

Let’s implement a simple application. In the following example, we are going to put a CSV file to S3 bucket and see the logs in the Lambda application

Steps for the near-real-time application

Step 1-Use the following file to process data in near real-time. The BikeBuyerData consist of customer information, and a tag that points to whether the customer buys a bicycle

You can see the data structure below.

Columns

ID
Marital Status
Gender
Income
Children
Education
Occupation
Home Owner
Cars
Commute Distance
Region
Age
Purchased Bike

Step 2-Create an S3 bucket with the name of bikebuyer

Step 3-Lambda function

Create a lambda function, the code content will be filled out later

Step 4-Define a lambda trigger

In the bikebuyer S3 bucket, click on the Properties

Go below and click to Create Event Notification

Step 5-Define the event notification

The definition of event notification is pretty straightforward, just fill out the following information

Event name : bike-buyer-event
Suffix : CSV
Prefix : I keep empty as of now
Event type : Put
Destination : Lambda function
Lambda function : bike-buyer-lambda

Step 6-Implement a simple Python code within Lambda which will be triggered once put the file

I won’t deep detail to the Python code, the following code logs some information regards to the CSV such as header, content type, and file name. The point is to note that, you don’t need to run the Lambda function, when you put the file to the S3, Lambda will be triggered automatically.

import json
import urllib.parse
import boto3
import csv
print('Loading function')s3 = boto3.client('s3')def lambda_handler(event, context):
#print("Received event: " + json.dumps(event, indent=2))
# Get the object from the event and show its content type
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
response = s3.get_object(Bucket=bucket, Key=key)
print("CONTENT TYPE: " + response['ContentType'])
print("Key: " + key)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, key)

data = s3_object.get()['Body'].read().decode('utf-8').splitlines()

lines = csv.reader(data)
headers = next(lines)
print('headers: %s' %(headers))
# for line in lines:
# #print complete line
# print(line)
# #print index wise
# print(line[0], line[1])
return response['ContentType']
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e

Step 7-Run the code

There is no need to run the code directly, what you need to do is to put a BikeBuyer data to the S3 bucket. After putting the file, you will be able to see the lambda function is triggered

Upload data to S3

Once you upload the file, the Lambda function will be triggered immediately.

When you check the logs via CloudWatch, you can see the function is triggered

Lambda function is called and print whatever you logged in the code

Conclusion

As you see, near-real-time processing is one of the useful approaches that you can implement via S3 and Lambda. Apart from processing latency, it will allow you to implement a stable, cost-effective, and observable method.

--

--