How to Connect Amazon S3 via EMR based PySpark
In this section, I’m going to explain you how to retrieve data from S3 to your PySpark application. Let’s start step by step
Opening EMR cluster
At first, you need to open an EMR cluster on AWS. These steps are very simple, you are using the benefits of cloud.
Open EMR service from AWS console and create your cluster. The only thing you need to do is to select/download EMR key pair. Otherwise, you couldn’t connect the EMR server
When you click to ‘Create cluster’, your cluster is going to be started.
Giving S3 access role from EMR
When you open EMR cluster, the cluster couldn’t access to S3 because of the access restriction. You can add this policy to your EMR
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::yourbucket",
"arn:aws:s3:::yourbucket/*"
]
}
]
}
Connecting to EMR cluster
When your cluster is ready, you can easily to connect your cluster
Click SSH and copy your ssh connection commands
Open your terminal and paste the command
Congrats, you are in EMR cluster.
Connecting to PySpark
EMR comes to play with Spark library. When you type ’pyspark’ , it is going to connected PySpark application
Changing S3 Endpoint
In case your cluster and s3 is different endpoint, you need to configure endpoints
sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 's3-eu-central-1.amazonaws.com')
Retrieve data from S3 in PySpark
Retrieving data from S3 is same with HDFS. You can easily connect to S3 in order to fetch your data.
Final Words
Lots of data champion is working with Amazon S3 and PySpark. When AWS comes to play, it is very simple to configure infrastructure and work with data