How to Connect Amazon S3 via EMR based PySpark

Serkan SAKINMAZ
3 min readFeb 23, 2019

--

In this section, I’m going to explain you how to retrieve data from S3 to your PySpark application. Let’s start step by step

Opening EMR cluster

At first, you need to open an EMR cluster on AWS. These steps are very simple, you are using the benefits of cloud.

Open EMR service from AWS console and create your cluster. The only thing you need to do is to select/download EMR key pair. Otherwise, you couldn’t connect the EMR server

When you click to ‘Create cluster’, your cluster is going to be started.

Giving S3 access role from EMR

When you open EMR cluster, the cluster couldn’t access to S3 because of the access restriction. You can add this policy to your EMR

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::yourbucket",
"arn:aws:s3:::yourbucket/*"
]
}
]
}

Connecting to EMR cluster

When your cluster is ready, you can easily to connect your cluster

Click SSH and copy your ssh connection commands

Open your terminal and paste the command

Congrats, you are in EMR cluster.

Connecting to PySpark

EMR comes to play with Spark library. When you type ’pyspark’ , it is going to connected PySpark application

Changing S3 Endpoint

In case your cluster and s3 is different endpoint, you need to configure endpoints

sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 's3-eu-central-1.amazonaws.com')

Retrieve data from S3 in PySpark

Retrieving data from S3 is same with HDFS. You can easily connect to S3 in order to fetch your data.

Final Words

Lots of data champion is working with Amazon S3 and PySpark. When AWS comes to play, it is very simple to configure infrastructure and work with data

--

--

Serkan SAKINMAZ
Serkan SAKINMAZ

Responses (2)