How to use Athena to analyse data in Amazon S3 using standard SQL

Do you have any difficulty analysing the CSV, Json, Avro, Parquet, XML file in your company or research project? In this blog, I am going to explain how to analyse this type of files using standard SQL. Another point is to note that , there is no size limit! Sounds good?

The following image shows just a simple example of how you query CSV file using standard SQL. Let’s deep dive into the architecture and steps that needs to be done

Amazon Athena is a serverless service that allow to execute SQL query on AWS S3 buckets. You can easily query data via Athena SQL editor. The point is to note that, there is no need to manage the infrastructure, hence you can focus on data analysis. Athena also uses AWS Glue as a data catalogue, once you define the data structure within Glue, you are ready to use the Athena.

In the following steps, I will explain how to query CSV file with respective services, will use CSV file, you can use JSON, parquet as well

Step 1 — Upload CSV data to S3

S3 is an object storage service in AWS. The following file is going to be uploaded to S3 and the fields are separated via comma.

Serkan, 1984, Computer Engineer
Serkan Batu, 2013, Student
Maria, 2017, Student
Angel, 1988, Data Scientist

Upload csv to the following bucket : s3://serkan-athena-test/user/

Step 2— Catalog data via Glue

Glue is a serverless data integration services in AWS, also used as a data catalog for multiple services. Let’s define a data structure for user data via Glue

Open Glue and Select “Add Tables” → “Add table manually” section

Give the name of table

Specify the S3 path

Define the file format and delimiter

Define the schema

Review and finish the data mapping.

Step 3 — Query via Athena

Athena is a query service that you easily execute SQL queries based on S3. In the previous step, we uploaded data to S3 and define the data mapping via Glue. Now, we are ready to query data from Athena

Open Athena service in AWS and execute the following query

Query : select * from user

You can also play with SQL query

Query : select * from user where name like ‘Serkan%’

Athena also allow you to download your query results which is helpful for further data analysis.

Cost

In general, using Athena, S3 and Glue are not expensive services. Putting some cost examples at the below, you can have a better idea to check from AWS official website. The cost is also differentiated based on the region that you are working

+---------------+--------------------------------------------------+
| Service | AWS Standard Cost |
+---------------+--------------------------------------------------+
| S3 Standard | $0.023 per GB (First 50 TB / Month) |
| Glue metadata | Free for the first million objects stored |
| Data Transfer | Free-Data Transfer IN To Amazon S3 From Internet |
| Athena | $5.00 per TB of data scanned |
+---------------+--------------------------------------------------+

Conclusion

In this section, we see how to query files using SQL. I would recommend to using this architecture in case have the following use-cases

  • Regularly analyse the files such as CSV, Json, Avro, Parquet, XML
  • Need to analyse massive amount of data
  • Know SQL and doesn’t have a development background
  • Skip to implement the ETL job
  • Looking for serverless and cost-effective solution

Data & Cloud Architect and Trainer .