How to use Athena to analyse data in Amazon S3 using standard SQL
Do you have any difficulty analysing the CSV, Json, Avro, Parquet, XML file in your company or research project? In this blog, I am going to explain how to analyse this type of files using standard SQL. Another point is to note that , there is no size limit! Sounds good?
The following image shows just a simple example of how you query CSV file using standard SQL. Let’s deep dive into the architecture and steps that needs to be done
Amazon Athena is a serverless service that allow to execute SQL query on AWS S3 buckets. You can easily query data via Athena SQL editor. The point is to note that, there is no need to manage the infrastructure, hence you can focus on data analysis. Athena also uses AWS Glue as a data catalogue, once you define the data structure within Glue, you are ready to use the Athena.
In the following steps, I will explain how to query CSV file with respective services, will use CSV file, you can use JSON, parquet as well
Step 1 — Upload CSV data to S3
S3 is an object storage service in AWS. The following file is going to be uploaded to S3 and the fields are separated via comma.
Serkan, 1984, Computer Engineer
Serkan Batu, 2013, Student
Maria, 2017, Student
Angel, 1988, Data Scientist
Upload csv to the following bucket : s3://serkan-athena-test/user/
Step 2— Catalog data via Glue
Glue is a serverless data integration services in AWS, also used as a data catalog for multiple services. Let’s define a data structure for user data via Glue
Open Glue and Select “Add Tables” → “Add table manually” section
Give the name of table
Specify the S3 path
Define the file format and delimiter
Define the schema
Review and finish the data mapping.
Step 3 — Query via Athena
Athena is a query service that you easily execute SQL queries based on S3. In the previous step, we uploaded data to S3 and define the data mapping via Glue. Now, we are ready to query data from Athena
Open Athena service in AWS and execute the following query
Query : select * from user
You can also play with SQL query
Query : select * from user where name like ‘Serkan%’
Athena also allow you to download your query results which is helpful for further data analysis.
Cost
In general, using Athena, S3 and Glue are not expensive services. Putting some cost examples at the below, you can have a better idea to check from AWS official website. The cost is also differentiated based on the region that you are working
+---------------+--------------------------------------------------+
| Service | AWS Standard Cost |
+---------------+--------------------------------------------------+
| S3 Standard | $0.023 per GB (First 50 TB / Month) |
| Glue metadata | Free for the first million objects stored |
| Data Transfer | Free-Data Transfer IN To Amazon S3 From Internet |
| Athena | $5.00 per TB of data scanned |
+---------------+--------------------------------------------------+
Conclusion
In this section, we see how to query files using SQL. I would recommend to using this architecture in case have the following use-cases
- Regularly analyse the files such as CSV, Json, Avro, Parquet, XML
- Need to analyse massive amount of data
- Know SQL and doesn’t have a development background
- Skip to implement the ETL job
- Looking for serverless and cost-effective solution