Preparation for AWS Certified Data Analytics Specialty

Serkan SAKINMAZ
8 min readJun 1, 2022

AWS Certified Data Analytics — Specialty is one of the most popular exam when it comes to work with data in AWS. In this section, I will touch some important points that needs to be reviewed

AWS Athena

Amazon Athena is a serverless service that allow to execute SQL query on AWS S3 buckets. You can easily query data via Athena SQL editor.

Key Points for exam

  • It is serverless, hence don’t need to manage any infrastructure
  • Pay only what you scan in the query
  • Analyse different types of data that is stored in S3 ( Parquet, CSV, JSON, Text files)
  • Athena uses AWS Glue Data Catalog as a metadata store, hence SQL query is mapped to data in S3 buckets
  • Built-in integration with QuickSight for visualisation. QuickSight can use Athena as a data source
  • It is good to use when you need to execute ad-hoc queries
  • It can be an option when you need a cost effective SQL query tool on massive amount of data
  • Athena is a query service , not storage service
  • To improve the query performance, you might use Parquet format with partition. (date or unique id)
  • Athena can query the data that is stored different regions in S3 bucket. What you need to do is to run Glue crawler for all regions. Some of answers recommends to replicate data in different regions, but it is not correct
  • Using S3 and Athena at the same region improves the query performance
  • Athena can also access to the ODBC driver.
  • To put a limit for the query usage, you can use Athena workgroup
  • Athena workgroup also allows to put limitations for sensitive data each user. To achieve that, create tags for each workgroup and give the proper permission in the policy
  • Using compression in S3 can improve the Athena query performance
  • Using correct partition in S3 will improve the Athena query performance. Apart from that, querying from columnar formats and optimise the queries can significantly improve the performance

Redshift

Redshift is a petabyte scale data warehouse service in AWS. You can store data in Redshift or to use Redshift Spectrum to query data from S3

Key Points for exam

  • Redshift composed of leader and worker in terms of architecture. Leader is responsible for schema, query and metadata management. Workers hold the data and execute the query.
  • It uses Columnar storage that helps to work with OLAP use-cases in efficient way
  • UNLOAD command is used to export data to S3
  • Redshift interface is compatible with PostgreSQL database. (types )
  • Redshift use compound key as a default sort key which has good performance when sort, order or partition the data. If you use a lot of WHERE clause, interleaved key gives provides some performance benefits.
  • Redshift randomly distribute the data to slices, identical keys are sent to same slices.Another approach is the type ‘All Distribution’ which distribute the table to all slices. If you have a util table, it would makes sense like country code.
  • You can see the unload command;

unload (‘select * from table’) to ‘s3://mybucket/target’

  • Use REGEXP_SUBSTR function to search for a regular expression pattern.
  • COPY command is used to load data from S3
  • COPY NOLOAD command is used to copy data from S3 to Redshift and check whether the data is loaded without any error
  • Native cross-instance feature allow you to restore snapshot for different size instance.
  • RA3 instance is used to have better compute and storage capacity instead of using Dense Storage and Dense Compute.
  • The compute node is divided into slices which uses memory and disk to be able to execute the queries.
  • In some cases, long running queries block on-demand short running queries. Redshift has a workload management (WLM) feature to prioritise the queries.
  • Redshift spectrum uses S3 as a storage layer, hence the cost is reduced.It only support SELECT and INSERT,
  • For Redshift Spectrum, use EXTERNAL keyword to create schema
  • In some cases, you need to reduce Redshift cost with different solutions. If you able to query data from S3 via SQL, Athena could be an option
  • You can enable audit logging to investigate the logs like connections, disconnections, authentications
  • Redshift supports HSM(hardware security module) to manage the encryption and connection between Amazon Redshift and your HSM.
  • Auto Resize allow to add/remove nodes automatically.In Elastic resize, you can manually add/remove nodes

AWS Glue ETL

AWS Glue is an ETL service that allow to process and integrate the data in an easy way

Key Points for exam

  • Glue crawler allow you to populate data catalog on a scheduled basis from different sources
  • If you have lots of small files in S3, Glue can group with the following feature

‘groupFiles’: ‘inPartition’

  • In case the Glue ETL create duplicate records in the target database, the optimal approach is to create temporary staging tables to get the unique record
  • Glue Job bookmark tracks the processed data and persist the state information. Hence it allow to skip reprocessing the old data.
  • Glue is able to crawl data from different regions and accounts
  • You are able to implement PySpark or Scala applications via Glue
  • Glue crawler can create a catalog to be used within Lake formation

AWS Kinesis

AWS Kinesis is a serverless streaming service that be used for real time data integration and data processing.

Key Points for exam

  • When using KPL (Kinesis Producer Library), you need to consider some configurations to improve efficiency, one of good practise is to increase RecordMaxBufferedTime for higher packing.
  • Amazon Kinesis Agent is generally used to collect and send data to Kinesis Data Firehose.
  • The default record size is 1 MB, would be up to 5 MB.
  • Enhanced fan-out feature allow to the consume records with dedicated throughput.
  • Kinesis Data Streams hold data 24 hours to 365 days. If you need more duration, MSK (Kafka) would be the good approach.
  • Kinesis Data Streams consist of the shards, the limit is up to 500 shard but can be adjusted with request. Each shard can have 1000 record, and payload data size would be 1 MB.
  • For Kinesis Data Streams, read limit is 2 Mb in second and write limit is 1 MB in second.
  • Kinesis stagger windows allow to group data based on time value for data that arrive at inconsistent times
  • If you have sensitive data in Kinesis, you can implement Lambda to remove sensitive information
  • When you need to stream CloudWatch logs to ElasticSearch, Kinesis Data Firehose is used as a connector to stream the logs
  • Kinesis Data Streams is able to analyse data in real time
  • When you face performance issue due to throttling read/write request, one option would be to use UpdateShardCount API
  • To increate performance that reads via Lambda, you can use Parallelisation Factor to process one shard of a Kinesis with more than one Lambda function at the same time.
  • Kinesis Data Streams allow to receive data via API. PutRecords allow to send multiple record per HTTP request, on the other hand, PutRecord allow to send only one record per HTTP request.

AWS QuickSight

AWS QuickSight is a serverless service that allow to create an interactive dashboard in an easy way.

Key Points for exam

  • To keep dashboard up-to-date with latest data, you need to call CreateIngestion API.
  • In order to restrict the access to dataset, you can use row-level security to restrict access on each dashboard.
  • QuickSight has built-in machine learning functionalities to get deeper insights from data.
  • To access QuickSight from different region, you need to adjust the security group that allow to access these IPs from different region
  • QuickSight has another ML feature to identify anomaly records
  • It is important to know the visual types for some specific questions. For example, if you show something on the map, it would be better to use geospatial chart : Visual types

AWS EMR

AWS EMR is allow to process large amount of data with using popular big data frameworks like Hadoop, Spark, Hive etc …

Key Points for exam

  • Since EMR supports native EBS encryption, you can define a security configuration to encrypt EMR root volume
  • S3DistCp is used to copy data between S3 and EMR.
  • You can use SSE-S3 and SSE-KMS to have an encryption at rest for EMRFS data.
  • When you have limited budget, it is good to use SPOT instance for Task nodes
  • The instance fleet allow you to select a different provisioning options for Amazon EC2 instance
  • To enable root volume encryption, you can use custom AMI with encrypted root device volume
  • Configuring external meta-store help you to keep metadata even if EMR is closed
  • To restrict public access, you can enable “block public access” from the settings
  • In some cases, keep the EMR running could be expensive. You can execute Lambda to run EMR with regular basis events
  • To transfer data from S3 to EMR, you can use S3DistCP tool
  • CloudWatch is used to scale-out or scale-in for the EMR. What you need to do is to setup an instance group for core and task nodes
  • To encrypt the EMR cluster’s root volumes, you need to enable at-rest encryption and recreate the cluster

AWS Elasticsearch

Elasticsearch is a data store and search engine which is provided as a service in AWS.

Key Points for exam

  • When you have any wordings about search functionality with rank, probably ElasticSearch is the best option
  • Elasticsearch has an integration with Kinesis Data Firehose, hence you can load and monitor real time data

AWS S3

S3 is an object storage service in AWS.You can store whatever you want like csv, json, image etc

Key Points for exam

  • To use S3 Transfer acceleration, you need to use bucketname.s3-accelerate.amazonaws.com endpoint
  • To access from QuickSight from S3, you can configure permissions from QuickSight console
  • To transfer data between EMR, you can use S3DistCp
  • When you put data to S3, S3 is able to trigger lambda function
  • Kinesis Data Firehose is able to write data to S3
  • S3 can also be used to serve static web sites
  • Compressing data in S3 reduces the data transfer I/O
  • You can move Redshift query results to S3 via UNLOAD command
  • In order to increase read performance, you can add naming prefix to object filename
  • To improve query performance from S3, you can use different formats like ORC
  • When the data is in Glacier, you couldn’t query the subset of data. In that case, the data should be stored in the S3
  • In case having TB of data, it would be better to split large files to smaller chunks in order to improve query and transfer performance

Good luck in the exam !

--

--