High level overview of Amazon Athena
Athena is a serverless query service to help you analyze the data stored in Amazon S3 buckets.
And to analyze this data, you’re going to use the standard SQL language to query the files.
Behind the scenes, Athena is built on the Presto engine, which uses the SQL language.
So the idea is that you are going to load data into your S3 bucket, and then you would use the Athena service to query and analyze this data in Amazon S3 without moving it.
So Athena is serverless, and it analyzes directly your data living in your S3 bucket.
So it supports different formats, such as CSV, JSON, ORC, Avro, and Parquet, and possibly others.
And the pricing is very simple, you’re just going to pay a fixed amount per terabytes of data scanned.
You don’t need to provision any database again, because the whole service is serverless.
So Athena is commonly used with another tool, called Amazon QuickSight to create reports and dashboards.
Amazon Quicksight connects to Athena, which connects to your S3 buckets.
Now, the use cases for Amazon Athena are to do ad hoc queries, Business Intelligence, Analytics, Reporting, and to analyze and query any kind of logs that originates from your AWS services.
So it could be your VPC flow logs, your load balancer logs, your CloudTrail trails, and so on.
Because you pay for the amount of data scanned per terabyte, you need to use a type of data where you’re going to scan less data.
And for this, you can use a columnar data type for cost-savings, because you only scan the columns you need.
So therefore, the recommended formats for Amazon Athena, are going to be Apache Parquet and ORC and it’s going to give you a huge performance improvement.
And to get your files into the Apache Parquet or ORC format, you must use a service for example, Glue.
Glue can be very helpful to convert your data as an ETL job, between, for example, CSV and Parquet.
Now, also because we want to scan less data, we need to compress data for smaller retrievals.
So you know that Athena can query data in S3, but actually you can query data anywhere, for example, in relational or non-relational databases, you can query objects and custom data sources, would it be on AWS or on-premises.
How? Well, you use what’s called a Data Source Connector.
It’s a Lambda Function, and that Lambda function is going to run the Federated Queries in other services.
So that could be, for example, CloudWatch Logs, DynamoDB, RDS, and so on.
So it’s very powerful.