Using Athena to Query CSV Data
This tutorial demonstrates how to analyze sales data using Amazon Athena to query CSV files stored in S3. We'll work with a realistic dataset from a fictional retail company, showing how to load, structure, and analyze sales data to derive business insights.
Setup
1. Create an S3 Bucket
Buckets must be given a name and that name must be unique across every AWS account owned by every user over every single AWS region, for this reason, everyone following this tutorial cannot use the same name. A suggested name is provided which you may whish to change, this will dynamically update the tutorial for your personal use.
Must be 3-63 characters, lowercase letters, numbers, dots, and hyphens only. Cannot start or end with dots or hyphens.
- Sign in to the AWS Management Console
- Navigate to S3
- (Optional) Change to a preferred region
- Click "Create bucket"
- Enter the bucket name: goldnode-tutorial-411980
We can leave the rest of the settings as default, but while we are here lets take a look at what they are:
- Object ownership: Leave as default with ACLs Disabled. IAM roles and policies are the preferred way to manage permissions.
- Block all public access: Keep enabled as we don't need public access for this tutorial.
- Bucket Versioning: Tracks different versions of objects with the same name. We'll leave it disabled for this tutorial.
- Tags: Used for resource organization and cost tracking. Not needed for this tutorial.
- Default encryption: Leave as SSE-S3 for automatic AES-256 encryption of all objects.
- Object Lock: Prevents object deletion/modification. Not needed for this tutorial.
- Click "Create bucket"
2. Create the folder structure
We will create two folders in our bucket, one for the raw CSV data and one for the query results.
- Click on your newly created bucket (goldnode-tutorial-411980)
- Click "Create folder" in the "Objects" tab
- Enter the folder name:
raw-data
- Click "Create folder" to create it
Repeat these steps to create the query-results
folder.
3. Upload Sample Data
Right click this link and click save to download our sample CSV file: store-transactions.csv
This file contains our sample data, it is a CSV file with a header row and 50 rows of data, the first rows being:
transaction_id,date,store_id,product_id,product_name,category,quantity,unit_price,total_amount
1001,2024-01-01,STORE001,P123,Organic Coffee,Beverages,2,12.99,25.98
1002,2024-01-01,STORE001,P456,Whole Grain Bread,Bakery,3,4.99,14.97
1003,2024-01-01,STORE002,P789,Fresh Milk,Dairy,4,3.99,15.96
...
- Click on the
raw-data
folder in your bucket - Click "Upload"
- Drag and drop the
store-transactions.csv
file into the upload area (or click "Add files" and select the file) - Click "Upload"
Creating the Table in Athena
Understanding Athena Tables
Before we create our table, it's important to understand that Athena tables are not traditional database tables. Instead, they are metadata definitions that tell Athena how to read your data in S3. When you create a table in Athena:
- No data is actually moved or copied
- You're creating a schema that maps to your S3 data
- The table is "external" because the data lives outside of Athena
- Queries read directly from your S3 files
1. Configure Athena Query Results Location
Before running any queries, you need to tell Athena where to store query results:
- Open the Athena console
- If this is your first time using Athena, you'll see a welcome page, just click "Explore the athena query editor" to get started
- From the Query editor window, click on the "Settings" tab
- Click "manage"
- Enter your S3 bucket path for query results:
s3://[BUCKET_NAME]/query-results/
- Click "Save"
2. Create a Database
A database in Athena is a logical container for your tables.
- Click the "Editor" tab
- Enter and run this query:
CREATE DATABASE goldnode_tutorial_athena
COMMENT 'Temporary database for goldnode tutorial';
- Click "Run"
You should now find your new database listed in the "Database" dropdown in the left sidebar
3. Create the Table
Now we'll create a table that maps to our CSV data.
- Clear the Query editor, click "Clear", and paste in the following:
CREATE EXTERNAL TABLE goldnode_tutorial_athena.sales (
transaction_id STRING,
date DATE,
store_id STRING,
product_id STRING,
product_name STRING,
category STRING,
quantity INT,
unit_price DECIMAL(10,2),
total_amount DECIMAL(10,2)
)
-- Tell Athena this is a CSV file
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://[BUCKET_NAME]/raw-data/'
TBLPROPERTIES ('skip.header.line.count'='1');
Key points about this table definition:
EXTERNAL TABLE
: Indicates the data is stored outside Athena (in S3)- Column types must match our CSV data format
LOCATION
: Must points to our S3 folder containing the CSVTBLPROPERTIES
: Tells Athena to skip the CSV header row
- Click "Run"
4. Verify the Table
After creating the table, let's verify it's working:
Clear the Query editor clicking "Clear", or open a new query tab (+ on right hand side)
Run a simple SELECT:
SELECT * FROM goldnode_tutorial_athena.sales LIMIT 5;
And you should see something like this under the results pane:
transaction_id | date | store_id | product_id | product_name | category | quantity | unit_price | total_amount | |
---|---|---|---|---|---|---|---|---|---|
1 | 1001 | 2024-01-01 | STORE002 | P123 | Organic Coffee | Beverages | 2 | 12.99 | 25.98 |
2 | 1002 | 2024-01-01 | STORE001 | P456 | Whole Grain Bread | Bakery | 3 | 4.99 | 14.97 |
3 | 1003 | 2024-01-01 | STORE002 | P789 | Fresh Milk | Dairy | 4 | 3.99 | 15.96 |
4 | 1004 | 2024-01-01 | STORE003 | P234 | Avocado | Produce | 5 | 1.99 | 9.95 |
5 | 1005 | 2024-01-01 | STORE003 | P567 | Free Range Eggs | Dairy | 2 | 5.99 | 11.98 |
- And lets check all 50 records are being scanned:
SELECT COUNT(*) as total_records FROM goldnode_tutorial_athena.sales;
Sample Queries
Now that our table is set up, let's try a few more queries to analyze our data.
1. Daily Sales Total
This query shows sales trends over time:
SELECT
date,
COUNT(*) as transactions,
SUM(total_amount) as daily_revenue
FROM goldnode_tutorial_athena.sales
GROUP BY date
ORDER BY date;
COUNT(*)
: Shows number of transactions per daySUM(total_amount)
: Calculates total revenue- Results help identify busy days and revenue patterns
2. Top Selling Categories
Analyze which product categories perform best:
SELECT
category,
SUM(quantity) as total_units_sold,
ROUND(SUM(total_amount), 2) as total_revenue
FROM goldnode_tutorial_athena.sales
GROUP BY category
ORDER BY total_revenue DESC
LIMIT 5;
- Shows both unit sales and revenue by category
ROUND()
function formats currency values- Helps identify most profitable categories
3. Store Performance
Compare performance across stores:
SELECT
store_id,
COUNT(DISTINCT transaction_id) as total_transactions,
COUNT(DISTINCT product_id) as unique_products_sold,
ROUND(SUM(total_amount), 2) as total_revenue,
ROUND(AVG(total_amount), 2) as avg_transaction_value
FROM goldnode_tutorial_athena.sales
GROUP BY store_id
ORDER BY total_revenue DESC;
- Provides multiple performance metrics per store
COUNT(DISTINCT)
shows product variety- Average transaction value helps identify store patterns
Real-World Scenario
Imagine you're a data analyst at a retail chain with 50 stores across the country. Each store's point-of-sale system exports daily sales data as CSV files that are automatically uploaded to S3. Your tasks include:
- Daily Sales Reports: Marketing needs daily sales totals by category
- Inventory Management: Operations needs to know fast-moving products
- Store Performance: Management needs store-by-store performance metrics
- Seasonal Trends: Planning team needs historical sales patterns
Instead of loading this data into a traditional database, you can use Athena to:
- Query data directly from S3
- Run ad-hoc analysis without infrastructure management
- Scale automatically during high-query periods (end of month reporting)
- Pay only for the queries you run
Clean-Up
Once you've finished trying out a few more query's lets clean up the resources we created to avoid any ongoing charges...
1. Delete Athena Resources
- Back in the Query "Editor" tab, run the following to drop the table:
DROP TABLE goldnode_tutorial_athena.sales;
- And then run the following to drop the database:
DROP DATABASE goldnode_tutorial_athena;
You should now see the database removed from the left sidebar.
2. Delete S3 Data
- Navigate to the S3 console
- Click on your bucket: goldnode-tutorial-411980
- Select the
raw-data
andquery-results
folders - Click "Delete"
- Confirm the deletion by entering permanently delete where prompted
- Click "Delete objects"
- Close the confirmation window and go back to the main List of buckets
- Select goldnode-tutorial-411980
- Click "Delete"
- Confirm the deletion by entering goldnode-tutorial-411980 where prompted
- Click "Delete bucket"
Hopefully you can see that Amazon Athena provides a powerful way to analyze data without the complexity of managing databases. By using standard SQL to query data directly from S3, you can quickly derive insights from your data without worrying about infrastructure, server management, or complex ETL processes. This makes it particularly valuable for ad-hoc analysis, log processing, and situations where you need quick answers from your data without a heavy technical setup.