Using Athena to Query CSV Data

This tutorial demonstrates how to analyze sales data using Amazon Athena to query CSV files stored in S3. We'll work with a realistic dataset from a fictional retail company, showing how to load, structure, and analyze sales data to derive business insights.

Published 2025-02-02
45 minutes completion

Low Cost Expectation, with Caution
Following this tutorial is expected to incur minimal costs, between $0.00 and $0.10 USD or equivalent in your local currency. This estimate assumes that the tutorial is fully completed and all resources are cleaned up on the same day.
Leaving resources running can become expensive, especially if larger resources are configured than are needed. Accuracy of this information is not guaranteed. You should verify the costs yourself before proceeding and if you have any doubts, do not follow this tutorial on the live system.

Setup

1. Create an S3 Bucket

Buckets must be given a name and that name must be unique across every AWS account owned by every user over every single AWS region, for this reason, everyone following this tutorial cannot use the same name. A suggested name is provided which you may whish to change, this will dynamically update the tutorial for your personal use.

Bucket Name

Must be 3-63 characters, lowercase letters, numbers, dots, and hyphens only. Cannot start or end with dots or hyphens.

  1. Sign in to the AWS Management Console
  2. Navigate to S3
  3. (Optional) Change to a preferred region
  4. Click "Create bucket"
  5. Enter the bucket name: goldnode-tutorial-411980

We can leave the rest of the settings as default, but while we are here lets take a look at what they are:

  • Object ownership: Leave as default with ACLs Disabled. IAM roles and policies are the preferred way to manage permissions.
  • Block all public access: Keep enabled as we don't need public access for this tutorial.
  • Bucket Versioning: Tracks different versions of objects with the same name. We'll leave it disabled for this tutorial.
  • Tags: Used for resource organization and cost tracking. Not needed for this tutorial.
  • Default encryption: Leave as SSE-S3 for automatic AES-256 encryption of all objects.
  • Object Lock: Prevents object deletion/modification. Not needed for this tutorial.
  1. Click "Create bucket"

2. Create the folder structure

We will create two folders in our bucket, one for the raw CSV data and one for the query results.

  1. Click on your newly created bucket (goldnode-tutorial-411980)
  2. Click "Create folder" in the "Objects" tab
  3. Enter the folder name: raw-data
  4. Click "Create folder" to create it

Repeat these steps to create the query-results folder.

3. Upload Sample Data

Right click this link and click save to download our sample CSV file: store-transactions.csv

This file contains our sample data, it is a CSV file with a header row and 50 rows of data, the first rows being:

transaction_id,date,store_id,product_id,product_name,category,quantity,unit_price,total_amount
1001,2024-01-01,STORE001,P123,Organic Coffee,Beverages,2,12.99,25.98
1002,2024-01-01,STORE001,P456,Whole Grain Bread,Bakery,3,4.99,14.97
1003,2024-01-01,STORE002,P789,Fresh Milk,Dairy,4,3.99,15.96
...
  1. Click on the raw-data folder in your bucket
  2. Click "Upload"
  3. Drag and drop the store-transactions.csv file into the upload area (or click "Add files" and select the file)
  4. Click "Upload"

Creating the Table in Athena

Understanding Athena Tables

Before we create our table, it's important to understand that Athena tables are not traditional database tables. Instead, they are metadata definitions that tell Athena how to read your data in S3. When you create a table in Athena:

  • No data is actually moved or copied
  • You're creating a schema that maps to your S3 data
  • The table is "external" because the data lives outside of Athena
  • Queries read directly from your S3 files

1. Configure Athena Query Results Location

Before running any queries, you need to tell Athena where to store query results:

  1. Open the Athena console
  2. If this is your first time using Athena, you'll see a welcome page, just click "Explore the athena query editor" to get started
  3. From the Query editor window, click on the "Settings" tab
  4. Click "manage"
  5. Enter your S3 bucket path for query results:
    s3://[BUCKET_NAME]/query-results/
  6. Click "Save"

2. Create a Database

A database in Athena is a logical container for your tables.

  1. Click the "Editor" tab
  2. Enter and run this query:
CREATE DATABASE goldnode_tutorial_athena
COMMENT 'Temporary database for goldnode tutorial';
  1. Click "Run"

You should now find your new database listed in the "Database" dropdown in the left sidebar

3. Create the Table

Now we'll create a table that maps to our CSV data.

  1. Clear the Query editor, click "Clear", and paste in the following:
CREATE EXTERNAL TABLE goldnode_tutorial_athena.sales (
    transaction_id STRING,
    date DATE,
    store_id STRING,
    product_id STRING,
    product_name STRING,
    category STRING,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2)
)
-- Tell Athena this is a CSV file
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://[BUCKET_NAME]/raw-data/'
TBLPROPERTIES ('skip.header.line.count'='1');

Key points about this table definition:

  • EXTERNAL TABLE: Indicates the data is stored outside Athena (in S3)
  • Column types must match our CSV data format
  • LOCATION: Must points to our S3 folder containing the CSV
  • TBLPROPERTIES: Tells Athena to skip the CSV header row
  1. Click "Run"

4. Verify the Table

After creating the table, let's verify it's working:

  1. Clear the Query editor clicking "Clear", or open a new query tab (+ on right hand side)

  2. Run a simple SELECT:

SELECT * FROM goldnode_tutorial_athena.sales LIMIT 5;

And you should see something like this under the results pane:

transaction_id date store_id product_id product_name category quantity unit_price total_amount
1 1001 2024-01-01 STORE002 P123 Organic Coffee Beverages 2 12.99 25.98
2 1002 2024-01-01 STORE001 P456 Whole Grain Bread Bakery 3 4.99 14.97
3 1003 2024-01-01 STORE002 P789 Fresh Milk Dairy 4 3.99 15.96
4 1004 2024-01-01 STORE003 P234 Avocado Produce 5 1.99 9.95
5 1005 2024-01-01 STORE003 P567 Free Range Eggs Dairy 2 5.99 11.98
  1. And lets check all 50 records are being scanned:
SELECT COUNT(*) as total_records FROM goldnode_tutorial_athena.sales;

Sample Queries

Now that our table is set up, let's try a few more queries to analyze our data.

1. Daily Sales Total

This query shows sales trends over time:

SELECT 
    date,
    COUNT(*) as transactions,
    SUM(total_amount) as daily_revenue
FROM goldnode_tutorial_athena.sales
GROUP BY date
ORDER BY date;
  • COUNT(*): Shows number of transactions per day
  • SUM(total_amount): Calculates total revenue
  • Results help identify busy days and revenue patterns

2. Top Selling Categories

Analyze which product categories perform best:

SELECT 
    category,
    SUM(quantity) as total_units_sold,
    ROUND(SUM(total_amount), 2) as total_revenue
FROM goldnode_tutorial_athena.sales
GROUP BY category
ORDER BY total_revenue DESC
LIMIT 5;
  • Shows both unit sales and revenue by category
  • ROUND() function formats currency values
  • Helps identify most profitable categories

3. Store Performance

Compare performance across stores:

SELECT 
    store_id,
    COUNT(DISTINCT transaction_id) as total_transactions,
    COUNT(DISTINCT product_id) as unique_products_sold,
    ROUND(SUM(total_amount), 2) as total_revenue,
    ROUND(AVG(total_amount), 2) as avg_transaction_value
FROM goldnode_tutorial_athena.sales
GROUP BY store_id
ORDER BY total_revenue DESC;
  • Provides multiple performance metrics per store
  • COUNT(DISTINCT) shows product variety
  • Average transaction value helps identify store patterns

Real-World Scenario

Imagine you're a data analyst at a retail chain with 50 stores across the country. Each store's point-of-sale system exports daily sales data as CSV files that are automatically uploaded to S3. Your tasks include:

  1. Daily Sales Reports: Marketing needs daily sales totals by category
  2. Inventory Management: Operations needs to know fast-moving products
  3. Store Performance: Management needs store-by-store performance metrics
  4. Seasonal Trends: Planning team needs historical sales patterns

Instead of loading this data into a traditional database, you can use Athena to:

  • Query data directly from S3
  • Run ad-hoc analysis without infrastructure management
  • Scale automatically during high-query periods (end of month reporting)
  • Pay only for the queries you run

Clean-Up

Once you've finished trying out a few more query's lets clean up the resources we created to avoid any ongoing charges...

1. Delete Athena Resources

  1. Back in the Query "Editor" tab, run the following to drop the table:
DROP TABLE goldnode_tutorial_athena.sales;
  1. And then run the following to drop the database:
DROP DATABASE goldnode_tutorial_athena;

You should now see the database removed from the left sidebar.

2. Delete S3 Data

  1. Navigate to the S3 console
  2. Click on your bucket: goldnode-tutorial-411980
  3. Select the raw-data and query-results folders
  4. Click "Delete"
  5. Confirm the deletion by entering permanently delete where prompted
  6. Click "Delete objects"
  7. Close the confirmation window and go back to the main List of buckets
  8. Select goldnode-tutorial-411980
  9. Click "Delete"
  10. Confirm the deletion by entering goldnode-tutorial-411980 where prompted
  11. Click "Delete bucket"

Hopefully you can see that Amazon Athena provides a powerful way to analyze data without the complexity of managing databases. By using standard SQL to query data directly from S3, you can quickly derive insights from your data without worrying about infrastructure, server management, or complex ETL processes. This makes it particularly valuable for ad-hoc analysis, log processing, and situations where you need quick answers from your data without a heavy technical setup.

© 2025 Goldnode. All rights reserved.