Goldnode: Using Athena to Query CSV Data

Low Cost Expectation, with Caution

Following this tutorial is expected to incur minimal costs, between $0.00 and $0.10 USD or equivalent in your local currency. This estimate assumes that the tutorial is fully completed and all resources are cleaned up on the same day.

Leaving resources running can become expensive, especially if larger resources are configured than are needed. Accuracy of this information is not guaranteed. You should verify the costs yourself before proceeding and if you have any doubts, do not follow this tutorial on the live system.

Setup

1. Create an S3 Bucket

Buckets must be given a name and that name must be unique across every AWS account owned by every user over every single AWS region, for this reason, everyone following this tutorial cannot use the same name. A suggested name is provided which you may whish to change, this will dynamically update the tutorial for your personal use.

Bucket Name

Must be 3-63 characters, lowercase letters, numbers, dots, and hyphens only. Cannot start or end with dots or hyphens.

Sign in to the AWS Management Console
Navigate to S3
(Optional) Change to a preferred region
Click "Create bucket"
Enter the bucket name: goldnode-tutorial-411980

We can leave the rest of the settings as default, but while we are here lets take a look at what they are:

Object ownership: Leave as default with ACLs Disabled. IAM roles and policies are the preferred way to manage permissions.
Block all public access: Keep enabled as we don't need public access for this tutorial.
Bucket Versioning: Tracks different versions of objects with the same name. We'll leave it disabled for this tutorial.
Tags: Used for resource organization and cost tracking. Not needed for this tutorial.
Default encryption: Leave as SSE-S3 for automatic AES-256 encryption of all objects.
Object Lock: Prevents object deletion/modification. Not needed for this tutorial.

Click "Create bucket"

2. Create the folder structure

We will create two folders in our bucket, one for the raw CSV data and one for the query results.

Click on your newly created bucket (goldnode-tutorial-411980)
Click "Create folder" in the "Objects" tab
Enter the folder name: raw-data
Click "Create folder" to create it

Repeat these steps to create the query-results folder.

3. Upload Sample Data

Right click this link and click save to download our sample CSV file: store-transactions.csv

This file contains our sample data, it is a CSV file with a header row and 50 rows of data, the first rows being:

transaction_id,date,store_id,product_id,product_name,category,quantity,unit_price,total_amount
1001,2024-01-01,STORE001,P123,Organic Coffee,Beverages,2,12.99,25.98
1002,2024-01-01,STORE001,P456,Whole Grain Bread,Bakery,3,4.99,14.97
1003,2024-01-01,STORE002,P789,Fresh Milk,Dairy,4,3.99,15.96
...

Click on the raw-data folder in your bucket
Click "Upload"
Drag and drop the store-transactions.csv file into the upload area (or click "Add files" and select the file)
Click "Upload"

Creating the Table in Athena

Understanding Athena Tables

Before we create our table, it's important to understand that Athena tables are not traditional database tables. Instead, they are metadata definitions that tell Athena how to read your data in S3. When you create a table in Athena:

No data is actually moved or copied
You're creating a schema that maps to your S3 data
The table is "external" because the data lives outside of Athena
Queries read directly from your S3 files

1. Configure Athena Query Results Location

Before running any queries, you need to tell Athena where to store query results:

Open the Athena console
If this is your first time using Athena, you'll see a welcome page, just click "Explore the athena query editor" to get started
From the Query editor window, click on the "Settings" tab
Click "manage"
Enter your S3 bucket path for query results:
```
s3://[BUCKET_NAME]/query-results/
```
Click "Save"

2. Create a Database

A database in Athena is a logical container for your tables.

Click the "Editor" tab
Enter and run this query:

CREATE DATABASE goldnode_tutorial_athena
COMMENT 'Temporary database for goldnode tutorial';

Click "Run"

You should now find your new database listed in the "Database" dropdown in the left sidebar

3. Create the Table

Now we'll create a table that maps to our CSV data.

Clear the Query editor, click "Clear", and paste in the following:

CREATE EXTERNAL TABLE goldnode_tutorial_athena.sales (
    transaction_id STRING,
    date DATE,
    store_id STRING,
    product_id STRING,
    product_name STRING,
    category STRING,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2)
)
-- Tell Athena this is a CSV file
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://[BUCKET_NAME]/raw-data/'
TBLPROPERTIES ('skip.header.line.count'='1');

Key points about this table definition:

EXTERNAL TABLE: Indicates the data is stored outside Athena (in S3)
Column types must match our CSV data format
LOCATION: Must points to our S3 folder containing the CSV
TBLPROPERTIES: Tells Athena to skip the CSV header row

Click "Run"

4. Verify the Table

After creating the table, let's verify it's working:

Clear the Query editor clicking "Clear", or open a new query tab (+ on right hand side)
Run a simple SELECT:

SELECT * FROM goldnode_tutorial_athena.sales LIMIT 5;

And you should see something like this under the results pane:

	transaction_id	date	store_id	product_id	product_name	category	quantity	unit_price	total_amount
1	1001	2024-01-01	STORE002	P123	Organic Coffee	Beverages	2	12.99	25.98
2	1002	2024-01-01	STORE001	P456	Whole Grain Bread	Bakery	3	4.99	14.97
3	1003	2024-01-01	STORE002	P789	Fresh Milk	Dairy	4	3.99	15.96
4	1004	2024-01-01	STORE003	P234	Avocado	Produce	5	1.99	9.95
5	1005	2024-01-01	STORE003	P567	Free Range Eggs	Dairy	2	5.99	11.98

And lets check all 50 records are being scanned:

SELECT COUNT(*) as total_records FROM goldnode_tutorial_athena.sales;

Sample Queries

Now that our table is set up, let's try a few more queries to analyze our data.

1. Daily Sales Total

This query shows sales trends over time:

SELECT 
    date,
    COUNT(*) as transactions,
    SUM(total_amount) as daily_revenue
FROM goldnode_tutorial_athena.sales
GROUP BY date
ORDER BY date;

COUNT(*): Shows number of transactions per day
SUM(total_amount): Calculates total revenue
Results help identify busy days and revenue patterns

2. Top Selling Categories

Analyze which product categories perform best:

SELECT 
    category,
    SUM(quantity) as total_units_sold,
    ROUND(SUM(total_amount), 2) as total_revenue
FROM goldnode_tutorial_athena.sales
GROUP BY category
ORDER BY total_revenue DESC
LIMIT 5;

Shows both unit sales and revenue by category
ROUND() function formats currency values
Helps identify most profitable categories

3. Store Performance

Compare performance across stores:

SELECT 
    store_id,
    COUNT(DISTINCT transaction_id) as total_transactions,
    COUNT(DISTINCT product_id) as unique_products_sold,
    ROUND(SUM(total_amount), 2) as total_revenue,
    ROUND(AVG(total_amount), 2) as avg_transaction_value
FROM goldnode_tutorial_athena.sales
GROUP BY store_id
ORDER BY total_revenue DESC;

Provides multiple performance metrics per store
COUNT(DISTINCT) shows product variety
Average transaction value helps identify store patterns

Real-World Scenario

Imagine you're a data analyst at a retail chain with 50 stores across the country. Each store's point-of-sale system exports daily sales data as CSV files that are automatically uploaded to S3. Your tasks include:

Daily Sales Reports: Marketing needs daily sales totals by category
Inventory Management: Operations needs to know fast-moving products
Store Performance: Management needs store-by-store performance metrics
Seasonal Trends: Planning team needs historical sales patterns

Instead of loading this data into a traditional database, you can use Athena to:

Query data directly from S3
Run ad-hoc analysis without infrastructure management
Scale automatically during high-query periods (end of month reporting)
Pay only for the queries you run

Clean-Up

Once you've finished trying out a few more query's lets clean up the resources we created to avoid any ongoing charges...

1. Delete Athena Resources

Back in the Query "Editor" tab, run the following to drop the table:

DROP TABLE goldnode_tutorial_athena.sales;

And then run the following to drop the database:

DROP DATABASE goldnode_tutorial_athena;

You should now see the database removed from the left sidebar.

2. Delete S3 Data

Navigate to the S3 console
Click on your bucket: goldnode-tutorial-411980
Select the raw-data and query-results folders
Click "Delete"
Confirm the deletion by entering permanently delete where prompted
Click "Delete objects"
Close the confirmation window and go back to the main List of buckets
Select goldnode-tutorial-411980
Click "Delete"
Confirm the deletion by entering goldnode-tutorial-411980 where prompted
Click "Delete bucket"

Hopefully you can see that Amazon Athena provides a powerful way to analyze data without the complexity of managing databases. By using standard SQL to query data directly from S3, you can quickly derive insights from your data without worrying about infrastructure, server management, or complex ETL processes. This makes it particularly valuable for ad-hoc analysis, log processing, and situations where you need quick answers from your data without a heavy technical setup.