Athena and the lakehouse: querying S3 like a database, The Cloud Ledger

For years my answer to "we need to query these logs" was "spin up a database, write an ingestion job, manage the cluster." Then I started leaving the data in S3 and pointing Athena at it. No servers, no ingestion, pay only for the bytes a query scans. It reframed S3 from a dumb bucket into the storage layer of a queryable warehouse.

That is the lakehouse idea in one sentence: keep raw data in cheap object storage, lay a table schema over it, and run SQL directly. Athena is the serverless query engine that makes it practical on AWS.

You are billed per byte scanned

Athena costs roughly $5 per terabyte scanned. There is no cluster to size and nothing running when you are idle, which is fantastic, but it means query cost is entirely a function of how much data each query has to read. Two changes cut scanned bytes by 10x or more:

Columnar formats. Convert raw JSON/CSV to Parquet. A query selecting 3 of 40 columns reads only those 3 columns' worth of bytes, and Parquet compresses far better than text.
Partitioning. Lay data out by a column you filter on (date is the classic) so a WHERE clause prunes whole prefixes instead of scanning everything.

Partition so queries read less

Storing data as s3://logs/year=2026/month=06/day=24/ lets Athena skip every other day's files when you filter on the date. With partition projection you do not even need to register partitions in the Glue catalog, Athena computes the S3 paths from the query predicate.

CREATE EXTERNAL TABLE access_logs (
  request_id   string,
  status       int,
  bytes_sent   bigint,
  user_agent   string
)
PARTITIONED BY (dt string)
STORED AS PARQUET
LOCATION 's3://my-logs/access/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.dt.type' = 'date',
  'projection.dt.range' = '2024-01-01,NOW',
  'projection.dt.format' = 'yyyy-MM-dd',
  'storage.location.template' = 's3://my-logs/access/dt=${dt}/'
);

Now a query like the following scans only one day of Parquet, not the whole table:

SELECT status, count(*) AS n
FROM access_logs
WHERE dt = '2026-06-24' AND status >= 500
GROUP BY status
ORDER BY n DESC;

The catalog is the glue

Athena reads table definitions from the AWS Glue Data Catalog. You can author tables by hand with DDL, or run a Glue Crawler to infer schema from your S3 files. The same catalog is shared by Redshift Spectrum, EMR, and other engines, so one table definition serves multiple tools. I prefer explicit DDL in version control over crawlers, because crawlers occasionally guess types wrong and silently change a schema.

Lakehouse, not just a query tool

Plain external tables are append-and-rewrite, fine for immutable logs, painful when you need updates, deletes, or time travel. That is where open table formats come in. Apache Iceberg, supported natively by Athena, adds ACID transactions, row-level UPDATE/DELETE, schema evolution, and snapshot-based time travel on top of S3.

External tables turn S3 into something you can read like a database. Iceberg turns it into something you can write like one too.

For a slowly changing dimension or a GDPR-deletion requirement, Iceberg is the difference between a clean DELETE FROM and rewriting partitions by hand. The trade-off is more metadata to manage and the need to compact small files periodically.

Takeaways

Athena bills ~$5/TB scanned, so cost is driven by bytes read, Parquet plus partitioning routinely cuts that 10x.
Use partition projection to prune S3 prefixes without maintaining partition metadata.
Define tables explicitly in the Glue Data Catalog and keep the DDL in version control rather than trusting crawlers.
Reach for Iceberg when you need updates, deletes, or time travel; stick to plain external tables for immutable logs.

Athena and the lakehouse: querying S3 like a database

You are billed per byte scanned

Partition so queries read less

The catalog is the glue

Lakehouse, not just a query tool

Takeaways

More on AWS Services