Building a feature pipeline with Glue and SageMaker
From raw data in S3 to model-ready features, orchestrated and repeatable.
The first model I shipped to production failed not because the algorithm was wrong, but because the features I trained on were computed differently from the features served at inference time. Training used a tidy pandas notebook; serving used a hand-written Lambda. The two drifted within weeks. After that, I stopped treating feature engineering as a notebook concern and built a real pipeline around AWS Glue and SageMaker.
This post is the architecture I landed on for batch features: Glue for the heavy transformations, the SageMaker Feature Store as the single source of truth, and the same definitions feeding both training and offline scoring.
Why split Glue and SageMaker at all
Glue is a managed Spark environment. It is good at exactly the part that hurts in a notebook: joining several large tables in S3, deduplicating, and windowed aggregations over months of event data. SageMaker is where the model lives, but its Processing jobs are not as economical for wide shuffles. So I let each do what it is good at.
- Glue job reads raw Parquet from the data lake, computes aggregates, and writes a clean feature frame.
- SageMaker Feature Store ingests that frame, keeping an online store (
DynamoDB-backed, single-digit ms reads) and an offline store (S3, Iceberg) in sync. - Training pulls a point-in-time-correct dataset from the offline store via Athena.
The single most valuable property here is that one feature definition produces both the training set and the online value served at inference. That is what kills training/serving skew.
The Glue transformation
Here is the core of the Glue job that builds rolling 30-day spend features per customer. I keep it in PySpark rather than the Glue visual editor so it is reviewable in git.
import sys
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql import functions as F, Window
sc = SparkContext()
glue = GlueContext(sc)
spark = glue.spark_session
txns = spark.read.parquet("s3://lake/curated/transactions/")
w = (Window.partitionBy("customer_id")
.orderBy(F.col("event_ts").cast("long"))
.rangeBetween(-30 * 86400, 0))
features = (txns
.withColumn("spend_30d", F.sum("amount").over(w))
.withColumn("txn_count_30d", F.count("*").over(w))
.withColumn("event_time", F.col("event_ts").cast("string"))
.select("customer_id", "spend_30d", "txn_count_30d", "event_time")
.dropDuplicates(["customer_id"]))
features.write.mode("overwrite").parquet("s3://lake/features/customer_spend/")
Note event_time as a string column. The Feature Store requires an explicit event-time field, and getting its format right early saves a painful re-ingest.
Ingesting into the Feature Store
Once the frame is in S3, a small boto3 step registers the feature group (once) and batch-ingests. The online store is what your real-time endpoint reads; skip it if you only ever score in batch, because it adds DynamoDB cost.
import boto3, sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup
sess = sagemaker.Session()
fg = FeatureGroup(name="customer-spend-v1", sagemaker_session=sess)
fg.load_feature_definitions(data_frame=df) # df is the pandas frame
fg.create(
s3_uri="s3://lake/featurestore/",
record_identifier_name="customer_id",
event_time_feature_name="event_time",
role_arn="arn:aws:iam::123456789012:role/sm-featurestore",
enable_online_store=True,
)
fg.ingest(data_frame=df, max_workers=4, wait=True)
Cost and scheduling trade-offs
A few numbers from running this daily on ~80M transactions:
| Component | Choice | Why |
|---|---|---|
| Glue workers | G.1X, 10 DPUs | Job finishes in ~12 min; G.2X was faster but not worth ~2x cost here |
| Online store | On only for serving groups | DynamoDB write/read charges add up on wide groups |
| Offline format | Iceberg | Point-in-time queries via Athena without manual partition juggling |
| Orchestration | EventBridge to Step Functions | Glue then ingest then training, with retries |
The one thing I would not skip: enable Glue job bookmarks if you switch to incremental reads, otherwise you reprocess the whole lake every run.
Takeaways
- Use Glue for the wide joins and aggregations; it is cheaper and more natural than SageMaker Processing for Spark-shaped work.
- Let the SageMaker Feature Store be the single definition feeding both training and serving to eliminate skew.
- Only enable the online store for groups you actually serve in real time, since DynamoDB charges scale with feature width.
- Orchestrate the Glue then ingest then train chain with Step Functions and EventBridge so failures retry instead of silently producing stale features.