Building a feature pipeline with Glue and SageMaker, The Cloud Ledger

The first model I shipped to production failed not because the algorithm was wrong, but because the features I trained on were computed differently from the features served at inference time. Training used a tidy pandas notebook; serving used a hand-written Lambda. The two drifted within weeks. After that, I stopped treating feature engineering as a notebook concern and built a real pipeline around AWS Glue and SageMaker.

This post is the architecture I landed on for batch features: Glue for the heavy transformations, the SageMaker Feature Store as the single source of truth, and the same definitions feeding both training and offline scoring.

Why split Glue and SageMaker at all

Glue is a managed Spark environment. It is good at exactly the part that hurts in a notebook: joining several large tables in S3, deduplicating, and windowed aggregations over months of event data. SageMaker is where the model lives, but its Processing jobs are not as economical for wide shuffles. So I let each do what it is good at.

Glue job reads raw Parquet from the data lake, computes aggregates, and writes a clean feature frame.
SageMaker Feature Store ingests that frame, keeping an online store (DynamoDB-backed, single-digit ms reads) and an offline store (S3, Iceberg) in sync.
Training pulls a point-in-time-correct dataset from the offline store via Athena.

The single most valuable property here is that one feature definition produces both the training set and the online value served at inference. That is what kills training/serving skew.

The Glue transformation

Here is the core of the Glue job that builds rolling 30-day spend features per customer. I keep it in PySpark rather than the Glue visual editor so it is reviewable in git.

import sys
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql import functions as F, Window

sc = SparkContext()
glue = GlueContext(sc)
spark = glue.spark_session

txns = spark.read.parquet("s3://lake/curated/transactions/")

w = (Window.partitionBy("customer_id")
           .orderBy(F.col("event_ts").cast("long"))
           .rangeBetween(-30 * 86400, 0))

features = (txns
    .withColumn("spend_30d", F.sum("amount").over(w))
    .withColumn("txn_count_30d", F.count("*").over(w))
    .withColumn("event_time", F.col("event_ts").cast("string"))
    .select("customer_id", "spend_30d", "txn_count_30d", "event_time")
    .dropDuplicates(["customer_id"]))

features.write.mode("overwrite").parquet("s3://lake/features/customer_spend/")

Note event_time as a string column. The Feature Store requires an explicit event-time field, and getting its format right early saves a painful re-ingest.

Ingesting into the Feature Store

Once the frame is in S3, a small boto3 step registers the feature group (once) and batch-ingests. The online store is what your real-time endpoint reads; skip it if you only ever score in batch, because it adds DynamoDB cost.

import boto3, sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup

sess = sagemaker.Session()
fg = FeatureGroup(name="customer-spend-v1", sagemaker_session=sess)

fg.load_feature_definitions(data_frame=df)  # df is the pandas frame
fg.create(
    s3_uri="s3://lake/featurestore/",
    record_identifier_name="customer_id",
    event_time_feature_name="event_time",
    role_arn="arn:aws:iam::123456789012:role/sm-featurestore",
    enable_online_store=True,
)
fg.ingest(data_frame=df, max_workers=4, wait=True)

Cost and scheduling trade-offs

A few numbers from running this daily on ~80M transactions:

Component	Choice	Why
Glue workers	G.1X, 10 DPUs	Job finishes in ~12 min; G.2X was faster but not worth ~2x cost here
Online store	On only for serving groups	DynamoDB write/read charges add up on wide groups
Offline format	Iceberg	Point-in-time queries via Athena without manual partition juggling
Orchestration	EventBridge to Step Functions	Glue then ingest then training, with retries

The one thing I would not skip: enable Glue job bookmarks if you switch to incremental reads, otherwise you reprocess the whole lake every run.

Takeaways

Use Glue for the wide joins and aggregations; it is cheaper and more natural than SageMaker Processing for Spark-shaped work.
Let the SageMaker Feature Store be the single definition feeding both training and serving to eliminate skew.
Only enable the online store for groups you actually serve in real time, since DynamoDB charges scale with feature width.
Orchestrate the Glue then ingest then train chain with Step Functions and EventBridge so failures retry instead of silently producing stale features.

Building a feature pipeline with Glue and SageMaker

Why split Glue and SageMaker at all

The Glue transformation

Ingesting into the Feature Store

Cost and scheduling trade-offs

Takeaways

More on Machine Learning