Serverless CSV Data Pipeline - ETL

Business Implications

The pipeline turns raw CSV drops into governed, analytics-ready data with zero servers to manage. Teams get faster time-to-insight, reliable data quality, and low operational cost. QuickSight dashboards update as new files arrive, enabling near-real-time decisions for business stakeholders.

Check GitHub

Final
Outcome

Automated Curated Data & Dashboard

Check GitHUB

Steps Performed

Automated CSV ingestion via S3 events, preprocessed with Lambda, transformed at scale using Glue, stored curated datasets in S3, and built interactive QuickSight dashboards for analysis.

Set Up Buckets And Roles

Created three S3 buckets (raw, processed, final). Configured IAM roles/policies for Lambda and Glue with least-privilege access to S3, Glue, and CloudWatch. Enabled QuickSight access to selected buckets for analytics.

Lambda Preprocess On Upload

Wrote a Python Lambda to trigger on raw-bucket uploads, read CSV, filter/clean rows, and write clean outputs to the processed bucket using Boto3, logging execution in CloudWatch.

Catalog And Discover Schema

Set up Glue Data Catalog database, created an on-demand Crawler to infer schema from processed CSVs, and materialized tables for downstream ETL and analytics consumption.

Transform With Glue ETL

Built a Glue Studio (Visual ETL) job to apply schema changes and business rules, write compressed CSV outputs to the final bucket, and parameterize paths for repeatable production runs.

Visualize In QuickSight

Connected QuickSight to final S3 data via a manifest, created bar/line visuals, and published a dashboard. Validated dynamic refresh by dropping new CSVs through the pipeline.

AWS Services Used

Amazon S3
AWS Lambda
AWS Glue
Amazon QuickSight
AWS IAM
Amazon CloudWatch

Python
Boto3
AWS Glue Studio
Amazon QuickSight (authoring)

Technical Tools Used

Serverless Data Engineering
Event-Driven ETL Orchestration
Data Modeling & Governance
Dashboarding & Analytics

Skills Demonstrated