top of page
sophia-logo.png
  • Linkedin

Business Implications

This baseline autoscaler highlights how simple rules often lead to under- or over-provisioning of GPUs during LLM inference. By quantifying cost, latency, and SLA behaviors, teams can identify inefficiencies before deploying expensive cloud clusters—creating a data-backed foundation for transitioning toward ML-driven autoscaling policies that reduce infrastructure cost while preserving user experience.

Final
Outcome

Threshold-based autoscaler baseline created.

Steps Performed

Implemented a lightweight autoscaler that reacts to LLM token load by scaling GPU instances up or down. Simulated latency, cost, and SLA violations using workload traces from Project 1, generating insights for future ML-based autoscaling improvements.

1.

Loaded synthetic LLM workload

Imported the aggregated workload generated in Project (llm-load-simulator), capturing per-minute token counts, request bursts, and prefill/decode behavior.

2.

Built threshold-based autoscaling rules

Implemented simple scale-up and scale-down triggers using total tokens-per-minute thresholds, GPU count limits, and minimum/maximum capacity constraints.

3.

Simulated latency and cost dynamics

Applied a custom latency formula based on token load per GPU and tracked cost-per-minute using configurable GPU pricing assumptions to mimic real cloud environments.

4.

Calculated SLA violations

Compared simulated latency to a 500 ms SLA target and flagged violations to understand performance reliability under different scaling levels.

5.

Visualized results with Matplotlib

Generated clean plots showing GPU count over time, latency curves, cost progression, and SLA violation patterns: all key metrics for evaluating autoscaling performance.

AWS Services Used

None

Python
Pandas
Matplotlib
CSV Workloads
Basic Cost/Latency Modeling

Technical Tools Used

Autoscaling logic
System simulation
Latency modeling
Cost analysis

Skills Demonstrated

Baseline LLM Autoscaler (Threshold-Based Scaling)

A simple heuristic autoscaler for LLM inference workloads using latency, cost, and token load thresholds.

I built a baseline autoscaling engine that reads synthetic LLM workload traces and makes scale-up/scale-down decisions using simple threshold rules. It computes cost, latency, and SLA violations over time and serving as a foundational baseline to compare against more advanced RL-based autoscalers.

Related Projects

CI/CD For Dockerized 2048 Game

CI/CD For Dockerized 2048 Game

Amazon ECS

Multi-Cloud Weather Tracker with DR (AWS+Azure)

Multi-Cloud Weather Tracker with DR (AWS+Azure)

Azure+AWS

Amazon Polly Text Narrator

Amazon Polly Text Narrator

Amazon Polly

Automated Receipt Processing System - Amazon Textract

Automated Receipt Processing System - Amazon Textract

Amazon Textract

Reinforcement Learning Auto-Scaler for LLM Inference

Reinforcement Learning Auto-Scaler for LLM Inference

RL-Based LLM Autoscaler

AWS Serverless Event Announcement System

AWS Serverless Event Announcement System

AWS Lambda

bottom of page