Baseline LLM Autoscaler (Threshold-Based Scaling)

Business Implications

This baseline autoscaler highlights how simple rules often lead to under- or over-provisioning of GPUs during LLM inference. By quantifying cost, latency, and SLA behaviors, teams can identify inefficiencies before deploying expensive cloud clusters—creating a data-backed foundation for transitioning toward ML-driven autoscaling policies that reduce infrastructure cost while preserving user experience.

Check GitHub

Final
Outcome

Threshold-based autoscaler baseline created.

Check GitHUB

Steps Performed

Implemented a lightweight autoscaler that reacts to LLM token load by scaling GPU instances up or down. Simulated latency, cost, and SLA violations using workload traces from Project 1, generating insights for future ML-based autoscaling improvements.

Loaded synthetic LLM workload

Imported the aggregated workload generated in Project (llm-load-simulator), capturing per-minute token counts, request bursts, and prefill/decode behavior.

Built threshold-based autoscaling rules

Implemented simple scale-up and scale-down triggers using total tokens-per-minute thresholds, GPU count limits, and minimum/maximum capacity constraints.

Simulated latency and cost dynamics

Applied a custom latency formula based on token load per GPU and tracked cost-per-minute using configurable GPU pricing assumptions to mimic real cloud environments.

Calculated SLA violations

Compared simulated latency to a 500 ms SLA target and flagged violations to understand performance reliability under different scaling levels.

Visualized results with Matplotlib

Generated clean plots showing GPU count over time, latency curves, cost progression, and SLA violation patterns: all key metrics for evaluating autoscaling performance.

AWS Services Used

None

Python
Pandas
Matplotlib
CSV Workloads
Basic Cost/Latency Modeling

Technical Tools Used

Autoscaling logic
System simulation
Latency modeling
Cost analysis

Skills Demonstrated