Business Implications
This baseline autoscaler highlights how simple rules often lead to under- or over-provisioning of GPUs during LLM inference. By quantifying cost, latency, and SLA behaviors, teams can identify inefficiencies before deploying expensive cloud clusters—creating a data-backed foundation for transitioning toward ML-driven autoscaling policies that reduce infrastructure cost while preserving user experience.


Steps Performed
Implemented a lightweight autoscaler that reacts to LLM token load by scaling GPU instances up or down. Simulated latency, cost, and SLA violations using workload traces from Project 1, generating insights for future ML-based autoscaling improvements.
1.
Loaded synthetic LLM workload
Imported the aggregated workload generated in Project (llm-load-simulator), capturing per-minute token counts, request bursts, and prefill/decode behavior.
2.
Built threshold-based autoscaling rules
Implemented simple scale-up and scale-down triggers using total tokens-per-minute thresholds, GPU count limits, and minimum/maximum capacity constraints.
3.
Simulated latency and cost dynamics
Applied a custom latency formula based on token load per GPU and tracked cost-per-minute using configurable GPU pricing assumptions to mimic real cloud environments.
4.
Calculated SLA violations
Compared simulated latency to a 500 ms SLA target and flagged violations to understand performance reliability under different scaling levels.
5.
Visualized results with Matplotlib
Generated clean plots showing GPU count over time, latency curves, cost progression, and SLA violation patterns: all key metrics for evaluating autoscaling performance.
AWS Services Used
None
Python
Pandas
Matplotlib
CSV Workloads
Basic Cost/Latency Modeling
Technical Tools Used
Autoscaling logic
System simulation
Latency modeling
Cost analysis
Skills Demonstrated

Baseline LLM Autoscaler (Threshold-Based Scaling)
A simple heuristic autoscaler for LLM inference workloads using latency, cost, and token load thresholds.
I built a baseline autoscaling engine that reads synthetic LLM workload traces and makes scale-up/scale-down decisions using simple threshold rules. It computes cost, latency, and SLA violations over time and serving as a foundational baseline to compare against more advanced RL-based autoscalers.






