Business Implications
By learning scaling decisions directly from workload patterns, the RL autoscaler reduces GPU cost while avoiding user-visible latency spikes. This approach helps organizations transition from rigid rule-based autoscaling to adaptive, intelligent capacity management and supporting more efficient, reliable, and cost-aware LLM deployments at scale.


Steps Performed
Created a custom RL environment that simulates GPU scaling, latency, cost, and SLA constraints. Trained a Q-learning agent to discover an efficient scaling strategy using workload traces, then visualized the policy’s cost and latency behavior.
1.
Designed custom RL environment
Built an environment with state (token load, GPU count), actions (scale down, same, scale up), and rewards based on cost savings and SLA-compliant latency.
2.
Integrated workload traces
Connected real synthetic workloads from Project (llm-load-simulator), feeding token-per-minute data to the RL environment to simulate realistic inference conditions.
3.
Implemented Q-learning agent
Created a tabular Q-learning model with epsilon-greedy exploration to iteratively improve scaling decisions across hundreds of training episodes.
4.
Tracked training performance
Logged per-episode reward, SLA violations, GPU usage, and overall behavior to analyze how the agent learned cost-efficient, SLA-safe strategies.
5.
Visualized learned policy
Generated plots for training curves, GPU scaling behavior, latency vs SLA thresholds, and cost per minute—demonstrating the RL agent’s decision-making quality.
AWS Services Used
None
Python
NumPy
Pandas
Matplotlib
Reinforcement Learning (Q-learning)
Technical Tools Used
RL modeling
Policy learning
Systems simulation
Cost–latency tradeoffs
Skills Demonstrated

Reinforcement Learning Auto-Scaler for LLM Inference
A Q-learning agent that learns GPU scaling policies for LLM workloads using simulated latency, cost, and SLA feedback.
I developed a reinforcement learning (Q-learning) autoscaler that learns when to scale GPU resources up or down for LLM inference, using workload traces from Project 1. The agent optimizes cost and latency in a custom RL environment and demonstrates how intelligent policies can outperform static threshold rules.






