Reinforcement Learning Auto-Scaler for LLM Inference

Business Implications

By learning scaling decisions directly from workload patterns, the RL autoscaler reduces GPU cost while avoiding user-visible latency spikes. This approach helps organizations transition from rigid rule-based autoscaling to adaptive, intelligent capacity management and supporting more efficient, reliable, and cost-aware LLM deployments at scale.

Check GitHub

Final
Outcome

RL policy learns optimal scaling.

Check GitHUB

Steps Performed

Created a custom RL environment that simulates GPU scaling, latency, cost, and SLA constraints. Trained a Q-learning agent to discover an efficient scaling strategy using workload traces, then visualized the policy’s cost and latency behavior.

Designed custom RL environment

Built an environment with state (token load, GPU count), actions (scale down, same, scale up), and rewards based on cost savings and SLA-compliant latency.

Integrated workload traces

Connected real synthetic workloads from Project (llm-load-simulator), feeding token-per-minute data to the RL environment to simulate realistic inference conditions.

Implemented Q-learning agent

Created a tabular Q-learning model with epsilon-greedy exploration to iteratively improve scaling decisions across hundreds of training episodes.

Tracked training performance

Logged per-episode reward, SLA violations, GPU usage, and overall behavior to analyze how the agent learned cost-efficient, SLA-safe strategies.

Visualized learned policy

Generated plots for training curves, GPU scaling behavior, latency vs SLA thresholds, and cost per minute—demonstrating the RL agent’s decision-making quality.

AWS Services Used

None

Python
NumPy
Pandas
Matplotlib
Reinforcement Learning (Q-learning)

Technical Tools Used

RL modeling
Policy learning
Systems simulation
Cost–latency tradeoffs

Skills Demonstrated