top of page
sophia-logo.png
  • Linkedin

Business Implications

By learning scaling decisions directly from workload patterns, the RL autoscaler reduces GPU cost while avoiding user-visible latency spikes. This approach helps organizations transition from rigid rule-based autoscaling to adaptive, intelligent capacity management and supporting more efficient, reliable, and cost-aware LLM deployments at scale.

Final
Outcome

RL policy learns optimal scaling.

Steps Performed

Created a custom RL environment that simulates GPU scaling, latency, cost, and SLA constraints. Trained a Q-learning agent to discover an efficient scaling strategy using workload traces, then visualized the policy’s cost and latency behavior.

1.

Designed custom RL environment

Built an environment with state (token load, GPU count), actions (scale down, same, scale up), and rewards based on cost savings and SLA-compliant latency.

2.

Integrated workload traces

Connected real synthetic workloads from Project (llm-load-simulator), feeding token-per-minute data to the RL environment to simulate realistic inference conditions.

3.

Implemented Q-learning agent

Created a tabular Q-learning model with epsilon-greedy exploration to iteratively improve scaling decisions across hundreds of training episodes.

4.

Tracked training performance

Logged per-episode reward, SLA violations, GPU usage, and overall behavior to analyze how the agent learned cost-efficient, SLA-safe strategies.

5.

Visualized learned policy

Generated plots for training curves, GPU scaling behavior, latency vs SLA thresholds, and cost per minute—demonstrating the RL agent’s decision-making quality.

AWS Services Used

None

Python
NumPy
Pandas
Matplotlib
Reinforcement Learning (Q-learning)

Technical Tools Used

RL modeling
Policy learning
Systems simulation
Cost–latency tradeoffs

Skills Demonstrated

Reinforcement Learning Auto-Scaler for LLM Inference

A Q-learning agent that learns GPU scaling policies for LLM workloads using simulated latency, cost, and SLA feedback.

I developed a reinforcement learning (Q-learning) autoscaler that learns when to scale GPU resources up or down for LLM inference, using workload traces from Project 1. The agent optimizes cost and latency in a custom RL environment and demonstrates how intelligent policies can outperform static threshold rules.

Related Projects

CI/CD For Dockerized 2048 Game

CI/CD For Dockerized 2048 Game

Amazon ECS

Multi-Cloud Weather Tracker with DR (AWS+Azure)

Multi-Cloud Weather Tracker with DR (AWS+Azure)

Azure+AWS

Amazon Polly Text Narrator

Amazon Polly Text Narrator

Amazon Polly

Automated Receipt Processing System - Amazon Textract

Automated Receipt Processing System - Amazon Textract

Amazon Textract

AWS Serverless Event Announcement System

AWS Serverless Event Announcement System

AWS Lambda

Serverless CSV Data Pipeline - ETL

Serverless CSV Data Pipeline - ETL

Amazon Glue

bottom of page