Play 12
Model Serving AKS
High🔧 Skeleton
Deploy and serve LLMs on AKS with GPU nodes, vLLM, and auto-scaling.
Host your own models on Kubernetes. AKS with NVIDIA GPU node pools runs vLLM for high-throughput inference. Auto-scaling based on request queue depth, health checks, and rolling deployments. Supports quantized models (GPTQ, AWQ) for cost efficiency. ACR stores model containers.
Architecture Pattern
GPU cluster, custom model hosting, LLM inference, auto-scaling
Azure Services
AKS (GPU nodes)NVIDIA GPUContainer Registry (ACR)vLLM
DevKit (.github Agentic OS)
- agent.md — AKS operations persona
- instructions.md — GPU management guide
- mcp/index.js — GPU validation, cluster health
- plugins/ — vLLM configurator, auto-scaler
TuneKit (AI Config)
- config/aks.json — node pools, GPU config, scaling rules
- config/vllm.json — quantization, batching, max concurrent
- infra/main.bicep — AKS cluster definition
Tuning Parameters
GPU node countQuantization level (GPTQ/AWQ)Batching paramsScaling rulesModel weights path
Estimated Cost
Dev/Test
$300–600/mo
Production
$3K–20K+/mo