FrootAI — AmpliFAI your Agentic Ecosystem Get Started

All Solution Plays

Play 12

Model Serving AKS

High🔧 Skeleton

Deploy and serve LLMs on AKS with GPU nodes, vLLM, and auto-scaling.

Host your own models on Kubernetes. AKS with NVIDIA GPU node pools runs vLLM for high-throughput inference. Auto-scaling based on request queue depth, health checks, and rolling deployments. Supports quantized models (GPTQ, AWQ) for cost efficiency. ACR stores model containers.

Architecture Pattern

GPU cluster, custom model hosting, LLM inference, auto-scaling

Azure Services

AKS (GPU nodes)NVIDIA GPUContainer Registry (ACR)vLLM

DevKit (.github Agentic OS)

  • agent.md — AKS operations persona
  • instructions.md — GPU management guide
  • mcp/index.js — GPU validation, cluster health
  • plugins/ — vLLM configurator, auto-scaler

TuneKit (AI Config)

  • config/aks.json — node pools, GPU config, scaling rules
  • config/vllm.json — quantization, batching, max concurrent
  • infra/main.bicep — AKS cluster definition

Tuning Parameters

GPU node countQuantization level (GPTQ/AWQ)Batching paramsScaling rulesModel weights path

Estimated Cost

Dev/Test

$300–600/mo

Production

$3K–20K+/mo