Skip to main content

Command Palette

Search for a command to run...

VAI: Zero-Overhead Model Switching for AI Inference

Published
4 min read

published: true
Description: "Why we treat model weights like ROM, not malloc()"

The Problem

Every time you switch models in a typical inference setup:

1. Unload weights from GPU memory
2. Load new weights from disk
3. Rebuild execution state
4. Warm up again

Time: 2-10 seconds. Sometimes more.

We kept asking one question:

Why are we reloading data that never changes?

Model weights are immutable. They don't change between inference calls. They don't change between users. They don't change between processes.

In hardware terms, weights are ROM. Configuration data. Firmware.

You don't reprogram ROM on every transaction.

The Reframe

We mapped AI concepts to hardware concepts:

What ML Calls ItWhat It Actually Is
Model weightsROM / Firmware
KV cacheSRAM / Scratchpad
PromptInput transaction
Token generationPipeline execution
Model switchContext select

Through this lens, a different architecture suggests itself.

What We Built

VAI (Virtual AI Inference) — shared memory architecture for AI inference.

Core principle: Model weights live longer than processes.

┌───────────┐  ┌───────────┐  ┌───────────┐
│ Client A  │  │ Client B  │  │ Client N  │
│ (Inference│  │ (Inference│  │ (Inference│
│  Request) │  │  Request) │  │  Request) │
└─────┬─────┘  └─────┬─────┘  └─────┬─────┘
      │              │              │
      └──────────────┼──────────────┘
                     │
                     ▼
          ┌─────────────────────┐
          │     VAI Daemon      │
          │  Memory Owner       │
          │  GPU Owner          │
          │  Weight Registry    │
          └──────────┬──────────┘
                     │
      ┌──────────────┼──────────────┐
      │              │              │
      ▼              ▼              ▼
┌──────────┐  ┌──────────┐  ┌──────────┐
│ Model A  │  │ Model B  │  │ Model N  │
│  (SHM)   │  │  (SHM)   │  │  (SHM)   │
└──────────┘  └──────────┘  └──────────┘
                     │
                     ▼
              ┌───────────┐
              │    GPU    │
              │ All Models│
              │ Resident  │
              └───────────┘

Implementation:

  • POSIX shared memory for weight registry

  • mmap for zero-copy visibility

  • Daemon owns memory and GPU resources

  • Stateless clients attach/detach without touching weights

The Proof (Real Numbers)

Two models loaded simultaneously:

╔═══════════════════════════════════════════════════════════════╗
║         WROM SHARED MEMORY STATUS                             ║
╠═══════════════════════════════════════════════════════════════╣
║ Daemon PID    : Active                                        ║
║ Models Loaded : 2                                             ║
╠═══════════════════════════════════════════════════════════════╣
║ 1. qwen2-1.5b-q4                                              ║
║    Loaded: Y  GPU: Y  Time: 767 ms                            ║
║    Vocab: 151936  Layers: 28   Embd: 1536                     ║
║ 2. deepseek-coder-6.7b-instruct.Q4_K_M                        ║
║    Loaded: Y  GPU: Y  Time: 525 ms                            ║
║    Vocab: 32256   Layers: 32   Embd: 4096                     ║
╚═══════════════════════════════════════════════════════════════╝

Alternating inference between them:

[ 1] qwen2-1.5b     → "Write Verilog AND gate"    ✓ first=40ms
[ 2] deepseek-6.7b  → "Explain flip-flop"         ✓ first=630ms
[ 3] qwen2-1.5b     → "Write UART module"         ✓ first=13ms
[ 4] deepseek-6.7b  → "What is CDC"               ✓ first=235ms
[ 5] qwen2-1.5b     → "Write FSM"                 ✓ first=10ms
[ 6] deepseek-6.7b  → "Write Verilog AND gate"    ✓ first=325ms
[ 7] qwen2-1.5b     → "Explain flip-flop"         ✓ first=13ms
[ 8] deepseek-6.7b  → "Write UART module"         ✓ first=225ms
[ 9] qwen2-1.5b     → "What is CDC"               ✓ first=10ms
[10] deepseek-6.7b  → "Write FSM"                 ✓ first=178ms

Results

#ModelFirst TokenTotalTPS
1qwen2-1.5b40 ms239 ms160.8
2deepseek-6.7b630 ms3371 ms11.7
3qwen2-1.5b13 ms211 ms161.6
4deepseek-6.7b235 ms2750 ms12.7
5qwen2-1.5b10 ms206 ms163.3
6deepseek-6.7b325 ms2844 ms12.7
7qwen2-1.5b13 ms210 ms162.4
8deepseek-6.7b225 ms2771 ms12.6
9qwen2-1.5b10 ms208 ms161.6
10deepseek-6.7b178 ms2689 ms12.7

Averages:

  • qwen2-1.5b: 17 ms first token

  • deepseek-6.7b: 318 ms first token

  • Model switch overhead: ~0 ms

The latency difference is pure compute — the 6.7B model is larger. But there's no switching penalty.

ApproachModel Switch Time
Traditional (unload + reload)2-10 sec
VAI SHM~0 ms

Why This Matters for Hardware

The abstractions map directly to silicon:

For RTL Engineers: VAI defines a stable contract between weights, context, and compute. This maps to SRAM blocks, weight banks, and MAC arrays. You can design hardware against it.

For DV Engineers: Deterministic behavior. No hidden reload states. Clear lifecycle boundaries. Inference becomes verifiable.

For Physical Design: Model residency means predictable memory footprint. Clear partitioning of static vs dynamic memory. This is how you floorplan an NPU.

Why We Built This

We needed it for our own work — hardware development assisted by AI.

When debugging RTL, you might want:

  • Fast model for quick syntax questions

  • Deeper model for architectural reasoning

  • Specialized model for verification

  • Another for documentation

Waiting 5 seconds between each defeats the purpose.


Built by hardware engineers who got tired of waiting.