VAI: Zero-Overhead Model Switching for AI Inference

published: true
Description: "Why we treat model weights like ROM, not malloc()"

The Problem

Every time you switch models in a typical inference setup:

1. Unload weights from GPU memory
2. Load new weights from disk
3. Rebuild execution state
4. Warm up again

Time: 2-10 seconds. Sometimes more.

We kept asking one question:

Why are we reloading data that never changes?

Model weights are immutable. They don't change between inference calls. They don't change between users. They don't change between processes.

In hardware terms, weights are ROM. Configuration data. Firmware.

You don't reprogram ROM on every transaction.

The Reframe

We mapped AI concepts to hardware concepts:

What ML Calls It	What It Actually Is
Model weights	`ROM / Firmware`
KV cache	`SRAM / Scratchpad`
Prompt	`Input transaction`
Token generation	`Pipeline execution`
Model switch	`Context select`

Through this lens, a different architecture suggests itself.

What We Built

VAI (Virtual AI Inference) — shared memory architecture for AI inference.

Core principle: Model weights live longer than processes.

┌───────────┐  ┌───────────┐  ┌───────────┐
│ Client A  │  │ Client B  │  │ Client N  │
│ (Inference│  │ (Inference│  │ (Inference│
│  Request) │  │  Request) │  │  Request) │
└─────┬─────┘  └─────┬─────┘  └─────┬─────┘
      │              │              │
      └──────────────┼──────────────┘
                     │
                     ▼
          ┌─────────────────────┐
          │     VAI Daemon      │
          │  Memory Owner       │
          │  GPU Owner          │
          │  Weight Registry    │
          └──────────┬──────────┘
                     │
      ┌──────────────┼──────────────┐
      │              │              │
      ▼              ▼              ▼
┌──────────┐  ┌──────────┐  ┌──────────┐
│ Model A  │  │ Model B  │  │ Model N  │
│  (SHM)   │  │  (SHM)   │  │  (SHM)   │
└──────────┘  └──────────┘  └──────────┘
                     │
                     ▼
              ┌───────────┐
              │    GPU    │
              │ All Models│
              │ Resident  │
              └───────────┘

Implementation:

POSIX shared memory for weight registry
mmap for zero-copy visibility
Daemon owns memory and GPU resources
Stateless clients attach/detach without touching weights

The Proof (Real Numbers)

Two models loaded simultaneously:

╔═══════════════════════════════════════════════════════════════╗
║         WROM SHARED MEMORY STATUS                             ║
╠═══════════════════════════════════════════════════════════════╣
║ Daemon PID    : Active                                        ║
║ Models Loaded : 2                                             ║
╠═══════════════════════════════════════════════════════════════╣
║ 1. qwen2-1.5b-q4                                              ║
║    Loaded: Y  GPU: Y  Time: 767 ms                            ║
║    Vocab: 151936  Layers: 28   Embd: 1536                     ║
║ 2. deepseek-coder-6.7b-instruct.Q4_K_M                        ║
║    Loaded: Y  GPU: Y  Time: 525 ms                            ║
║    Vocab: 32256   Layers: 32   Embd: 4096                     ║
╚═══════════════════════════════════════════════════════════════╝

Alternating inference between them:

[ 1] qwen2-1.5b     → "Write Verilog AND gate"    ✓ first=40ms
[ 2] deepseek-6.7b  → "Explain flip-flop"         ✓ first=630ms
[ 3] qwen2-1.5b     → "Write UART module"         ✓ first=13ms
[ 4] deepseek-6.7b  → "What is CDC"               ✓ first=235ms
[ 5] qwen2-1.5b     → "Write FSM"                 ✓ first=10ms
[ 6] deepseek-6.7b  → "Write Verilog AND gate"    ✓ first=325ms
[ 7] qwen2-1.5b     → "Explain flip-flop"         ✓ first=13ms
[ 8] deepseek-6.7b  → "Write UART module"         ✓ first=225ms
[ 9] qwen2-1.5b     → "What is CDC"               ✓ first=10ms
[10] deepseek-6.7b  → "Write FSM"                 ✓ first=178ms

Results

#	Model	First Token	Total	TPS
1	qwen2-1.5b	40 ms	239 ms	160.8
2	deepseek-6.7b	630 ms	3371 ms	11.7
3	qwen2-1.5b	13 ms	211 ms	161.6
4	deepseek-6.7b	235 ms	2750 ms	12.7
5	qwen2-1.5b	10 ms	206 ms	163.3
6	deepseek-6.7b	325 ms	2844 ms	12.7
7	qwen2-1.5b	13 ms	210 ms	162.4
8	deepseek-6.7b	225 ms	2771 ms	12.6
9	qwen2-1.5b	10 ms	208 ms	161.6
10	deepseek-6.7b	178 ms	2689 ms	12.7

Averages:

qwen2-1.5b: 17 ms first token
deepseek-6.7b: 318 ms first token
Model switch overhead: ~0 ms

The latency difference is pure compute — the 6.7B model is larger. But there's no switching penalty.

Approach	Model Switch Time
Traditional (unload + reload)	2-10 sec
VAI SHM	~0 ms

Why This Matters for Hardware

The abstractions map directly to silicon:

For RTL Engineers: VAI defines a stable contract between weights, context, and compute. This maps to SRAM blocks, weight banks, and MAC arrays. You can design hardware against it.

For DV Engineers: Deterministic behavior. No hidden reload states. Clear lifecycle boundaries. Inference becomes verifiable.

For Physical Design: Model residency means predictable memory footprint. Clear partitioning of static vs dynamic memory. This is how you floorplan an NPU.

Why We Built This

We needed it for our own work — hardware development assisted by AI.

When debugging RTL, you might want:

Fast model for quick syntax questions
Deeper model for architectural reasoning
Specialized model for verification
Another for documentation

Waiting 5 seconds between each defeats the purpose.

VAI: Zero-Overhead Model Switching for AI Inference

The Problem

The Reframe

What We Built

The Proof (Real Numbers)

Results

Why This Matters for Hardware

Why We Built This

Links

Comments

Command Palette

The Problem

The Reframe

What We Built

The Proof (Real Numbers)

Results

Why This Matters for Hardware

Why We Built This

Links

Comments