VAI: Zero-Overhead Model Switching for AI Inference
published: true
Description: "Why we treat model weights like ROM, not malloc()"
The Problem
Every time you switch models in a typical inference setup:
1. Unload weights from GPU memory
2. Load new weights from disk
3. Rebuild execution state
4. Warm up again
Time: 2-10 seconds. Sometimes more.
We kept asking one question:
Why are we reloading data that never changes?
Model weights are immutable. They don't change between inference calls. They don't change between users. They don't change between processes.
In hardware terms, weights are ROM. Configuration data. Firmware.
You don't reprogram ROM on every transaction.
The Reframe
We mapped AI concepts to hardware concepts:
| What ML Calls It | What It Actually Is |
| Model weights | ROM / Firmware |
| KV cache | SRAM / Scratchpad |
| Prompt | Input transaction |
| Token generation | Pipeline execution |
| Model switch | Context select |
Through this lens, a different architecture suggests itself.
What We Built
VAI (Virtual AI Inference) — shared memory architecture for AI inference.
Core principle: Model weights live longer than processes.
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Client A │ │ Client B │ │ Client N │
│ (Inference│ │ (Inference│ │ (Inference│
│ Request) │ │ Request) │ │ Request) │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────┼──────────────┘
│
▼
┌─────────────────────┐
│ VAI Daemon │
│ Memory Owner │
│ GPU Owner │
│ Weight Registry │
└──────────┬──────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Model A │ │ Model B │ │ Model N │
│ (SHM) │ │ (SHM) │ │ (SHM) │
└──────────┘ └──────────┘ └──────────┘
│
▼
┌───────────┐
│ GPU │
│ All Models│
│ Resident │
└───────────┘
Implementation:
POSIX shared memory for weight registry
mmap for zero-copy visibility
Daemon owns memory and GPU resources
Stateless clients attach/detach without touching weights
The Proof (Real Numbers)
Two models loaded simultaneously:
╔═══════════════════════════════════════════════════════════════╗
║ WROM SHARED MEMORY STATUS ║
╠═══════════════════════════════════════════════════════════════╣
║ Daemon PID : Active ║
║ Models Loaded : 2 ║
╠═══════════════════════════════════════════════════════════════╣
║ 1. qwen2-1.5b-q4 ║
║ Loaded: Y GPU: Y Time: 767 ms ║
║ Vocab: 151936 Layers: 28 Embd: 1536 ║
║ 2. deepseek-coder-6.7b-instruct.Q4_K_M ║
║ Loaded: Y GPU: Y Time: 525 ms ║
║ Vocab: 32256 Layers: 32 Embd: 4096 ║
╚═══════════════════════════════════════════════════════════════╝
Alternating inference between them:
[ 1] qwen2-1.5b → "Write Verilog AND gate" ✓ first=40ms
[ 2] deepseek-6.7b → "Explain flip-flop" ✓ first=630ms
[ 3] qwen2-1.5b → "Write UART module" ✓ first=13ms
[ 4] deepseek-6.7b → "What is CDC" ✓ first=235ms
[ 5] qwen2-1.5b → "Write FSM" ✓ first=10ms
[ 6] deepseek-6.7b → "Write Verilog AND gate" ✓ first=325ms
[ 7] qwen2-1.5b → "Explain flip-flop" ✓ first=13ms
[ 8] deepseek-6.7b → "Write UART module" ✓ first=225ms
[ 9] qwen2-1.5b → "What is CDC" ✓ first=10ms
[10] deepseek-6.7b → "Write FSM" ✓ first=178ms
Results
| # | Model | First Token | Total | TPS |
| 1 | qwen2-1.5b | 40 ms | 239 ms | 160.8 |
| 2 | deepseek-6.7b | 630 ms | 3371 ms | 11.7 |
| 3 | qwen2-1.5b | 13 ms | 211 ms | 161.6 |
| 4 | deepseek-6.7b | 235 ms | 2750 ms | 12.7 |
| 5 | qwen2-1.5b | 10 ms | 206 ms | 163.3 |
| 6 | deepseek-6.7b | 325 ms | 2844 ms | 12.7 |
| 7 | qwen2-1.5b | 13 ms | 210 ms | 162.4 |
| 8 | deepseek-6.7b | 225 ms | 2771 ms | 12.6 |
| 9 | qwen2-1.5b | 10 ms | 208 ms | 161.6 |
| 10 | deepseek-6.7b | 178 ms | 2689 ms | 12.7 |
Averages:
qwen2-1.5b: 17 ms first token
deepseek-6.7b: 318 ms first token
Model switch overhead: ~0 ms
The latency difference is pure compute — the 6.7B model is larger. But there's no switching penalty.
| Approach | Model Switch Time |
| Traditional (unload + reload) | 2-10 sec |
| VAI SHM | ~0 ms |
Why This Matters for Hardware
The abstractions map directly to silicon:
For RTL Engineers: VAI defines a stable contract between weights, context, and compute. This maps to SRAM blocks, weight banks, and MAC arrays. You can design hardware against it.
For DV Engineers: Deterministic behavior. No hidden reload states. Clear lifecycle boundaries. Inference becomes verifiable.
For Physical Design: Model residency means predictable memory footprint. Clear partitioning of static vs dynamic memory. This is how you floorplan an NPU.
Why We Built This
We needed it for our own work — hardware development assisted by AI.
When debugging RTL, you might want:
Fast model for quick syntax questions
Deeper model for architectural reasoning
Specialized model for verification
Another for documentation
Waiting 5 seconds between each defeats the purpose.
Links
VAI Article: https://wiowiz.com/virtual-ai-inference-what-hardware-engineers-see.html
WIOWIZ: https://wiowiz.com
Built by hardware engineers who got tired of waiting.