The easiest way to misunderstand vLLM Semantic Router is to start from the word router.
In a traditional gateway, routing is a forwarding problem. A request arrives, the gateway evaluates policy, picks a backend cluster, and sends the request there. The backend is usually a deterministic service. If it is healthy, the request succeeds. If it fails, the gateway retries, fails over, or returns an error.
LLM systems break that mental model.
The backend is no longer just a service. It is a probabilistic reasoning system. It may need tools. It may need memory. It may need retrieval. It may hallucinate. It may require a verifier. It may run on a GPU pool whose KV cache is warm for one session and cold for another. It may be cheap for short requests and expensive for long-context ones. It may be allowed for one tenant and forbidden for another.
A request is no longer only traffic. It is semantic work.
Routing is the moment where semantic intent becomes infrastructure placement.
That is the starting point of vLLM SR.
The project is not trying to build a prettier model selector. The deeper idea is to turn every LLM request into an explainable capability path: which evidence was observed, which derived facts were produced, which policy matched, which selection or collaboration algorithm ran, which plugins were attached, which backend and hardware path was used, and what trace was left behind.
I want to walk through that design as a system, not as a feature list. We will keep one running example in view, then open the layers one by one: signals, projections, decisions, selection algorithms, looper algorithms, router memory and learning, plugins, Envoy integration, model design, native bindings, hardware support, observability, and evaluation.
A Request Walks Into the Router
Imagine an enterprise AI gateway receives this request:
Analyze this customer contract, compare it with our internal policy, and draft a response. It may contain personal information. Use tools if needed.
If we look only at surface text, this is a contract-analysis request. If we look only at token count, it may be a long-context request. If we look at risk, it touches PII, internal policy, tool access, and factual correctness. If we look at infrastructure, it depends on tenant permissions, pool health, cache warmth, context-window capacity, and model cost.
A naive model picker asks one question:
Which model should answer this prompt?
vLLM SR asks a better question:
What must the system know before it is allowed to spend intelligence?
contract + policy + PII"] --> B["Signals
observe evidence"] B --> C["Projections
derive routing facts"] C --> D["Decisions
match policy"] D --> E["Selection Algorithms
choose one model"] D --> L["Looper Algorithms
choose collaboration path"] E --> M["Router Learning
adapt + protect"] L --> M M --> F["Plugins
attach behavior"] F --> G["Gateway / Backend
execute"] G --> H["Trace / Replay
learn"]
The split is intentional. Each layer owns a different kind of reasoning.
| Layer | It owns | It should not own |
|---|---|---|
| Signals | Observing facts about the request, user, session, content, and environment | Final policy or model choice |
| Projections | Turning noisy evidence into stable routing facts | Gateway mutation or backend calls |
| Decisions | Expressing auditable route policy | Low-level selection or collaboration execution |
| Selection algorithms | Choosing one model from a policy-approved candidate set | Rewriting route eligibility |
| Looper algorithms | Selecting and running a multi-model collaboration path | Pretending to be a single-model selector |
| Router Learning | Adapting model choice and protecting session continuity after base selection | Mutating recipe policy on the request path |
| Plugins | Route-scoped behavior such as cache, RAG, memory, tools, verification, replay | Global hidden middleware |
| Bindings | Fast native ML paths for hot routing components | Policy semantics |
| Gateway integration | Putting semantic policy at the network boundary | Replacing the gateway data plane |
The core design choice is separation of concerns. If a request goes to a frontier model with RAG, tool filtering, and a response verifier, the answer should not be “because the router said so.” The router should be able to show the chain: evidence, projection, policy, algorithm, learning decision, plugins, mutations, backend path, and replay id.
The Full Request Lifecycle
For the contract request, the router first extracts request context: body, headers, tenant identity, session id, conversation id, tool state, prior response markers, and any existing routing metadata.
It then runs only the signals referenced by the active recipe. Domain, PII, context length, authorization, knowledge-base relevance, fact-check need, tool availability, and jailbreak or prompt-injection signals can run independently. The point is not to run every detector on every request; the point is to gather the evidence that the configured policy actually needs.
Projections then compose that evidence into stable route facts: privacy_sensitive, legal_contract_path, long_context, needs_internal_rag, and verification_required. The decision engine matches those facts against route policy. A matched decision exposes candidate models and, depending on its algorithm, either a single-model selection path or a looper collaboration path. Router Learning may then adjust the proposed model or hold the current model if switching would break session continuity. Finally, plugins attach route-scoped behavior before the gateway executes the main traffic path.
headers + body"] E --> R["vLLM SR
request context"] R --> S["Signals
run required evidence"] S --> P["Projections
derive route facts"] P --> D["Decisions
match policy"] D --> A{"Execution mode"} A -- "single model" --> M["Selector"] A -- "collaboration" --> L["Looper"] M --> G["Router Learning
adapt + protect"] L --> G G --> X["Plugins
cache, RAG, tools, safety"] X --> B["Backend pool
or looper call"] B --> T["Trace / replay
response checks"]
The model name is only the visible end of the route. The deeper value is that the system can explain why this was not a cheap direct-answer path. Privacy, internal knowledge, tool exposure, factuality, and context pressure are different concerns. vLLM SR gives each concern a place in the architecture.
The Recipe Is the System Contract
The next design step is easy to miss: vLLM SR is not primarily configured as a collection of ad hoc callbacks. It is configured as a recipe. That recipe is the contract between application intent, routing policy, backend inventory, plugin behavior, learning boundaries, and gateway mutation.
In production, routing policy tends to spread. One part lands in application code, one part in gateway config, one part in prompt templates, one part in a model registry, one part in a dashboard, and one part in a notebook used by the evaluation team. Once that happens, nobody can answer a simple operational question: “Why did this request take this path?”
vLLM SR’s v0.3-style contract tries to keep that answer in one durable shape.
RouterConfig"] --> B["Evidence contract
signals"] B --> C["Fact contract
projections"] C --> D["Policy contract
decisions"] D --> E["Execution contract
modelRefs, algorithm, plugins, adaptations"] E --> F["Runtime contract
backend models, provider profiles, router options"]
| Contract surface | What it owns | Why it belongs in the recipe |
|---|---|---|
| Global runtime | Semantic cache, memory, replay, learning, observability, API listeners, startup behavior | These features are infrastructure state, not per-request improvisation |
routing.signals |
The evidence families that are allowed to run | A decision should only depend on named, reviewable evidence |
routing.projections |
Partitions, scores, and mappings derived from signals | Policy should consume stable facts, not raw detector noise |
routing.decisions |
Named route policies with priority, tier, rules, output contract, modelRefs, algorithms, adaptations, plugins, and emits | A route is an auditable capability bundle |
decision.modelRefs |
The approved candidate boundary for a decision | Selection and learning cannot escape policy by discovering a better-looking model elsewhere |
decision.algorithm |
The execution strategy inside the matched decision | The recipe distinguishes one-model selection from collaboration paths such as Fusion or Flow |
decision.plugins |
Cache, memory, RAG, tools, prompt mutation, response checks, replay, image generation, or fast response | Behavior follows the matched route instead of hiding in global middleware |
decision.adaptations |
Apply, observe, or bypass learning for this policy | Sensitive routes can be protected from online exploration |
| Backend models | Provider profiles, vLLM endpoints, image backends, model metadata, aliases, default model | Logical routing must resolve to real serving pools and provider contracts |
| Router options | Auto model names, body streaming mode, route-cache clearing, skip-processing controls | Gateway behavior and compatibility are explicit operational choices |
That is the operational difference between a router that happens to make good choices and a router that can be trusted. A recipe is diffable. It can be validated. It can be replayed against stored traffic. It can be promoted across environments. It can be patched by offline learning without letting the hot path rewrite production policy.
The rule is conservative on purpose: runtime intelligence may propose better choices, but the recipe defines the legal search space. That is what keeps “adaptive routing” from becoming hidden policy mutation.
Research as Design Pressure
The paper shelf behind vLLM SR is not a citation dump. Each research direction creates pressure on one part of the router. Read the table as a map from problem to architecture.
| Work | Problem | Core idea | Architectural pressure |
|---|---|---|---|
| vLLM Semantic Router: Signal Driven Decision Routing | Single-classification routing cannot express cost, privacy, latency, safety, and multimodal constraints together | Compose heterogeneous signals into route decisions and route-scoped behavior | Establishes signals, projections, decisions, algorithms, and plugins as separate layers |
| Workload-Router-Pool | Routing papers often ignore serving pools, while fleet papers often ignore workload semantics | Close the loop among workload evidence, router policy, and physical pool state | Makes queue, cache, hardware, and pool feedback routing inputs |
| When to Reason | Reasoning models are expensive and not always useful | Detect reasoning need and apply reasoning only when beneficial | Creates complexity signals, reasoning policy, and model-card reasoning capability |
| Category-Aware Semantic Caching | One similarity threshold is unsafe across heterogeneous workloads | Cache thresholds, TTLs, and quotas vary by category | Makes semantic cache a route-scoped plugin, not a global middleware |
| Outcome-Aware Tool Selection | Tool selection cannot rely only on semantic similarity | Refine tool embeddings offline using outcome evidence under latency budgets | Connects tool signals, tool-selection plugins, and replay/outcome data |
| 98x Faster Routing Without Dedicated GPU | A hot-path router cannot take seconds to route | Use prompt compression, Flash Attention, and near-streaming body processing | Pushes native bindings and fast signal runtime |
| Adaptive VLM Routing | Multimodal computer-use steps have very different difficulty | Estimate visual/action difficulty and route cheap or strong VLMs accordingly | Adds modality, visual difficulty, and multimodal safety signals |
| Visual Confused Deputy | Computer-use agents can be attacked through visual perception failures | Check click target and action reasoning independently | Pulls visual safety and action-boundary validation into routing |
| Knowledge Access Beats Model Size | Correct knowledge access can beat a larger model | Use memory and retrieval-grounded routing to recover quality with smaller models | Promotes RAG, memory, KB signals, and knowledge paths to first-class capabilities |
| Fast and Faithful RAG Verification | Retrieval does not guarantee faithful answers | Verify long-document RAG responses in real time | Motivates hallucination and response-verification plugins |
| inference-fleet-sim | Model choice depends on queueing, TTFT, and fleet capacity | Use queueing-grounded simulation for multi-pool planning | Connects router policy to fleet simulation |
| FleetOpt | Minimum-cost pools depend on workload CDF and P99 targets | Derive pool boundaries and deploy them through compress-and-route | Connects token budget, pool boundary, and cost-aware routing |
| 1/W Law | Context window size changes energy efficiency and memory pressure | Analyze context-length routing topology and tokens-per-watt | Makes context routing and long-context pools central |
| Conflict-Free Policy Languages | Probabilistic ML predicates can co-fire silently | Detect and prevent policy conflicts in the DSL | Motivates declarative decisions, priority, confidence, and auditability |
| Cross-Layer Policy Compilation | Policy should not live as scattered gateway, workflow, K8s, and agent code | Compile one declarative source into multiple execution layers | Points toward policy-as-code and cross-layer verification |
| Token-Budget-Aware Pool Routing | Token budget affects KV cache, pool choice, and failure risk | Estimate total token budget and route to short or long pools | Connects context, bytes-per-token, pools, and failure avoidance |
| SIRP and Multi-Provider API | Semantic routing should not require private application protocols | Standardize semantic inference routing and multi-provider surfaces | Reinforces OpenAI-compatible, gateway-compatible control planes |
The pressure from the research shelf is clear: a router cannot be one classifier. If it were, every new paper would become another special case. In vLLM SR, a paper can become a signal, projection, decision primitive, selection algorithm, looper algorithm, plugin, learning signal, model-card field, or pool feedback loop.
Signals: Evidence Before Judgment
Signals are deliberately humble. A signal answers one factual question. It does not choose the model. It does not select a plugin. It does not decide the route.
For the contract example, the router may need to know whether the request is legal or enterprise-related, whether it contains PII, whether the user is authorized for premium or private models, whether it needs internal knowledge, whether tools are available, whether the context is too large for a short-context backend, and whether the response needs verification.
| Signal family | What it observes | Why it matters |
|---|---|---|
| Keyword and lexical | Explicit words, regex, BM25, n-grams, fuzzy anchors | Fast deterministic anchors for compliance, product names, incidents, and explicit task classes |
| Context and structure | Token count, context pressure, prompt shape, JSON/schema/workflow form | Separates short direct paths from long-context, compression, and workflow paths |
| Authz and tenant | User identity, groups, role bindings, tiers, allowlists | Prevents silent access to premium, private, tenant-specific, or sensitive paths |
| Conversation and session | Turn count, tool calls, tool results, active loop state, previous response markers | Protects continuity and avoids switching during non-portable state |
| Domain and KB | Task domain, internal knowledge source, KB relevance | Chooses domain models, RAG sources, and policy constraints |
| Embedding and preference | Semantic similarity, user/task preference, model-description fit | Handles soft semantic routing where exact keywords are insufficient |
| Complexity and repair | Difficulty, uncertainty, repeated dissatisfaction, retry and repair signals | Drives reasoning activation, escalation, and learning |
| Fact-check and safety | Factuality need, jailbreak, prompt injection, PII, response risk | Triggers RAG, verifier, guardrail, local/private route, or block path |
| Modality and event | Text, image, image-generation intent, SRE/SOC event shape | Routes to VLM, image, operational, or event-specific paths |
The maintained runtime surface currently exposes eighteen base signal families: authz, complexity, context, conversation, domain, embedding, fact_check, jailbreak, keyword, language, modality, pii, preference, reask, structure, kb, user_feedback, and event. projection is also a rule condition type, but it is not a raw sensor. It is the decision-visible output of the projection layer.
The runtime behavior is as important as the list of names.
| Runtime behavior | Design implication |
|---|---|
| Used-signal analysis | The classifier builds a map from decisions and projections, then runs only the signal families needed by the active recipe unless forced evaluation is enabled |
| Concurrent dispatch | Independent signal evaluators can run in parallel, keeping the route path from becoming a serial detector chain |
| Readiness checks | A configured signal is only useful if its model, rule, or backend dependency is ready; the router can avoid pretending a missing detector produced evidence |
| Request-scoped caches | Expensive intermediate work, such as image embeddings, can be shared by complexity and embedding signals during one request |
| Trace preservation | Signal output is retained as evidence for projections, decision traces, replay, and learning diagnostics |
The payoff is evidence reuse. A pii signal can influence privacy policy, cache policy, provider selection, and audit policy. A conversation signal can influence router protection, tool filtering, and replay. A context signal can influence long-context pools, compression, and cost-aware selection. If signals directly selected models, that reuse would disappear.
Projections: The Coordination Layer
Signals are often messy. Some are booleans. Some are classifier probabilities. Some are similarity scores. Some are raw metrics such as token count or KB relevance. Production policy should not need to manually combine all of those raw values every time.
Projections turn evidence into stable routing facts.
| Raw evidence | Example value | Why a projection helps |
|---|---|---|
domain=legal |
confidence 0.82 | Domain can overlap with finance, support, or security; policy needs a stable partition |
pii=present |
confidence 0.91 | Privacy should not depend on one detector alone |
context_tokens |
42K | Token count matters relative to model window and pool state |
fact_check=needed |
confidence 0.76 | Factuality should combine with domain and knowledge availability |
authz=premium |
matched | Authorization is a hard gate, not a difficulty signal |
kb=internal_policy |
score 0.68 | KB relevance is retrieval evidence, not a complete route |
| Projection pattern | Logic | Example |
|---|---|---|
| Partition | Pick one winner among competing semantic candidates | Choose legal_contract_path over finance_path when margin is sufficient |
| Weighted score | Combine booleans, confidence values, raw values, and similarity scores | Compute risk or complexity pressure |
| Threshold mapping | Convert a continuous score into stable bands | Map complexity into simple, medium, or complex |
| Multi-emit mapping | Emit several non-exclusive derived facts | Emit both needs_rag and needs_verifier |
| Normalization | Put heterogeneous signal scales into comparable space | Feed hybrid selection and confidence ranking |
One subtle rule keeps projections clean: decisions read projection outputs, not every intermediate score name. A partition or weighted score can be rich internally, but the policy surface should see facts such as legal_contract_path, risk_high, or long_context_lane.
| Projection artifact | Decision-visible? | Trace value |
|---|---|---|
| Partition | Usually through the selected output | Shows competing semantic candidates and margin |
| Score | Not by itself unless mapped or referenced by another projection | Shows weighted input contributions, match and miss values, and normalization behavior |
| Mapping output | Yes | Shows which threshold band or multi-emit output fired |
| Boundary distance | Indirectly through confidence | Explains near-miss cases where a score barely crossed or missed a band |
| Projection trace | Operationally visible | Lets operators debug why raw evidence became a route fact |
That is why the projection layer can grow without making decisions unreadable. A risk score might combine PII, jailbreak, tenant tier, KB source, and response verifier need. The decision should not have to carry that formula inline. It should match a named derived fact and leave the math in a traceable coordination layer.
Projections keep the rest of the system readable. Signals remain fine-grained. Decisions remain policy-oriented. Algorithms remain responsible for selection or collaboration. The coordination work lives in between, where it can be traced.
Decisions: Policy Needs a Shape
Once projections produce stable facts, decisions express route policy.
The decision engine is intentionally closer to a boolean circuit than a hidden Python function. Routing policy should be inspectable, diffable, auditable, sortable, and eventually compilable across infrastructure layers.
For the contract request, the policy can be simplified as:
A route is not just an endpoint. It is a policy-approved capability bundle: candidate models, algorithm, plugins, gateway mutations, retention behavior, fallback, and diagnostics.
| Decision element | Meaning | Why it matters |
|---|---|---|
| Leaf | References a signal or projection | Keeps policy connected to explicit evidence |
| AND | Requires all children to match | Expresses strict gates such as auth plus privacy |
| OR | Accepts any child | Lets multiple evidence patterns imply the same path |
| NOT | Excludes one child | Useful for fallback, denial, or bypass policy |
| Priority and tier | Sort matched decisions | Prevents low-risk paths from shadowing high-risk paths |
| Confidence | Carries evidence strength | Allows ranking without hiding why |
| Emits | Produces route metadata | Connects policy to cache, learning, plugins, and gateway headers |
In the current contract, a decision is a route object, not only a rule tree.
| Decision field | What it contributes to the route |
|---|---|
name and description |
Human-readable identity for traces, dashboards, replay, and review |
priority and tier |
Deterministic ordering when multiple policies match |
output_contract |
Declares the expected API or response shape for the route |
rules |
Recursive Boolean tree over signals and projection outputs |
modelRefs |
Candidate boundary for selectors, loopers, and learning |
algorithm |
Execution strategy after the policy match |
adaptations |
Per-decision learning mode: apply, observe, or bypass |
plugins |
Route-scoped behavior attached after matching |
candidateIterations |
Declarative candidate loops used by richer selection or workflow constructs |
emits |
Declarative side effects such as retention behavior |
Decision selection is intentionally deterministic. If matched decisions use tiers, lower tier values win first, then confidence, priority, and name. Without tiers, strategy=confidence ranks by confidence before priority; the default strategy ranks priority before confidence. Even fallback is explicit: an empty AND can be a catch-all route, but it has zero confidence so it does not outrank real evidence-backed decisions.
Retention emits make policy visible beyond the immediate model choice. A decision can express drop to skip semantic-cache writes, ttl_turns to bound cache lifetime, keep_current_model to protect session continuity, or prefer_prefix_retention to tell the serving pool that KV/prefix reuse matters.
That shift from classification-style routing to signal-decision routing is not cosmetic. Classification is useful for demos. Decision architecture is what production policy needs.
Selection Algorithms: Choosing One Model After Policy
Many router designs start with the selector: embeddings, MLPs, bandits, or an LLM-as-router that picks a model directly. That easily turns one algorithm into a sink for every concern: semantic fit, cost, latency, safety, authorization, session continuity, and provider policy.
vLLM SR keeps the order stricter. The decision matches first. Then a selection algorithm chooses one model from the policy-approved candidate set.
| Selection algorithm | Catalog tier | Core idea | Best use |
|---|---|---|---|
| Static | Supported | Pick the configured order or fixed score | Deterministic fallback, explicit business policy, early rollout |
| RouterDC | Supported | Match query embedding to model-description embeddings | Query-to-capability matching when model cards are meaningful |
| Hybrid | Supported | Combine semantic fit, quality, latency, cost, cache affinity, and other scores | Production tradeoffs where no single signal should dominate |
| Multi-factor | Supported | Filter by SLO, then score quality, latency, cost, and load | Fleet-aware route selection |
| Latency-aware | Supported | Prefer candidates using TTFT/TPOT percentile metrics | SLO-sensitive paths |
| AutoMix | Experimental | Start cheaper and escalate using confidence or verification | Cost-saving cascades where repair is acceptable |
| KNN | Experimental | Route by nearest labeled examples | Interpretable example-based routing |
| KMeans | Experimental | Route by cluster membership | Coarse workload segmentation |
| SVM | Experimental | Route by learned decision boundary | Fast offline-trained classification |
| MLP | Experimental | Non-linear neural selector through native ML artifacts | Mature deployments with trained artifacts |
The boundary is strict: selection algorithms choose one model from modelRefs. They do not run a panel, coordinate a workflow, or hold a session model. Multi-model collaboration belongs to Looper. Session and conversation stability belongs to Router Learning.
Looper Algorithms: Selecting a Model Collaboration Path
Looper is important enough to discuss separately because it changes what the router is selecting.
A selection algorithm chooses one model. A Looper algorithm chooses a model collaboration path: a bounded execution pattern involving escalation, fan-out, panel judgment, multi-round reasoning, or micro-agent workflows. This is usually the path for scaling model capability without exposing a new application protocol. The client may still call one logical model name, but the router executes a structured collaboration behind that name.
backend path"] B -- "looper" --> D["Collaboration path"] D --> E["Sequential
Confidence"] D --> F["Parallel
Ratings / Fusion"] D --> G["Multi-round
ReMoM"] D --> H["Workflow
Router Flow"] E --> I["One API response
headers + replay"] F --> I G --> I H --> I
| Looper algorithm | Catalog tier | Collaboration pattern | What it scales | How to read it |
|---|---|---|---|---|
| Confidence | Supported | Try smaller or cheaper models first, evaluate confidence, escalate when confidence is too low | Cost-efficient quality | A sequential small-to-large cascade with explicit stopping conditions |
| Ratings | Supported | Run multiple candidates concurrently up to a cap and aggregate with rating-aware logic | Ensemble breadth under cost control | A bounded fan-out path for evaluation, A/B, and ensemble-style responses |
| ReMoM | Supported | Run multi-round parallel reasoning with a breadth schedule and final synthesis | Test-time reasoning capacity | A breadth-controlled reasoning tree across models |
| Fusion | Experimental | Run an analysis panel, ask a judge for structured analysis, then synthesize one final answer | Independent model perspectives | A panel-judge-synthesis path for tasks where disagreement and blind spots matter |
| Router Flow / Workflows | Experimental | Execute a static or planner-generated micro-agent workflow behind one model name | Decomposition, verification, tool-aware work | A bounded agent workflow where workers are constrained by decision modelRefs |
These are not just “more algorithms.” They are the router’s answer to capability scaling.
Confidence keeps the cost curve low by starting small and escalating only when confidence is insufficient. It can use average log probability, margin, a hybrid score, self-verification, or an AutoMix-style entailment verifier. The question is not “small or large model?” It is “is the current answer good enough to stop?”
modelRefs"] --> B["Start with cheaper
or smaller model"] B --> C{"Confidence
enough?"} C -- "yes" --> D["Return answer"] C -- "no" --> E["Escalate to stronger
candidate"] E --> F{"Verifier, margin,
or logprob passes?"} F -- "yes" --> D F -- "no" --> G["Next candidate
or fallback"] G --> D
Ratings uses concurrency as a controlled resource. Instead of one winner, several candidates participate, bounded by max_concurrent, and the router aggregates successful responses. This is useful when operators want ensemble behavior or live comparison without letting fan-out become unbounded.
modelRefs"] --> B["Bounded fan-out
max_concurrent"] B --> C["Model A
response"] B --> D["Model B
response"] B --> E["Model C
response"] C --> F["Rating-aware
aggregation"] D --> F E --> F F --> G["One API response
plus trace"]
Fusion is the clean panel pattern. It sends the request to analysis models, asks a judge to identify consensus, contradictions, partial coverage, and blind spots, and then synthesizes a final answer. The important design point is that Fusion policy lives under the matched decision. vllm-sr/auto can decide whether Fusion is warranted; vllm-sr/fusion narrows matching to Fusion-capable decisions instead of silently falling back to ordinary single-model routing.
consensus, conflicts, gaps"] C --> E D --> E E --> F["Synthesis model"] F --> G["Final answer
with panel trace"]
ReMoM is the multi-round version of the same philosophy. It uses a breadth schedule such as [3, 2] or [32, 4], distributes calls across model candidates, compacts intermediate responses when needed, and synthesizes the final answer. This is useful when the value comes from exploration over multiple reasoning paths rather than one panel pass.
breadth schedule"] B --> C["Parallel reasoning
across candidates"] C --> D["Compact or select
intermediate outputs"] D --> E["Next round
reduced breadth"] E --> F["Final synthesis"] F --> G["Answer
with round trace"]
Router Flow turns the route into a bounded micro-agent workflow. A static flow can define roles such as thinker, worker, verifier, and final synthesizer. A dynamic flow can ask a planner model to produce a plan, but worker execution remains constrained to the decision’s modelRefs. Tool calls preserve the OpenAI-compatible contract while the router stores enough workflow state to resume the correct worker after tool results return.
or planner output"] B --> C["Thinker
decompose task"] C --> D["Worker
bounded by modelRefs"] D --> E["Tool calls
and tool results"] E --> D D --> F["Verifier
check result"] F --> G["Final synthesizer"] G --> H["OpenAI-compatible
response + workflow trace"]
The Looper layer is the bridge from routing to model collaboration. It lets the router scale capability through multiple models while keeping policy, traces, cost boundaries, and public API shape explicit.
Router Memory and Learning: Adaptation Is Not an Algorithm
Session-aware and learning-related behavior should not be hidden inside decision.algorithm. In the clean vLLM SR design, this belongs to Router Learning.
The distinction matters. A decision says what is allowed. A selection or looper algorithm produces a base result. Router Learning then asks whether the system should adapt that result from runtime experience, and whether switching is safe in the current session or conversation.
Learning can improve a route inside policy. It must not become a second, invisible policy system.
The runtime order is fixed:
| Component | Question it answers | What it may change | What it must not change |
|---|---|---|---|
| Recipe policy | Which route is allowed for this request? | Matched decision and candidate boundary | Runtime experience |
| Base selector / looper | What is the policy-approved base model or collaboration path? | Base result | Decision eligibility |
| Adaptation | Does experience suggest a better candidate inside the allowed boundary? | Proposal model | Signals, thresholds, decisions, priorities, modelRefs |
| Protection | Is exploration or switching safe now? | Hold, allow, or rescue final model | Model quality scores or policy matching |
| Replay and outcomes | What happened, and how did it perform? | Experience and offline evidence | Live recipe policy |
| Offline recipe learning | What recipe patch should humans review? | Candidate recipe patches and seed packs | Production behavior without review |
The public learning concepts are intentionally small:
| Concept | Public surface | Meaning |
|---|---|---|
| Adaptation | global.router.learning.adaptation |
Online model-choice learning from runtime experience |
| Protection | global.router.learning.protection |
Session and conversation stability control |
| Decision control | routing.decisions[].adaptations |
Apply, observe, or bypass learning for the matched decision |
| Candidate boundary | decision, tier, or global |
How far adaptation may search |
| Outcome | /v1/router/outcomes linked to replay |
Typed feedback for model, route, policy, stability, provider, or router |
| Replay | x-vsr-replay-id and durable record |
Evidence log for diagnostics and offline learning |
Adaptation’s day-0 strategy is routing_sampling. It scores candidates from local experience: quality seed, good-fit outcomes, underpowered outcomes, overprovisioned outcomes, failures, latency evidence, cache reuse, effective input cost, and reliability. The default candidate set is decision, which means adaptation may only choose among the matched decision’s modelRefs. Broader scopes such as tier and global are more powerful, but they need stronger guards.
Protection is the session-aware half. It has a preflight guard and a switch guard. Preflight suppresses stochastic sampling during tool loops, protocol-sensitive continuations, or routine continuation steps. The switch guard decides whether to hold the current model, allow the proposal, or perform a bounded rescue switch. The simplified rule is:
switch if proposal_gain >= switch_margin + stability_weight * switch_cost
The switch cost can include cache warmth, handoff cost, tool-loop state, provider state, turn count, and switch history. Session-aware routing is therefore not sticky sessions. It is controlled continuity. The router keeps a model when switching is unsafe or not worth it, and it can switch again at idle boundaries, decision drift, or rescue conditions.
| Router memory layer | Hot path? | Purpose |
|---|---|---|
| Protection state | Yes | Protected model, identity scope, turn count, cache/tool-loop evidence, switch history |
| Model experience | Yes | Quality, overuse, reliability, latency, cache, and cost evidence for adaptation |
| Router Replay | Write from hot path, read offline | Durable route, response, outcome, and learning diagnostics |
| Offline recipe artifacts | No | Findings, candidate recipes, recipe patches, and optional experience seed packs |
Sensitive routes can bypass learning entirely:
routing:
decisions:
- name: local_privacy_policy
modelRefs:
- model: local-private-model
adaptations:
mode: bypass
That boundary is the contract. Learning can improve choices inside recipe policy. It cannot silently rewrite the recipe, add a new privacy exception, change a decision priority, or mutate modelRefs on the request path. Offline recipe learning can propose those changes as reviewable artifacts, but live routing remains governed by the recipe.
Plugins: Behavior Belongs to the Route
After a route is selected, the request still may not be ready for the model. It may need cache, memory, RAG, tool filtering, request parameter caps, prompt mutation, response verification, fast policy response, image generation, or replay.
These behaviors should not be global decoration. Privacy routes may need to bypass cache. High-risk factual routes may require verification. Agentic routes may need tool boundaries. Low-risk summarization may need only replay. Plugins are therefore route-scoped.
| Plugin | What it changes | Why route scope matters |
|---|---|---|
| Semantic cache | Reads or writes semantic cache with threshold, TTL, and quota | Privacy and category boundaries change cache policy |
| Memory | Retrieves or stores conversational/user memory | Memory scope must respect tenant, privacy, and session policy |
| RAG | Adds retrieval from vector DB, MCP, file search, or external API | Knowledge access is a capability path |
| Tools | Passes through, filters, blocks, or dynamically retrieves tools | Tool exposure depends on route, user, risk, and session |
| Tool selection | Adds or filters tools from a tool database or request subset | Ranking tools is a route decision, not an application afterthought |
| Request params | Caps or rewrites temperature, max tokens, tools, or response format | High-risk routes need tighter request shape |
| System prompt | Injects route-specific instructions | Policy must reach model behavior |
| Header mutation | Adds provider, cluster, audit, or routing headers | Gateway and backend need explicit context |
| Fast response | Returns without model call | Blocks, denies, quotas, or unsupported paths |
| Response jailbreak | Checks response-side safety | Request-only scanning misses output failures |
| Hallucination | Warns, blocks, or rewrites unsupported claims | High-risk factual routes need response governance |
| Router replay | Records request, evidence, decision, model/path, plugins, and response | Debugging and learning need durable artifacts |
| Image generation | Bridges modality-aware routes to image backends | Image routes have different models and policies |
Some plugins are worth reading as miniature subsystems.
| Plugin subsystem | Important runtime detail | Failure it prevents |
|---|---|---|
semantic-cache |
Can override similarity threshold and TTL per decision; personalized RAG or memory routes can skip cache writes | Reusing private or personalized answers as generic cache hits |
memory |
Retrieves with limit and similarity threshold, supports auto-store, hybrid search, and reflection; injected after system/developer messages as a separate user-context message | Blending memory into hidden prompt text that operators cannot reason about |
rag |
Supports Milvus, Qdrant, external API, MCP, OpenAI file search, and hybrid modes; injection can be tool-role or system-prompt | Treating all knowledge access as one opaque retrieval step |
tools |
Supports passthrough, filtered, none, allow/block, semantic selection, and dynamic retrieval modes such as semantic_only and hybrid_history |
Letting an agent see tools just because the client sent them |
request_params |
Can block or strip request parameters, cap max_tokens and n, and optionally strip unknown OpenAI fields |
High-risk paths inheriting unsafe sampling or output shape |
response_jailbreak and hallucination |
Run after the model response and can warn, block, or rewrite warning metadata | Assuming request-time safety checks are enough |
router_replay |
Captures bounded request, response, tool trace, route, and plugin evidence | Losing the evidence needed for debugging, evaluation, and learning |
A route is better understood as an execution contract. The model is one part of it. The route also carries constraints, tools, knowledge, verification, memory, and evidence.
The Hot Path: Header, Body, Route, Response
The conceptual architecture only becomes convincing when it touches the request path. In vLLM SR, the hot path is shaped around a simple constraint: the router must see enough context to make a semantic decision, but it should avoid turning every request into an expensive full parse and full detector run.
The route lifecycle inside the gateway path looks like this.
id, path, protocol, identity"] --> B["Body
fast extraction"] B --> C{"Mutation
needed?"} C -- "no" --> D["Signals
projections
decisions"] C -- "yes" --> E["Full parse
OpenAI / Responses / Anthropic"] E --> D D --> F["Model routing
explicit, auto, looper slug"] F --> G["Request preparation
memory, RAG, tools, params, prompt"] G --> H["Backend or Looper
execution"] H --> I["Response phase
normalize, verify, cache, replay"]
| Phase | What the router extracts or changes | Why it is on the hot path |
|---|---|---|
| Header phase | Request id, method/path, client protocol, identity headers, streaming expectation, replay/model/response API paths, skip-processing opt-out | Routing needs tenant, protocol, and control metadata before reading the full body |
| Body phase | Fast request state first; full OpenAI-compatible parse only when mutation is needed | Most routing decisions should not pay unnecessary parsing and mutation cost |
| Pre-routing | Response API translation when needed, validation, signal dispatch, projection application, decision match, algorithm or looper preflight | Semantic policy must happen before backend selection |
| Model routing | Explicit model, auto model names, direct looper slugs such as Fusion or Flow, Anthropic provider routing, alias resolution, provider profile/auth, reasoning mode |
Logical model names must resolve into real provider and backend behavior |
| Request preparation | System prompt, memory, RAG, request params, tools, tool selection, route headers, trace headers | The selected route becomes concrete model input and gateway metadata |
| Response phase | Normalize OpenAI, Responses, or Anthropic shapes, report usage, calibrate token estimate, update cache, run response jailbreak/hallucination checks, store memory, emit warnings, record replay | The router must observe what happened, not only what it predicted |
Protocol compatibility is part of the same story. The router should not hide provider differences behind a vague facade. It should translate them explicitly at the boundary: OpenAI-compatible chat, Responses-style calls, Anthropic-style provider routing, direct looper model slugs, and backend-specific provider profiles all become inputs to one routing engine. Applications keep a familiar API shape, while the infrastructure keeps the differences visible enough to debug.
The ExtProc path matters because it gives semantic policy a boundary-native shape. vLLM SR can receive enough request and response context to make semantic decisions while still returning control to the real gateway data plane.
Envoy Integration: Put Semantic Policy at the Boundary
The Envoy integration shows the intended boundary clearly.
Envoy owns the data plane: TLS, clusters, endpoint health, timeouts, retries, load balancing, and filter chains. vLLM SR should not rebuild those capabilities. It should act as the semantic policy plane: receive request context through External Processing, evaluate the route, and return header/body mutations.
| Component | Responsibility |
|---|---|
| Client | Sends OpenAI-compatible or provider-compatible requests |
| Envoy | Handles network path, clusters, TLS, timeout, health, retry, and load balancing |
| ExtProc bridge | Sends request context and receives header/body mutations |
| vLLM SR | Extracts evidence, matches policy, selects or orchestrates, applies learning and plugins |
| Backend clusters | Serve model traffic after the semantic route is decided |
The adoption advantage is large. Applications do not need a new private API just to benefit from semantic routing. They can keep calling familiar model APIs while infrastructure maps logical model names to small models, frontier models, RAG paths, verifier paths, Looper collaboration paths, workflows, or private hardware lanes.
Protocol work such as SIRP and multi-provider inference API matters for the same reason. Semantic routing should strengthen the control plane without forcing every application team into a custom gateway dialect.
Model Design: Model Name Is the Wrong Primitive
The router cannot make production-grade decisions from model names alone.
gpt-4, qwen3-32b, claude-opus, or local-private identifies an endpoint or alias. It does not describe reasoning ability, coding strength, tool behavior, vision support, context window, latency distribution, cost, hardware path, privacy boundary, or observed failure modes.
| Model-card field | Examples | Routing implication |
|---|---|---|
| Capability | reasoning, coding, vision, tool use, image generation, verifier, embedding | Determines which routes can legally include the model |
| Economics | price, quality score, cost weight, expected output length | Feeds cost-quality optimization |
| Latency | TTFT, TPOT, p50/p95/p99, warm/cold behavior | Feeds SLO-aware selection |
| Context | context window, compression support, long-context stability | Drives context-aware routing and token-budget routing |
| Hardware | CUDA, ROCm, XPU, CPU, quantization, engine family | Connects logical model to physical pool |
| Policy | provider profile, tenant allowlist, data boundary, reasoning family | Prevents unsafe or unauthorized selection |
| Feedback | replay success, failure type, verifier disagreement, user feedback | Supports learning and recalibration |
Users can ask for auto. The system cannot treat auto as a magic endpoint. Internally, it must expand into model cards, candidate sets, route policy, Looper eligibility, learning boundaries, and execution paths.
Bindings: Fast ML Without Turning the Router Into a Model Server
vLLM SR’s control plane is written around Go because gateway integration, configuration, request mutation, response processing, Envoy ExtProc, and Kubernetes-style infrastructure fit Go well. But routing also has ML hot paths: embeddings, classification, modality detection, LoRA classification, and MLP selectors.
The binding layer keeps those concerns separated.
| Native surface | Role | Capability shape |
|---|---|---|
candle-binding |
Rust/Candle high-performance ML path | Unified batch classification, LoRA classification, batched embeddings, multimodal embeddings, modality routing, MLP selector |
ml-binding |
Rust helpers for classical ML selector artifacts | KNN, KMeans, and SVM-style selector support where trained artifacts exist |
nlp-binding |
Rust lexical routing helpers | BM25, n-gram, and deterministic lexical classifiers for low-latency evidence |
| ONNX backend | Portable runtime path | Batched embedding in the current public capability contract |
| Stub backend | Minimal or unsupported build path | Explicit capability absence for fallback, tests, and non-native builds |
The rule is not “everything must be native.” The rule is capability must be explicit. If a deployment lacks a native classifier, the router should know. If ONNX supports only part of the contract, policy should not pretend otherwise. If backend lifecycle needs reset boundaries, the router should expose that instead of hiding it. Intel/OpenVINO-oriented paths can be valuable deployment options, but they belong in the hardware/runtime discussion unless they are part of the same advertised native capability contract.
Hardware Is a Routing Variable
Hardware support is often described as a compatibility matrix. For semantic routing, it is more than that.
Different hardware paths imply different latency, cost, memory behavior, kernel availability, quantization support, context-window economics, privacy boundary, and energy curve. A router that ignores hardware is only doing half the job.
| Platform path | What matters to routing | Example route |
|---|---|---|
| NVIDIA CUDA | Mature high-throughput vLLM serving, CUDA kernels, quantized paths, broad accelerator availability | Default high-performance pool, frontier or private data-center path |
| AMD ROCm | First-class non-CUDA vLLM platform direction, MI300/MI350-class deployments, AITER kernels and attention paths | Cost/performance diversification, ROCm production pool, AMD Developer Cloud validation path |
| Intel XPU | SYCL/DPC++, oneDNN, XPU kernels, OpenVINO-oriented portability and optimization paths | Enterprise accelerator lane, private infrastructure, CPU/XPU hybrid deployment |
| CPU / local | Intel/AMD x86, ARM AArch64, Apple silicon, edge and offline fallback | PII-sensitive, low-throughput, local-only, or cost-minimal workloads |
| KV / context | Prefix retention, warm state, prefill/decode balance, context-window pressure, bytes-per-token drift | Session-aware protection, long-context pools, token-budget-aware routing |
The acceleration story is not one-size-fits-all. CUDA gives the broadest default serving path. ROCm/AITER makes AMD pools a serious production option rather than a compatibility afterthought. Intel XPU and OpenVINO-style paths matter for enterprises that already own Intel-heavy infrastructure or need CPU/XPU portability. CPU and local paths remain important because privacy, availability, and cost sometimes beat raw throughput.
WRP is the right mental model here. Workload is semantic. Pool is physical. Router translates between them.
intent, risk, context, modality"] --> R["Router
policy + selection + learning"] R --> P["Pool
models, queues, cache, hardware"] P --> R R --> O["Outcome
quality, latency, cost, safety"] O --> W O --> R
A future router should know when CUDA is overloaded, when ROCm is cost-effective, when XPU is sufficient, when CPU/local is preferable for privacy, and when preserving KV cache is worth more than reselecting a theoretically better model.
Observability: The Router Must Leave Evidence Behind
If the system makes semantic decisions, operators need to see those decisions.
| Trace surface | It explains |
|---|---|
| Signal trace | Which signals ran, what matched, and what raw values or confidence appeared |
| Projection trace | How evidence became derived route facts |
| Decision trace | Which policy matched and why it outranked alternatives |
| Selection trace | Which candidates were considered and which model won |
| Looper trace | Which panel, rounds, workers, judge, or synthesis path executed |
| Learning trace | Base model, proposal model, final model, protection action, adaptation reason, cache and switch evidence |
| Plugin trace | Which cache, RAG, memory, tool, safety, verification, and replay behaviors executed |
| Header trace | What x-vsr-* metadata went to the gateway or client |
| Replay record | The durable request/response/decision/model/path artifact for debugging and evaluation |
Replay is not only for dashboards. It is the second control plane. The first control plane makes the live decision; replay preserves enough evidence to debug that decision, compare it against alternatives, attach outcomes, and produce offline recipe patches.
If the router records what it saw, what it decided, what happened, and how users, agents, verifiers, or evals responded, the next generation of selectors, thresholds, learning strategies, and recipes can improve.
Without traces, the router is another opaque model. With traces, it becomes a system component.
Evaluation: The Router Is a Frontier
The wrong way to evaluate a router is to count features. A router with many knobs can still be bad. The right question is whether it improves the frontier between quality, cost, latency, safety, privacy, reliability, and hardware efficiency.
| Evaluation axis | What should improve |
|---|---|
| Quality-cost frontier | Same or better quality with lower model spend |
| Latency frontier | Better SLO compliance without blindly choosing weak models |
| Safety and privacy | Better handling of PII, jailbreak, tool exposure, and local/private paths |
| Factuality | More grounded answers through RAG and verification where needed |
| Collaboration value | Better output from Fusion, ReMoM, Flow, or Ratings than single-model baselines |
| Session stability | Fewer broken tool loops and fewer harmful model switches |
| Fleet efficiency | Better queue, cache, hardware, and pool utilization |
| Debuggability | More replayable and explainable decisions |
| Learning loop | Better calibration from traces, outcomes, and offline evals |
Router evaluation, fleet simulation, replay traces, RouterArena-style comparison, Looper evals, and offline recipe learning matter because semantic routing is infrastructure. It has to be measured like infrastructure.
The Part I Care About Most
The most important design belief in vLLM SR is not any single signal, plugin, model, algorithm, or hardware platform. It is the separation of concerns.
Signals observe. Projections coordinate. Decisions express policy. Selection algorithms choose one model. Looper algorithms scale capability through model collaboration. Router Learning adapts and protects within recipe boundaries. Plugins mutate behavior. Bindings accelerate hot-path ML. Envoy integration places semantic policy at the traffic boundary. Model cards connect logical names to real capabilities. Hardware metadata connects semantic constraints to physical execution. Observability makes every decision accountable.
That separation is what lets the router evolve without turning every new idea into a fork of the hot path.
A new jailbreak detector can become a signal. A new risk formula can become a projection. A new compliance rule can become a decision. A new selector can become a selection algorithm. A new panel or workflow primitive can become a Looper algorithm. A new verifier can become a plugin. A new accelerator lane can become model-card metadata. A new benchmark can become an evaluation loop. A new fleet model can feed pool-aware routing. A new learning strategy can propose better candidates without rewriting policy.
That is why I like the phrase Intelligence Control Plane. Not because the router is always intelligent by itself, but because it gives the system a place to allocate intelligence deliberately.
The first stage of AI infrastructure made intelligence callable.
The next stage has to make intelligence allocatable, explainable, and optimizable.
Calling a model was the first abstraction. Allocating intelligence is the next one.
That is the philosophy of vLLM SR.
Source Trail
This article is based on the vLLM Semantic Router codebase, website research archive, vLLM project blog posts, my earlier essays on LLM routing, and infrastructure references around Envoy, vLLM, ROCm, Intel XPU, and Gateway API inference routing.
| Source | Why it matters here |
|---|---|
| vLLM Semantic Router research archive | Paper table and system-design throughline |
| Canonical config and routing contract code | Recipe-as-contract, signal/projection/decision surfaces, supported algorithm and plugin catalog |
| Signal-Decision Driven Architecture | Shift from single classification to signal-decision routing |
| Iris / Athena / Themis release posts | Signals, model selection, plugins, memory, replay, AMD ROCm, and release progression |
| Fusion API and Looper tutorials | Multi-model collaboration path: Confidence, Ratings, Fusion, ReMoM, Router Flow |
| Router Learning docs and proposal | Adaptation, protection, memory, replay, outcomes, and offline recipe learning |
| Session-Aware Agentic Routing | Tool-loop continuity, provider state, prefix cache, and safe switch boundaries |
| Agentic Routing on AMD ROCm | AMD ROCm deployment, agentic recipe, dashboard, learning, and replay |
| ExtProc runtime pipeline notes | Header/body/model-routing/response phases, protocol normalization, replay and response-time checks |
| Envoy External Processing filter | Semantic policy path versus main traffic data plane |
| Native binding capability matrix | Candle, ML binding, NLP binding, ONNX, and Stub capability boundaries |
| vLLM platform documentation | CUDA, ROCm, XPU, CPU, and deployment background |
| AMD ROCm / AITER / vLLM ROCm attention backend | AMD acceleration path and ROCm serving ecosystem |
| Intel XPU kernels / IPEX XPU ecosystem | Intel accelerator serving path |
| Kubernetes Gateway API Inference Extension | Standardization direction for inference routing at the gateway layer |