System Intelligence — June 25, 2026

The Philosophy of vLLM SR

vLLM Semantic Router is not a smarter model picker. It is a control plane for turning semantic work into explainable capability paths across models, tools, memory, policy, gateways, and hardware.

The Philosophy of vLLM SR

The easiest way to misunderstand vLLM Semantic Router is to start from the word router.

In a traditional gateway, routing is a forwarding problem. A request arrives, the gateway evaluates policy, picks a backend cluster, and sends the request there. The backend is usually a deterministic service. If it is healthy, the request succeeds. If it fails, the gateway retries, fails over, or returns an error.

LLM systems break that mental model.

The backend is no longer just a service. It is a probabilistic reasoning system. It may need tools. It may need memory. It may need retrieval. It may hallucinate. It may require a verifier. It may run on a GPU pool whose KV cache is warm for one session and cold for another. It may be cheap for short requests and expensive for long-context ones. It may be allowed for one tenant and forbidden for another.

A request is no longer only traffic. It is semantic work.

Routing is the moment where semantic intent becomes infrastructure placement.

That is the starting point of vLLM SR.

The project is not trying to build a prettier model selector. The deeper idea is to turn every LLM request into an explainable capability path: which evidence was observed, which derived facts were produced, which policy matched, which selection or collaboration algorithm ran, which plugins were attached, which backend and hardware path was used, and what trace was left behind.

Request flows into signals, projections, decisions, and actions including model, verify, and tools.
The output of vLLM SR is not just a model name. It is a governed capability path.

I want to walk through that design as a system, not as a feature list. We will keep one running example in view, then open the layers one by one: signals, projections, decisions, selection algorithms, looper algorithms, router memory and learning, plugins, Envoy integration, model design, native bindings, hardware support, observability, and evaluation.

A Request Walks Into the Router

Imagine an enterprise AI gateway receives this request:

Analyze this customer contract, compare it with our internal policy, and draft a response. It may contain personal information. Use tools if needed.

If we look only at surface text, this is a contract-analysis request. If we look only at token count, it may be a long-context request. If we look at risk, it touches PII, internal policy, tool access, and factual correctness. If we look at infrastructure, it depends on tenant permissions, pool health, cache warmth, context-window capacity, and model cost.

A naive model picker asks one question:

Which model should answer this prompt?

vLLM SR asks a better question:

What must the system know before it is allowed to spend intelligence?

flowchart LR A["Request
contract + policy + PII"] --> B["Signals
observe evidence"] B --> C["Projections
derive routing facts"] C --> D["Decisions
match policy"] D --> E["Selection Algorithms
choose one model"] D --> L["Looper Algorithms
choose collaboration path"] E --> M["Router Learning
adapt + protect"] L --> M M --> F["Plugins
attach behavior"] F --> G["Gateway / Backend
execute"] G --> H["Trace / Replay
learn"]

The split is intentional. Each layer owns a different kind of reasoning.

Layer It owns It should not own
Signals Observing facts about the request, user, session, content, and environment Final policy or model choice
Projections Turning noisy evidence into stable routing facts Gateway mutation or backend calls
Decisions Expressing auditable route policy Low-level selection or collaboration execution
Selection algorithms Choosing one model from a policy-approved candidate set Rewriting route eligibility
Looper algorithms Selecting and running a multi-model collaboration path Pretending to be a single-model selector
Router Learning Adapting model choice and protecting session continuity after base selection Mutating recipe policy on the request path
Plugins Route-scoped behavior such as cache, RAG, memory, tools, verification, replay Global hidden middleware
Bindings Fast native ML paths for hot routing components Policy semantics
Gateway integration Putting semantic policy at the network boundary Replacing the gateway data plane

The core design choice is separation of concerns. If a request goes to a frontier model with RAG, tool filtering, and a response verifier, the answer should not be “because the router said so.” The router should be able to show the chain: evidence, projection, policy, algorithm, learning decision, plugins, mutations, backend path, and replay id.

The Full Request Lifecycle

For the contract request, the router first extracts request context: body, headers, tenant identity, session id, conversation id, tool state, prior response markers, and any existing routing metadata.

It then runs only the signals referenced by the active recipe. Domain, PII, context length, authorization, knowledge-base relevance, fact-check need, tool availability, and jailbreak or prompt-injection signals can run independently. The point is not to run every detector on every request; the point is to gather the evidence that the configured policy actually needs.

Projections then compose that evidence into stable route facts: privacy_sensitive, legal_contract_path, long_context, needs_internal_rag, and verification_required. The decision engine matches those facts against route policy. A matched decision exposes candidate models and, depending on its algorithm, either a single-model selection path or a looper collaboration path. Router Learning may then adjust the proposed model or hold the current model if switching would break session continuity. Finally, plugins attach route-scoped behavior before the gateway executes the main traffic path.

flowchart LR C["Client request"] --> E["Envoy / Gateway
headers + body"] E --> R["vLLM SR
request context"] R --> S["Signals
run required evidence"] S --> P["Projections
derive route facts"] P --> D["Decisions
match policy"] D --> A{"Execution mode"} A -- "single model" --> M["Selector"] A -- "collaboration" --> L["Looper"] M --> G["Router Learning
adapt + protect"] L --> G G --> X["Plugins
cache, RAG, tools, safety"] X --> B["Backend pool
or looper call"] B --> T["Trace / replay
response checks"]

The model name is only the visible end of the route. The deeper value is that the system can explain why this was not a cheap direct-answer path. Privacy, internal knowledge, tool exposure, factuality, and context pressure are different concerns. vLLM SR gives each concern a place in the architecture.

The Recipe Is the System Contract

The next design step is easy to miss: vLLM SR is not primarily configured as a collection of ad hoc callbacks. It is configured as a recipe. That recipe is the contract between application intent, routing policy, backend inventory, plugin behavior, learning boundaries, and gateway mutation.

In production, routing policy tends to spread. One part lands in application code, one part in gateway config, one part in prompt templates, one part in a model registry, one part in a dashboard, and one part in a notebook used by the evaluation team. Once that happens, nobody can answer a simple operational question: “Why did this request take this path?”

vLLM SR’s v0.3-style contract tries to keep that answer in one durable shape.

flowchart LR A["Recipe
RouterConfig"] --> B["Evidence contract
signals"] B --> C["Fact contract
projections"] C --> D["Policy contract
decisions"] D --> E["Execution contract
modelRefs, algorithm, plugins, adaptations"] E --> F["Runtime contract
backend models, provider profiles, router options"]
Contract surface What it owns Why it belongs in the recipe
Global runtime Semantic cache, memory, replay, learning, observability, API listeners, startup behavior These features are infrastructure state, not per-request improvisation
routing.signals The evidence families that are allowed to run A decision should only depend on named, reviewable evidence
routing.projections Partitions, scores, and mappings derived from signals Policy should consume stable facts, not raw detector noise
routing.decisions Named route policies with priority, tier, rules, output contract, modelRefs, algorithms, adaptations, plugins, and emits A route is an auditable capability bundle
decision.modelRefs The approved candidate boundary for a decision Selection and learning cannot escape policy by discovering a better-looking model elsewhere
decision.algorithm The execution strategy inside the matched decision The recipe distinguishes one-model selection from collaboration paths such as Fusion or Flow
decision.plugins Cache, memory, RAG, tools, prompt mutation, response checks, replay, image generation, or fast response Behavior follows the matched route instead of hiding in global middleware
decision.adaptations Apply, observe, or bypass learning for this policy Sensitive routes can be protected from online exploration
Backend models Provider profiles, vLLM endpoints, image backends, model metadata, aliases, default model Logical routing must resolve to real serving pools and provider contracts
Router options Auto model names, body streaming mode, route-cache clearing, skip-processing controls Gateway behavior and compatibility are explicit operational choices

That is the operational difference between a router that happens to make good choices and a router that can be trusted. A recipe is diffable. It can be validated. It can be replayed against stored traffic. It can be promoted across environments. It can be patched by offline learning without letting the hot path rewrite production policy.

The rule is conservative on purpose: runtime intelligence may propose better choices, but the recipe defines the legal search space. That is what keeps “adaptive routing” from becoming hidden policy mutation.

Research as Design Pressure

The paper shelf behind vLLM SR is not a citation dump. Each research direction creates pressure on one part of the router. Read the table as a map from problem to architecture.

Work Problem Core idea Architectural pressure
vLLM Semantic Router: Signal Driven Decision Routing Single-classification routing cannot express cost, privacy, latency, safety, and multimodal constraints together Compose heterogeneous signals into route decisions and route-scoped behavior Establishes signals, projections, decisions, algorithms, and plugins as separate layers
Workload-Router-Pool Routing papers often ignore serving pools, while fleet papers often ignore workload semantics Close the loop among workload evidence, router policy, and physical pool state Makes queue, cache, hardware, and pool feedback routing inputs
When to Reason Reasoning models are expensive and not always useful Detect reasoning need and apply reasoning only when beneficial Creates complexity signals, reasoning policy, and model-card reasoning capability
Category-Aware Semantic Caching One similarity threshold is unsafe across heterogeneous workloads Cache thresholds, TTLs, and quotas vary by category Makes semantic cache a route-scoped plugin, not a global middleware
Outcome-Aware Tool Selection Tool selection cannot rely only on semantic similarity Refine tool embeddings offline using outcome evidence under latency budgets Connects tool signals, tool-selection plugins, and replay/outcome data
98x Faster Routing Without Dedicated GPU A hot-path router cannot take seconds to route Use prompt compression, Flash Attention, and near-streaming body processing Pushes native bindings and fast signal runtime
Adaptive VLM Routing Multimodal computer-use steps have very different difficulty Estimate visual/action difficulty and route cheap or strong VLMs accordingly Adds modality, visual difficulty, and multimodal safety signals
Visual Confused Deputy Computer-use agents can be attacked through visual perception failures Check click target and action reasoning independently Pulls visual safety and action-boundary validation into routing
Knowledge Access Beats Model Size Correct knowledge access can beat a larger model Use memory and retrieval-grounded routing to recover quality with smaller models Promotes RAG, memory, KB signals, and knowledge paths to first-class capabilities
Fast and Faithful RAG Verification Retrieval does not guarantee faithful answers Verify long-document RAG responses in real time Motivates hallucination and response-verification plugins
inference-fleet-sim Model choice depends on queueing, TTFT, and fleet capacity Use queueing-grounded simulation for multi-pool planning Connects router policy to fleet simulation
FleetOpt Minimum-cost pools depend on workload CDF and P99 targets Derive pool boundaries and deploy them through compress-and-route Connects token budget, pool boundary, and cost-aware routing
1/W Law Context window size changes energy efficiency and memory pressure Analyze context-length routing topology and tokens-per-watt Makes context routing and long-context pools central
Conflict-Free Policy Languages Probabilistic ML predicates can co-fire silently Detect and prevent policy conflicts in the DSL Motivates declarative decisions, priority, confidence, and auditability
Cross-Layer Policy Compilation Policy should not live as scattered gateway, workflow, K8s, and agent code Compile one declarative source into multiple execution layers Points toward policy-as-code and cross-layer verification
Token-Budget-Aware Pool Routing Token budget affects KV cache, pool choice, and failure risk Estimate total token budget and route to short or long pools Connects context, bytes-per-token, pools, and failure avoidance
SIRP and Multi-Provider API Semantic routing should not require private application protocols Standardize semantic inference routing and multi-provider surfaces Reinforces OpenAI-compatible, gateway-compatible control planes

The pressure from the research shelf is clear: a router cannot be one classifier. If it were, every new paper would become another special case. In vLLM SR, a paper can become a signal, projection, decision primitive, selection algorithm, looper algorithm, plugin, learning signal, model-card field, or pool feedback loop.

Signals: Evidence Before Judgment

Signals are deliberately humble. A signal answers one factual question. It does not choose the model. It does not select a plugin. It does not decide the route.

For the contract example, the router may need to know whether the request is legal or enterprise-related, whether it contains PII, whether the user is authorized for premium or private models, whether it needs internal knowledge, whether tools are available, whether the context is too large for a short-context backend, and whether the response needs verification.

Request fan-out to Keyword, Context, Authz, Domain, PII, Jailbreak, Modality, Feedback, and Signal Results.
Signals are sensors, not policy. They produce reusable evidence.
Signal family What it observes Why it matters
Keyword and lexical Explicit words, regex, BM25, n-grams, fuzzy anchors Fast deterministic anchors for compliance, product names, incidents, and explicit task classes
Context and structure Token count, context pressure, prompt shape, JSON/schema/workflow form Separates short direct paths from long-context, compression, and workflow paths
Authz and tenant User identity, groups, role bindings, tiers, allowlists Prevents silent access to premium, private, tenant-specific, or sensitive paths
Conversation and session Turn count, tool calls, tool results, active loop state, previous response markers Protects continuity and avoids switching during non-portable state
Domain and KB Task domain, internal knowledge source, KB relevance Chooses domain models, RAG sources, and policy constraints
Embedding and preference Semantic similarity, user/task preference, model-description fit Handles soft semantic routing where exact keywords are insufficient
Complexity and repair Difficulty, uncertainty, repeated dissatisfaction, retry and repair signals Drives reasoning activation, escalation, and learning
Fact-check and safety Factuality need, jailbreak, prompt injection, PII, response risk Triggers RAG, verifier, guardrail, local/private route, or block path
Modality and event Text, image, image-generation intent, SRE/SOC event shape Routes to VLM, image, operational, or event-specific paths

The maintained runtime surface currently exposes eighteen base signal families: authz, complexity, context, conversation, domain, embedding, fact_check, jailbreak, keyword, language, modality, pii, preference, reask, structure, kb, user_feedback, and event. projection is also a rule condition type, but it is not a raw sensor. It is the decision-visible output of the projection layer.

The runtime behavior is as important as the list of names.

Runtime behavior Design implication
Used-signal analysis The classifier builds a map from decisions and projections, then runs only the signal families needed by the active recipe unless forced evaluation is enabled
Concurrent dispatch Independent signal evaluators can run in parallel, keeping the route path from becoming a serial detector chain
Readiness checks A configured signal is only useful if its model, rule, or backend dependency is ready; the router can avoid pretending a missing detector produced evidence
Request-scoped caches Expensive intermediate work, such as image embeddings, can be shared by complexity and embedding signals during one request
Trace preservation Signal output is retained as evidence for projections, decision traces, replay, and learning diagnostics

The payoff is evidence reuse. A pii signal can influence privacy policy, cache policy, provider selection, and audit policy. A conversation signal can influence router protection, tool filtering, and replay. A context signal can influence long-context pools, compression, and cost-aware selection. If signals directly selected models, that reuse would disappear.

Projections: The Coordination Layer

Signals are often messy. Some are booleans. Some are classifier probabilities. Some are similarity scores. Some are raw metrics such as token count or KB relevance. Production policy should not need to manually combine all of those raw values every time.

Projections turn evidence into stable routing facts.

Input signals flow through a projection layer with linear transforms and thresholds into selected routing bands.
Projections turn raw evidence into policy-readable routing facts.
Raw evidence Example value Why a projection helps
domain=legal confidence 0.82 Domain can overlap with finance, support, or security; policy needs a stable partition
pii=present confidence 0.91 Privacy should not depend on one detector alone
context_tokens 42K Token count matters relative to model window and pool state
fact_check=needed confidence 0.76 Factuality should combine with domain and knowledge availability
authz=premium matched Authorization is a hard gate, not a difficulty signal
kb=internal_policy score 0.68 KB relevance is retrieval evidence, not a complete route
flowchart LR A["Raw signals"] --> B["Risk score"] A --> C["Complexity score"] A --> D["Domain partition"] A --> E["Knowledge need"] B --> F["privacy_sensitive"] C --> G["simple / medium / complex"] D --> H["legal_contract_path"] E --> I["needs_internal_rag"] F --> J["Decision inputs"] G --> J H --> J I --> J
Projection pattern Logic Example
Partition Pick one winner among competing semantic candidates Choose legal_contract_path over finance_path when margin is sufficient
Weighted score Combine booleans, confidence values, raw values, and similarity scores Compute risk or complexity pressure
Threshold mapping Convert a continuous score into stable bands Map complexity into simple, medium, or complex
Multi-emit mapping Emit several non-exclusive derived facts Emit both needs_rag and needs_verifier
Normalization Put heterogeneous signal scales into comparable space Feed hybrid selection and confidence ranking

One subtle rule keeps projections clean: decisions read projection outputs, not every intermediate score name. A partition or weighted score can be rich internally, but the policy surface should see facts such as legal_contract_path, risk_high, or long_context_lane.

Projection artifact Decision-visible? Trace value
Partition Usually through the selected output Shows competing semantic candidates and margin
Score Not by itself unless mapped or referenced by another projection Shows weighted input contributions, match and miss values, and normalization behavior
Mapping output Yes Shows which threshold band or multi-emit output fired
Boundary distance Indirectly through confidence Explains near-miss cases where a score barely crossed or missed a band
Projection trace Operationally visible Lets operators debug why raw evidence became a route fact

That is why the projection layer can grow without making decisions unreadable. A risk score might combine PII, jailbreak, tenant tier, KB source, and response verifier need. The decision should not have to carry that formula inline. It should match a named derived fact and leave the math in a traceable coordination layer.

Projections keep the rest of the system readable. Signals remain fine-grained. Decisions remain policy-oriented. Algorithms remain responsible for selection or collaboration. The coordination work lives in between, where it can be traced.

Decisions: Policy Needs a Shape

Once projections produce stable facts, decisions express route policy.

The decision engine is intentionally closer to a boolean circuit than a hidden Python function. Routing policy should be inspectable, diffable, auditable, sortable, and eventually compilable across infrastructure layers.

Evidence enters a boolean decision engine with AND, OR, NOT gates and outputs selected route, candidate models, plugins, and fallback route.
The decision layer turns evidence into auditable route policy.

For the contract request, the policy can be simplified as:

flowchart LR A["authz: premium"] --> D1["AND"] B["privacy_sensitive"] --> D1 C["legal_contract_path"] --> D1 D1 --> D2["AND"] E["needs_internal_rag"] --> D2 F["verification_required"] --> D2 D2 --> R["enterprise_contract_path"] R --> M["candidate models"] R --> P["RAG + tool filter + verifier + replay"]

A route is not just an endpoint. It is a policy-approved capability bundle: candidate models, algorithm, plugins, gateway mutations, retention behavior, fallback, and diagnostics.

Decision element Meaning Why it matters
Leaf References a signal or projection Keeps policy connected to explicit evidence
AND Requires all children to match Expresses strict gates such as auth plus privacy
OR Accepts any child Lets multiple evidence patterns imply the same path
NOT Excludes one child Useful for fallback, denial, or bypass policy
Priority and tier Sort matched decisions Prevents low-risk paths from shadowing high-risk paths
Confidence Carries evidence strength Allows ranking without hiding why
Emits Produces route metadata Connects policy to cache, learning, plugins, and gateway headers

In the current contract, a decision is a route object, not only a rule tree.

Decision field What it contributes to the route
name and description Human-readable identity for traces, dashboards, replay, and review
priority and tier Deterministic ordering when multiple policies match
output_contract Declares the expected API or response shape for the route
rules Recursive Boolean tree over signals and projection outputs
modelRefs Candidate boundary for selectors, loopers, and learning
algorithm Execution strategy after the policy match
adaptations Per-decision learning mode: apply, observe, or bypass
plugins Route-scoped behavior attached after matching
candidateIterations Declarative candidate loops used by richer selection or workflow constructs
emits Declarative side effects such as retention behavior

Decision selection is intentionally deterministic. If matched decisions use tiers, lower tier values win first, then confidence, priority, and name. Without tiers, strategy=confidence ranks by confidence before priority; the default strategy ranks priority before confidence. Even fallback is explicit: an empty AND can be a catch-all route, but it has zero confidence so it does not outrank real evidence-backed decisions.

Retention emits make policy visible beyond the immediate model choice. A decision can express drop to skip semantic-cache writes, ttl_turns to bound cache lifetime, keep_current_model to protect session continuity, or prefer_prefix_retention to tell the serving pool that KV/prefix reuse matters.

That shift from classification-style routing to signal-decision routing is not cosmetic. Classification is useful for demos. Decision architecture is what production policy needs.

Selection Algorithms: Choosing One Model After Policy

Many router designs start with the selector: embeddings, MLPs, bandits, or an LLM-as-router that picks a model directly. That easily turns one algorithm into a sink for every concern: semantic fit, cost, latency, safety, authorization, session continuity, and provider policy.

vLLM SR keeps the order stricter. The decision matches first. Then a selection algorithm chooses one model from the policy-approved candidate set.

Route enters a selector with Static, RouterDC, Hybrid, and Latency algorithms, then outputs one model or multi model.
Selection algorithms choose inside a matched decision. They do not define route eligibility.
Selection algorithm Catalog tier Core idea Best use
Static Supported Pick the configured order or fixed score Deterministic fallback, explicit business policy, early rollout
RouterDC Supported Match query embedding to model-description embeddings Query-to-capability matching when model cards are meaningful
Hybrid Supported Combine semantic fit, quality, latency, cost, cache affinity, and other scores Production tradeoffs where no single signal should dominate
Multi-factor Supported Filter by SLO, then score quality, latency, cost, and load Fleet-aware route selection
Latency-aware Supported Prefer candidates using TTFT/TPOT percentile metrics SLO-sensitive paths
AutoMix Experimental Start cheaper and escalate using confidence or verification Cost-saving cascades where repair is acceptable
KNN Experimental Route by nearest labeled examples Interpretable example-based routing
KMeans Experimental Route by cluster membership Coarse workload segmentation
SVM Experimental Route by learned decision boundary Fast offline-trained classification
MLP Experimental Non-linear neural selector through native ML artifacts Mature deployments with trained artifacts

The boundary is strict: selection algorithms choose one model from modelRefs. They do not run a panel, coordinate a workflow, or hold a session model. Multi-model collaboration belongs to Looper. Session and conversation stability belongs to Router Learning.

Looper Algorithms: Selecting a Model Collaboration Path

Looper is important enough to discuss separately because it changes what the router is selecting.

A selection algorithm chooses one model. A Looper algorithm chooses a model collaboration path: a bounded execution pattern involving escalation, fan-out, panel judgment, multi-round reasoning, or micro-agent workflows. This is usually the path for scaling model capability without exposing a new application protocol. The client may still call one logical model name, but the router executes a structured collaboration behind that name.

flowchart LR A["Matched decision"] --> B{"Execution choice"} B -- "selection" --> C["One model
backend path"] B -- "looper" --> D["Collaboration path"] D --> E["Sequential
Confidence"] D --> F["Parallel
Ratings / Fusion"] D --> G["Multi-round
ReMoM"] D --> H["Workflow
Router Flow"] E --> I["One API response
headers + replay"] F --> I G --> I H --> I
Looper algorithm Catalog tier Collaboration pattern What it scales How to read it
Confidence Supported Try smaller or cheaper models first, evaluate confidence, escalate when confidence is too low Cost-efficient quality A sequential small-to-large cascade with explicit stopping conditions
Ratings Supported Run multiple candidates concurrently up to a cap and aggregate with rating-aware logic Ensemble breadth under cost control A bounded fan-out path for evaluation, A/B, and ensemble-style responses
ReMoM Supported Run multi-round parallel reasoning with a breadth schedule and final synthesis Test-time reasoning capacity A breadth-controlled reasoning tree across models
Fusion Experimental Run an analysis panel, ask a judge for structured analysis, then synthesize one final answer Independent model perspectives A panel-judge-synthesis path for tasks where disagreement and blind spots matter
Router Flow / Workflows Experimental Execute a static or planner-generated micro-agent workflow behind one model name Decomposition, verification, tool-aware work A bounded agent workflow where workers are constrained by decision modelRefs

These are not just “more algorithms.” They are the router’s answer to capability scaling.

Confidence keeps the cost curve low by starting small and escalating only when confidence is insufficient. It can use average log probability, margin, a hybrid score, self-verification, or an AutoMix-style entailment verifier. The question is not “small or large model?” It is “is the current answer good enough to stop?”

flowchart LR A["Matched decision
modelRefs"] --> B["Start with cheaper
or smaller model"] B --> C{"Confidence
enough?"} C -- "yes" --> D["Return answer"] C -- "no" --> E["Escalate to stronger
candidate"] E --> F{"Verifier, margin,
or logprob passes?"} F -- "yes" --> D F -- "no" --> G["Next candidate
or fallback"] G --> D

Ratings uses concurrency as a controlled resource. Instead of one winner, several candidates participate, bounded by max_concurrent, and the router aggregates successful responses. This is useful when operators want ensemble behavior or live comparison without letting fan-out become unbounded.

flowchart LR A["Matched decision
modelRefs"] --> B["Bounded fan-out
max_concurrent"] B --> C["Model A
response"] B --> D["Model B
response"] B --> E["Model C
response"] C --> F["Rating-aware
aggregation"] D --> F E --> F F --> G["One API response
plus trace"]

Fusion is the clean panel pattern. It sends the request to analysis models, asks a judge to identify consensus, contradictions, partial coverage, and blind spots, and then synthesizes a final answer. The important design point is that Fusion policy lives under the matched decision. vllm-sr/auto can decide whether Fusion is warranted; vllm-sr/fusion narrows matching to Fusion-capable decisions instead of silently falling back to ordinary single-model routing.

flowchart LR A["Matched Fusion decision"] --> B["Analysis model A"] A --> C["Analysis model B"] A --> D["Analysis model C"] B --> E["Judge
consensus, conflicts, gaps"] C --> E D --> E E --> F["Synthesis model"] F --> G["Final answer
with panel trace"]

ReMoM is the multi-round version of the same philosophy. It uses a breadth schedule such as [3, 2] or [32, 4], distributes calls across model candidates, compacts intermediate responses when needed, and synthesizes the final answer. This is useful when the value comes from exploration over multiple reasoning paths rather than one panel pass.

flowchart LR A["Matched ReMoM decision"] --> B["Round 1
breadth schedule"] B --> C["Parallel reasoning
across candidates"] C --> D["Compact or select
intermediate outputs"] D --> E["Next round
reduced breadth"] E --> F["Final synthesis"] F --> G["Answer
with round trace"]

Router Flow turns the route into a bounded micro-agent workflow. A static flow can define roles such as thinker, worker, verifier, and final synthesizer. A dynamic flow can ask a planner model to produce a plan, but worker execution remains constrained to the decision’s modelRefs. Tool calls preserve the OpenAI-compatible contract while the router stores enough workflow state to resume the correct worker after tool results return.

flowchart LR A["Matched Flow decision"] --> B["Static flow
or planner output"] B --> C["Thinker
decompose task"] C --> D["Worker
bounded by modelRefs"] D --> E["Tool calls
and tool results"] E --> D D --> F["Verifier
check result"] F --> G["Final synthesizer"] G --> H["OpenAI-compatible
response + workflow trace"]

The Looper layer is the bridge from routing to model collaboration. It lets the router scale capability through multiple models while keeping policy, traces, cost boundaries, and public API shape explicit.

Router Memory and Learning: Adaptation Is Not an Algorithm

Session-aware and learning-related behavior should not be hidden inside decision.algorithm. In the clean vLLM SR design, this belongs to Router Learning.

The distinction matters. A decision says what is allowed. A selection or looper algorithm produces a base result. Router Learning then asks whether the system should adapt that result from runtime experience, and whether switching is safe in the current session or conversation.

Learning can improve a route inside policy. It must not become a second, invisible policy system.

Timeline showing tool lock, model lane A, KV cache, idle drift boundary, and possible switch to model B.
Router Learning protects continuity and adapts choices after the base route is selected.

The runtime order is fixed:

flowchart LR A["Matched decision"] --> B["Base selector or looper"] B --> C["Protection preflight"] C --> D["Adaptation proposal"] D --> E["Protection switch guard"] E --> F["Final model/path"] F --> G["Learning headers"] F --> H["Replay diagnostics"] H --> I["Outcomes"] I --> J["Experience update"] H --> K["Offline recipe learning"]
Component Question it answers What it may change What it must not change
Recipe policy Which route is allowed for this request? Matched decision and candidate boundary Runtime experience
Base selector / looper What is the policy-approved base model or collaboration path? Base result Decision eligibility
Adaptation Does experience suggest a better candidate inside the allowed boundary? Proposal model Signals, thresholds, decisions, priorities, modelRefs
Protection Is exploration or switching safe now? Hold, allow, or rescue final model Model quality scores or policy matching
Replay and outcomes What happened, and how did it perform? Experience and offline evidence Live recipe policy
Offline recipe learning What recipe patch should humans review? Candidate recipe patches and seed packs Production behavior without review

The public learning concepts are intentionally small:

Concept Public surface Meaning
Adaptation global.router.learning.adaptation Online model-choice learning from runtime experience
Protection global.router.learning.protection Session and conversation stability control
Decision control routing.decisions[].adaptations Apply, observe, or bypass learning for the matched decision
Candidate boundary decision, tier, or global How far adaptation may search
Outcome /v1/router/outcomes linked to replay Typed feedback for model, route, policy, stability, provider, or router
Replay x-vsr-replay-id and durable record Evidence log for diagnostics and offline learning

Adaptation’s day-0 strategy is routing_sampling. It scores candidates from local experience: quality seed, good-fit outcomes, underpowered outcomes, overprovisioned outcomes, failures, latency evidence, cache reuse, effective input cost, and reliability. The default candidate set is decision, which means adaptation may only choose among the matched decision’s modelRefs. Broader scopes such as tier and global are more powerful, but they need stronger guards.

Protection is the session-aware half. It has a preflight guard and a switch guard. Preflight suppresses stochastic sampling during tool loops, protocol-sensitive continuations, or routine continuation steps. The switch guard decides whether to hold the current model, allow the proposal, or perform a bounded rescue switch. The simplified rule is:

switch if proposal_gain >= switch_margin + stability_weight * switch_cost

The switch cost can include cache warmth, handoff cost, tool-loop state, provider state, turn count, and switch history. Session-aware routing is therefore not sticky sessions. It is controlled continuity. The router keeps a model when switching is unsafe or not worth it, and it can switch again at idle boundaries, decision drift, or rescue conditions.

Router memory layer Hot path? Purpose
Protection state Yes Protected model, identity scope, turn count, cache/tool-loop evidence, switch history
Model experience Yes Quality, overuse, reliability, latency, cache, and cost evidence for adaptation
Router Replay Write from hot path, read offline Durable route, response, outcome, and learning diagnostics
Offline recipe artifacts No Findings, candidate recipes, recipe patches, and optional experience seed packs

Sensitive routes can bypass learning entirely:

routing:
  decisions:
    - name: local_privacy_policy
      modelRefs:
        - model: local-private-model
      adaptations:
        mode: bypass

That boundary is the contract. Learning can improve choices inside recipe policy. It cannot silently rewrite the recipe, add a new privacy exception, change a decision priority, or mutate modelRefs on the request path. Offline recipe learning can propose those changes as reviewable artifacts, but live routing remains governed by the recipe.

Plugins: Behavior Belongs to the Route

After a route is selected, the request still may not be ready for the model. It may need cache, memory, RAG, tool filtering, request parameter caps, prompt mutation, response verification, fast policy response, image generation, or replay.

These behaviors should not be global decoration. Privacy routes may need to bypass cache. High-risk factual routes may require verification. Agentic routes may need tool boundaries. Low-risk summarization may need only replay. Plugins are therefore route-scoped.

flowchart LR A["Matched decision"] --> B["Selection / Looper / Learning"] B --> C["Route-scoped plugins"] C --> D["Request mutation"] D --> E["Backend or Looper call"] E --> F["Response plugins"] F --> G["Replay / audit"] C --> C1["Semantic cache"] C --> C2["RAG"] C --> C3["Memory"] C --> C4["Tool selection"] F --> F1["Hallucination check"] F --> F2["Response safety"]
Route connected to Cache, Memory, RAG, Tools, Safety, Replay, with plugin paths after route selection.
Plugins are route-scoped behavior, not hidden global middleware.
Plugin What it changes Why route scope matters
Semantic cache Reads or writes semantic cache with threshold, TTL, and quota Privacy and category boundaries change cache policy
Memory Retrieves or stores conversational/user memory Memory scope must respect tenant, privacy, and session policy
RAG Adds retrieval from vector DB, MCP, file search, or external API Knowledge access is a capability path
Tools Passes through, filters, blocks, or dynamically retrieves tools Tool exposure depends on route, user, risk, and session
Tool selection Adds or filters tools from a tool database or request subset Ranking tools is a route decision, not an application afterthought
Request params Caps or rewrites temperature, max tokens, tools, or response format High-risk routes need tighter request shape
System prompt Injects route-specific instructions Policy must reach model behavior
Header mutation Adds provider, cluster, audit, or routing headers Gateway and backend need explicit context
Fast response Returns without model call Blocks, denies, quotas, or unsupported paths
Response jailbreak Checks response-side safety Request-only scanning misses output failures
Hallucination Warns, blocks, or rewrites unsupported claims High-risk factual routes need response governance
Router replay Records request, evidence, decision, model/path, plugins, and response Debugging and learning need durable artifacts
Image generation Bridges modality-aware routes to image backends Image routes have different models and policies

Some plugins are worth reading as miniature subsystems.

Plugin subsystem Important runtime detail Failure it prevents
semantic-cache Can override similarity threshold and TTL per decision; personalized RAG or memory routes can skip cache writes Reusing private or personalized answers as generic cache hits
memory Retrieves with limit and similarity threshold, supports auto-store, hybrid search, and reflection; injected after system/developer messages as a separate user-context message Blending memory into hidden prompt text that operators cannot reason about
rag Supports Milvus, Qdrant, external API, MCP, OpenAI file search, and hybrid modes; injection can be tool-role or system-prompt Treating all knowledge access as one opaque retrieval step
tools Supports passthrough, filtered, none, allow/block, semantic selection, and dynamic retrieval modes such as semantic_only and hybrid_history Letting an agent see tools just because the client sent them
request_params Can block or strip request parameters, cap max_tokens and n, and optionally strip unknown OpenAI fields High-risk paths inheriting unsafe sampling or output shape
response_jailbreak and hallucination Run after the model response and can warn, block, or rewrite warning metadata Assuming request-time safety checks are enough
router_replay Captures bounded request, response, tool trace, route, and plugin evidence Losing the evidence needed for debugging, evaluation, and learning

A route is better understood as an execution contract. The model is one part of it. The route also carries constraints, tools, knowledge, verification, memory, and evidence.

The Hot Path: Header, Body, Route, Response

The conceptual architecture only becomes convincing when it touches the request path. In vLLM SR, the hot path is shaped around a simple constraint: the router must see enough context to make a semantic decision, but it should avoid turning every request into an expensive full parse and full detector run.

The route lifecycle inside the gateway path looks like this.

flowchart LR A["Headers
id, path, protocol, identity"] --> B["Body
fast extraction"] B --> C{"Mutation
needed?"} C -- "no" --> D["Signals
projections
decisions"] C -- "yes" --> E["Full parse
OpenAI / Responses / Anthropic"] E --> D D --> F["Model routing
explicit, auto, looper slug"] F --> G["Request preparation
memory, RAG, tools, params, prompt"] G --> H["Backend or Looper
execution"] H --> I["Response phase
normalize, verify, cache, replay"]
Phase What the router extracts or changes Why it is on the hot path
Header phase Request id, method/path, client protocol, identity headers, streaming expectation, replay/model/response API paths, skip-processing opt-out Routing needs tenant, protocol, and control metadata before reading the full body
Body phase Fast request state first; full OpenAI-compatible parse only when mutation is needed Most routing decisions should not pay unnecessary parsing and mutation cost
Pre-routing Response API translation when needed, validation, signal dispatch, projection application, decision match, algorithm or looper preflight Semantic policy must happen before backend selection
Model routing Explicit model, auto model names, direct looper slugs such as Fusion or Flow, Anthropic provider routing, alias resolution, provider profile/auth, reasoning mode Logical model names must resolve into real provider and backend behavior
Request preparation System prompt, memory, RAG, request params, tools, tool selection, route headers, trace headers The selected route becomes concrete model input and gateway metadata
Response phase Normalize OpenAI, Responses, or Anthropic shapes, report usage, calibrate token estimate, update cache, run response jailbreak/hallucination checks, store memory, emit warnings, record replay The router must observe what happened, not only what it predicted

Protocol compatibility is part of the same story. The router should not hide provider differences behind a vague facade. It should translate them explicitly at the boundary: OpenAI-compatible chat, Responses-style calls, Anthropic-style provider routing, direct looper model slugs, and backend-specific provider profiles all become inputs to one routing engine. Applications keep a familiar API shape, while the infrastructure keeps the differences visible enough to debug.

The ExtProc path matters because it gives semantic policy a boundary-native shape. vLLM SR can receive enough request and response context to make semantic decisions while still returning control to the real gateway data plane.

Envoy Integration: Put Semantic Policy at the Boundary

The Envoy integration shows the intended boundary clearly.

Envoy owns the data plane: TLS, clusters, endpoint health, timeouts, retries, load balancing, and filter chains. vLLM SR should not rebuild those capabilities. It should act as the semantic policy plane: receive request context through External Processing, evaluate the route, and return header/body mutations.

Client Request to Envoy Gateway, semantic policy path to Semantic Router, and main traffic path to Backend Model Clusters.
Envoy keeps the main traffic path. vLLM SR returns semantic policy decisions and mutations.
Component Responsibility
Client Sends OpenAI-compatible or provider-compatible requests
Envoy Handles network path, clusters, TLS, timeout, health, retry, and load balancing
ExtProc bridge Sends request context and receives header/body mutations
vLLM SR Extracts evidence, matches policy, selects or orchestrates, applies learning and plugins
Backend clusters Serve model traffic after the semantic route is decided

The adoption advantage is large. Applications do not need a new private API just to benefit from semantic routing. They can keep calling familiar model APIs while infrastructure maps logical model names to small models, frontier models, RAG paths, verifier paths, Looper collaboration paths, workflows, or private hardware lanes.

Protocol work such as SIRP and multi-provider inference API matters for the same reason. Semantic routing should strengthen the control plane without forcing every application team into a custom gateway dialect.

Model Design: Model Name Is the Wrong Primitive

The router cannot make production-grade decisions from model names alone.

gpt-4, qwen3-32b, claude-opus, or local-private identifies an endpoint or alias. It does not describe reasoning ability, coding strength, tool behavior, vision support, context window, latency distribution, cost, hardware path, privacy boundary, or observed failure modes.

Request flows to Lexical, Embedding, LoRA, and MLP Selector models, then Calibration, Decision, Small Model, RAG, and Frontier Model with Feedback.
The router needs calibrated model metadata, not just endpoint strings.
Model-card field Examples Routing implication
Capability reasoning, coding, vision, tool use, image generation, verifier, embedding Determines which routes can legally include the model
Economics price, quality score, cost weight, expected output length Feeds cost-quality optimization
Latency TTFT, TPOT, p50/p95/p99, warm/cold behavior Feeds SLO-aware selection
Context context window, compression support, long-context stability Drives context-aware routing and token-budget routing
Hardware CUDA, ROCm, XPU, CPU, quantization, engine family Connects logical model to physical pool
Policy provider profile, tenant allowlist, data boundary, reasoning family Prevents unsafe or unauthorized selection
Feedback replay success, failure type, verifier disagreement, user feedback Supports learning and recalibration

Users can ask for auto. The system cannot treat auto as a magic endpoint. Internally, it must expand into model cards, candidate sets, route policy, Looper eligibility, learning boundaries, and execution paths.

Bindings: Fast ML Without Turning the Router Into a Model Server

vLLM SR’s control plane is written around Go because gateway integration, configuration, request mutation, response processing, Envoy ExtProc, and Kubernetes-style infrastructure fit Go well. But routing also has ML hot paths: embeddings, classification, modality detection, LoRA classification, and MLP selectors.

The binding layer keeps those concerns separated.

Go Router Service connected through FFI Boundary to Candle Backend, ONNX Backend, and Stub Backend capability matrix.
Native bindings expose capability explicitly instead of letting deployments guess.
Native surface Role Capability shape
candle-binding Rust/Candle high-performance ML path Unified batch classification, LoRA classification, batched embeddings, multimodal embeddings, modality routing, MLP selector
ml-binding Rust helpers for classical ML selector artifacts KNN, KMeans, and SVM-style selector support where trained artifacts exist
nlp-binding Rust lexical routing helpers BM25, n-gram, and deterministic lexical classifiers for low-latency evidence
ONNX backend Portable runtime path Batched embedding in the current public capability contract
Stub backend Minimal or unsupported build path Explicit capability absence for fallback, tests, and non-native builds

The rule is not “everything must be native.” The rule is capability must be explicit. If a deployment lacks a native classifier, the router should know. If ONNX supports only part of the contract, policy should not pretend otherwise. If backend lifecycle needs reset boundaries, the router should expose that instead of hiding it. Intel/OpenVINO-oriented paths can be valuable deployment options, but they belong in the hardware/runtime discussion unless they are part of the same advertised native capability contract.

Hardware Is a Routing Variable

Hardware support is often described as a compatibility matrix. For semantic routing, it is more than that.

Different hardware paths imply different latency, cost, memory behavior, kernel availability, quantization support, context-window economics, privacy boundary, and energy curve. A router that ignores hardware is only doing half the job.

Router connected to latency, cost, context, privacy, and hardware paths CUDA, ROCm, XPU, CPU.
Hardware-aware routing connects semantic constraints to physical execution.
Platform path What matters to routing Example route
NVIDIA CUDA Mature high-throughput vLLM serving, CUDA kernels, quantized paths, broad accelerator availability Default high-performance pool, frontier or private data-center path
AMD ROCm First-class non-CUDA vLLM platform direction, MI300/MI350-class deployments, AITER kernels and attention paths Cost/performance diversification, ROCm production pool, AMD Developer Cloud validation path
Intel XPU SYCL/DPC++, oneDNN, XPU kernels, OpenVINO-oriented portability and optimization paths Enterprise accelerator lane, private infrastructure, CPU/XPU hybrid deployment
CPU / local Intel/AMD x86, ARM AArch64, Apple silicon, edge and offline fallback PII-sensitive, low-throughput, local-only, or cost-minimal workloads
KV / context Prefix retention, warm state, prefill/decode balance, context-window pressure, bytes-per-token drift Session-aware protection, long-context pools, token-budget-aware routing

The acceleration story is not one-size-fits-all. CUDA gives the broadest default serving path. ROCm/AITER makes AMD pools a serious production option rather than a compatibility afterthought. Intel XPU and OpenVINO-style paths matter for enterprises that already own Intel-heavy infrastructure or need CPU/XPU portability. CPU and local paths remain important because privacy, availability, and cost sometimes beat raw throughput.

WRP is the right mental model here. Workload is semantic. Pool is physical. Router translates between them.

flowchart LR W["Workload
intent, risk, context, modality"] --> R["Router
policy + selection + learning"] R --> P["Pool
models, queues, cache, hardware"] P --> R R --> O["Outcome
quality, latency, cost, safety"] O --> W O --> R

A future router should know when CUDA is overloaded, when ROCm is cost-effective, when XPU is sufficient, when CPU/local is preferable for privacy, and when preserving KV cache is worth more than reselecting a theoretically better model.

Observability: The Router Must Leave Evidence Behind

If the system makes semantic decisions, operators need to see those decisions.

Pipeline with Request, Signal Trace, Decision Trace, Route, Response, Headers, Replay, and Audit Record.
Traces and replay turn routing from hidden magic into debuggable infrastructure.
Trace surface It explains
Signal trace Which signals ran, what matched, and what raw values or confidence appeared
Projection trace How evidence became derived route facts
Decision trace Which policy matched and why it outranked alternatives
Selection trace Which candidates were considered and which model won
Looper trace Which panel, rounds, workers, judge, or synthesis path executed
Learning trace Base model, proposal model, final model, protection action, adaptation reason, cache and switch evidence
Plugin trace Which cache, RAG, memory, tool, safety, verification, and replay behaviors executed
Header trace What x-vsr-* metadata went to the gateway or client
Replay record The durable request/response/decision/model/path artifact for debugging and evaluation

Replay is not only for dashboards. It is the second control plane. The first control plane makes the live decision; replay preserves enough evidence to debug that decision, compare it against alternatives, attach outcomes, and produce offline recipe patches.

If the router records what it saw, what it decided, what happened, and how users, agents, verifiers, or evals responded, the next generation of selectors, thresholds, learning strategies, and recipes can improve.

Without traces, the router is another opaque model. With traces, it becomes a system component.

Evaluation: The Router Is a Frontier

The wrong way to evaluate a router is to count features. A router with many knobs can still be bad. The right question is whether it improves the frontier between quality, cost, latency, safety, privacy, reliability, and hardware efficiency.

Cost quality frontier with Small, RAG, Frontier, Router, and Waste points.
Routing intelligence should move workloads toward the efficient frontier.
Evaluation axis What should improve
Quality-cost frontier Same or better quality with lower model spend
Latency frontier Better SLO compliance without blindly choosing weak models
Safety and privacy Better handling of PII, jailbreak, tool exposure, and local/private paths
Factuality More grounded answers through RAG and verification where needed
Collaboration value Better output from Fusion, ReMoM, Flow, or Ratings than single-model baselines
Session stability Fewer broken tool loops and fewer harmful model switches
Fleet efficiency Better queue, cache, hardware, and pool utilization
Debuggability More replayable and explainable decisions
Learning loop Better calibration from traces, outcomes, and offline evals

Router evaluation, fleet simulation, replay traces, RouterArena-style comparison, Looper evals, and offline recipe learning matter because semantic routing is infrastructure. It has to be measured like infrastructure.

The Part I Care About Most

The most important design belief in vLLM SR is not any single signal, plugin, model, algorithm, or hardware platform. It is the separation of concerns.

Signals observe. Projections coordinate. Decisions express policy. Selection algorithms choose one model. Looper algorithms scale capability through model collaboration. Router Learning adapts and protects within recipe boundaries. Plugins mutate behavior. Bindings accelerate hot-path ML. Envoy integration places semantic policy at the traffic boundary. Model cards connect logical names to real capabilities. Hardware metadata connects semantic constraints to physical execution. Observability makes every decision accountable.

That separation is what lets the router evolve without turning every new idea into a fork of the hot path.

A new jailbreak detector can become a signal. A new risk formula can become a projection. A new compliance rule can become a decision. A new selector can become a selection algorithm. A new panel or workflow primitive can become a Looper algorithm. A new verifier can become a plugin. A new accelerator lane can become model-card metadata. A new benchmark can become an evaluation loop. A new fleet model can feed pool-aware routing. A new learning strategy can propose better candidates without rewriting policy.

That is why I like the phrase Intelligence Control Plane. Not because the router is always intelligent by itself, but because it gives the system a place to allocate intelligence deliberately.

The first stage of AI infrastructure made intelligence callable.

The next stage has to make intelligence allocatable, explainable, and optimizable.

Calling a model was the first abstraction. Allocating intelligence is the next one.

That is the philosophy of vLLM SR.

Source Trail

This article is based on the vLLM Semantic Router codebase, website research archive, vLLM project blog posts, my earlier essays on LLM routing, and infrastructure references around Envoy, vLLM, ROCm, Intel XPU, and Gateway API inference routing.

Source Why it matters here
vLLM Semantic Router research archive Paper table and system-design throughline
Canonical config and routing contract code Recipe-as-contract, signal/projection/decision surfaces, supported algorithm and plugin catalog
Signal-Decision Driven Architecture Shift from single classification to signal-decision routing
Iris / Athena / Themis release posts Signals, model selection, plugins, memory, replay, AMD ROCm, and release progression
Fusion API and Looper tutorials Multi-model collaboration path: Confidence, Ratings, Fusion, ReMoM, Router Flow
Router Learning docs and proposal Adaptation, protection, memory, replay, outcomes, and offline recipe learning
Session-Aware Agentic Routing Tool-loop continuity, provider state, prefix cache, and safe switch boundaries
Agentic Routing on AMD ROCm AMD ROCm deployment, agentic recipe, dashboard, learning, and replay
ExtProc runtime pipeline notes Header/body/model-routing/response phases, protocol normalization, replay and response-time checks
Envoy External Processing filter Semantic policy path versus main traffic data plane
Native binding capability matrix Candle, ML binding, NLP binding, ONNX, and Stub capability boundaries
vLLM platform documentation CUDA, ROCm, XPU, CPU, and deployment background
AMD ROCm / AITER / vLLM ROCm attention backend AMD acceleration path and ROCm serving ecosystem
Intel XPU kernels / IPEX XPU ecosystem Intel accelerator serving path
Kubernetes Gateway API Inference Extension Standardization direction for inference routing at the gateway layer