The Philosophy of vLLM SR

vLLM Semantic Router is not a smarter model picker. It is a control plane for turning semantic work into explainable capability paths across models, tools, memory, policy, gateways, and hardware.

The easiest way to misunderstand vLLM Semantic Router is to start from the word router.

In a traditional gateway, routing is a forwarding problem. A request arrives, the gateway evaluates policy, picks a backend cluster, and sends the request there. The backend is usually a deterministic service. If it is healthy, the request succeeds. If it fails, the gateway retries, fails over, or returns an error.

LLM systems break that mental model.

The backend is no longer just a service. It is a probabilistic reasoning system. It may need tools. It may need memory. It may need retrieval. It may hallucinate. It may require a verifier. It may run on a GPU pool whose KV cache is warm for one session and cold for another. It may be cheap for short requests and expensive for long-context ones. It may be allowed for one tenant and forbidden for another.

A request is no longer only traffic. It is semantic work.

Routing is the moment where semantic intent becomes infrastructure placement.

That is the starting point of vLLM SR.

The project is not trying to build a prettier model selector. The deeper idea is to turn every LLM request into an explainable capability path: which evidence was observed, which derived facts were produced, which policy matched, which selection or collaboration algorithm ran, which plugins were attached, which backend and hardware path was used, and what trace was left behind.

Request flows into signals, projections, decisions, and actions including model, verify, and tools. — The output of vLLM SR is not just a model name. It is a governed capability path.

I want to walk through that design as a system, not as a feature list. We will keep one running example in view, then open the layers one by one: signals, projections, decisions, selection algorithms, looper algorithms, router memory and learning, plugins, Envoy integration, model design, native bindings, hardware support, observability, and evaluation.

A Request Walks Into the Router

Imagine an enterprise AI gateway receives this request:

Analyze this customer contract, compare it with our internal policy, and draft a response. It may contain personal information. Use tools if needed.

If we look only at surface text, this is a contract-analysis request. If we look only at token count, it may be a long-context request. If we look at risk, it touches PII, internal policy, tool access, and factual correctness. If we look at infrastructure, it depends on tenant permissions, pool health, cache warmth, context-window capacity, and model cost.

A naive model picker asks one question:

Which model should answer this prompt?

vLLM SR asks a better question:

What must the system know before it is allowed to spend intelligence?

flowchart LR A["Request
contract + policy + PII"] --> B["Signals
observe evidence"] B --> C["Projections
derive routing facts"] C --> D["Decisions
match policy"] D --> E["Selection Algorithms
choose one model"] D --> L["Looper Algorithms
choose collaboration path"] E --> M["Router Learning
adapt + protect"] L --> M M --> F["Plugins
attach behavior"] F --> G["Gateway / Backend
execute"] G --> H["Trace / Replay
learn"]

The split is intentional. Each layer owns a different kind of reasoning.

Layer	It owns	It should not own
Signals	Observing facts about the request, user, session, content, and environment	Final policy or model choice
Projections	Turning noisy evidence into stable routing facts	Gateway mutation or backend calls
Decisions	Expressing auditable route policy	Low-level selection or collaboration execution
Selection algorithms	Choosing one model from a policy-approved candidate set	Rewriting route eligibility
Looper algorithms	Selecting and running a multi-model collaboration path	Pretending to be a single-model selector
Router Learning	Adapting model choice and protecting session continuity after base selection	Mutating recipe policy on the request path
Plugins	Route-scoped behavior such as cache, RAG, memory, tools, verification, replay	Global hidden middleware
Bindings	Fast native ML paths for hot routing components	Policy semantics
Gateway integration	Putting semantic policy at the network boundary	Replacing the gateway data plane

The core design choice is separation of concerns. If a request goes to a frontier model with RAG, tool filtering, and a response verifier, the answer should not be “because the router said so.” The router should be able to show the chain: evidence, projection, policy, algorithm, learning decision, plugins, mutations, backend path, and replay id.

The Full Request Lifecycle

For the contract request, the router first extracts request context: body, headers, tenant identity, session id, conversation id, tool state, prior response markers, and any existing routing metadata.

It then runs only the signals referenced by the active recipe. Domain, PII, context length, authorization, knowledge-base relevance, fact-check need, tool availability, and jailbreak or prompt-injection signals can run independently. The point is not to run every detector on every request; the point is to gather the evidence that the configured policy actually needs.

Projections then compose that evidence into stable route facts: privacy_sensitive, legal_contract_path, long_context, needs_internal_rag, and verification_required. The decision engine matches those facts against route policy. A matched decision exposes candidate models and, depending on its algorithm, either a single-model selection path or a looper collaboration path. Router Learning may then adjust the proposed model or hold the current model if switching would break session continuity. Finally, plugins attach route-scoped behavior before the gateway executes the main traffic path.

flowchart LR C["Client request"] --> E["Envoy / Gateway
headers + body"] E --> R["vLLM SR
request context"] R --> S["Signals
run required evidence"] S --> P["Projections
derive route facts"] P --> D["Decisions
match policy"] D --> A{"Execution mode"} A -- "single model" --> M["Selector"] A -- "collaboration" --> L["Looper"] M --> G["Router Learning
adapt + protect"] L --> G G --> X["Plugins
cache, RAG, tools, safety"] X --> B["Backend pool
or looper call"] B --> T["Trace / replay
response checks"]

The model name is only the visible end of the route. The deeper value is that the system can explain why this was not a cheap direct-answer path. Privacy, internal knowledge, tool exposure, factuality, and context pressure are different concerns. vLLM SR gives each concern a place in the architecture.

The Recipe Is the System Contract

The next design step is easy to miss: vLLM SR is not primarily configured as a collection of ad hoc callbacks. It is configured as a recipe. That recipe is the contract between application intent, routing policy, backend inventory, plugin behavior, learning boundaries, and gateway mutation.

In production, routing policy tends to spread. One part lands in application code, one part in gateway config, one part in prompt templates, one part in a model registry, one part in a dashboard, and one part in a notebook used by the evaluation team. Once that happens, nobody can answer a simple operational question: “Why did this request take this path?”

vLLM SR’s v0.3-style contract tries to keep that answer in one durable shape.

flowchart LR A["Recipe
RouterConfig"] --> B["Evidence contract
signals"] B --> C["Fact contract
projections"] C --> D["Policy contract
decisions"] D --> E["Execution contract
modelRefs, algorithm, plugins, adaptations"] E --> F["Runtime contract
backend models, provider profiles, router options"]

Contract surface	What it owns	Why it belongs in the recipe
Global runtime	Semantic cache, memory, replay, learning, observability, API listeners, startup behavior	These features are infrastructure state, not per-request improvisation
`routing.signals`	The evidence families that are allowed to run	A decision should only depend on named, reviewable evidence
`routing.projections`	Partitions, scores, and mappings derived from signals	Policy should consume stable facts, not raw detector noise
`routing.decisions`	Named route policies with priority, tier, rules, output contract, modelRefs, algorithms, adaptations, plugins, and emits	A route is an auditable capability bundle
`decision.modelRefs`	The approved candidate boundary for a decision	Selection and learning cannot escape policy by discovering a better-looking model elsewhere
`decision.algorithm`	The execution strategy inside the matched decision	The recipe distinguishes one-model selection from collaboration paths such as Fusion or Flow
`decision.plugins`	Cache, memory, RAG, tools, prompt mutation, response checks, replay, image generation, or fast response	Behavior follows the matched route instead of hiding in global middleware
`decision.adaptations`	Apply, observe, or bypass learning for this policy	Sensitive routes can be protected from online exploration
Backend models	Provider profiles, vLLM endpoints, image backends, model metadata, aliases, default model	Logical routing must resolve to real serving pools and provider contracts
Router options	Auto model names, body streaming mode, route-cache clearing, skip-processing controls	Gateway behavior and compatibility are explicit operational choices

That is the operational difference between a router that happens to make good choices and a router that can be trusted. A recipe is diffable. It can be validated. It can be replayed against stored traffic. It can be promoted across environments. It can be patched by offline learning without letting the hot path rewrite production policy.

The rule is conservative on purpose: runtime intelligence may propose better choices, but the recipe defines the legal search space. That is what keeps “adaptive routing” from becoming hidden policy mutation.

Research as Design Pressure

The paper shelf behind vLLM SR is not a citation dump. Each research direction creates pressure on one part of the router. Read the table as a map from problem to architecture.

Work	Problem	Core idea	Architectural pressure
vLLM Semantic Router: Signal Driven Decision Routing	Single-classification routing cannot express cost, privacy, latency, safety, and multimodal constraints together	Compose heterogeneous signals into route decisions and route-scoped behavior	Establishes signals, projections, decisions, algorithms, and plugins as separate layers
Workload-Router-Pool	Routing papers often ignore serving pools, while fleet papers often ignore workload semantics	Close the loop among workload evidence, router policy, and physical pool state	Makes queue, cache, hardware, and pool feedback routing inputs
When to Reason	Reasoning models are expensive and not always useful	Detect reasoning need and apply reasoning only when beneficial	Creates complexity signals, reasoning policy, and model-card reasoning capability
Category-Aware Semantic Caching	One similarity threshold is unsafe across heterogeneous workloads	Cache thresholds, TTLs, and quotas vary by category	Makes semantic cache a route-scoped plugin, not a global middleware
Outcome-Aware Tool Selection	Tool selection cannot rely only on semantic similarity	Refine tool embeddings offline using outcome evidence under latency budgets	Connects tool signals, tool-selection plugins, and replay/outcome data
98x Faster Routing Without Dedicated GPU	A hot-path router cannot take seconds to route	Use prompt compression, Flash Attention, and near-streaming body processing	Pushes native bindings and fast signal runtime
Adaptive VLM Routing	Multimodal computer-use steps have very different difficulty	Estimate visual/action difficulty and route cheap or strong VLMs accordingly	Adds modality, visual difficulty, and multimodal safety signals
Visual Confused Deputy	Computer-use agents can be attacked through visual perception failures	Check click target and action reasoning independently	Pulls visual safety and action-boundary validation into routing
Knowledge Access Beats Model Size	Correct knowledge access can beat a larger model	Use memory and retrieval-grounded routing to recover quality with smaller models	Promotes RAG, memory, KB signals, and knowledge paths to first-class capabilities
Fast and Faithful RAG Verification	Retrieval does not guarantee faithful answers	Verify long-document RAG responses in real time	Motivates hallucination and response-verification plugins
inference-fleet-sim	Model choice depends on queueing, TTFT, and fleet capacity	Use queueing-grounded simulation for multi-pool planning	Connects router policy to fleet simulation
FleetOpt	Minimum-cost pools depend on workload CDF and P99 targets	Derive pool boundaries and deploy them through compress-and-route	Connects token budget, pool boundary, and cost-aware routing
1/W Law	Context window size changes energy efficiency and memory pressure	Analyze context-length routing topology and tokens-per-watt	Makes context routing and long-context pools central
Conflict-Free Policy Languages	Probabilistic ML predicates can co-fire silently	Detect and prevent policy conflicts in the DSL	Motivates declarative decisions, priority, confidence, and auditability
Cross-Layer Policy Compilation	Policy should not live as scattered gateway, workflow, K8s, and agent code	Compile one declarative source into multiple execution layers	Points toward policy-as-code and cross-layer verification
Token-Budget-Aware Pool Routing	Token budget affects KV cache, pool choice, and failure risk	Estimate total token budget and route to short or long pools	Connects context, bytes-per-token, pools, and failure avoidance
SIRP and Multi-Provider API	Semantic routing should not require private application protocols	Standardize semantic inference routing and multi-provider surfaces	Reinforces OpenAI-compatible, gateway-compatible control planes

The pressure from the research shelf is clear: a router cannot be one classifier. If it were, every new paper would become another special case. In vLLM SR, a paper can become a signal, projection, decision primitive, selection algorithm, looper algorithm, plugin, learning signal, model-card field, or pool feedback loop.

Signals: Evidence Before Judgment

Signals are deliberately humble. A signal answers one factual question. It does not choose the model. It does not select a plugin. It does not decide the route.

For the contract example, the router may need to know whether the request is legal or enterprise-related, whether it contains PII, whether the user is authorized for premium or private models, whether it needs internal knowledge, whether tools are available, whether the context is too large for a short-context backend, and whether the response needs verification.

Request fan-out to Keyword, Context, Authz, Domain, PII, Jailbreak, Modality, Feedback, and Signal Results. — Signals are sensors, not policy. They produce reusable evidence.

Signal family	What it observes	Why it matters
Keyword and lexical	Explicit words, regex, BM25, n-grams, fuzzy anchors	Fast deterministic anchors for compliance, product names, incidents, and explicit task classes
Context and structure	Token count, context pressure, prompt shape, JSON/schema/workflow form	Separates short direct paths from long-context, compression, and workflow paths
Authz and tenant	User identity, groups, role bindings, tiers, allowlists	Prevents silent access to premium, private, tenant-specific, or sensitive paths
Conversation and session	Turn count, tool calls, tool results, active loop state, previous response markers	Protects continuity and avoids switching during non-portable state
Domain and KB	Task domain, internal knowledge source, KB relevance	Chooses domain models, RAG sources, and policy constraints
Embedding and preference	Semantic similarity, user/task preference, model-description fit	Handles soft semantic routing where exact keywords are insufficient
Complexity and repair	Difficulty, uncertainty, repeated dissatisfaction, retry and repair signals	Drives reasoning activation, escalation, and learning
Fact-check and safety	Factuality need, jailbreak, prompt injection, PII, response risk	Triggers RAG, verifier, guardrail, local/private route, or block path
Modality and event	Text, image, image-generation intent, SRE/SOC event shape	Routes to VLM, image, operational, or event-specific paths

The maintained runtime surface currently exposes eighteen base signal families: authz, complexity, context, conversation, domain, embedding, fact_check, jailbreak, keyword, language, modality, pii, preference, reask, structure, kb, user_feedback, and event. projection is also a rule condition type, but it is not a raw sensor. It is the decision-visible output of the projection layer.

The runtime behavior is as important as the list of names.

Runtime behavior	Design implication
Used-signal analysis	The classifier builds a map from decisions and projections, then runs only the signal families needed by the active recipe unless forced evaluation is enabled
Concurrent dispatch	Independent signal evaluators can run in parallel, keeping the route path from becoming a serial detector chain
Readiness checks	A configured signal is only useful if its model, rule, or backend dependency is ready; the router can avoid pretending a missing detector produced evidence
Request-scoped caches	Expensive intermediate work, such as image embeddings, can be shared by complexity and embedding signals during one request
Trace preservation	Signal output is retained as evidence for projections, decision traces, replay, and learning diagnostics

The payoff is evidence reuse. A pii signal can influence privacy policy, cache policy, provider selection, and audit policy. A conversation signal can influence router protection, tool filtering, and replay. A context signal can influence long-context pools, compression, and cost-aware selection. If signals directly selected models, that reuse would disappear.

Projections: The Coordination Layer

Signals are often messy. Some are booleans. Some are classifier probabilities. Some are similarity scores. Some are raw metrics such as token count or KB relevance. Production policy should not need to manually combine all of those raw values every time.

Projections turn evidence into stable routing facts.

Input signals flow through a projection layer with linear transforms and thresholds into selected routing bands. — Projections turn raw evidence into policy-readable routing facts.

Raw evidence	Example value	Why a projection helps
`domain=legal`	confidence 0.82	Domain can overlap with finance, support, or security; policy needs a stable partition
`pii=present`	confidence 0.91	Privacy should not depend on one detector alone
`context_tokens`	42K	Token count matters relative to model window and pool state
`fact_check=needed`	confidence 0.76	Factuality should combine with domain and knowledge availability
`authz=premium`	matched	Authorization is a hard gate, not a difficulty signal
`kb=internal_policy`	score 0.68	KB relevance is retrieval evidence, not a complete route

flowchart LR A["Raw signals"] --> B["Risk score"] A --> C["Complexity score"] A --> D["Domain partition"] A --> E["Knowledge need"] B --> F["privacy_sensitive"] C --> G["simple / medium / complex"] D --> H["legal_contract_path"] E --> I["needs_internal_rag"] F --> J["Decision inputs"] G --> J H --> J I --> J

Projection pattern	Logic	Example
Partition	Pick one winner among competing semantic candidates	Choose `legal_contract_path` over `finance_path` when margin is sufficient
Weighted score	Combine booleans, confidence values, raw values, and similarity scores	Compute risk or complexity pressure
Threshold mapping	Convert a continuous score into stable bands	Map complexity into simple, medium, or complex
Multi-emit mapping	Emit several non-exclusive derived facts	Emit both `needs_rag` and `needs_verifier`
Normalization	Put heterogeneous signal scales into comparable space	Feed hybrid selection and confidence ranking

One subtle rule keeps projections clean: decisions read projection outputs, not every intermediate score name. A partition or weighted score can be rich internally, but the policy surface should see facts such as legal_contract_path, risk_high, or long_context_lane.

Projection artifact	Decision-visible?	Trace value
Partition	Usually through the selected output	Shows competing semantic candidates and margin
Score	Not by itself unless mapped or referenced by another projection	Shows weighted input contributions, match and miss values, and normalization behavior
Mapping output	Yes	Shows which threshold band or multi-emit output fired
Boundary distance	Indirectly through confidence	Explains near-miss cases where a score barely crossed or missed a band
Projection trace	Operationally visible	Lets operators debug why raw evidence became a route fact

That is why the projection layer can grow without making decisions unreadable. A risk score might combine PII, jailbreak, tenant tier, KB source, and response verifier need. The decision should not have to carry that formula inline. It should match a named derived fact and leave the math in a traceable coordination layer.

Projections keep the rest of the system readable. Signals remain fine-grained. Decisions remain policy-oriented. Algorithms remain responsible for selection or collaboration. The coordination work lives in between, where it can be traced.

Decisions: Policy Needs a Shape

Once projections produce stable facts, decisions express route policy.

The decision engine is intentionally closer to a boolean circuit than a hidden Python function. Routing policy should be inspectable, diffable, auditable, sortable, and eventually compilable across infrastructure layers.

Evidence enters a boolean decision engine with AND, OR, NOT gates and outputs selected route, candidate models, plugins, and fallback route. — The decision layer turns evidence into auditable route policy.

For the contract request, the policy can be simplified as:

flowchart LR A["authz: premium"] --> D1["AND"] B["privacy_sensitive"] --> D1 C["legal_contract_path"] --> D1 D1 --> D2["AND"] E["needs_internal_rag"] --> D2 F["verification_required"] --> D2 D2 --> R["enterprise_contract_path"] R --> M["candidate models"] R --> P["RAG + tool filter + verifier + replay"]

A route is not just an endpoint. It is a policy-approved capability bundle: candidate models, algorithm, plugins, gateway mutations, retention behavior, fallback, and diagnostics.

Decision element	Meaning	Why it matters
Leaf	References a signal or projection	Keeps policy connected to explicit evidence
AND	Requires all children to match	Expresses strict gates such as auth plus privacy
OR	Accepts any child	Lets multiple evidence patterns imply the same path
NOT	Excludes one child	Useful for fallback, denial, or bypass policy
Priority and tier	Sort matched decisions	Prevents low-risk paths from shadowing high-risk paths
Confidence	Carries evidence strength	Allows ranking without hiding why
Emits	Produces route metadata	Connects policy to cache, learning, plugins, and gateway headers

In the current contract, a decision is a route object, not only a rule tree.

Decision field	What it contributes to the route
`name` and `description`	Human-readable identity for traces, dashboards, replay, and review
`priority` and `tier`	Deterministic ordering when multiple policies match
`output_contract`	Declares the expected API or response shape for the route
`rules`	Recursive Boolean tree over signals and projection outputs
`modelRefs`	Candidate boundary for selectors, loopers, and learning
`algorithm`	Execution strategy after the policy match
`adaptations`	Per-decision learning mode: apply, observe, or bypass
`plugins`	Route-scoped behavior attached after matching
`candidateIterations`	Declarative candidate loops used by richer selection or workflow constructs
`emits`	Declarative side effects such as retention behavior

Decision selection is intentionally deterministic. If matched decisions use tiers, lower tier values win first, then confidence, priority, and name. Without tiers, strategy=confidence ranks by confidence before priority; the default strategy ranks priority before confidence. Even fallback is explicit: an empty AND can be a catch-all route, but it has zero confidence so it does not outrank real evidence-backed decisions.

Retention emits make policy visible beyond the immediate model choice. A decision can express drop to skip semantic-cache writes, ttl_turns to bound cache lifetime, keep_current_model to protect session continuity, or prefer_prefix_retention to tell the serving pool that KV/prefix reuse matters.

That shift from classification-style routing to signal-decision routing is not cosmetic. Classification is useful for demos. Decision architecture is what production policy needs.

Selection Algorithms: Choosing One Model After Policy

Many router designs start with the selector: embeddings, MLPs, bandits, or an LLM-as-router that picks a model directly. That easily turns one algorithm into a sink for every concern: semantic fit, cost, latency, safety, authorization, session continuity, and provider policy.

vLLM SR keeps the order stricter. The decision matches first. Then a selection algorithm chooses one model from the policy-approved candidate set.

Route enters a selector with Static, RouterDC, Hybrid, and Latency algorithms, then outputs one model or multi model. — Selection algorithms choose inside a matched decision. They do not define route eligibility.

Selection algorithm	Catalog tier	Core idea	Best use
Static	Supported	Pick the configured order or fixed score	Deterministic fallback, explicit business policy, early rollout
RouterDC	Supported	Match query embedding to model-description embeddings	Query-to-capability matching when model cards are meaningful
Hybrid	Supported	Combine semantic fit, quality, latency, cost, cache affinity, and other scores	Production tradeoffs where no single signal should dominate
Multi-factor	Supported	Filter by SLO, then score quality, latency, cost, and load	Fleet-aware route selection
Latency-aware	Supported	Prefer candidates using TTFT/TPOT percentile metrics	SLO-sensitive paths
AutoMix	Experimental	Start cheaper and escalate using confidence or verification	Cost-saving cascades where repair is acceptable
KNN	Experimental	Route by nearest labeled examples	Interpretable example-based routing
KMeans	Experimental	Route by cluster membership	Coarse workload segmentation
SVM	Experimental	Route by learned decision boundary	Fast offline-trained classification
MLP	Experimental	Non-linear neural selector through native ML artifacts	Mature deployments with trained artifacts

The boundary is strict: selection algorithms choose one model from modelRefs. They do not run a panel, coordinate a workflow, or hold a session model. Multi-model collaboration belongs to Looper. Session and conversation stability belongs to Router Learning.

Looper Algorithms: Selecting a Model Collaboration Path

Looper is important enough to discuss separately because it changes what the router is selecting.

A selection algorithm chooses one model. A Looper algorithm chooses a model collaboration path: a bounded execution pattern involving escalation, fan-out, panel judgment, multi-round reasoning, or micro-agent workflows. This is usually the path for scaling model capability without exposing a new application protocol. The client may still call one logical model name, but the router executes a structured collaboration behind that name.

flowchart LR A["Matched decision"] --> B{"Execution choice"} B -- "selection" --> C["One model
backend path"] B -- "looper" --> D["Collaboration path"] D --> E["Sequential
Confidence"] D --> F["Parallel
Ratings / Fusion"] D --> G["Multi-round
ReMoM"] D --> H["Workflow
Router Flow"] E --> I["One API response
headers + replay"] F --> I G --> I H --> I

Looper algorithm	Catalog tier	Collaboration pattern	What it scales	How to read it
Confidence	Supported	Try smaller or cheaper models first, evaluate confidence, escalate when confidence is too low	Cost-efficient quality	A sequential small-to-large cascade with explicit stopping conditions
Ratings	Supported	Run multiple candidates concurrently up to a cap and aggregate with rating-aware logic	Ensemble breadth under cost control	A bounded fan-out path for evaluation, A/B, and ensemble-style responses
ReMoM	Supported	Run multi-round parallel reasoning with a breadth schedule and final synthesis	Test-time reasoning capacity	A breadth-controlled reasoning tree across models
Fusion	Experimental	Run an analysis panel, ask a judge for structured analysis, then synthesize one final answer	Independent model perspectives	A panel-judge-synthesis path for tasks where disagreement and blind spots matter
Router Flow / Workflows	Experimental	Execute a static or planner-generated micro-agent workflow behind one model name	Decomposition, verification, tool-aware work	A bounded agent workflow where workers are constrained by decision `modelRefs`

These are not just “more algorithms.” They are the router’s answer to capability scaling.

Confidence keeps the cost curve low by starting small and escalating only when confidence is insufficient. It can use average log probability, margin, a hybrid score, self-verification, or an AutoMix-style entailment verifier. The question is not “small or large model?” It is “is the current answer good enough to stop?”

flowchart LR A["Matched decision
modelRefs"] --> B["Start with cheaper
or smaller model"] B --> C{"Confidence
enough?"} C -- "yes" --> D["Return answer"] C -- "no" --> E["Escalate to stronger
candidate"] E --> F{"Verifier, margin,
or logprob passes?"} F -- "yes" --> D F -- "no" --> G["Next candidate
or fallback"] G --> D

Ratings uses concurrency as a controlled resource. Instead of one winner, several candidates participate, bounded by max_concurrent, and the router aggregates successful responses. This is useful when operators want ensemble behavior or live comparison without letting fan-out become unbounded.

flowchart LR A["Matched decision
modelRefs"] --> B["Bounded fan-out
max_concurrent"] B --> C["Model A
response"] B --> D["Model B
response"] B --> E["Model C
response"] C --> F["Rating-aware
aggregation"] D --> F E --> F F --> G["One API response
plus trace"]

Fusion is the clean panel pattern. It sends the request to analysis models, asks a judge to identify consensus, contradictions, partial coverage, and blind spots, and then synthesizes a final answer. The important design point is that Fusion policy lives under the matched decision. vllm-sr/auto can decide whether Fusion is warranted; vllm-sr/fusion narrows matching to Fusion-capable decisions instead of silently falling back to ordinary single-model routing.

flowchart LR A["Matched Fusion decision"] --> B["Analysis model A"] A --> C["Analysis model B"] A --> D["Analysis model C"] B --> E["Judge
consensus, conflicts, gaps"] C --> E D --> E E --> F["Synthesis model"] F --> G["Final answer
with panel trace"]

ReMoM is the multi-round version of the same philosophy. It uses a breadth schedule such as [3, 2] or [32, 4], distributes calls across model candidates, compacts intermediate responses when needed, and synthesizes the final answer. This is useful when the value comes from exploration over multiple reasoning paths rather than one panel pass.

flowchart LR A["Matched ReMoM decision"] --> B["Round 1
breadth schedule"] B --> C["Parallel reasoning
across candidates"] C --> D["Compact or select
intermediate outputs"] D --> E["Next round
reduced breadth"] E --> F["Final synthesis"] F --> G["Answer
with round trace"]

Router Flow turns the route into a bounded micro-agent workflow. A static flow can define roles such as thinker, worker, verifier, and final synthesizer. A dynamic flow can ask a planner model to produce a plan, but worker execution remains constrained to the decision’s modelRefs. Tool calls preserve the OpenAI-compatible contract while the router stores enough workflow state to resume the correct worker after tool results return.

flowchart LR A["Matched Flow decision"] --> B["Static flow
or planner output"] B --> C["Thinker
decompose task"] C --> D["Worker
bounded by modelRefs"] D --> E["Tool calls
and tool results"] E --> D D --> F["Verifier
check result"] F --> G["Final synthesizer"] G --> H["OpenAI-compatible
response + workflow trace"]

The Looper layer is the bridge from routing to model collaboration. It lets the router scale capability through multiple models while keeping policy, traces, cost boundaries, and public API shape explicit.

Router Memory and Learning: Adaptation Is Not an Algorithm

Session-aware and learning-related behavior should not be hidden inside decision.algorithm. In the clean vLLM SR design, this belongs to Router Learning.

The distinction matters. A decision says what is allowed. A selection or looper algorithm produces a base result. Router Learning then asks whether the system should adapt that result from runtime experience, and whether switching is safe in the current session or conversation.

Learning can improve a route inside policy. It must not become a second, invisible policy system.

Timeline showing tool lock, model lane A, KV cache, idle drift boundary, and possible switch to model B. — Router Learning protects continuity and adapts choices after the base route is selected.

The runtime order is fixed:

flowchart LR A["Matched decision"] --> B["Base selector or looper"] B --> C["Protection preflight"] C --> D["Adaptation proposal"] D --> E["Protection switch guard"] E --> F["Final model/path"] F --> G["Learning headers"] F --> H["Replay diagnostics"] H --> I["Outcomes"] I --> J["Experience update"] H --> K["Offline recipe learning"]

Component	Question it answers	What it may change	What it must not change
Recipe policy	Which route is allowed for this request?	Matched decision and candidate boundary	Runtime experience
Base selector / looper	What is the policy-approved base model or collaboration path?	Base result	Decision eligibility
Adaptation	Does experience suggest a better candidate inside the allowed boundary?	Proposal model	Signals, thresholds, decisions, priorities, modelRefs
Protection	Is exploration or switching safe now?	Hold, allow, or rescue final model	Model quality scores or policy matching
Replay and outcomes	What happened, and how did it perform?	Experience and offline evidence	Live recipe policy
Offline recipe learning	What recipe patch should humans review?	Candidate recipe patches and seed packs	Production behavior without review

The public learning concepts are intentionally small:

Concept	Public surface	Meaning
Adaptation	`global.router.learning.adaptation`	Online model-choice learning from runtime experience
Protection	`global.router.learning.protection`	Session and conversation stability control
Decision control	`routing.decisions[].adaptations`	Apply, observe, or bypass learning for the matched decision
Candidate boundary	`decision`, `tier`, or `global`	How far adaptation may search
Outcome	`/v1/router/outcomes` linked to replay	Typed feedback for model, route, policy, stability, provider, or router
Replay	`x-vsr-replay-id` and durable record	Evidence log for diagnostics and offline learning

Adaptation’s day-0 strategy is routing_sampling. It scores candidates from local experience: quality seed, good-fit outcomes, underpowered outcomes, overprovisioned outcomes, failures, latency evidence, cache reuse, effective input cost, and reliability. The default candidate set is decision, which means adaptation may only choose among the matched decision’s modelRefs. Broader scopes such as tier and global are more powerful, but they need stronger guards.

Protection is the session-aware half. It has a preflight guard and a switch guard. Preflight suppresses stochastic sampling during tool loops, protocol-sensitive continuations, or routine continuation steps. The switch guard decides whether to hold the current model, allow the proposal, or perform a bounded rescue switch. The simplified rule is:

switch if proposal_gain >= switch_margin + stability_weight * switch_cost

The switch cost can include cache warmth, handoff cost, tool-loop state, provider state, turn count, and switch history. Session-aware routing is therefore not sticky sessions. It is controlled continuity. The router keeps a model when switching is unsafe or not worth it, and it can switch again at idle boundaries, decision drift, or rescue conditions.

Router memory layer	Hot path?	Purpose
Protection state	Yes	Protected model, identity scope, turn count, cache/tool-loop evidence, switch history
Model experience	Yes	Quality, overuse, reliability, latency, cache, and cost evidence for adaptation
Router Replay	Write from hot path, read offline	Durable route, response, outcome, and learning diagnostics
Offline recipe artifacts	No	Findings, candidate recipes, recipe patches, and optional experience seed packs

Sensitive routes can bypass learning entirely:

routing:
  decisions:
    - name: local_privacy_policy
      modelRefs:
        - model: local-private-model
      adaptations:
        mode: bypass

That boundary is the contract. Learning can improve choices inside recipe policy. It cannot silently rewrite the recipe, add a new privacy exception, change a decision priority, or mutate modelRefs on the request path. Offline recipe learning can propose those changes as reviewable artifacts, but live routing remains governed by the recipe.

Plugins: Behavior Belongs to the Route

After a route is selected, the request still may not be ready for the model. It may need cache, memory, RAG, tool filtering, request parameter caps, prompt mutation, response verification, fast policy response, image generation, or replay.

These behaviors should not be global decoration. Privacy routes may need to bypass cache. High-risk factual routes may require verification. Agentic routes may need tool boundaries. Low-risk summarization may need only replay. Plugins are therefore route-scoped.

flowchart LR A["Matched decision"] --> B["Selection / Looper / Learning"] B --> C["Route-scoped plugins"] C --> D["Request mutation"] D --> E["Backend or Looper call"] E --> F["Response plugins"] F --> G["Replay / audit"] C --> C1["Semantic cache"] C --> C2["RAG"] C --> C3["Memory"] C --> C4["Tool selection"] F --> F1["Hallucination check"] F --> F2["Response safety"]

Route connected to Cache, Memory, RAG, Tools, Safety, Replay, with plugin paths after route selection. — Plugins are route-scoped behavior, not hidden global middleware.

Plugin	What it changes	Why route scope matters
Semantic cache	Reads or writes semantic cache with threshold, TTL, and quota	Privacy and category boundaries change cache policy
Memory	Retrieves or stores conversational/user memory	Memory scope must respect tenant, privacy, and session policy
RAG	Adds retrieval from vector DB, MCP, file search, or external API	Knowledge access is a capability path
Tools	Passes through, filters, blocks, or dynamically retrieves tools	Tool exposure depends on route, user, risk, and session
Tool selection	Adds or filters tools from a tool database or request subset	Ranking tools is a route decision, not an application afterthought
Request params	Caps or rewrites temperature, max tokens, tools, or response format	High-risk routes need tighter request shape
System prompt	Injects route-specific instructions	Policy must reach model behavior
Header mutation	Adds provider, cluster, audit, or routing headers	Gateway and backend need explicit context
Fast response	Returns without model call	Blocks, denies, quotas, or unsupported paths
Response jailbreak	Checks response-side safety	Request-only scanning misses output failures
Hallucination	Warns, blocks, or rewrites unsupported claims	High-risk factual routes need response governance
Router replay	Records request, evidence, decision, model/path, plugins, and response	Debugging and learning need durable artifacts
Image generation	Bridges modality-aware routes to image backends	Image routes have different models and policies

Some plugins are worth reading as miniature subsystems.

Plugin subsystem	Important runtime detail	Failure it prevents
`semantic-cache`	Can override similarity threshold and TTL per decision; personalized RAG or memory routes can skip cache writes	Reusing private or personalized answers as generic cache hits
`memory`	Retrieves with limit and similarity threshold, supports auto-store, hybrid search, and reflection; injected after system/developer messages as a separate user-context message	Blending memory into hidden prompt text that operators cannot reason about
`rag`	Supports Milvus, Qdrant, external API, MCP, OpenAI file search, and hybrid modes; injection can be tool-role or system-prompt	Treating all knowledge access as one opaque retrieval step
`tools`	Supports passthrough, filtered, none, allow/block, semantic selection, and dynamic retrieval modes such as `semantic_only` and `hybrid_history`	Letting an agent see tools just because the client sent them
`request_params`	Can block or strip request parameters, cap `max_tokens` and `n`, and optionally strip unknown OpenAI fields	High-risk paths inheriting unsafe sampling or output shape
`response_jailbreak` and `hallucination`	Run after the model response and can warn, block, or rewrite warning metadata	Assuming request-time safety checks are enough
`router_replay`	Captures bounded request, response, tool trace, route, and plugin evidence	Losing the evidence needed for debugging, evaluation, and learning

A route is better understood as an execution contract. The model is one part of it. The route also carries constraints, tools, knowledge, verification, memory, and evidence.

The Hot Path: Header, Body, Route, Response

The conceptual architecture only becomes convincing when it touches the request path. In vLLM SR, the hot path is shaped around a simple constraint: the router must see enough context to make a semantic decision, but it should avoid turning every request into an expensive full parse and full detector run.

The route lifecycle inside the gateway path looks like this.

flowchart LR A["Headers
id, path, protocol, identity"] --> B["Body
fast extraction"] B --> C{"Mutation
needed?"} C -- "no" --> D["Signals
projections
decisions"] C -- "yes" --> E["Full parse
OpenAI / Responses / Anthropic"] E --> D D --> F["Model routing
explicit, auto, looper slug"] F --> G["Request preparation
memory, RAG, tools, params, prompt"] G --> H["Backend or Looper
execution"] H --> I["Response phase
normalize, verify, cache, replay"]

Phase	What the router extracts or changes	Why it is on the hot path
Header phase	Request id, method/path, client protocol, identity headers, streaming expectation, replay/model/response API paths, skip-processing opt-out	Routing needs tenant, protocol, and control metadata before reading the full body
Body phase	Fast request state first; full OpenAI-compatible parse only when mutation is needed	Most routing decisions should not pay unnecessary parsing and mutation cost
Pre-routing	Response API translation when needed, validation, signal dispatch, projection application, decision match, algorithm or looper preflight	Semantic policy must happen before backend selection
Model routing	Explicit model, `auto` model names, direct looper slugs such as Fusion or Flow, Anthropic provider routing, alias resolution, provider profile/auth, reasoning mode	Logical model names must resolve into real provider and backend behavior
Request preparation	System prompt, memory, RAG, request params, tools, tool selection, route headers, trace headers	The selected route becomes concrete model input and gateway metadata
Response phase	Normalize OpenAI, Responses, or Anthropic shapes, report usage, calibrate token estimate, update cache, run response jailbreak/hallucination checks, store memory, emit warnings, record replay	The router must observe what happened, not only what it predicted

Protocol compatibility is part of the same story. The router should not hide provider differences behind a vague facade. It should translate them explicitly at the boundary: OpenAI-compatible chat, Responses-style calls, Anthropic-style provider routing, direct looper model slugs, and backend-specific provider profiles all become inputs to one routing engine. Applications keep a familiar API shape, while the infrastructure keeps the differences visible enough to debug.

The ExtProc path matters because it gives semantic policy a boundary-native shape. vLLM SR can receive enough request and response context to make semantic decisions while still returning control to the real gateway data plane.

Envoy Integration: Put Semantic Policy at the Boundary

The Envoy integration shows the intended boundary clearly.

Envoy owns the data plane: TLS, clusters, endpoint health, timeouts, retries, load balancing, and filter chains. vLLM SR should not rebuild those capabilities. It should act as the semantic policy plane: receive request context through External Processing, evaluate the route, and return header/body mutations.

Client Request to Envoy Gateway, semantic policy path to Semantic Router, and main traffic path to Backend Model Clusters. — Envoy keeps the main traffic path. vLLM SR returns semantic policy decisions and mutations.

Component	Responsibility
Client	Sends OpenAI-compatible or provider-compatible requests
Envoy	Handles network path, clusters, TLS, timeout, health, retry, and load balancing
ExtProc bridge	Sends request context and receives header/body mutations
vLLM SR	Extracts evidence, matches policy, selects or orchestrates, applies learning and plugins
Backend clusters	Serve model traffic after the semantic route is decided

The adoption advantage is large. Applications do not need a new private API just to benefit from semantic routing. They can keep calling familiar model APIs while infrastructure maps logical model names to small models, frontier models, RAG paths, verifier paths, Looper collaboration paths, workflows, or private hardware lanes.

Protocol work such as SIRP and multi-provider inference API matters for the same reason. Semantic routing should strengthen the control plane without forcing every application team into a custom gateway dialect.

Model Design: Model Name Is the Wrong Primitive

The router cannot make production-grade decisions from model names alone.

gpt-4, qwen3-32b, claude-opus, or local-private identifies an endpoint or alias. It does not describe reasoning ability, coding strength, tool behavior, vision support, context window, latency distribution, cost, hardware path, privacy boundary, or observed failure modes.

Request flows to Lexical, Embedding, LoRA, and MLP Selector models, then Calibration, Decision, Small Model, RAG, and Frontier Model with Feedback. — The router needs calibrated model metadata, not just endpoint strings.

Model-card field	Examples	Routing implication
Capability	reasoning, coding, vision, tool use, image generation, verifier, embedding	Determines which routes can legally include the model
Economics	price, quality score, cost weight, expected output length	Feeds cost-quality optimization
Latency	TTFT, TPOT, p50/p95/p99, warm/cold behavior	Feeds SLO-aware selection
Context	context window, compression support, long-context stability	Drives context-aware routing and token-budget routing
Hardware	CUDA, ROCm, XPU, CPU, quantization, engine family	Connects logical model to physical pool
Policy	provider profile, tenant allowlist, data boundary, reasoning family	Prevents unsafe or unauthorized selection
Feedback	replay success, failure type, verifier disagreement, user feedback	Supports learning and recalibration

Users can ask for auto. The system cannot treat auto as a magic endpoint. Internally, it must expand into model cards, candidate sets, route policy, Looper eligibility, learning boundaries, and execution paths.

Bindings: Fast ML Without Turning the Router Into a Model Server

vLLM SR’s control plane is written around Go because gateway integration, configuration, request mutation, response processing, Envoy ExtProc, and Kubernetes-style infrastructure fit Go well. But routing also has ML hot paths: embeddings, classification, modality detection, LoRA classification, and MLP selectors.

The binding layer keeps those concerns separated.

Go Router Service connected through FFI Boundary to Candle Backend, ONNX Backend, and Stub Backend capability matrix. — Native bindings expose capability explicitly instead of letting deployments guess.

Native surface	Role	Capability shape
`candle-binding`	Rust/Candle high-performance ML path	Unified batch classification, LoRA classification, batched embeddings, multimodal embeddings, modality routing, MLP selector
`ml-binding`	Rust helpers for classical ML selector artifacts	KNN, KMeans, and SVM-style selector support where trained artifacts exist
`nlp-binding`	Rust lexical routing helpers	BM25, n-gram, and deterministic lexical classifiers for low-latency evidence
ONNX backend	Portable runtime path	Batched embedding in the current public capability contract
Stub backend	Minimal or unsupported build path	Explicit capability absence for fallback, tests, and non-native builds

The rule is not “everything must be native.” The rule is capability must be explicit. If a deployment lacks a native classifier, the router should know. If ONNX supports only part of the contract, policy should not pretend otherwise. If backend lifecycle needs reset boundaries, the router should expose that instead of hiding it. Intel/OpenVINO-oriented paths can be valuable deployment options, but they belong in the hardware/runtime discussion unless they are part of the same advertised native capability contract.

Hardware Is a Routing Variable

Hardware support is often described as a compatibility matrix. For semantic routing, it is more than that.

Different hardware paths imply different latency, cost, memory behavior, kernel availability, quantization support, context-window economics, privacy boundary, and energy curve. A router that ignores hardware is only doing half the job.

Router connected to latency, cost, context, privacy, and hardware paths CUDA, ROCm, XPU, CPU. — Hardware-aware routing connects semantic constraints to physical execution.

Platform path	What matters to routing	Example route
NVIDIA CUDA	Mature high-throughput vLLM serving, CUDA kernels, quantized paths, broad accelerator availability	Default high-performance pool, frontier or private data-center path
AMD ROCm	First-class non-CUDA vLLM platform direction, MI300/MI350-class deployments, AITER kernels and attention paths	Cost/performance diversification, ROCm production pool, AMD Developer Cloud validation path
Intel XPU	SYCL/DPC++, oneDNN, XPU kernels, OpenVINO-oriented portability and optimization paths	Enterprise accelerator lane, private infrastructure, CPU/XPU hybrid deployment
CPU / local	Intel/AMD x86, ARM AArch64, Apple silicon, edge and offline fallback	PII-sensitive, low-throughput, local-only, or cost-minimal workloads
KV / context	Prefix retention, warm state, prefill/decode balance, context-window pressure, bytes-per-token drift	Session-aware protection, long-context pools, token-budget-aware routing

The acceleration story is not one-size-fits-all. CUDA gives the broadest default serving path. ROCm/AITER makes AMD pools a serious production option rather than a compatibility afterthought. Intel XPU and OpenVINO-style paths matter for enterprises that already own Intel-heavy infrastructure or need CPU/XPU portability. CPU and local paths remain important because privacy, availability, and cost sometimes beat raw throughput.

WRP is the right mental model here. Workload is semantic. Pool is physical. Router translates between them.

flowchart LR W["Workload
intent, risk, context, modality"] --> R["Router
policy + selection + learning"] R --> P["Pool
models, queues, cache, hardware"] P --> R R --> O["Outcome
quality, latency, cost, safety"] O --> W O --> R

A future router should know when CUDA is overloaded, when ROCm is cost-effective, when XPU is sufficient, when CPU/local is preferable for privacy, and when preserving KV cache is worth more than reselecting a theoretically better model.

Observability: The Router Must Leave Evidence Behind

If the system makes semantic decisions, operators need to see those decisions.

Pipeline with Request, Signal Trace, Decision Trace, Route, Response, Headers, Replay, and Audit Record. — Traces and replay turn routing from hidden magic into debuggable infrastructure.

Trace surface	It explains
Signal trace	Which signals ran, what matched, and what raw values or confidence appeared
Projection trace	How evidence became derived route facts
Decision trace	Which policy matched and why it outranked alternatives
Selection trace	Which candidates were considered and which model won
Looper trace	Which panel, rounds, workers, judge, or synthesis path executed
Learning trace	Base model, proposal model, final model, protection action, adaptation reason, cache and switch evidence
Plugin trace	Which cache, RAG, memory, tool, safety, verification, and replay behaviors executed
Header trace	What `x-vsr-*` metadata went to the gateway or client
Replay record	The durable request/response/decision/model/path artifact for debugging and evaluation

Replay is not only for dashboards. It is the second control plane. The first control plane makes the live decision; replay preserves enough evidence to debug that decision, compare it against alternatives, attach outcomes, and produce offline recipe patches.

If the router records what it saw, what it decided, what happened, and how users, agents, verifiers, or evals responded, the next generation of selectors, thresholds, learning strategies, and recipes can improve.

Without traces, the router is another opaque model. With traces, it becomes a system component.

Evaluation: The Router Is a Frontier

The wrong way to evaluate a router is to count features. A router with many knobs can still be bad. The right question is whether it improves the frontier between quality, cost, latency, safety, privacy, reliability, and hardware efficiency.

Cost quality frontier with Small, RAG, Frontier, Router, and Waste points. — Routing intelligence should move workloads toward the efficient frontier.

Evaluation axis	What should improve
Quality-cost frontier	Same or better quality with lower model spend
Latency frontier	Better SLO compliance without blindly choosing weak models
Safety and privacy	Better handling of PII, jailbreak, tool exposure, and local/private paths
Factuality	More grounded answers through RAG and verification where needed
Collaboration value	Better output from Fusion, ReMoM, Flow, or Ratings than single-model baselines
Session stability	Fewer broken tool loops and fewer harmful model switches
Fleet efficiency	Better queue, cache, hardware, and pool utilization
Debuggability	More replayable and explainable decisions
Learning loop	Better calibration from traces, outcomes, and offline evals

Router evaluation, fleet simulation, replay traces, RouterArena-style comparison, Looper evals, and offline recipe learning matter because semantic routing is infrastructure. It has to be measured like infrastructure.

The Part I Care About Most

The most important design belief in vLLM SR is not any single signal, plugin, model, algorithm, or hardware platform. It is the separation of concerns.

Signals observe. Projections coordinate. Decisions express policy. Selection algorithms choose one model. Looper algorithms scale capability through model collaboration. Router Learning adapts and protects within recipe boundaries. Plugins mutate behavior. Bindings accelerate hot-path ML. Envoy integration places semantic policy at the traffic boundary. Model cards connect logical names to real capabilities. Hardware metadata connects semantic constraints to physical execution. Observability makes every decision accountable.

That separation is what lets the router evolve without turning every new idea into a fork of the hot path.

A new jailbreak detector can become a signal. A new risk formula can become a projection. A new compliance rule can become a decision. A new selector can become a selection algorithm. A new panel or workflow primitive can become a Looper algorithm. A new verifier can become a plugin. A new accelerator lane can become model-card metadata. A new benchmark can become an evaluation loop. A new fleet model can feed pool-aware routing. A new learning strategy can propose better candidates without rewriting policy.

That is why I like the phrase Intelligence Control Plane. Not because the router is always intelligent by itself, but because it gives the system a place to allocate intelligence deliberately.

The first stage of AI infrastructure made intelligence callable.

The next stage has to make intelligence allocatable, explainable, and optimizable.

Calling a model was the first abstraction. Allocating intelligence is the next one.

That is the philosophy of vLLM SR.

Source Trail

This article is based on the vLLM Semantic Router codebase, website research archive, vLLM project blog posts, my earlier essays on LLM routing, and infrastructure references around Envoy, vLLM, ROCm, Intel XPU, and Gateway API inference routing.

Source	Why it matters here
vLLM Semantic Router research archive	Paper table and system-design throughline
Canonical config and routing contract code	Recipe-as-contract, signal/projection/decision surfaces, supported algorithm and plugin catalog
Signal-Decision Driven Architecture	Shift from single classification to signal-decision routing
Iris / Athena / Themis release posts	Signals, model selection, plugins, memory, replay, AMD ROCm, and release progression
Fusion API and Looper tutorials	Multi-model collaboration path: Confidence, Ratings, Fusion, ReMoM, Router Flow
Router Learning docs and proposal	Adaptation, protection, memory, replay, outcomes, and offline recipe learning
Session-Aware Agentic Routing	Tool-loop continuity, provider state, prefix cache, and safe switch boundaries
Agentic Routing on AMD ROCm	AMD ROCm deployment, agentic recipe, dashboard, learning, and replay
ExtProc runtime pipeline notes	Header/body/model-routing/response phases, protocol normalization, replay and response-time checks
Envoy External Processing filter	Semantic policy path versus main traffic data plane
Native binding capability matrix	Candle, ML binding, NLP binding, ONNX, and Stub capability boundaries
vLLM platform documentation	CUDA, ROCm, XPU, CPU, and deployment background
AMD ROCm / AITER / vLLM ROCm attention backend	AMD acceleration path and ROCm serving ecosystem
Intel XPU kernels / IPEX XPU ecosystem	Intel accelerator serving path
Kubernetes Gateway API Inference Extension	Standardization direction for inference routing at the gateway layer