Semantic Routing as Energy Infrastructure

Not every token deserves a frontier model. Every semantic workload deserves the right intelligence budget.

Not every token deserves a frontier model.

But every semantic workload deserves the right intelligence budget.

Six months ago, I wrote The Second Half of LLM Routing. At the time, my argument was that LLM routing was reaching a turning point. The first half of routing made models look like manageable backend services: unified API entry points, token quotas, retries, fallbacks, observability, load balancing, and cost controls.

That work mattered. It made AI production systems possible.

But it also started to converge. Once every AI gateway begins to look like the same recipe, the next question is no longer how to forward requests more reliably. The question becomes what “routing” should mean when the backend is no longer a deterministic service, but a model that can reason, act, call tools, fail in subtle ways, and change behavior through feedback.

Back then, I described the second half this way:

In the first half, routing transports requests. In the second half, routing builds collective intelligence.

I still believe that.

After another half year of watching enterprise AI adoption, open source inference, model releases, agentic coding, and infrastructure teams trying to control real token bills, I would now make the argument sharper.

The second half of routing is not only about moving from endpoint selection to behavior control.

It is about intelligence resource allocation.

The next infrastructure problem is not just serving models. It is deciding how much intelligence each piece of semantic work deserves, where that intelligence should run, and what cost, latency, privacy exposure, hardware capacity, and energy footprint the system is allowed to spend.

That is why semantic routing now looks less like an API gateway problem and more like energy infrastructure.

1. The Bill Arrives Before the Architecture

Every computing era begins by making a scarce resource feel abundant.

The industrial age made mechanical force portable. The cloud era made compute elastic. The AI era has made something stranger callable: intelligence, exposed through APIs, metered by tokens, and inserted into millions of workflows.

The first visible effect is acceleration. A frontier model makes a product feel smarter. An agentic coding tool changes how developers work. A support assistant compresses repetitive labor. The new resource looks flexible, almost liquid.

Then the bill arrives, and the architecture becomes visible.

In May 2026, Business Insider reported that Salesforce CEO Marc Benioff said the company could spend around $300 million this year on Anthropic tokens. The number is eye-catching, but the more important part was not the number. It was the architecture question behind it.

Benioff pointed toward a future intermediary layer that can decide which inputs should go to a frontier model and which can be handled by smaller models.

That is the moment AI adoption stops being a product experiment and becomes an infrastructure problem.

When token spend is small, model choice feels like a developer preference. When token spend becomes material at company scale, model choice becomes infrastructure strategy.

Around the same time, The Verge reported that Microsoft was preparing to wind down many Claude Code licenses internally and move more developers toward GitHub Copilot CLI. That story is not the same as per-request routing. It is about tools, platform control, internal workflows, and operating expenses. But it points in the same direction: once AI moves from experimentation into enterprise-scale adoption, companies will not simply use the most capable external tool everywhere forever.

They will ask harder questions:

Who controls the workflow?
Which model or tool is allowed to touch which codebase?
Which provider dependency is acceptable?
Which costs are worth it?
Which capabilities should be internalized?
Which workloads need frontier intelligence, and which do not?

Red Hat framed a related tension as the “agentic paradox”: frontier models are often the fastest path to agentic adoption, but routing every agentic workload to them becomes difficult to sustain at enterprise scale because of cost, latency, confidentiality, sovereignty, and control.

There is also a supply-side version of the same pressure. NVIDIA recently described AI data centers as token factories, arguing that AI efficiency and revenue scale through better performance per watt. That is exactly what frontier hardware and inference engine teams are trying to do: squeeze more useful tokens out of every watt of power and every dollar of infrastructure.

These stories are not identical. One is about token bills. One is about developer tools and platform control. One is about hybrid AI. One is about performance per watt. But together they show the same transition:

AI is moving from capability discovery to resource discipline.

The supply side will keep improving tokens per watt per dollar. The demand side now needs to decide which semantic work deserves those tokens in the first place. A faster token factory still wastes power if every workload is sent through the most expensive lane.

Diagram showing token bill, tool control, hybrid AI, and power limit converging into a routing layer. — Four pressure signals are converging: token spend, tool control, hybrid deployment, and power efficiency.

Frontier models remain the fastest way to discover new capabilities.

They are not, by themselves, the architecture of scale.

2. Tokens Are Not the Unit

The industry likes tokens because tokens are easy to count.

Tokens are convenient for billing. They are convenient for quotas. They are convenient for dashboards. They turn something messy, language, into something finance and infrastructure teams can track.

But tokens are a billing unit, not a value unit.

Diagram showing the same token count flowing to cheap path, open model, and frontier model with different costs. — The same token count can carry very different cost, latency, energy, and intelligence density.

A token generated by a small local model, a specialized open model, a frontier closed model, a reasoning model, or a model running on an edge device does not carry the same cost. It does not carry the same latency. It does not carry the same energy footprint. It does not carry the same privacy risk. It does not even carry the same kind of intelligence.

Counting tokens is like counting kilowatt-hours without asking what the electricity powered.

One kilowatt-hour used to keep a hospital operating is not the same as one kilowatt-hour wasted by an idle machine. The unit is the same. The value is not.

The same is true for AI systems.

One token may be cheap autocomplete. Another may be the deciding step in a high-risk reasoning chain. One request may be a low-stakes formatting task. Another may require retrieval, verification, policy checks, and a stronger model because the failure cost is high.

So the real question is not:

How many tokens did we use?

The real question is:

Did we spend the right kind of intelligence in the right place?

That is a different kind of accounting.

It asks the system to understand the semantic workload before spending compute. Is the task simple or ambiguous? Private or public? Reversible or dangerous? Latency-sensitive or quality-sensitive? Cheap enough to answer locally, or important enough to escalate?

This is where token economics becomes system design.

3. Frontier-by-Default Is a Temporary Architecture

The simplest AI architecture is always:

Send everything to the strongest model.

It is easy to build. It is easy to explain. It gives good demos. It minimizes engineering effort in the short term because the model absorbs complexity that the system does not know how to model yet.

In the early stage of adoption, that is rational.

When a team is trying to prove AI works, frontier-by-default is often the fastest path. It reduces product risk. It gives users a better first experience. It lets teams focus on workflow and adoption instead of building a routing layer too early.

But architecture that works for discovery often breaks at scale.

When every workflow becomes AI-assisted, when agents run continuously, when coding tools expand across thousands of developers, when customer support, sales, analytics, compliance, and internal operations all begin to consume model calls, the question changes.

The company no longer asks:

Can AI do this?

It asks:

Can we afford this, control this, govern this, and scale this?

That is when frontier-by-default starts to look less like an architecture and more like a subsidy paid by the early phase of adoption.

Diagram comparing frontier default with a routed system that sends work to the right path. — Frontier-by-default works for discovery. Mature systems route demand across capability tiers.

The future is not one frontier model serving every request. The future is a heterogeneous fabric of intelligence:

frontier closed models for high-value reasoning and hard escalation paths
open-weight models for controllable, private, and cost-efficient workloads
small models for classification, extraction, intent detection, routing, and frequent low-risk tasks
domain models for specialized knowledge and enterprise context
verifiers for checking claims, actions, policies, and citations
retrieval systems and memory layers for grounding
edge models for local privacy, low latency, and near-zero marginal cost
different generations of hardware that still need to be used efficiently

This is not anti-frontier.

It is what mature infrastructure does: preserve the most capable resource for the places where it actually changes the outcome.

Electric grids do not route every load through the most expensive power source. Networks do not send every packet through the same path. Cloud platforms do not run every job on the largest instance type.

Mature systems differentiate demand.

AI systems will have to do the same.

4. Hybrid AI Is Not Just a Deployment Topology

The phrase “hybrid AI” is becoming more common. Usually it means some mixture of cloud models, private deployments, open models, and edge inference.

That is true, but incomplete.

Hybrid AI is not just about where models are hosted.

It is about who decides where semantic work should go.

Cloud versus local is not a static preference. It depends on the workload. A privacy-sensitive task might stay local. A high-uncertainty task might escalate. A cheap repetitive task might use a small model. A regulated workflow might require audit, policy enforcement, and a self-managed model. A difficult reasoning task might deserve a frontier model, but only after cheaper signals fail to resolve uncertainty.

In other words, hybrid AI needs a control layer.

Not a proxy that only hides provider differences.

Not a gateway that only does auth, retries, and rate limiting.

Not a static rule table that says tenant A goes to model X and tenant B goes to model Y.

It needs a layer that understands semantic workload signals and turns them into capability paths.

A path may be:

answer directly with a small model
retrieve first, then answer with a mid-size model
classify risk, then apply a stricter policy
ask a clarification question before spending expensive reasoning
run a verifier before returning the answer
escalate to a frontier model only when uncertainty or failure cost justifies it
keep the task on-device because privacy matters more than marginal quality

This is why “the right model for every request” is a useful slogan but not the whole idea.

The real unit is not always a model.

The real unit is a capability path.

Semantic routing is the layer that chooses that path.

Diagram showing workload, policy, semantic router, capability path, and model pool. — A router does not only pick a model. It maps workload and policy into a capability path.

5. Semantic Routing as Energy Infrastructure

In the old internet, routing moved packets.

In the AI stack, what moves through the system is semantic work: intent, uncertainty, context, memory, policy, privacy risk, reasoning demand, tool use, and action.

Once traffic starts carrying meaning, routing cannot remain a thin forwarding layer.

It becomes a decision layer.

And once decisions have different cost, energy, and risk profiles, routing becomes an allocation layer.

This is why I use the phrase energy infrastructure.

I do not mean semantic routing replaces the power grid. I mean that, at scale, AI systems must treat intelligence like an energy-consuming resource that needs to be scheduled, conserved, escalated, and justified.

Different models are not just different endpoints. They are different intelligence densities with different energy prices.

A small model might be enough for a large percentage of routine tasks. An open model might be good enough for many production workflows where control and cost matter. A frontier model might be necessary for hard reasoning, ambiguous tasks, creative synthesis, or high-stakes escalation. A verifier might be more valuable than a stronger generator. Retrieval might beat reasoning. Asking the user one question might be cheaper and safer than guessing with a larger model.

The router’s job is not to reserve the strongest model for every problem.

The router’s job is to conserve scarce intelligence without making the system dumber.

That is an energy problem in the broadest sense:

energy as power consumed by hardware
energy as operational cost
energy as scarce GPU capacity
energy as latency budget
energy as human trust
energy as organizational attention

The AI industry is right to care about the supply side: bigger models, faster accelerators, larger clusters, longer contexts, more capable agents.

That supply side is important. Hardware teams and inference engine teams are doing exactly what they should do: maximize tokens per watt per dollar. Better accelerators, better kernels, better batching, better KV-cache systems, better speculative decoding, better serving engines, and better cluster scheduling all push in the same direction: more useful tokens from the same power envelope.

Diagram showing hardware, inference engine, router, and right workload as a tokens per watt per dollar efficiency stack. — Hardware and inference engines improve token supply. Routing decides where that improved intelligence should be spent.

But supply-side efficiency is only half the problem.

The other half is demand-side allocation. If every semantic workload is allowed to consume the most expensive path, the system will be powerful but wasteful. If every workload is forced onto the cheapest path, the system will be efficient but dumb. The hard problem is not choosing cheap or strong. The hard problem is knowing when each is appropriate.

That is the first-principles case for semantic routing.

It is resource scheduling for intelligence.

6. The Layer We Are Missing

If semantic routing is going to become real infrastructure, it needs more than a model selector.

I think the missing layer has at least six pieces.

First, workload signals.

The system has to understand more than prompt length and tenant ID. It needs signals for intent, domain, difficulty, uncertainty, privacy, safety, tool requirements, expected output type, failure cost, and historical behavior. Without workload signals, routing collapses back into static rules.

Second, routing memory.

A router should not repeat the same decision blindly. It should remember which paths worked, which failed, which users or tasks require stricter handling, and which policies caused friction. Routing without memory cannot improve; it can only re-run yesterday’s assumptions.

Third, policy languages.

As routing becomes more powerful, organizations need a way to express constraints: what can run where, which data can leave the boundary, when a verifier is mandatory, when escalation is allowed, and how cost-quality tradeoffs should be handled. Natural language prompts are not enough for infrastructure policy.

Fourth, evaluation.

If a router claims to be intelligent, it has to be measurable. Accuracy alone is not enough. We need cost-quality frontiers, task completion rates, latency distributions, privacy guarantees, safety outcomes, stability across multi-turn sessions, and evidence that the system improves through feedback.

Cost-quality frontier diagram with waste below the frontier and a good route on the frontier. — Routing intelligence should be evaluated as a cost-quality frontier, not a feature checklist.

Fifth, cross-layer scheduling.

Semantic routing cannot live completely separate from inference scheduling, caching, model serving, and hardware utilization. The right semantic decision depends on the state of the serving layer, and the serving layer can be more efficient if it understands semantic demand. A router that ignores the pool is incomplete; a pool that ignores demand is blind.

Sixth, open interfaces.

If this layer becomes important, it should not only exist as a hidden optimization inside closed products. The policies, metrics, failure modes, and interfaces should be inspectable and composable.

That is where open source matters.

Not because every company will run the same router, or because one project should own the entire space. Open source matters because shared infrastructure needs shared language: common workloads, common metrics, common failure cases, common policy concepts, and enough transparency for people to argue about the right abstractions.

Diagram showing signals, policy, memory, and evaluation feeding an open control plane over shared infrastructure. — An open control plane makes routing policies, memory, evaluation, and interfaces inspectable.

7. Why We Built vLLM Semantic Router

This is the problem space behind vLLM Semantic Router.

The project started from a simple belief: routing in LLM systems should be driven by semantic signals, not only backend health, load, and static rules.

Early AI gateways made LLMs callable and manageable. That foundation is necessary. But the next layer has to answer a harder question:

Given this workload, this policy, this budget, this model pool, and this risk profile, what capability path should the system choose?

vLLM Semantic Router is one attempt to make that question concrete in open source.

It is not the final answer. It should not be.

The field is too early. The right abstractions are still being discovered. Evaluation is still immature. Enterprises have different policies. Models are changing quickly. Hardware is changing more slowly. Edge, cloud, private deployment, and frontier APIs are all evolving at different speeds.

That gap is exactly why routing matters.

Models will keep improving.

Hardware will keep improving.

But the space between them will not organize itself.

There will be old GPUs, new GPUs, edge devices, private clusters, cloud APIs, open models, closed models, specialized models, verifiers, memory systems, and agentic tools all coexisting. The question is not whether one of them wins everything. The question is how to use all of them well.

The future of AI infrastructure is not one model.

It is a system that knows how much intelligence each task deserves.

8. The Next Control Plane

Every important infrastructure era eventually builds a control plane.

Cloud needed orchestration because individual machines were not the right abstraction. Kubernetes emerged because scheduling containers manually did not scale. Service meshes appeared because traffic between services needed policy, observability, and control. API gateways became necessary because external and internal services needed a stable boundary.

AI systems are entering the same phase.

The analogy is not perfect. Semantic workloads are not containers, and intelligence is not a CPU core. But the pattern is familiar: once a resource becomes heterogeneous, expensive, policy-sensitive, and widely used, manual decisions stop scaling. A control plane appears because the system needs a place to express intent, observe behavior, enforce policy, and schedule resources under constraints.

At first, it was enough to call a model. Then it became necessary to manage many models. Now it is becoming necessary to decide how intelligence itself should be allocated.

This control plane will not only optimize for cost. Cost is just the pressure that makes the problem visible. The deeper goal is to optimize system behavior under constraints:

quality without waste
privacy without isolation
latency without shallow answers
safety without paralysis
openness without fragmentation
frontier intelligence without frontier dependency

That is why this problem is bigger than model routing, AI gateways, or agent orchestration alone. It sits between semantic demand and infrastructure supply.

And if AI becomes a general-purpose layer of the economy, this routing layer will matter everywhere: cloud MaaS, private data centers, edge devices, enterprise agents, coding systems, personal AI, and eventually even communication infrastructure where intelligence is placed closer to where work happens.

This will take years.

The first versions will be imperfect. Some routers will be simple classifiers. Some will be policy engines. Some will be inference schedulers with semantic hints. Some will look like agents. Some will be embedded inside gateways. Some will be built into serving engines. Some will run on device.

That is fine.

Infrastructure is not born clean.

It becomes clear through pressure.

And the pressure is already here: token cost, latency, privacy, sovereignty, safety, control, energy, and the simple fact that not every task deserves the same intelligence budget.

9. Beyond the Frontier Default

I still believe in frontier models.

They expand the boundary of what is possible. They make new workflows legible. They let builders discover the future before the rest of the system catches up.

But infrastructure maturity is not measured by how often we use the strongest model. It is measured by whether the system knows when frontier intelligence is worth spending, when an open model is enough, when the task should stay local, when a verifier is more useful than another generation step, and when the safest answer is to ask for more context.

That is the difference between capability and infrastructure.

Capability says: use the strongest model and see what becomes possible.

Infrastructure says: understand the workload, understand the constraints, choose the right capability path, measure the outcome, and improve the system.

Six months ago, I wrote that the second half of LLM routing would build collective intelligence.

I would now add one more line:

The next stage of routing is where AI systems learn to budget intelligence.

Not budget in the narrow finance sense.

Budget in the deeper systems sense: how much capability, cost, latency, privacy exposure, hardware, and energy should this work consume?

That is the question every serious AI system will have to answer.

The future is not a single model behind every request. It is a system that can look at a piece of semantic work and decide how much intelligence, energy, privacy exposure, latency, hardware, and money it is allowed to consume.

That is the infrastructure problem.

Semantic routing is where that answer begins.