Semantic Routing as Energy Infrastructure
I keep coming back to one question: if intelligence becomes an infrastructure resource, who decides where it should run?
In the old network, routing mostly moved packets. In the AI stack, the thing being moved is semantic work: intent, uncertainty, privacy risk, reasoning demand, memory, tool use, and action. Once traffic starts carrying meaning, the routing layer cannot stay a thin forwarding layer.
This is why tokens feel like the wrong primitive to optimize alone. Tokens are easy to count, but they are not equal. A token produced by a small open model, a specialized model, a frontier closed model, or a local edge model has a very different cost, latency, energy footprint, and risk profile. The deeper question is whether the system spent the right kind of intelligence in the right place.
- Tokens are not equal. The cost gap between model paths can be orders of magnitude, so token economics has to ask whether each token was spent in the right intelligence tier.
- Energy is the hidden unit of intelligence. A model is not only capability; it is hardware, power, latency, supply, and operating cost.
- The durable problem is coordination. The future is not one frontier model serving everything, but a heterogeneous fabric of closed models, open models, tools, verifiers, memory, edge devices, and different generations of hardware.
That is the lens behind Semantic Routing as Energy Infrastructure. To me, semantic routing is the control layer that decides where semantic work should live, when to stay cheap and local, when to escalate, when to retrieve, when to verify, and when to spend the expensive intelligence. It is not just model selection. It is resource scheduling for intelligence.
My research direction is to make this layer real and measurable: workload signals, routing memory, policy languages, evaluation, cost-quality frontiers, privacy boundaries, and cross-layer scheduling. vLLM Semantic Router is one concrete step toward an open semantic control plane for AI systems: inspectable, composable, and shared by design.
vLLM Semantic Router
Co-Founder
Signal-driven decision routing for mixture-of-modality deployments.
Elephant Agent
Creator
Personal-model-first self-evolving AI agent that grows correctable understanding and gets curious at the user's pace.
Inferoa
Builder
Inference-native tokenmaxxing agent harness for loop engineering.
Envoy Gateway
Steering Committee and Maintainer
Manages Envoy Proxy as a standalone or Kubernetes-based application gateway.
Envoy AI Gateway
Maintainer
Manages unified access to generative AI services built on Envoy Gateway.
Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems
SIGIR 2026 Industry Track
Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
arXiv Technical Report
vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models
arXiv Technical Report
When to Reason: Semantic Router for vLLM
NeurIPS - MLForSys
Agentic Intelligence Lab
Chair
Chairing the lab's research and community work on agentic AI, personal AI agents, and system intelligence.
Kubernetes AI Gateway WorkGroup
Co-Chair
Leading the community effort to define standards for AI Gateway in the Kubernetes ecosystem.
CNCF Ambassador
Fall 2023 Ambassador
Representing and promoting Cloud Native Computing Foundation projects and values globally.
Linux Foundation APAC Open Source Evangelist
2024 Program
Advocating for open source adoption and best practices across the Asia-Pacific region.
KubeCon Program Committee
KubeCon 2024 Hong Kong
Reviewing and selecting talks for one of the largest cloud-native conferences.