Overview
Modelplane is the open source control plane for AI inference. It’s software you install and run in your own environment, and it orchestrates the models, serving stack, and infrastructure across cloud, neocloud, and on-premise. Modelplane supports running any model and any engine on any infrastructure, with the frontier-level serving topologies and performance the largest models demand, from a single GPU to disaggregated, multi-node deployments.
Modelplane operates across the whole fleet: provisioning inference clusters, scheduling model deployments on compatible clusters, autoscaling model replicas across clusters, caching model weights across clusters, and routing across clusters.
It’s an active system that is always reconciling the fleet toward the state you declare. You install Modelplane on a Kubernetes cluster, which becomes the control cluster for your inference fleet. It’s built on Crossplane and fully integrates with your existing platform systems.
Deploy a model
Modelplane’s API is declarative, designed for platform teams responsible for the inference infrastructure and developers deploying models on that infrastructure.
Once a platform team has provisioned inference clusters and declared the available GPUs and networking fabric, an ML development team deploys a model with a declarative manifest:
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
name: qwen-demo
namespace: ml-team
spec:
replicas: 1
engines:
- name: qwen
members:
- role: Standalone
nodeSelector:
devices:
- name: gpu
count: 1
selectors:
- cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("20Gi")) >= 0
template:
spec:
containers:
- name: engine
image: vllm/vllm-openai:v0.23.0
args: ["--model=Qwen/Qwen2.5-0.5B-Instruct"]Modelplane schedules a model replica onto an inference cluster with free, compatible GPUs and memory, and deploys the serving engine. Exposing an OpenAI-compatible endpoint can be done by declaring a model service:
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: qwen
namespace: ml-team
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: qwen-demoA universal control plane for AI inference
Modelplane is designed to be a universal control plane for inference. It runs inference clusters on any cloud, neocloud, or on-premise environment, or any combination of them. Modelplane can provision the clusters for you, or you can bring your own.
It supports any serving engine that runs as a container, and can serve frontier-quality models using advanced topologies including tensor parallel, pipeline parallel, data and expert parallel, and prefill/decode disaggregation. Modelplane works across different accelerators and networking fabrics, and schedules each model’s replicas by matching the model’s hardware requirements to the hardware available across your clusters.
What Modelplane is not
Modelplane is not a serving engine like vLLM, SGLang, or TensorRT-LLM. Modelplane composes serving engines and orchestrates them fleet-wide across cloud, neocloud, and on-premise. Modelplane is not a managed inference service like Baseten, Together, or Fireworks. These offer cloud services, while Modelplane is self-hosted software.