# llm-d Operations Competence Center Switzerland

> llm-d consulting and operations in Switzerland. VSHN deploys and manages distributed LLM inference on Kubernetes with Swiss data residency. ISO 27001 certified.


Distributed LLM inference with llm-d, deployed and operated on Swiss infrastructure by VSHN. Our engineers configure disaggregated prefill/decode serving, intelligent request routing, and KV cache transfer on Kubernetes - for organizations that need to scale inference beyond single-node limits with full Swiss data residency. Part of VSHN's [LLM Operations practice](https://www.llmops.ch).


## Pages

- [Homepage](https://www.llm-d.ch/): llm-d Experts – Distributed LLM Inference Switzerland | VSHN
- [llm-d Sovereignty — Swiss Distributed Inference | VSHN](https://www.llm-d.ch/sovereignty.md)

## Features

- **llm-d Consulting and Architecture**: Design and deploy production-grade llm-d topologies tailored to your inference workloads. VSHN architects disaggregated serving stacks with separate prefill and decode phases, optimised KV cache transfer, and intelligent request routing. We help you choose the right model server backend, GPU allocation, and Kubernetes cluster layout for maximum throughput and minimum latency.

- **Intelligent Inference Scheduling**: Use llm-d's routing layer for prefix-cache-aware request scheduling and multi-tenant fairness. VSHN configures production-grade inference gateways with load-aware routing, priority queues, and session affinity so your applications achieve optimal time-to-first-token while sharing GPU resources fairly across teams and workloads.

- **Prefill and Decode Disaggregation**: Separate prefill and decode phases across dedicated GPU pools to optimise both time-to-first-token and inter-token latency independently. VSHN deploys llm-d's disaggregated architecture with KV cache transfer between nodes, allowing you to scale each phase independently based on your workload profile and latency targets on Kubernetes infrastructure.

- **Swiss Cloud and GPU Infrastructure**: LLM inference, model weights, and request logs stay in Swiss data centers. VSHN operates on Exoscale, Cloudscale, and other Swiss cloud providers, ensuring full GDPR compliance and data residency for organizations that cannot afford to send sensitive prompts and completions to hyperscaler regions outside Switzerland.

- **Kubernetes-Native Operations**: Run llm-d on production Kubernetes clusters with Helm charts, automated scaling, and GitOps workflows. VSHN deploys on APPUiO, Red Hat OpenShift, and enterprise Kubernetes platforms with NVIDIA device plugins, GPU resource quotas, and horizontal pod autoscaling based on inference queue depth and latency targets to optimise cost and performance.

- **24/7 Support and Incident Response**: Monitor llm-d inference latency, throughput, token generation rates, and GPU utilisation across your entire serving fleet. VSHN integrates Prometheus, Grafana, and custom dashboards into your platform with 24/7 operations support and SLA-backed incident response, so performance issues are caught and resolved before they affect your users and applications.


## llm-d FAQ

### What is llm-d and how does it differ from vLLM?

llm-d is an open-source, Kubernetes-native distributed inference framework, a joint initiative by Red Hat, Google, and other contributors, that sits above model servers like vLLM. While vLLM handles single-node model execution, llm-d adds intelligent request routing, prefill/decode disaggregation, and distributed KV cache management across multiple nodes. It supports models like Llama, Mistral, and Apertus (the Swiss AI foundation model, Apache 2.0 licensed). VSHN deploys both technologies and helps you choose the right architecture for your inference workloads based on scale and latency requirements.


### What platforms does VSHN support for llm-d workloads?

VSHN deploys and operates llm-d workloads on APPUiO (our managed Kubernetes platform), Red Hat OpenShift, enterprise private cloud infrastructure, and sovereign cloud partners. All platforms run on Swiss or European data centers and are backed by up to 99.99% uptime SLA. We help you choose the right platform based on your compliance, performance, and budget requirements.


### Which cloud providers are available for llm-d deployments?

VSHN operates on multiple Swiss cloud providers including Exoscale and Cloudscale, as well as European sovereign cloud partners. For organizations that need GPU-accelerated workloads, we work with providers offering GPU instances in Swiss data centers on public and private cloud. All infrastructure is managed under a single SLA with 24/7 support from our operations team.


### How does prefill/decode disaggregation improve performance?

LLM requests are non-uniform: a short question uses far fewer resources than a request that processes a long document. The prefill phase (processing the input prompt) is compute-bound, while the decode phase (generating tokens one by one) is memory-bound. Disaggregation separates these onto dedicated GPU pools so each can be scaled and optimised independently. VSHN configures KV cache transfer between nodes, achieving lower time-to-first-token and higher overall throughput compared to monolithic serving where a large request blocks smaller ones.


### How does VSHN scope and quote llm-d consulting engagements?

Every engagement starts with a free architecture consultation where we assess your model serving needs, GPU requirements, and compliance constraints. VSHN then delivers a written scope document with a fixed-price or time-and-materials quote in CHF. Typical engagements cover cluster design, llm-d deployment, observability setup with Prometheus and Grafana, and backup automation for model artefacts and configuration data. Distributed serving involves models of tens of GB or more, so we size storage accordingly. There is no commitment at the scoping stage.


### How does VSHN ensure data sovereignty for llm-d workloads?

All infrastructure runs in Swiss data centers operated by Swiss or European sovereign cloud providers. Model weights, input prompts, generated completions, and inference logs never leave the chosen jurisdiction. All operational access is from Switzerland-based engineers, and we provide audit trails for compliance reporting.


### Can llm-d integrate with existing AI pipelines?

Yes. llm-d exposes an OpenAI-compatible API through its gateway layer, so existing applications using OpenAI client libraries can switch to self-hosted models without code changes. VSHN also integrates llm-d with [LiteLLM](https://www.litellm.ch) gateways, retrieval-augmented generation pipelines, and managed PostgreSQL with pgvector for vector storage - with automated backups and up to 99.99% SLA as all our VSHN-operated databases.


### What monitoring and observability does VSHN provide for llm-d?

VSHN integrates Prometheus and Grafana into every managed platform, with custom dashboards for llm-d-specific metrics: inference latency (p50, p95, p99), tokens per second, GPU utilisation, queue depth, and estimated cost per request. Alerting rules notify your team and our 24/7 operations center when metrics breach thresholds, so performance issues are caught before they affect users.


### How do I get started with VSHN's llm-d consulting?

Contact us through the form below for a free initial consultation. We assess your current model serving needs, platform requirements, and compliance constraints, then propose an architecture running on APPUiO, OpenShift, or your preferred infrastructure. llm-d consulting is part of VSHN's broader LLM Operations practice -- see [llmops.ch](https://www.llmops.ch) for the full picture.


## Book an llm-d consultation

Tell us about your distributed inference requirements. VSHN provides a free initial consultation covering llm-d architecture, GPU sizing, and a scoped proposal for your deployment on Swiss infrastructure.

---

## llm-d Sovereignty — Swiss Distributed Inference | VSHN

# llm-d Sovereignty: Distributed Inference on Swiss Infrastructure

llm-d is an open-source, Kubernetes-native framework for distributed LLM inference. That matters for sovereignty: you can serve large models across multiple GPUs in Switzerland, with full visibility into the code that processes every prompt and completion.

When you use OpenAI's API, Azure OpenAI, or AWS Bedrock for inference, every prompt and every model output passes through US infrastructure, governed by US law, and accessible under the [CLOUD Act](https://en.wikipedia.org/wiki/CLOUD_Act) without Swiss judicial process. Your prompts and completions stay outside foreign jurisdiction only when the inference stack itself runs on Swiss soil.

Sovereignty is more than where GPUs are located. The EU Cloud Sovereignty Framework defines eight dimensions that determine whether your provider is truly sovereign.

## Why llm-d is a strong choice for sovereign inference

Unlike proprietary inference APIs from OpenAI, Google, or Amazon, llm-d gives you:

- **No vendor lock-in**: serve any compatible open-weight model (Llama, Mistral, Qwen, and others)
- **Full code auditability**: every component of llm-d is inspectable on GitHub
- **No data exfiltration**: prompts and outputs stay on your infrastructure, period
- **Kubernetes-native scaling**: distribute inference across GPU nodes without proprietary orchestration
- **Hardware flexibility**: run on the accelerators you choose, without API vendor approval

VSHN deploys and operates llm-d on Swiss Kubernetes clusters with GPU scheduling. Combined with VSHN's Swiss ownership and operations, this creates a fully sovereign inference platform.

## llm-d sovereignty compared

| Dimension | OpenAI API | Azure OpenAI | AWS Bedrock Inference | VSHN Managed llm-d |
|-----------|-----------|-------------|---------------------|------------------|
| **Ownership** | OpenAI (USA) | Microsoft (USA) | Amazon (USA) | VSHN AG (Switzerland) |
| **Governing law** | US law | US law | US law | Swiss law |
| **CLOUD Act** | Exposed | Exposed | Exposed | Not exposed |
| **Data location** | USA | Regional (US-controlled) | Regional (US-controlled) | Switzerland (Cloudscale, Exoscale, or your choice) |
| **Inference stack** | Proprietary | Proprietary | Proprietary | Open source (llm-d, Kubernetes) |
| **Prompt data access** | Provider has access, may use for training | Microsoft has access | Amazon has access | VSHN has operational access only for authorized support — never used for model training |
| **Operations team** | USA | USA | USA | Switzerland ([Swiss-only option](https://products.vshn.ch/support_plans.html#_option_switzerland_only_support)) |
| **Certifications** | SOC 2 | SOC 2, ISO 27001 | SOC 2, ISO 27001 | [ISO 27001](https://www.vshn.ch/wp-content/uploads/2025/12/ISO-27001-certificate-VSHN-2024.pdf), ISAE 3402 Type II |

## VSHN sovereignty self-assessment

We applied the EU's [Cloud Sovereignty Framework](https://commission.europa.eu/document/09579818-64a6-4dd5-9577-446ab6219113_en) (v1.2.1, October 2025) to our own services. This framework was used to score providers in the EU's [EUR 180M sovereign cloud tender](https://ec.europa.eu/commission/presscorner/detail/en/ip_26_833) in April 2026. Three pure-European providers achieved SEAL-3, while a consortium involving Google Cloud scored only SEAL-2.

*This is a self-assessment, not a formal SEAL certification. We publish it for transparency so customers can evaluate our sovereignty profile using the same structured criteria the EU uses.*

| # | Dimension | Weight | Assessment | Evidence |
|---|-----------|--------|-----------|----------|
| SOV-1 | Strategic | 15% | **Strong** | Swiss AG, no foreign parent, all shareholders Swiss citizens ([Commercial Register](https://zh.chregister.ch/cr-portal/auszug/auszug.xhtml?uid=CHE-275.566.226)) |
| SOV-2 | Legal | 10% | **Strong** | Swiss law ([GTC](https://products.vshn.ch/legal/gtc_en.html)), no CLOUD Act, [EU adequacy decision](https://commission.europa.eu/law/law-topic/data-protection/international-dimension-data-protection/adequacy-decisions_en) |
| SOV-3 | Data & AI | 10% | **Strong** | Swiss DCs by default. Sovereign key management via [Managed OpenBao](https://www.openbao.ch) + [Swiss HSM](https://cloud.securosys.com/cloudhsm) |
| SOV-4 | Operational | 15% | **Strong** | Swiss 24/7 ops, [Swiss-only support option](https://products.vshn.ch/support_plans.html#_option_switzerland_only_support). All services on vanilla Kubernetes |
| SOV-5 | Supply Chain | 20% | **Strong** | Infrastructure-agnostic — [customer chooses provider](https://servala.com/providers/). Open-source software |
| SOV-6 | Technology | 15% | **Strong** | 100% open source. VSHN contributes to [K8up](https://github.com/k8up-io) (CNCF), [Crossplane providers](https://github.com/vshn), [Project Syn](https://github.com/projectsyn) |
| SOV-7 | Security | 10% | **Strong** | [ISO 27001](https://www.vshn.ch/wp-content/uploads/2025/12/ISO-27001-certificate-VSHN-2024.pdf), ISAE 3402 Type II, Swiss SOC. [FINMA-regulated customers](https://www.vshn.ch/en/solutions/solutions-for-banks-and-financial-service-providers/) |
| SOV-8 | Environmental | 5% | **Moderate** | DC operators: Green Datacenter AG (ISO 22301/27001/27701), [Exoscale sustainability](https://www.exoscale.com/sustainability/). [VSHN CSR policy](https://handbook.vshn.ch/corporate_social_responsibility_policy.html) |

**Overall: SEAL-3 equivalent**, the same level achieved by the winners of the EU's own sovereignty tender. No provider worldwide achieved SEAL-4: it requires fully EU/EEA-sourced hardware supply chains and open-source foundations, structural gaps shared by every cloud provider.

Try Swiss infrastructure: [APPUiO](https://www.appuio.ch) (managed Kubernetes, free trial), [Exoscale]({{partner:exoscale.signup_url}}) (Swiss IaaS). Want help choosing? [Contact us](#contact).

## Get a sovereignty assessment for your inference setup

If you're running inference through US-hosted APIs or evaluating sovereign alternatives, we can assess your current setup against the EU framework and design an llm-d deployment that keeps your prompts and model outputs under Swiss jurisdiction.