The LLMOps stack: the four layers of production LLM tooling, and why they leak
Search "best LLMOps tools" and you get a dozen numbered lists, most written by a vendor that ranks itself first. They are not wrong about the tools; they are wrong about the shape of the problem. Running an LLM in production is not a single purchase. It is a stack of four jobs along the request path: a gateway in front of the models, observability to see what happened, evaluation to decide whether it was good, and guardrails to enforce what is allowed. The useful question is not "which tool wins" but "which layer do I need first, and how do the pieces fit together without locking me in." This guide maps the category as that stack, places 19 tools in it, and compares them on the two axes nobody else lines up: OpenTelemetry support (the portability dividing line) and pricing model (the thing that decides your bill at scale). Match a stack to your situation with our AI stack optimizer, or jump to the comparison matrix.
- LLMOps is a four-layer stack: Gateway and Routing, Observability and Tracing, Evaluation and Testing, Guardrails and Safety. Build down the path in that order; most teams need observability first and a gateway second.
- The layers leak on purpose. The strongest tools span two or three layers (Langfuse does observability plus eval; Maxim spans eval, observability, and a gateway; Portkey spans gateway, observability, and guardrails). Buying by layer and then checking the overlap beats buying by brand.
- OpenTelemetry is the dividing line. OTel-native tools let you swap backends and avoid lock-in; tools with a proprietary-only SDK trap your instrumentation. Treat OTel support as a first-class selection criterion, not a footnote.
- Pricing model matters more than price. Per-event, per-seat, usage, and flat models scale very differently. A per-trace tool that is cheap in a demo can dominate the bill in production.
- The honest open-source default for observability is Langfuse (MIT, self-hostable, OpenTelemetry-native since v3). For a gateway, Portkey or LiteLLM if you want to hold your own keys; an aggregator like AIMLAPI if you want one bill across hundreds of models.
The stack, and the thesis that the layers leak
The four layers are the four jobs a request passes through on its way to and from a model. A gateway stands in front: it gives you one API across many providers, routes and falls back between them, caches, and enforces keys and budgets. Observability records what happened: the prompt, the chain of spans, tokens, cost, and latency, so you can debug a bad answer after the fact. Evaluation decides whether the answer was good, offline against datasets in CI and online against live traffic, using LLM-as-judge, human review, and programmatic checks. Guardrails enforce what is allowed in real time: prompt-injection defense, PII redaction, toxicity and factuality checks on input and output.
Gateway / Routing
The control plane in front of the models: a unified API, routing and fallback, caching, and key and budget management. Split it further into true gateways (you hold your provider keys) and aggregators (you buy their credits, they hold the keys).
Portkey ยท LiteLLM ยท Cloudflare AI Gateway ยท Kong AI Gateway ยท OpenRouter ยท AIMLAPI ยท TrueFoundry
Observability / Tracing
Sees what happened: captures every prompt, span, token, cost, and latency for debugging and monitoring. This is where the OpenTelemetry dividing line is sharpest.
Langfuse ยท LangSmith ยท Helicone ยท Arize Phoenix ยท Traceloop / OpenLLMetry ยท Datadog LLM Observability
Evaluation / Testing
Decides whether it was good: scores quality offline (CI and datasets) and online (production) through LLM-judge, human, and programmatic evals.
Maxim ยท Braintrust ยท Promptfoo ยท Langfuse evals ยท DeepEval / Confident AI
Guardrails / Safety
Enforces what is allowed: real-time input and output checks for injection, PII, toxicity, and factuality.
Guardrails AI ยท NeMo Guardrails ยท Lakera ยท Prediction Guard
Why OpenTelemetry is the dividing line
The single axis practitioners argue about most is OpenTelemetry support, because it decides portability. OpenTelemetry (OTel) is the vendor-neutral standard for traces, metrics, and logs, and it now has a GenAI semantic convention for LLM spans. A tool that is OTel-native lets you instrument once and point the data at any compatible backend; if you outgrow the vendor, you keep your instrumentation. A tool with a proprietary-only SDK does the opposite: your traces are written in its dialect, and switching means re-instrumenting your whole application.
It is worth being precise, because the marketing blurs three tiers. Some tools are built on OTel as the native data model (Traceloop / OpenLLMetry is the reference implementation; Arize Phoenix uses OpenInference on top of OTel; Datadog LLM Observability and Langfuse v3 align to the GenAI convention). Some ingest OTLP but ship their own SDK as the primary path. And some are late or partial: LangSmith added OTel export well after launch and remains LangChain-centric; tools like Lunary use non-OTel SDKs that create switching risk. When a vendor claims "OpenTelemetry support," ask which tier. In the matrix below we only mark a tool OTel-native where we could confirm it from the tool's own documentation.
The comparison matrix: OpenTelemetry, self-host, and pricing model
Nineteen tools across the four layers, on the five axes that decide interoperability and cost. yes and no are read from each vendor's own documentation; partial means tier-limited, proxy-based, or unconfirmed-native. Pricing model is the column to read at scale, not the headline price. This table is published as an open dataset under CC-BY (see methodology).
| Tool | Layer | OTel-native | Self-host | Open source | Pricing model | Eval / Guardrails built in |
|---|---|---|---|---|---|---|
| Portkey | Gateway (+obs, guardrails) | yes | yes | Apache-2.0 core | Usage (per log) | guardrails + obs |
| LiteLLM | Gateway | yes | yes | OSS + Enterprise | Free OSS / Enterprise | routing only |
| Cloudflare AI Gateway | Gateway | partial | cloud only | no | Free (with Workers) | no |
| Kong AI Gateway | Gateway | yes | yes | core OSS, AI plugins paid | Enterprise license | policy-level |
| OpenRouter | Gateway (aggregator) | partial | cloud only | no | Flat 5.5% on credits | no |
| AIMLAPI | Gateway (aggregator) | partial | cloud only | no | Usage (pay-as-you-go) | no |
| Langfuse | Observability (+eval) | yes (v3) | yes | MIT | Usage (per unit) | eval + prompt mgmt |
| LangSmith | Observability (+eval) | partial / late | enterprise | no | Seat + per-trace | eval |
| Helicone | Observability (gateway-style) | partial | yes | OSS | Usage (per request) | light eval |
| Arize Phoenix | Observability (+eval) | yes (OpenInference) | yes | OSS | Free OSS / usage | eval |
| Traceloop / OpenLLMetry | Observability | yes (reference) | yes (lib) | Apache-2.0 | Free lib / platform | platform only |
| Datadog LLM Obs | Observability | yes (GenAI conv) | cloud only | no | Usage (per span) | eval |
| Maxim | Eval (+obs, gateway) | yes (Bifrost) | enterprise | Bifrost Apache-2.0 | Usage / Enterprise | obs + guardrails |
| Braintrust | Eval (+obs) | ingests OTLP | enterprise | no | Usage / Enterprise | obs |
| Promptfoo | Eval / red-team | no (focus) | yes (local-first) | OSS | Free OSS / Enterprise | red-team |
| DeepEval / Confident AI | Eval (+obs) | yes | OSS lib only | DeepEval OSS | Usage (cloud) | obs (cloud) |
| Prediction Guard | Guardrails (+inference) | yes (events) | hosted | no | Usage / Enterprise | eval-style checks |
| Guardrails AI | Guardrails | yes (telemetry) | yes | Apache-2.0 | Free OSS / Pro | validation |
| NeMo Guardrails | Guardrails | no | yes | Apache-2.0 | Free OSS | rails only |
Highlighted rows are tools that span multiple layers cleanly, the leakage the thesis describes. Pricing models and OTel status reflect each vendor's public documentation as of June 2026 and change often; verify on the vendor's own page before a purchase decision.
This comparison is published as an open dataset (CC-BY) with a permanent DOI: DOI 10.5281/zenodo.20738671. Browse the full dataset landing page or download the machine-readable JSON.
Layer 1: the gateway, and the question of who holds the key
The cleanest way to split this layer is the one most articles skip: who holds the provider API key. A true gateway sits in your request path and you keep your own keys, getting fallback, load-balancing, caching, and budget control. An aggregator resells model access; you buy their credits, they hold the keys, and you get one bill across many models. They solve different problems, and conflating them is how teams pick the wrong one.
Portkey is the strongest open-source true gateway: an Apache-2.0 core that routes across a very large model catalog, self-hostable, and OpenTelemetry-native, with guardrails and observability layered on top, which is why it shows up as a multi-layer tool in the matrix. Portkey open-sourced its gateway after operating it at large production scale (see Portkey's own statements for current figures). LiteLLM is the OSS SDK-and-proxy that normalizes 100-plus providers into the OpenAI format; it is the pragmatic default when you want full control and will carry the self-host operations yourself. Cloudflare AI Gateway and Kong AI Gateway are the right picks if you already live on those platforms (edge caching for Cloudflare, plugin governance for Kong) and a tax otherwise. On the aggregator side, OpenRouter resells 300-plus models on a flat percentage of credit purchases, and AIMLAPI has quietly assembled one of the broadest catalogs in the category: by its own account, 400-plus models across every modality, text, image, video, and audio, behind one OpenAI-compatible endpoint and one bill. If your problem is "I want to try many models, including non-text, without managing many keys," an aggregator is the honest answer; if your problem is "I need a production control plane I govern," a true gateway is.
Layer 2: observability, where the OTel line is sharpest
This is the layer most teams should buy first, because you cannot fix what you cannot see, and it is where OpenTelemetry support most changes your future options. Langfuse is the open-source default: MIT-licensed, fully self-hostable for free, spanning tracing, evaluation, and prompt management, and OpenTelemetry-native since v3 (with an OTLP backend), so it interoperates instead of locking you in. The honest caveat is that its unit-based billing (a trace, an observation, and a score each count) can surprise high-emission teams, and the v3 self-host is heavier to run (Clickhouse, Redis, object storage). Arize Phoenix is the other strong open-source pick, built on OpenInference over OTel, excellent as a developer and eval tool, though production RBAC pushes you toward paid Arize. Traceloop / OpenLLMetry is the reference OTel implementation but is a library, not a turnkey platform: you bring your own backend and UI. LangSmith is polished and tightly integrated with LangChain, but it is closed, its OTel export arrived late and partial, and per-trace-plus-seat pricing can spike. Datadog LLM Observability is the natural choice only if you already run Datadog, with the usual cost and lock-in trade-off. Helicone takes an observability-first, gateway-style proxy approach that is fast to adopt but introduces a hop in your path.
Layer 3: evaluation, the layer that decides "is it good"
Evaluation is where the category is moving fastest, and where the leakage is most visible: most eval tools also do observability, because you cannot score production quality without the traces. Maxim is the most full-stack of the cohort: experimentation, agent simulation across thousands of scenarios, evaluation, observability, and even its own open-source gateway, Bifrost (Apache-2.0), which Maxim's published benchmarks put at roughly 20 microseconds of added latency at 5,000 requests per second. The simulation-across-scenarios approach to agent eval is a genuine differentiator few eval tools match; the trade-off is the usual breadth-versus-depth question when one tool competes on many fronts. Braintrust is the polished closed-source eval-plus-observability platform with strong CI ergonomics. Promptfoo is the open-source, local-first, pure-eval and red-team CLI: the right pick for pre-deploy testing in CI, with no live production monitoring of its own. DeepEval / Confident AI pairs an OSS eval framework with a commercial cloud for the production layer. The decision here is offline-versus-online: if you mainly need pre-deploy testing, Promptfoo; if you need a unified online-plus-offline platform, Maxim or Braintrust; if you already run Langfuse or Phoenix, start with their built-in evals before adding a fourth tool.
Layer 4: guardrails, enforced in real time
Guardrails are the runtime enforcement layer: input and output checks for prompt injection, PII, toxicity, and factuality, applied on the live request rather than after the fact. Guardrails AI is the open-source validation framework with a Hub of community validators and OTel telemetry; its strength is output-quality validation more than adversarial defense. NeMo Guardrails (NVIDIA) is the programmable option, using its Colang language to script rails, which is powerful but carries a learning curve. Lakera is the API-first commercial guard focused on prompt-injection and PII defense, with cost that scales with traffic. Prediction Guard takes a distinct path: rather than bolting a guard library onto someone else's model, it offers privacy-conserving hosted inference with guardrails native to the path (PII, injection, toxicity, factuality), running on Intel Gaudi 2 accelerators, an early pioneer of running paying-customer LLM workloads on that hardware. Its founder, Daniel Whitenack, hosts the long-running Practical AI podcast, which tells you the posture: opinionated, security-first, built for teams that cannot send data to a public API. First-party options exist too (AWS Bedrock Guardrails, Azure AI Content Safety, Vertex Model Armor), but they are cloud-locked and weaker on adversarial benchmarks.
Which layer, and which tool, for your situation
Build down the request path, but rarely all at once. The order that works for most teams: observability first (you need to see), then a gateway (control and fallback), then evaluation (a quality bar before you scale), then guardrails (enforcement as exposure grows). Match the move to your situation:
Solo builder / pre-revenue
One app, one or two providers, no budget for seats.
Self-host Langfuse (obs+eval) + LiteLLM (gateway). Both free, both OTel-friendly.
Series-A product team
Real traffic, multiple providers, a quality bar to defend.
Portkey (gateway+guardrails) + Langfuse or Phoenix (obs) + Promptfoo in CI.
Privacy / regulated
Data cannot leave your perimeter or a trusted host.
Prediction Guard (private inference + guardrails) + self-hosted Langfuse.
Agent / eval-heavy team
Multi-step agents you need to test across many scenarios.
Maxim (simulation + eval + obs) as the spine; add a gateway when you go multi-provider.
Already on a platform
Standardized on Datadog, Cloudflare, or Kong.
Use the first-party AI layer (Datadog LLM Obs, CF/Kong AI Gateway) before adding a new vendor.
Want one bill, many models
Experimenting across many models and modalities.
An aggregator (AIMLAPI for breadth across modalities; OpenRouter for flat-fee text) + Langfuse for traces.
The honest per-layer verdicts
Best open-source observability
layer 2Langfuse -- the rare platform you can fully self-host for free that also became OpenTelemetry-native, so it interoperates instead of trapping you. Watch unit-based billing on the cloud tier and the heavier v3 self-host.
Best self-governed gateway
layer 1Portkey for an OSS, OTel-native, production-proven control plane you hold the keys for; LiteLLM if you want a thinner proxy and will run the ops yourself.
Best breadth across models and modalities
layer 1 (aggregator)AIMLAPI -- 400-plus models across text, image, video, and audio behind one OpenAI-compatible endpoint and one bill. The honest framing is that it is a marketplace, not a control plane you govern.
Best agent / full-stack evaluation
layer 3Maxim -- experimentation, simulation, eval, and observability in one, plus an OSS gateway (Bifrost). The trade is breadth versus depth. Promptfoo remains the cleanest OSS pick for pure pre-deploy CI testing.
Best privacy-first guardrails
layer 4Prediction Guard -- guardrails native to a private inference path rather than bolted onto a public API, for teams that cannot send data out. Guardrails AI is the OSS default for output validation.
Methodology and conflict disclosure
- Sample
- 19 tools across four layers, selected for category relevance, not for who pays. Defunct and acquired products (for example HumanLoop, wound down after an August 2025 acqui-hire) are excluded from the live comparison.
- Criteria
- OpenTelemetry-native support, self-host availability, open-source licensing, pricing model, and whether evaluation and guardrails are built in. Each cell is read from the vendor's own documentation, repository, or pricing page.
- OTel claims
- Marked "native" only where confirmed from the tool's own docs; "partial" denotes tier-limited, proxy-based, ingest-only, or unconfirmed-native. Vendor performance claims (for example latency benchmarks) are attributed, not asserted as independent fact.
- Conflicts
- Rankings and the stack model were fixed before any monetization check. Nesyona has no paid placement in this piece, no sponsorship from any tool listed, and no affiliate relationship that altered the order. Where a vendor runs a public affiliate program, an outbound link may be tagged; it does not change a placement.
- Last verified
- June 2026. This layer reprices and ships fast; verify any pricing or OTel detail on the vendor's own page before deciding.
- Reviewed by
- The Nesyona editorial team, against public docs, GitHub, and OpenTelemetry support status.
If you build one of these tools and want to check that we have represented it fairly, or that a pricing or OTel detail has moved, we would genuinely welcome the correction. The goal of this page is to be the one map of the category that is not written by a vendor ranking itself first.
Adjacent reading on Nesyona: the best AI agent frameworks that sit above this stack, the best AI coding assistants for the teams building it, and the best local and self-hosted AI for the privacy-first end of the spectrum.
Frequently asked questions
What is the LLMOps stack?
What is the best LLM observability tool in 2026?
Why does OpenTelemetry matter for LLMOps tools?
What is the difference between an LLM gateway and an aggregator?
Do I need all four layers of the stack?
Which LLMOps tools are open source and self-hostable?
The bottom line
LLMOps is not a leaderboard, it is a stack. Map your problem onto the four layers, buy the one you need first, and let OpenTelemetry connect the rest so you are never re-instrumenting to escape a vendor. Read the pricing model, not the headline number, because per-event and per-seat tools scale very differently at production volume. And remember the thesis: the layers leak, so the leading tool in one layer often saves you a purchase in the next. Start with observability (Langfuse is the open-source default), add a gateway you govern (Portkey or LiteLLM) or an aggregator for breadth (AIMLAPI), put a quality bar in CI (Promptfoo, or Maxim for agents), and enforce at the edge (Prediction Guard or Guardrails AI) as your exposure grows.
Sources
- OpenTelemetry, GenAI semantic conventions and OTLP specification, opentelemetry.io (accessed June 2026).
- Langfuse documentation and pricing, langfuse.com (v3 OpenTelemetry support, MIT license, self-host requirements).
- Portkey gateway repository and OpenTelemetry docs, portkey.ai and github.com/portkey-ai/gateway.
- Maxim and Bifrost gateway repository and published benchmarks, getmaxim.ai and github.com/maximhq/bifrost.
- AIMLAPI model catalog and documentation, aimlapi.com.
- Prediction Guard documentation and Intel developer write-up on privacy-conserving LLM inference on Intel Gaudi 2.
- Arize Phoenix (OpenInference) and Traceloop / OpenLLMetry repositories.
- Datadog LLM Observability and its GenAI OpenTelemetry semantic-convention blog.
- OpenRouter, LiteLLM, Kong AI Gateway, Cloudflare AI Gateway, Braintrust, Promptfoo, DeepEval / Confident AI, Guardrails AI, and NeMo Guardrails official documentation and pricing pages.
- TechCrunch, on Anthropic's August 2025 acqui-hire of the HumanLoop team.