Updated June 2026 ยท 18 min read ยท Reviewed by the Nesyona editorial team against each tool's public documentation, GitHub repository, pricing page, and OpenTelemetry support status

The LLMOps stack: the four layers of production LLM tooling, and why they leak

Search "best LLMOps tools" and you get a dozen numbered lists, most written by a vendor that ranks itself first. They are not wrong about the tools; they are wrong about the shape of the problem. Running an LLM in production is not a single purchase. It is a stack of four jobs along the request path: a gateway in front of the models, observability to see what happened, evaluation to decide whether it was good, and guardrails to enforce what is allowed. The useful question is not "which tool wins" but "which layer do I need first, and how do the pieces fit together without locking me in." This guide maps the category as that stack, places 19 tools in it, and compares them on the two axes nobody else lines up: OpenTelemetry support (the portability dividing line) and pricing model (the thing that decides your bill at scale). Match a stack to your situation with our AI stack optimizer, or jump to the comparison matrix.

Last reviewed: June 2026 Next review: December 2026
Bottom line up front
4
Layers in the LLMOps stack (gateway / observability / eval / guardrails)
19
Tools placed and compared across the four layers
OTel
The interoperability standard that decides portability vs lock-in
3
Layers the best tools typically span (the leakage is the point)
5
Comparison axes: OTel-native, self-host, OSS, pricing model, eval/guardrails built in
THE LLMOPS STACK -- ALONG THE REQUEST PATH your app 1 ยท Gateway / Routing unified API ยท routing ยท fallback ยท caching ยท key & budget control 2 ยท Observability / Tracing every prompt ยท span ยท token ยท cost ยท latency, captured for debugging 3 ยท Evaluation / Testing offline (CI / datasets) + online quality scoring: LLM-judge, human, code 4 ยท Guardrails / Safety real-time input/output checks: injection ยท PII ยท toxicity ยท factuality OpenTelemetry runs vertically through all four -- the connective tissue.

The stack, and the thesis that the layers leak

The four layers are the four jobs a request passes through on its way to and from a model. A gateway stands in front: it gives you one API across many providers, routes and falls back between them, caches, and enforces keys and budgets. Observability records what happened: the prompt, the chain of spans, tokens, cost, and latency, so you can debug a bad answer after the fact. Evaluation decides whether the answer was good, offline against datasets in CI and online against live traffic, using LLM-as-judge, human review, and programmatic checks. Guardrails enforce what is allowed in real time: prompt-injection defense, PII redaction, toxicity and factuality checks on input and output.

1

Gateway / Routing

The control plane in front of the models: a unified API, routing and fallback, caching, and key and budget management. Split it further into true gateways (you hold your provider keys) and aggregators (you buy their credits, they hold the keys).

Portkey ยท LiteLLM ยท Cloudflare AI Gateway ยท Kong AI Gateway ยท OpenRouter ยท AIMLAPI ยท TrueFoundry

2

Observability / Tracing

Sees what happened: captures every prompt, span, token, cost, and latency for debugging and monitoring. This is where the OpenTelemetry dividing line is sharpest.

Langfuse ยท LangSmith ยท Helicone ยท Arize Phoenix ยท Traceloop / OpenLLMetry ยท Datadog LLM Observability

3

Evaluation / Testing

Decides whether it was good: scores quality offline (CI and datasets) and online (production) through LLM-judge, human, and programmatic evals.

Maxim ยท Braintrust ยท Promptfoo ยท Langfuse evals ยท DeepEval / Confident AI

4

Guardrails / Safety

Enforces what is allowed: real-time input and output checks for injection, PII, toxicity, and factuality.

Guardrails AI ยท NeMo Guardrails ยท Lakera ยท Prediction Guard

The thesis: these layers leak, and that is the whole point. Treating them as clean silos is the mistake every listicle makes. Langfuse spans observability and evaluation. Maxim spans evaluation, observability, and ships its own open-source gateway. Portkey spans gateway, observability, and guardrails. Prediction Guard fuses guardrails with hosted inference. The practical consequence: pick the layer you need first, then check what the leading tool in that layer already covers in the layers next to it, because OpenTelemetry is the connective tissue that lets you assemble the rest without re-instrumenting.

Why OpenTelemetry is the dividing line

The single axis practitioners argue about most is OpenTelemetry support, because it decides portability. OpenTelemetry (OTel) is the vendor-neutral standard for traces, metrics, and logs, and it now has a GenAI semantic convention for LLM spans. A tool that is OTel-native lets you instrument once and point the data at any compatible backend; if you outgrow the vendor, you keep your instrumentation. A tool with a proprietary-only SDK does the opposite: your traces are written in its dialect, and switching means re-instrumenting your whole application.

It is worth being precise, because the marketing blurs three tiers. Some tools are built on OTel as the native data model (Traceloop / OpenLLMetry is the reference implementation; Arize Phoenix uses OpenInference on top of OTel; Datadog LLM Observability and Langfuse v3 align to the GenAI convention). Some ingest OTLP but ship their own SDK as the primary path. And some are late or partial: LangSmith added OTel export well after launch and remains LangChain-centric; tools like Lunary use non-OTel SDKs that create switching risk. When a vendor claims "OpenTelemetry support," ask which tier. In the matrix below we only mark a tool OTel-native where we could confirm it from the tool's own documentation.

The comparison matrix: OpenTelemetry, self-host, and pricing model

Nineteen tools across the four layers, on the five axes that decide interoperability and cost. yes and no are read from each vendor's own documentation; partial means tier-limited, proxy-based, or unconfirmed-native. Pricing model is the column to read at scale, not the headline price. This table is published as an open dataset under CC-BY (see methodology).

ToolLayerOTel-nativeSelf-hostOpen sourcePricing modelEval / Guardrails built in
PortkeyGateway (+obs, guardrails)yesyesApache-2.0 coreUsage (per log)guardrails + obs
LiteLLMGatewayyesyesOSS + EnterpriseFree OSS / Enterpriserouting only
Cloudflare AI GatewayGatewaypartialcloud onlynoFree (with Workers)no
Kong AI GatewayGatewayyesyescore OSS, AI plugins paidEnterprise licensepolicy-level
OpenRouterGateway (aggregator)partialcloud onlynoFlat 5.5% on creditsno
AIMLAPIGateway (aggregator)partialcloud onlynoUsage (pay-as-you-go)no
LangfuseObservability (+eval)yes (v3)yesMITUsage (per unit)eval + prompt mgmt
LangSmithObservability (+eval)partial / lateenterprisenoSeat + per-traceeval
HeliconeObservability (gateway-style)partialyesOSSUsage (per request)light eval
Arize PhoenixObservability (+eval)yes (OpenInference)yesOSSFree OSS / usageeval
Traceloop / OpenLLMetryObservabilityyes (reference)yes (lib)Apache-2.0Free lib / platformplatform only
Datadog LLM ObsObservabilityyes (GenAI conv)cloud onlynoUsage (per span)eval
MaximEval (+obs, gateway)yes (Bifrost)enterpriseBifrost Apache-2.0Usage / Enterpriseobs + guardrails
BraintrustEval (+obs)ingests OTLPenterprisenoUsage / Enterpriseobs
PromptfooEval / red-teamno (focus)yes (local-first)OSSFree OSS / Enterprisered-team
DeepEval / Confident AIEval (+obs)yesOSS lib onlyDeepEval OSSUsage (cloud)obs (cloud)
Prediction GuardGuardrails (+inference)yes (events)hostednoUsage / Enterpriseeval-style checks
Guardrails AIGuardrailsyes (telemetry)yesApache-2.0Free OSS / Provalidation
NeMo GuardrailsGuardrailsnoyesApache-2.0Free OSSrails only

Highlighted rows are tools that span multiple layers cleanly, the leakage the thesis describes. Pricing models and OTel status reflect each vendor's public documentation as of June 2026 and change often; verify on the vendor's own page before a purchase decision.

This comparison is published as an open dataset (CC-BY) with a permanent DOI: DOI 10.5281/zenodo.20738671. Browse the full dataset landing page or download the machine-readable JSON.

Layer 1: the gateway, and the question of who holds the key

The cleanest way to split this layer is the one most articles skip: who holds the provider API key. A true gateway sits in your request path and you keep your own keys, getting fallback, load-balancing, caching, and budget control. An aggregator resells model access; you buy their credits, they hold the keys, and you get one bill across many models. They solve different problems, and conflating them is how teams pick the wrong one.

Portkey is the strongest open-source true gateway: an Apache-2.0 core that routes across a very large model catalog, self-hostable, and OpenTelemetry-native, with guardrails and observability layered on top, which is why it shows up as a multi-layer tool in the matrix. Portkey open-sourced its gateway after operating it at large production scale (see Portkey's own statements for current figures). LiteLLM is the OSS SDK-and-proxy that normalizes 100-plus providers into the OpenAI format; it is the pragmatic default when you want full control and will carry the self-host operations yourself. Cloudflare AI Gateway and Kong AI Gateway are the right picks if you already live on those platforms (edge caching for Cloudflare, plugin governance for Kong) and a tax otherwise. On the aggregator side, OpenRouter resells 300-plus models on a flat percentage of credit purchases, and AIMLAPI has quietly assembled one of the broadest catalogs in the category: by its own account, 400-plus models across every modality, text, image, video, and audio, behind one OpenAI-compatible endpoint and one bill. If your problem is "I want to try many models, including non-text, without managing many keys," an aggregator is the honest answer; if your problem is "I need a production control plane I govern," a true gateway is.

Layer 2: observability, where the OTel line is sharpest

This is the layer most teams should buy first, because you cannot fix what you cannot see, and it is where OpenTelemetry support most changes your future options. Langfuse is the open-source default: MIT-licensed, fully self-hostable for free, spanning tracing, evaluation, and prompt management, and OpenTelemetry-native since v3 (with an OTLP backend), so it interoperates instead of locking you in. The honest caveat is that its unit-based billing (a trace, an observation, and a score each count) can surprise high-emission teams, and the v3 self-host is heavier to run (Clickhouse, Redis, object storage). Arize Phoenix is the other strong open-source pick, built on OpenInference over OTel, excellent as a developer and eval tool, though production RBAC pushes you toward paid Arize. Traceloop / OpenLLMetry is the reference OTel implementation but is a library, not a turnkey platform: you bring your own backend and UI. LangSmith is polished and tightly integrated with LangChain, but it is closed, its OTel export arrived late and partial, and per-trace-plus-seat pricing can spike. Datadog LLM Observability is the natural choice only if you already run Datadog, with the usual cost and lock-in trade-off. Helicone takes an observability-first, gateway-style proxy approach that is fast to adopt but introduces a hop in your path.

Layer 3: evaluation, the layer that decides "is it good"

Evaluation is where the category is moving fastest, and where the leakage is most visible: most eval tools also do observability, because you cannot score production quality without the traces. Maxim is the most full-stack of the cohort: experimentation, agent simulation across thousands of scenarios, evaluation, observability, and even its own open-source gateway, Bifrost (Apache-2.0), which Maxim's published benchmarks put at roughly 20 microseconds of added latency at 5,000 requests per second. The simulation-across-scenarios approach to agent eval is a genuine differentiator few eval tools match; the trade-off is the usual breadth-versus-depth question when one tool competes on many fronts. Braintrust is the polished closed-source eval-plus-observability platform with strong CI ergonomics. Promptfoo is the open-source, local-first, pure-eval and red-team CLI: the right pick for pre-deploy testing in CI, with no live production monitoring of its own. DeepEval / Confident AI pairs an OSS eval framework with a commercial cloud for the production layer. The decision here is offline-versus-online: if you mainly need pre-deploy testing, Promptfoo; if you need a unified online-plus-offline platform, Maxim or Braintrust; if you already run Langfuse or Phoenix, start with their built-in evals before adding a fourth tool.

Layer 4: guardrails, enforced in real time

Guardrails are the runtime enforcement layer: input and output checks for prompt injection, PII, toxicity, and factuality, applied on the live request rather than after the fact. Guardrails AI is the open-source validation framework with a Hub of community validators and OTel telemetry; its strength is output-quality validation more than adversarial defense. NeMo Guardrails (NVIDIA) is the programmable option, using its Colang language to script rails, which is powerful but carries a learning curve. Lakera is the API-first commercial guard focused on prompt-injection and PII defense, with cost that scales with traffic. Prediction Guard takes a distinct path: rather than bolting a guard library onto someone else's model, it offers privacy-conserving hosted inference with guardrails native to the path (PII, injection, toxicity, factuality), running on Intel Gaudi 2 accelerators, an early pioneer of running paying-customer LLM workloads on that hardware. Its founder, Daniel Whitenack, hosts the long-running Practical AI podcast, which tells you the posture: opinionated, security-first, built for teams that cannot send data to a public API. First-party options exist too (AWS Bedrock Guardrails, Azure AI Content Safety, Vertex Model Armor), but they are cloud-locked and weaker on adversarial benchmarks.

Which layer, and which tool, for your situation

Build down the request path, but rarely all at once. The order that works for most teams: observability first (you need to see), then a gateway (control and fallback), then evaluation (a quality bar before you scale), then guardrails (enforcement as exposure grows). Match the move to your situation:

Solo builder / pre-revenue

One app, one or two providers, no budget for seats.

Self-host Langfuse (obs+eval) + LiteLLM (gateway). Both free, both OTel-friendly.

Series-A product team

Real traffic, multiple providers, a quality bar to defend.

Portkey (gateway+guardrails) + Langfuse or Phoenix (obs) + Promptfoo in CI.

Privacy / regulated

Data cannot leave your perimeter or a trusted host.

Prediction Guard (private inference + guardrails) + self-hosted Langfuse.

Agent / eval-heavy team

Multi-step agents you need to test across many scenarios.

Maxim (simulation + eval + obs) as the spine; add a gateway when you go multi-provider.

Already on a platform

Standardized on Datadog, Cloudflare, or Kong.

Use the first-party AI layer (Datadog LLM Obs, CF/Kong AI Gateway) before adding a new vendor.

Want one bill, many models

Experimenting across many models and modalities.

An aggregator (AIMLAPI for breadth across modalities; OpenRouter for flat-fee text) + Langfuse for traces.

The honest per-layer verdicts

Methodology and conflict disclosure

How this comparison was built
Sample
19 tools across four layers, selected for category relevance, not for who pays. Defunct and acquired products (for example HumanLoop, wound down after an August 2025 acqui-hire) are excluded from the live comparison.
Criteria
OpenTelemetry-native support, self-host availability, open-source licensing, pricing model, and whether evaluation and guardrails are built in. Each cell is read from the vendor's own documentation, repository, or pricing page.
OTel claims
Marked "native" only where confirmed from the tool's own docs; "partial" denotes tier-limited, proxy-based, ingest-only, or unconfirmed-native. Vendor performance claims (for example latency benchmarks) are attributed, not asserted as independent fact.
Conflicts
Rankings and the stack model were fixed before any monetization check. Nesyona has no paid placement in this piece, no sponsorship from any tool listed, and no affiliate relationship that altered the order. Where a vendor runs a public affiliate program, an outbound link may be tagged; it does not change a placement.
Last verified
June 2026. This layer reprices and ships fast; verify any pricing or OTel detail on the vendor's own page before deciding.
Reviewed by
The Nesyona editorial team, against public docs, GitHub, and OpenTelemetry support status.

If you build one of these tools and want to check that we have represented it fairly, or that a pricing or OTel detail has moved, we would genuinely welcome the correction. The goal of this page is to be the one map of the category that is not written by a vendor ranking itself first.

Adjacent reading on Nesyona: the best AI agent frameworks that sit above this stack, the best AI coding assistants for the teams building it, and the best local and self-hosted AI for the privacy-first end of the spectrum.

Frequently asked questions

What is the LLMOps stack?
The LLMOps stack is a four-layer reference model for production large-language-model tooling, ordered along the request path: (1) Gateway and Routing, a unified API with routing, fallback, caching, and budget control in front of the models; (2) Observability and Tracing, capturing every prompt, span, token, cost, and latency; (3) Evaluation and Testing, scoring quality offline in CI and online in production; and (4) Guardrails and Safety, enforcing input and output rules in real time. The layers are interconnected rather than siloed: the strongest tools span two or three of them, and OpenTelemetry is the interoperability standard that connects them without lock-in.
What is the best LLM observability tool in 2026?
For most teams the best open-source option is Langfuse: MIT-licensed, fully self-hostable for free, spanning tracing, evaluation, and prompt management, and OpenTelemetry-native since v3 so your instrumentation stays portable. Arize Phoenix is the other strong open-source pick (built on OpenInference over OTel). LangSmith is polished and tightly integrated with LangChain but closed, with OTel support that arrived late, and per-trace-plus-seat pricing that can spike. Datadog LLM Observability is the right choice mainly if you already run Datadog. The honest answer depends on whether you need to self-host (Langfuse, Phoenix), whether you are already on a platform (Datadog), and how OpenTelemetry-portable you need to stay.
Why does OpenTelemetry matter for LLMOps tools?
OpenTelemetry decides portability. An OTel-native tool lets you instrument your application once and point the data at any compatible backend, so if you outgrow the vendor you keep your instrumentation. A tool with a proprietary-only SDK writes your traces in its own dialect, so switching means re-instrumenting your whole stack. Be precise about the three tiers: built on OTel as the native model (Traceloop, Arize Phoenix, Datadog GenAI convention, Langfuse v3), ingests OTLP but ships a separate primary SDK, or late and partial (LangSmith). When a vendor says "OpenTelemetry support," ask which tier it means.
What is the difference between an LLM gateway and an aggregator?
The tell is who holds the provider API key. A true gateway (Portkey, LiteLLM, Cloudflare AI Gateway, Kong) sits in your request path and you keep your own keys, getting routing, fallback, caching, and budget control. An aggregator or marketplace (OpenRouter, AIMLAPI) resells model access: you buy their credits, they hold the keys, and you get one bill across many models. Pick a true gateway when you need a production control plane you govern; pick an aggregator when you want to try many models, including non-text modalities, without managing many keys.
Do I need all four layers of the stack?
Rarely all at once, and rarely in the order people assume. Most teams should buy observability first, because you cannot fix what you cannot see, then add a gateway for control and fallback, then evaluation to set a quality bar before scaling, then guardrails as exposure grows. A solo builder can run self-hosted Langfuse plus LiteLLM and nothing else. The key move is to pick the layer you need now, then check what the leading tool in that layer already covers in the neighboring layers, because the best tools deliberately span two or three.
Which LLMOps tools are open source and self-hostable?
The strongest fully open-source, self-hostable options are Langfuse (MIT, observability and eval), Arize Phoenix (observability and eval), LiteLLM (gateway), Promptfoo (evaluation and red-team, local-first), Portkey (Apache-2.0 gateway core), Guardrails AI and NeMo Guardrails (guardrails), and Traceloop / OpenLLMetry (an OTel observability library). Several others are open-core: the SDK or framework is OSS while the production or governance layer is a paid cloud (for example DeepEval with Confident AI, Maxim with its OSS Bifrost gateway, Kong with paid AI plugins). Closed or cloud-only options include LangSmith, Datadog LLM Observability, Braintrust, OpenRouter, AIMLAPI, and Prediction Guard.

The bottom line

LLMOps is not a leaderboard, it is a stack. Map your problem onto the four layers, buy the one you need first, and let OpenTelemetry connect the rest so you are never re-instrumenting to escape a vendor. Read the pricing model, not the headline number, because per-event and per-seat tools scale very differently at production volume. And remember the thesis: the layers leak, so the leading tool in one layer often saves you a purchase in the next. Start with observability (Langfuse is the open-source default), add a gateway you govern (Portkey or LiteLLM) or an aggregator for breadth (AIMLAPI), put a quality bar in CI (Promptfoo, or Maxim for agents), and enforce at the edge (Prediction Guard or Guardrails AI) as your exposure grows.

Sources

  1. OpenTelemetry, GenAI semantic conventions and OTLP specification, opentelemetry.io (accessed June 2026).
  2. Langfuse documentation and pricing, langfuse.com (v3 OpenTelemetry support, MIT license, self-host requirements).
  3. Portkey gateway repository and OpenTelemetry docs, portkey.ai and github.com/portkey-ai/gateway.
  4. Maxim and Bifrost gateway repository and published benchmarks, getmaxim.ai and github.com/maximhq/bifrost.
  5. AIMLAPI model catalog and documentation, aimlapi.com.
  6. Prediction Guard documentation and Intel developer write-up on privacy-conserving LLM inference on Intel Gaudi 2.
  7. Arize Phoenix (OpenInference) and Traceloop / OpenLLMetry repositories.
  8. Datadog LLM Observability and its GenAI OpenTelemetry semantic-convention blog.
  9. OpenRouter, LiteLLM, Kong AI Gateway, Cloudflare AI Gateway, Braintrust, Promptfoo, DeepEval / Confident AI, Guardrails AI, and NeMo Guardrails official documentation and pricing pages.
  10. TechCrunch, on Anthropic's August 2025 acqui-hire of the HumanLoop team.
Featured in this analysis? Grab a free badge for your site → nesyona.com/badges
Save
Dashboard

From our network

Best AI Tools for Amazon Sellers - bagengine.comBest AI Courses 2026 - edubracket.comBest Accounting Software for Online Sellers - ceocult.com