Observability for AI Applications Guide

Your AI chatbot demo wowed the leadership team. Two weeks after launch, support tickets spike. Users say answers are wrong, slow, or oddly expensive. You open the server logs and see… HTTP 200 OK. Everything looks "fine."

That is the trap. Traditional monitoring tells you the server is up. It does not tell you whether the model gave a harmful answer, burned through your token budget, or took eight seconds because retrieval failed silently. That is why observability for AI apps is a different game — and one worth learning early.

What is observability?

Observability means you can understand what is happening inside a running system by looking at the data it produces — without redeploying code or guessing.

It rests on three pillars:

Logs — detailed event records ("model returned 412 tokens in 2.3s").
Metrics — numbers over time (average latency, error rate, cost per day).
Traces — end-to-end timelines showing each step of a request.

Think of a hospital patient monitor. A single beep tells you the heart is beating (uptime). Observability is the full dashboard — heart rate, oxygen, blood pressure — that explains why the patient feels unwell.

Why do AI apps need it?

Regular web apps are mostly predictable: same input, same output. LLM (Large Language Model) apps are not. The same question can produce different answers. Costs scale with tokens — pieces of text the model processes — not just server CPU.

You need AI observability because:

Wrong answers do not throw errors — they return HTTP 200 with confident nonsense.
Token usage can spike silently and inflate your Azure bill.
Slow retrieval or tool calls hide inside "the AI feels sluggish" complaints.
Compliance teams ask what was sent to the model and whether PII leaked.

Without visibility, you are flying blind with a fuel gauge that only measures altitude.

How does it work?

Instrument your app at every step of the AI pipeline: user request in, retrieval, model call, tool execution, response out. Send that data to a monitoring service like Application Insights on Azure.

Structured logging means writing logs as consistent key-value fields (model name, token count, duration) instead of random sentences. That lets you query and chart them later.

Custom metrics track AI-specific numbers: tokens per request, retrieval hit rate, thumbs-down count. Distributed tracing links your API, database, search index, and OpenAI call into one timeline.

User question arrives
        ↓
   [Trace starts]
        ↓
Retrieve document chunks (log: chunk IDs, scores)
        ↓
Call Azure OpenAI (log: tokens, latency, model)
        ↓
Optional tool calls (log: tool name, success/fail)
        ↓
Return answer + record user feedback
        ↓
   [Trace ends] → Dashboards & alerts

Set alerts on what matters: p95 latency above 5 seconds, daily token spend 2× normal, error rate on tool calls above 1%. p95 latency means 95% of users get a response within that time — a better picture than a single average.

Note Never log full prompts containing passwords, health data, or credit card numbers. Log hashes or redacted summaries instead, and keep retention policies aligned with privacy rules.

Real-world example

A retail company launches a product Q&A bot on its shopping app. Week one looks great. Week three, the Azure bill doubles.

With observability, the team discovers 40% of queries trigger ten retrieval chunks because the prompt template grew too large. Average input tokens tripled. They trim the template, cap chunks at five, and add a daily token dashboard. Costs drop back — and they catch the next spike before finance does.

It is like checking your mobile data usage after streaming Netflix all month. The app "worked" — but the meter told the real story.

Step-by-step: instrument your AI app

Step 1: Enable Application Insights on your ASP.NET Core or Azure Functions host.

Step 2: Wrap each AI request in a trace span with a unique correlation ID.

Step 3: After every model call, log model name, input tokens, output tokens, and duration.

Step 4: Log retrieval chunk IDs so you can audit bad answers.

Step 5: Add a user feedback button (thumbs up/down) tied to the same correlation ID.

Step 6: Build dashboards for latency, token spend, and feedback ratio. Set alerts on thresholds.

using var activity = ActivitySource.StartActivity("ChatCompletion");
activity?.SetTag("model", "gpt-4o-mini");

var sw = Stopwatch.StartNew();
var completion = await chat.CompleteChatAsync(messages);
sw.Stop();

var usage = completion.Value.Usage;
logger.LogInformation(
    "AI call completed in {ElapsedMs}ms. Input={InputTokens} Output={OutputTokens}",
    sw.ElapsedMilliseconds,
    usage.InputTokenCount,
    usage.OutputTokenCount);

telemetry.TrackMetric("ai.tokens.input", usage.InputTokenCount);
telemetry.TrackMetric("ai.latency.ms", sw.ElapsedMilliseconds);

Common misconceptions

"No errors means the AI is healthy." Wrong answers are not exceptions. Track quality signals, not just crashes.

"We will add monitoring later." Later usually means after an expensive incident. Instrument from day one of the pilot.

"Log everything for debugging." Logging full prompts with customer PII creates compliance risk. Log what you need, redact the rest.

Signal	What it tells you	Example alert
Token count per request	Cost and prompt bloat	Daily spend > budget threshold
p95 latency	Real user wait time	Latency > 5 seconds for 15 minutes
Retrieval score	Whether search found good chunks	Avg score drops below 0.7
Thumbs-down rate	Answer quality trend	Negative feedback > 10% daily

Quick recap

Observability = logs + metrics + traces that explain system behavior.
AI apps need token tracking, quality signals, and retrieval visibility — not just uptime.
Application Insights on Azure is a practical starting point for .NET teams.
Correlate every answer with IDs so you can debug and audit later.

Summary

Shipping an AI feature without observability is like opening a restaurant without reading customer reviews or checking the food cost sheet. The kitchen might still be running.

Log the AI-specific details — tokens, latency, retrieval, feedback — from the first pilot user. Dashboards turn mystery complaints into fixable patterns. That is how you keep answers trustworthy and bills predictable.

Frequently Asked Questions

Observability means you can understand what your running application is doing by looking at logs, metrics, and traces — without guessing or redeploying code.

LLM apps add non-deterministic outputs, token costs, prompt content, and quality concerns that traditional uptime monitoring does not cover.

Log latency, token counts, model name, retrieval chunk IDs, tool calls, errors, and user feedback signals. Avoid logging secrets or full PII in prompts.

Application Insights is Azure's monitoring service that collects telemetry from your apps so you can see performance, failures, and custom metrics in dashboards.

Record input and output token counts per request and multiply by your model's price per token. Dashboard daily totals to catch spikes early.

p95 latency means 95% of requests finish within that time. It shows typical slow experiences better than a single average number.

Partially. Track thumbs up/down, escalation rates, and run periodic evals with a golden question set. Fully automatic quality scoring is still evolving.

Key Takeaways

HTTP 200 does not mean the AI answer was good — monitor quality and cost too.
Log tokens, latency, retrieval IDs, and tool outcomes on every request.
Application Insights gives Azure teams a ready-made place to start.
Redact sensitive data in logs and alert on budgets before bills surprise you.

What is observability?

Why do AI apps need it?

How does it work?

Real-world example

Step-by-step: instrument your AI app

Common misconceptions

Quick recap

Summary

Frequently Asked Questions

What is observability?

Why is AI observability different from regular app monitoring?

What should I log for an LLM application?

What is Application Insights?

How do I track AI costs?

What is p95 latency?

Can I monitor answer quality automatically?

Key Takeaways

Suggested Next Reads