Your AI chatbot demo wowed the leadership team. Two weeks after launch, support tickets spike. Users say answers are wrong, slow, or oddly expensive. You open the server logs and see… HTTP 200 OK. Everything looks "fine."
That is the trap. Traditional monitoring tells you the server is up. It does not tell you whether the model gave a harmful answer, burned through your token budget, or took eight seconds because retrieval failed silently. That is why observability for AI apps is a different game — and one worth learning early.
What is observability?
Observability means you can understand what is happening inside a running system by looking at the data it produces — without redeploying code or guessing.
It rests on three pillars:
- Logs — detailed event records ("model returned 412 tokens in 2.3s").
- Metrics — numbers over time (average latency, error rate, cost per day).
- Traces — end-to-end timelines showing each step of a request.
Think of a hospital patient monitor. A single beep tells you the heart is beating (uptime). Observability is the full dashboard — heart rate, oxygen, blood pressure — that explains why the patient feels unwell.
Why do AI apps need it?
Regular web apps are mostly predictable: same input, same output. LLM (Large Language Model) apps are not. The same question can produce different answers. Costs scale with tokens — pieces of text the model processes — not just server CPU.
You need AI observability because:
- Wrong answers do not throw errors — they return HTTP 200 with confident nonsense.
- Token usage can spike silently and inflate your Azure bill.
- Slow retrieval or tool calls hide inside "the AI feels sluggish" complaints.
- Compliance teams ask what was sent to the model and whether PII leaked.
Without visibility, you are flying blind with a fuel gauge that only measures altitude.
How does it work?
Instrument your app at every step of the AI pipeline: user request in, retrieval, model call, tool execution, response out. Send that data to a monitoring service like Application Insights on Azure.
Structured logging means writing logs as consistent key-value fields (model name, token count, duration) instead of random sentences. That lets you query and chart them later.
Custom metrics track AI-specific numbers: tokens per request, retrieval hit rate, thumbs-down count. Distributed tracing links your API, database, search index, and OpenAI call into one timeline.
User question arrives
↓
[Trace starts]
↓
Retrieve document chunks (log: chunk IDs, scores)
↓
Call Azure OpenAI (log: tokens, latency, model)
↓
Optional tool calls (log: tool name, success/fail)
↓
Return answer + record user feedback
↓
[Trace ends] → Dashboards & alerts
Set alerts on what matters: p95 latency above 5 seconds, daily token spend 2× normal, error rate on tool calls above 1%. p95 latency means 95% of users get a response within that time — a better picture than a single average.
Real-world example
A retail company launches a product Q&A bot on its shopping app. Week one looks great. Week three, the Azure bill doubles.
With observability, the team discovers 40% of queries trigger ten retrieval chunks because the prompt template grew too large. Average input tokens tripled. They trim the template, cap chunks at five, and add a daily token dashboard. Costs drop back — and they catch the next spike before finance does.
It is like checking your mobile data usage after streaming Netflix all month. The app "worked" — but the meter told the real story.
Step-by-step: instrument your AI app
Step 1: Enable Application Insights on your ASP.NET Core or Azure Functions host.
Step 2: Wrap each AI request in a trace span with a unique correlation ID.
Step 3: After every model call, log model name, input tokens, output tokens, and duration.
Step 4: Log retrieval chunk IDs so you can audit bad answers.
Step 5: Add a user feedback button (thumbs up/down) tied to the same correlation ID.
Step 6: Build dashboards for latency, token spend, and feedback ratio. Set alerts on thresholds.
using var activity = ActivitySource.StartActivity("ChatCompletion");
activity?.SetTag("model", "gpt-4o-mini");
var sw = Stopwatch.StartNew();
var completion = await chat.CompleteChatAsync(messages);
sw.Stop();
var usage = completion.Value.Usage;
logger.LogInformation(
"AI call completed in {ElapsedMs}ms. Input={InputTokens} Output={OutputTokens}",
sw.ElapsedMilliseconds,
usage.InputTokenCount,
usage.OutputTokenCount);
telemetry.TrackMetric("ai.tokens.input", usage.InputTokenCount);
telemetry.TrackMetric("ai.latency.ms", sw.ElapsedMilliseconds);
Common misconceptions
"No errors means the AI is healthy." Wrong answers are not exceptions. Track quality signals, not just crashes.
"We will add monitoring later." Later usually means after an expensive incident. Instrument from day one of the pilot.
"Log everything for debugging." Logging full prompts with customer PII creates compliance risk. Log what you need, redact the rest.
| Signal | What it tells you | Example alert |
|---|---|---|
| Token count per request | Cost and prompt bloat | Daily spend > budget threshold |
| p95 latency | Real user wait time | Latency > 5 seconds for 15 minutes |
| Retrieval score | Whether search found good chunks | Avg score drops below 0.7 |
| Thumbs-down rate | Answer quality trend | Negative feedback > 10% daily |
Quick recap
- Observability = logs + metrics + traces that explain system behavior.
- AI apps need token tracking, quality signals, and retrieval visibility — not just uptime.
- Application Insights on Azure is a practical starting point for .NET teams.
- Correlate every answer with IDs so you can debug and audit later.
Summary
Shipping an AI feature without observability is like opening a restaurant without reading customer reviews or checking the food cost sheet. The kitchen might still be running.
Log the AI-specific details — tokens, latency, retrieval, feedback — from the first pilot user. Dashboards turn mystery complaints into fixable patterns. That is how you keep answers trustworthy and bills predictable.
Frequently Asked Questions
Key Takeaways
- HTTP 200 does not mean the AI answer was good — monitor quality and cost too.
- Log tokens, latency, retrieval IDs, and tool outcomes on every request.
- Application Insights gives Azure teams a ready-made place to start.
- Redact sensitive data in logs and alert on budgets before bills surprise you.