Skip to main content

Monitoring

LangGuard provides comprehensive monitoring capabilities to help you understand the health and performance of your AI operations.

Dashboard Overview

The main dashboard displays key metrics and recent activity at a glance.

Key Metrics

The dashboard header shows aggregate metrics:

MetricDescription
Total TracesNumber of traces ingested
Success RatePercentage of successful traces
Avg LatencyMean trace duration
Active AgentsAgents with recent activity
Policy ViolationsCount of triggered policies

Time Range Selection

Use the time range selector to adjust the analysis period:

  • Last Hour
  • Last 24 Hours (default)
  • Last 7 Days
  • Last 30 Days
  • Custom Range

All metrics update automatically when you change the time range.

Metrics Deep Dive

Success Rate

The success rate shows the percentage of traces that completed without errors:

Success Rate: 94.2%
───────────────────────────────
████████████████████░░░░ 94.2%

Interpretation:

  • ≥ 95% - Excellent (green)
  • 85-95% - Acceptable (yellow)
  • < 85% - Needs attention (red)

Latency

Track response times across your agents:

  • P50 (Median) - Half of requests complete within this time
  • P95 - 95% of requests complete within this time
  • P99 - 99% of requests complete within this time
  • Average - Mean duration
Latency Distribution (P95)
────────────────────────────────
< 100ms ████████ 40%
100-500ms ██████████████ 35%
500ms-1s ████████ 15%
> 1s ████ 10%

Token Usage

Monitor token consumption for cost management:

  • Input Tokens - Tokens in requests
  • Output Tokens - Tokens in responses
  • Total Tokens - Combined usage
  • Cost Estimate - Based on model pricing

See request patterns over time:

Requests per Hour (Last 24h)
100 | ╭────╮
75 | ╭─────╯ ╰───╮
50 | ╭────╯ ╰───╮
25 | ────────╯ ╰────
0 └───────────────────────────────────────
12am 6am 12pm 6pm 12am

Agent Health

Agent Status Cards

Each discovered agent has a health card showing:

┌──────────────────────────────────────┐
│ CustomerService Agent [●●●] │
├──────────────────────────────────────┤
│ Status: ● Healthy │
│ Success Rate: 94.2% ▲ +2.3% │
│ Avg Latency: 1.2s ▼ -0.1s │
│ Traces (24h): 1,234 │
│ Last Active: 5 minutes ago │
└──────────────────────────────────────┘

Health Status Indicators

StatusMeaning
🟢 HealthySuccess rate ≥ 95%, no recent errors
🟡 WarningSuccess rate 85-95% or elevated latency
🔴 CriticalSuccess rate < 85% or many errors
⚪ InactiveNo activity in selected time range

Integration Health

Connection Status

Monitor the health of your data source connections:

┌─────────────────────────────────────────┐
│ Integrations │
├─────────────────────────────────────────┤
│ Langfuse (Production) ● Connected │
│ Last Sync: 2 minutes ago │
│ Traces Synced: 12,456 │
├─────────────────────────────────────────┤
│ Databricks MLflow ● Connected │
│ Last Sync: 5 minutes ago │
│ Traces Synced: 3,421 │
└─────────────────────────────────────────┘

Sync History

View recent sync operations:

TimeIntegrationStatusItems
2 min agoLangfuse✓ Success45 traces
7 min agoLangfuse✓ Success52 traces
12 min agoDatabricks✓ Success23 traces
17 min agoLangfuse⚠ Warning0 traces (rate limited)

Policy Violations

Violation Summary

The dashboard shows recent policy violations:

Policy Violations (Last 24h)
────────────────────────────────
Critical: 3 ████
High: 8 ██████████
Medium: 15 ████████████████████
Low: 7 █████████

Recent Violations

Quick list of recent policy triggers:

TimePolicyAgentSeverity
5 min agoPII DetectionEmailBotCritical
12 min agoToken LimitsChatBotMedium
1 hour agoSQL InjectionDataAgentHigh

Click any violation to view details in the Trace Explorer.

Real-time Updates

Auto-Refresh

The dashboard automatically refreshes based on your sync interval:

  • Metrics update after each sync
  • Agent status refreshes in real-time
  • New violations appear immediately

Manual Refresh

Click the refresh button (↻) to force an update.

Exporting Data

Export Formats

Export monitoring data for external analysis:

  • JSON - Complete data with metadata
  • CSV - Spreadsheet-compatible format
  • PDF - Formatted report (coming soon)

API Access

Access metrics programmatically:

# Get dashboard metrics
curl -X GET "https://api.langguard.ai/v1/metrics" \
-H "Authorization: Bearer $API_KEY"

Response:

{
"timeRange": "24h",
"metrics": {
"totalTraces": 12456,
"successRate": 0.942,
"avgLatency": 1.23,
"activeAgents": 8,
"policyViolations": 33
}
}

Best Practices

1. Set Up Baselines

Establish baseline metrics for comparison:

  • Document normal success rates
  • Record typical latency ranges
  • Track average daily volumes

Look for patterns rather than absolute values:

  • Gradual latency increases
  • Declining success rates
  • Volume anomalies

3. Regular Reviews

Schedule periodic reviews:

  • Daily: Quick health check
  • Weekly: Trend analysis
  • Monthly: Deep dive and optimization

4. Configure Alerts (Coming Soon)

Set up alerts for critical conditions:

  • Success rate drops below threshold
  • Latency exceeds limit
  • Policy violations spike

Next Steps