🧭 1. Latency & Performance Metrics πŸ”₯ 2. Error & Reliability Metrics βš™οΈ 3. Throughput & Capacity Metrics 🧩 4. Resource Utilization Metrics πŸ“¦ 5. Dependency / External Service Metrics 🧠 6. Business & Custom Metrics πŸ” 7. Tracing & Distributed Observability Metrics πŸ“Š 8. SLO/SLI/SLA Metrics πŸ•°οΈ 9. Change & Deployment Metrics

Here’s a structured overview of key observability metrics grouped by what aspect of the system they reveal:


🧭 1. Latency & Performance Metrics

Measure responsiveness and user experience.

MetricMeaningTypical Use
p50 / p90 / p95 / p99 latency50th, 90th, 95th, 99th percentile of response timesUnderstand normal and tail latency behavior
Average latency (mean)Simple average of all response timesQuick overview, but hides tail issues
Max latencyHighest recorded latencyDetect severe outliers
Request rate (RPS/QPS)Requests per secondLoad characterization
Saturation% of resource utilization (CPU, I/O, thread pool)Detect approaching bottlenecks
Queue length / wait timePending requests in queueBackpressure visibility

πŸ”₯ 2. Error & Reliability Metrics

Quantify correctness and stability.

MetricMeaningTypical Use
Error rate% of failed requestsDetect reliability drops
Error budget consumptionPortion of SLO used up by failuresSLO tracking
Availability (%)Success / Total requestsSLI for uptime
Retry rate% of retried requestsCatch flaky dependencies
Timeouts & circuit breaker opensDefensive mechanisms triggeringDetect degraded dependencies

βš™οΈ 3. Throughput & Capacity Metrics

Measure how much work the system can handle.

MetricMeaningTypical Use
Requests per second (RPS)Volume of incoming trafficCapacity planning
Processed messages/secFor queues, Kafka topics, etc.Measure system throughput
Concurrent connections / sessionsActive clientsDetect overloads
Backlog depthPending jobs or queue sizeDetect lag in consumers

🧩 4. Resource Utilization Metrics

Track underlying system health.

ResourceCommon Metrics
CPUUsage %, load average, throttling
MemoryHeap usage, GC pauses, OOM events
Disk / I/OIOPS, read/write latency, disk full %
NetworkIn/out throughput, packet loss, retransmits
Thread poolsActive vs idle threads, queue size

πŸ“¦ 5. Dependency / External Service Metrics

Track downstream impact and SLA compliance.

MetricExample
Dependency latencyDB, cache, external API call duration
Cache hit ratioRedis/memcached effectiveness
DB query timeSlow queries, lock contention
Third-party API availabilityDetect external failures early

🧠 6. Business & Custom Metrics

Tie observability to product outcomes.

MetricExample
Signups / logins per minuteTrack core flows
Checkout success rateDetect conversion issues
Order fulfillment lagE2E pipeline latency
User-facing error %From frontend telemetry

πŸ” 7. Tracing & Distributed Observability Metrics

For microservices and async systems.

MetricMeaning
Trace latency breakdownTime spent in each span
Span error countsFailures per service/component
Service dependency graphIdentify slow hops in a request path
Cross-service correlationEnd-to-end flow latency (via trace IDs)

πŸ“Š 8. SLO/SLI/SLA Metrics

Used for setting reliability targets.

CategoryExample SLI
Availabilityβ‰₯99.9% of requests succeed
Latency95% of requests <200 ms
FreshnessData updated within 1 min
DurabilityZero data loss after failure

πŸ•°οΈ 9. Change & Deployment Metrics

To correlate incidents with changes.

MetricExample
Deploy frequencyHow often code is released
Change failure rate% of deployments causing incidents
Mean time to recover (MTTR)Speed of recovery from incidents
Mean time between failures (MTBF)Stability over time