Observability

Target Audience: Software engineers with Java + AWS background


Table of Contents

  1. Foundations
  2. Java-Focused Observability
  3. AWS Observability
  4. Distributed Tracing & Context Propagation
  5. System Design Perspective
  6. Domain-Specific Scenarios
  7. Incident Response & Debugging
  8. Common Questions on Observability
  9. Architecture Diagrams (Textual)
  10. Production Checklists

Foundations

What Observability Really Means

Observability vs. Monitoring: The Critical Distinction

The two terms are often used interchangeably but serve fundamentally different purposes. Understanding this distinction is essential for Tech Lead credibility.

Monitoring answers the question: “What is wrong?” It uses predefined metrics, dashboards, and thresholds to detect known problems. You set up alerts for CPU > 80%, error rate > 1%, or response time > 500ms. When these thresholds trigger, you know something is broken. Monitoring works well for predictable failure modes.

Observability answers: “Why is it wrong?” It empowers engineers to ask arbitrary questions about system behavior without needing to anticipate those questions during instrumentation design. Rather than relying on predefined dashboards, observability enables exploratory analysis. For example, during an outage affecting 0.001% of requests from a specific geographic region using a particular mobile app version, observability lets you drill down into that slice of data without having pre-built a dashboard for every possible combination.

In practice: Monitoring is a subset of observability. Your monitoring alerts detect problems; observability tools provide the investigative power to diagnose them.

Why Observability Matters for Distributed Systems

In monolithic applications, a stack trace and a local debugger often suffice. In microservices:

  • A single user request traverses dozens of services across multiple regions
  • Failures can cascade silently (service A fails but returns cached data)
  • Performance degradation may be caused by dependencies you don’t directly control
  • Third-party APIs and edge functions introduce external variables
  • Timing issues only manifest under production load

40% of organizations report that distributed system complexity directly contributes to major outages. Without observability, you’re essentially flying blind—you know something crashed, but tracing the root cause across ten services and three cloud providers becomes a nightmare of manual log tailing and hypothesis-driven debugging.

The Three Pillars of Observability

Each pillar provides a different lens on system behavior. Together, they form a complete picture.

Pillar 1: Logs — Detailed Event Records

What they are: Timestamped records of discrete events, typically with structured context (request ID, user ID, operation, result, duration).

Strengths:

  • High cardinality: capture arbitrary data (user IDs, session details, query parameters)
  • Debuggability: exact error messages, stack traces, payload details
  • Compliance: audit trails with full context
  • Human-readable: engineers can quickly scan and search

Limitations:

  • Overwhelming at scale: millions of events per second become a wall of text
  • Cost: storing all logs can be expensive (though sampling/filtering helps)
  • Lack of structure: unstructured logs are nearly impossible to query
  • Incomplete context: without correlation IDs, linking logs across services is manual and slow

When to use: Root cause analysis, compliance auditing, understanding exactly what went wrong in a specific request.

Pillar 2: Metrics — Aggregated Quantitative Data

What they are: Time-series data points: counters (HTTP requests), gauges (CPU usage), histograms (latency percentiles).

Strengths:

  • Efficiency: minimal storage (1 data point per minute per metric)
  • Real-time visibility: dashboards show system health at a glance
  • Trend detection: identify patterns over hours/days
  • Cost-effective: scales to billions of events
  • Alerting: set thresholds and trigger automated responses

Limitations:

  • Low cardinality: hard to slice by (customer_id, region, app_version) without exploding cardinality
  • Loss of detail: histograms with percentiles lose individual request information
  • Blindness to anomalies: a metric trending at the 99th percentile may hide the actual problem

When to use: Dashboards, alerts, capacity planning, understanding aggregate system behavior.

Pillar 3: Traces — Request Flow Across Services

What they are: Hierarchical records of a request’s journey, composed of spans (one span per operation). Each span captures latency, status, attributes.

Strengths:

  • End-to-end visibility: see exactly where latency is spent
  • Service dependency mapping: automatic architecture discovery
  • Causality: understand how services interact
  • Performance bottleneck isolation: find the slow hop in a chain

Limitations:

  • Cost: storing full traces for every request is expensive; sampling becomes mandatory
  • Cardinality explosion: tracing with high-cardinality baggage can overwhelm collectors
  • Sampling bias: if you sample 1% of traces, you might miss rare errors
  • Latency: tail-based sampling decisions add latency to the critical path

When to use: Debugging latency issues, understanding service dependencies, pinpointing bottlenecks.

How These Pillars Work Together

The most powerful observability practice is trace-first debugging: start with a trace to see the high-level request flow, identify the slow or failing span, then use that span’s ID to find all associated logs for deep investigation.

Example workflow:

  1. Alert fires: Error rate spike detected (monitoring)
  2. Metrics confirm: P99 latency up 10x in payment-service
  3. Trace shows: Request spends 8s in database query (slow span identified)
  4. Logs reveal: Query is SELECT * FROM orders (missing WHERE clause due to bad deployment)

Java-Focused Observability

Logging Best Practices in Java

Structured Logging (Not Just Text)

The Problem with Unstructured Logs:

ERROR: Payment processing failed for user 12345
java.lang.NullPointerException at PaymentService.charge(PaymentService.java:125)

This is human-readable but unsearchable. You can’t ask “show me all errors for payment-service grouped by error type” without parsing strings.

Structured Logging Solution:

{
"timestamp": "2026-01-27T09:15:30Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "550e8400-e29b-41d4-a716-446655440000",
"span_id": "7a085853-4b1e-4b8f-9c3e-5d2f1a8b9c7d",
"correlation_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "12345",
"event": "payment_charge_failed",
"error_code": "PAYMENT_GATEWAY_TIMEOUT",
"error_message": "NullPointerException at PaymentService.charge(PaymentService.java:125)",
"operation": "charge",
"amount_usd": 99.99,
"payment_gateway": "stripe",
"retry_count": 2,
"duration_ms": 5000
}

This is queryable: service="payment-service" AND error_code="PAYMENT_GATEWAY_TIMEOUT" AND retry_count > 1.

Implementation with SLF4J

SLF4J is the de facto standard logging facade in Java (abstraction over implementations like Logback, Log4j2).

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
public class PaymentService {
private static final Logger logger = LoggerFactory.getLogger(PaymentService.class);
public PaymentResponse charge(String userId, BigDecimal amount) {
String correlationId = UUID.randomUUID().toString();
MDC.put("correlation_id", correlationId);
MDC.put("user_id", userId);
try {
logger.info("payment_started",
"amount_usd", amount,
"gateway", "stripe");
PaymentResponse response = gateway.charge(amount);
logger.info("payment_completed",
"response_code", response.getCode(),
"duration_ms", response.getDuration());
return response;
} catch (PaymentGatewayException e) {
logger.error("payment_charge_failed",
"error_code", e.getCode(),
"error_message", e.getMessage(),
"retry_count", retryCount,
new Exception(e)); // Include stack trace
throw new PaymentFailedException(e);
} finally {
MDC.clear(); // Always clear to prevent leaks
}
}
}

MDC (Mapped Diagnostic Context) is a thread-local map that automatically includes values in every log statement within that thread. When you set MDC.put("correlation_id", id), that value appears in all logs until you clear it.

Correlation IDs: Connecting the Dots

A correlation ID is a unique identifier that flows through every service involved in a request. It enables you to find all logs/traces for a single user request across the entire distributed system.

Generation & Propagation:

  • API Gateway generates or inherits X-Correlation-ID from request header
  • Includes in response headers (so clients can refer to the request in support tickets)
  • Passes to all downstream services via HTTP headers
  • For async (Kafka, SQS), includes in message headers
  • Each service adds its own span_id but preserves trace_id
// Filter/Interceptor at API boundary
@Component
public class CorrelationIdFilter implements Filter {
private static final String CORRELATION_ID_HEADER = "X-Correlation-ID";
@Override
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {
HttpServletRequest httpRequest = (HttpServletRequest) request;
HttpServletResponse httpResponse = (HttpServletResponse) response;
String correlationId = httpRequest.getHeader(CORRELATION_ID_HEADER);
if (correlationId == null) {
correlationId = UUID.randomUUID().toString();
}
MDC.put("correlation_id", correlationId);
httpResponse.setHeader(CORRELATION_ID_HEADER, correlationId);
try {
chain.doFilter(request, response);
} finally {
MDC.clear();
}
}
}
// When calling downstream service
RestTemplate restTemplate = new RestTemplate();
HttpHeaders headers = new HttpHeaders();
headers.set("X-Correlation-ID", MDC.get("correlation_id"));
HttpEntity<String> entity = new HttpEntity<>(headers);
restTemplate.exchange(url, HttpMethod.GET, entity, String.class);

Logging Configuration (Logback)

Logback is the recommended implementation (faster, more flexible than Log4j, better than JUL).

<!-- logback.xml -->
<configuration>
<!-- Property definitions for environment -->
<springProperty name="spring.application.name" source="spring.application.name"/>
<property name="LOG_PATTERN" value="%d{ISO8601} [%thread] %-5level %logger{36} [%X{correlation_id}] - %msg%n"/>
<!-- JSON structured logging -->
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<customFields>{"service":"${spring.application.name}"}</customFields>
<fieldNames>
<timestamp>timestamp</timestamp>
<version>[ignore]</version>
<level>level</level>
<loggerName>logger</loggerName>
<message>message</message>
<thread>thread</thread>
</fieldNames>
</encoder>
</appender>
<!-- Root logger -->
<root level="INFO">
<appender-ref ref="STDOUT"/>
</root>
<!-- Suppress noisy dependencies -->
<logger name="org.springframework" level="WARN"/>
<logger name="org.hibernate" level="WARN"/>
</configuration>

Anti-patterns to avoid:

  • Logging at DEBUG in production (costs without insight)
  • Including sensitive data (passwords, credit cards, PII)
  • Relying on logs alone for debugging (trace tells the full story)
  • Logging stack traces as strings (parse as structured exception data)
  • Missing context (correlation IDs, user IDs)

JVM-Level Observability

The JVM itself is a complex distributed system. Understanding GC, heap, threads, and CPU is essential for debugging performance issues.

Memory Observability

Heap Metrics (from Micrometer):

jvm.memory.used{area="heap",id="G1 Survivor Space"} = 5MB
jvm.memory.committed{area="heap"} = 256MB
jvm.memory.max{area="heap"} = 4096MB
jvm.memory.usage{area="heap"} = 0.06 (6% of max)

What this tells you:

  • If used approaches max → memory pressure, likely to trigger GC
  • If committed < max → JVM has room to grow (not hitting ceiling yet)
  • If usage > 0.85 → risk of full GC soon

Garbage Collection

GC Metrics (from Micrometer):

jvm.gc.pause{action="end of major GC",cause="G1 Evacuation Pause"} = histogram
- count: number of GC pauses
- sum: total pause time
- max: longest pause
jvm.gc.memory.allocated = counter (bytes allocated since start)
jvm.gc.memory.promoted = counter (bytes moved from young to old gen)

What this tells you:

  • GC pause time → latency spikes (user-visible delays)
  • Allocation rate → memory pressure (high allocation = frequent GC)
  • Promotion rate → objects living too long (tune young gen size)

Healthy patterns:

  • GC pauses < 100ms (minor), < 500ms (major)
  • Major GC every 10+ minutes (not constantly)
  • Allocation rate plateaus (not constantly increasing)

Thread Observability

Thread Metrics:

jvm.threads.live = 42
jvm.threads.peak = 100
jvm.threads.daemon = 40
jvm.threads.deadlocked = 0

What this tells you:

  • Thread creep: live threads growing over time = leak
  • Deadlocks: deadlocked > 0 = immediate action required
  • Daemon threads: should be mostly daemon (background)

CPU & System

process.cpu.usage = 0.45 (45% of available CPU)
system.cpu.usage = 0.72 (72% of system CPU)
process.runtime.jvm.cpu.time = counter (nanoseconds)

What this tells you:

  • If process CPU stays high after traffic drops → memory leak or stuck thread
  • If system CPU >> process CPU → other processes consuming resources
  • CPU usage should scale with load; flat usage after traffic drop = problem

Implementing JVM Observability with Micrometer

Micrometer auto-instruments the JVM; you just enable it:

// Spring Boot (automatic)
@Configuration
public class ObservabilityConfig {
// Auto-configured if management.endpoints.web.exposure.include=prometheus
}
// Gradle dependency
implementation("io.micrometer:micrometer-registry-prometheus")
implementation("io.micrometer:micrometer-core")
// Actuator endpoint exposes metrics
// GET /actuator/prometheus → Prometheus format
// GET /actuator/metrics → JSON metadata
// GET /actuator/metrics/jvm.memory.used → specific metric

Dashboards to build:

  • GC pause time trend (alert if > 500ms)
  • Heap usage trend (alert if > 85%)
  • Thread count trend (alert on spike)
  • CPU usage (correlate with traffic)

Common Java Observability Libraries

Logback vs Log4j2

FeatureLogbackLog4j2Winner
PerformanceGoodBetter (async)Log4j2
Async supportLimitedFirst-classLog4j2
Garbage-free modeNoYesLog4j2
Configuration reloadYesYesTie
Groovy configYesNoLogback
Lambda supportNoYes (filters)Log4j2
Spring Boot defaultYesNoLogback

Recommendation: Use Logback for simplicity, Log4j2 for high-throughput/low-latency systems.

Micrometer (Metrics)

Micrometer is the abstraction layer for metrics (like SLF4J for logging). It auto-instruments JVM, HTTP, database, and custom metrics, then exports to Prometheus, CloudWatch, Datadog, New Relic, etc.

public class PaymentService {
private final MeterRegistry meterRegistry;
public PaymentService(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
}
public void charge(BigDecimal amount) {
Timer.Sample sample = Timer.start(meterRegistry);
try {
// charge logic
meterRegistry.counter("payments.successful",
"currency", "USD").increment();
} catch (Exception e) {
meterRegistry.counter("payments.failed",
"error_type", e.getClass().getSimpleName()).increment();
} finally {
sample.stop(Timer.builder("payment.latency")
.tag("status", "success")
.register(meterRegistry));
}
}
}

OpenTelemetry (Tracing & Metrics)

OpenTelemetry is the industry standard for observability instrumentation. It’s vendor-agnostic and auto-instruments common libraries.

// Auto-instrumentation agent (zero code changes)
// java -javaagent:opentelemetry-javaagent.jar -jar app.jar
// Manual instrumentation
import io.opentelemetry.api.trace.Tracer;
public class OrderService {
private final Tracer tracer;
public OrderService(Tracer tracer) {
this.tracer = tracer;
}
public void processOrder(String orderId) {
try (Span span = tracer.spanBuilder("process_order")
.setAttribute("order.id", orderId)
.startSpan()) {
// Processing logic
// Nested spans automatically created by instrumented libraries
}
}
}

Anti-Patterns (What NOT to Do)

Over-logging: Logging every method entry/exit at INFO level drowns out signal. Use DEBUG for verbose tracing.

// BAD: Too noisy
logger.info("Entering charge method");
logger.info("Retrieved gateway");
logger.info("Calling gateway.charge()");
logger.info("Exiting charge method");
// GOOD: Log decisions and outcomes
logger.info("payment_initiated", "user_id", userId, "amount", amount);
logger.info("payment_completed", "duration_ms", duration);

Missing context: Logs without correlation IDs are isolated incidents, not connected stories.

Log-only debugging: Chasing logs instead of using traces. Traces show causality; logs show detail.

No sampling: Attempting to log/trace everything at high scale causes cost explosion and performance degradation. Sample early, sample often.


AWS Observability

CloudWatch (Logs, Metrics, Alarms)

AWS CloudWatch is the native observability service. While it has limitations compared to specialized tools (Datadog, New Relic), it has deep integration with AWS services.

CloudWatch Logs

Strengths:

  • Native to EC2, Lambda, ECS, RDS
  • Log Insights for ad-hoc querying (SQL-like syntax)
  • Automatic parsing of JSON logs
  • Log retention policies and expiration
  • Cost: ~$0.50 per GB ingested (expensive at scale)

Limitations:

  • Limited querying speed (not real-time like Elasticsearch)
  • Cardinality explosion: too many high-cardinality fields = high cost
  • No full-text search (must use structured fields)

CloudWatch Metrics

Built-in metrics (auto-collected):

  • Lambda: Invocations, Duration, Errors, ConcurrentExecutions
  • ECS: CPU, Memory, Network
  • RDS: DatabaseConnections, ReadLatency, WriteLatency
  • API Gateway: Count, Latency, 4xx/5xx Errors

Custom metrics (application-emitted):

PutMetricDataRequest request = new PutMetricDataRequest()
.withNamespace("MyApp/Payment")
.withMetricData(new MetricDatum()
.withMetricName("CheckoutLatency")
.withValue(latencyMs)
.withUnit(StandardUnit.Milliseconds)
.withDimensions(
new Dimension().withName("Environment").withValue("production"),
new Dimension().withName("Region").withValue("us-east-1")
));
cloudWatch.putMetricData(request);

Cost trap: High-cardinality dimensions (user_id, request_id) explode your bill. Use Embedded Metric Format (EMF) to log high-cardinality data instead.

CloudWatch Alarms

Alarms trigger on metric thresholds or log pattern matching.

// Create alarm in code
PutMetricAlarmRequest request = new PutMetricAlarmRequest()
.withAlarmName("HighErrorRate")
.withMetricName("Errors")
.withNamespace("MyApp")
.withStatistic(Statistic.Average)
.withPeriod(60) // evaluation window: 60 seconds
.withThreshold(0.05) // 5% error rate
.withComparisonOperator(ComparisonOperator.GreaterThanThreshold)
.withTreatMissingData(TreatMissingData.NotBreaching); // don't alarm if no data
cloudWatch.putMetricAlarm(request);

X-Ray (Distributed Tracing)

X-Ray is AWS’s distributed tracing service. It’s simpler than Jaeger but has AWS-specific limitations.

How X-Ray Works

  1. Instrumentation: Java agent or SDK captures spans
  2. Sampling: By default, traces 1 req/sec (configurable)
  3. Collection: Data sent to X-Ray service
  4. Service Map: Visual representation of dependencies
  5. Analysis: Drill into traces, see latency by service

X-Ray with Java

// Maven dependency
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-xray-sdk-java</artifactId>
</dependency>
// Auto-instrumentation via agent
// java -javaagent:/path/to/xray-agent.jar -jar app.jar
// Manual instrumentation
import com.amazonaws.xray.AWSXRay;
import com.amazonaws.xray.entities.Segment;
import com.amazonaws.xray.entities.Subsegment;
public class PaymentService {
public void charge(String userId, BigDecimal amount) {
Segment segment = AWSXRay.getCurrentSegment(); // auto-created for Lambda
Subsegment subsegment = segment.beginSubsegment("payment.gateway");
try {
subsegment.putAnnotation("user_id", userId); // indexed, queryable
subsegment.putMetadata("amount", amount); // searchable but not indexed
// Call payment gateway
gateway.charge(amount);
} catch (Exception e) {
subsegment.addException(e);
throw e;
} finally {
subsegment.close();
}
}
}

X-Ray Strengths & Gaps

Strengths:

  • Tight integration with Lambda, API Gateway, ECS
  • Service map auto-discovery
  • Error rate and latency by service visible immediately
  • No extra infrastructure to manage

Gaps:

  • Sampling decision is head-based (can’t capture all errors)
  • Limited retention (30 days by default)
  • Querying less flexible than Jaeger
  • High cost per trace at scale (financial services often disable it)

Observability in ECS/Fargate

CloudWatch Container Insights

Enable Container Insights on your ECS cluster for automatic metric collection:

AWS console → ECS → Cluster → [cluster name] → Monitor → Enable Container Insights

Metrics collected:

  • Cluster CPU, memory utilization
  • Per-task CPU, memory
  • Container CPU, memory
  • Network I/O

Logging

All stdout/stderr from containers automatically goes to CloudWatch Logs if you configure the log driver:

{
"containerDefinitions": [
{
"name": "my-app",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/my-app",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}

Observability in Lambda

Native CloudWatch Integration

Lambda automatically logs to CloudWatch Logs (stdout/stderr). All invocations visible.

X-Ray

Enable X-Ray tracing in Lambda:

AWS console → Lambda → [function name] → Configuration → X-Ray → Active tracing

Or via code:

import com.amazonaws.xray.AWSXRay;
public class LambdaHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
@Override
public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent event, Context context) {
// Segment auto-created; subsegments for downstream calls
AWSXRay.getCurrentSegment()
.putAnnotation("request_id", event.getRequestContext().getRequestId());
// Call downstream service (auto-traced if instrumented)
S3Client s3 = AWSXRay.instrumentClient(S3Client.create());
s3.getObject(GetObjectRequest.builder().bucket("my-bucket").key("file").build());
return new APIGatewayProxyResponseEvent()
.withStatusCode(200)
.withBody("{\"message\": \"success\"}");
}
}

Embedded Metric Format (EMF)

EMF is a trick to emit custom metrics without CloudWatch API calls. You log JSON, CloudWatch automatically extracts metrics.

// Instead of this (expensive):
cloudWatch.putMetricData(new PutMetricDataRequest()
.withMetricData(new MetricDatum().withMetricName("OrderProcessing").withValue(latency)));
// Do this (cheaper):
System.out.println(
"{\"_aws\": {\"CloudWatch\": {\"Namespace\": \"MyApp\", \"MetricData\": [{" +
"\"MetricName\": \"OrderProcessing\", \"Value\": " + latency +
"}]}}, \"order_id\": \"123\", \"user_id\": \"456\"}"
);

CloudWatch automatically parses the EMF wrapper and extracts the metric, but the high-cardinality fields (order_id, user_id) remain in logs, not metrics.


Distributed Tracing & Context Propagation

How Traces Flow Across Services

A trace is a directed acyclic graph (DAG) of spans. Each span represents one operation.

User Request → API Gateway (span 1)
→ Payment Service (span 2) [child of span 1]
→ Card Validator (span 3) [child of span 2]
→ Payment Gateway (span 4) [child of span 2, parallel]
→ Order Service (span 5) [child of span 1]
→ Database Query (span 6) [child of span 5]
→ Notification Service (span 7) [child of span 1, async via SQS]

Each span has:

  • trace_id: shared across entire request flow (connects all spans)
  • span_id: unique to this span
  • parent_span_id: points to parent (creates hierarchy)
  • baggage: key-value pairs propagated downstream (tenant_id, user_id, etc.)

Trace IDs, Span IDs, Baggage

Trace ID (128-bit UUID):

trace_id: 550e8400-e29b-41d4-a716-446655440000

Generated at entry point (API Gateway, message queue consumer) and propagated to all downstream services.

Span ID (64-bit random):

span_id: 7a085853-4b1e-4b8f-9c3e-5d2f1a8b9c7d (for this operation)
parent_span_id: 8b9c7d-550e8400-e29b-41d4 (parent operation)

Baggage (custom context):

baggage: {
"tenant_id": "acme-corp",
"user_id": "user-12345",
"request_source": "mobile-app",
"feature_flags": {"new_checkout": "enabled"}
}

Baggage is propagated to all downstream services but should be used sparingly (every field = extra header size).

Asynchronous Flows (Kafka, SQS)

The challenge with queues: producer and consumer are decoupled in time. How do you maintain trace continuity?

Kafka Example

Producer (injects trace context into headers):

import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.propagation.TextMapPropagator;
public class OrderProducer {
private final Tracer tracer;
private final TextMapPropagator propagator;
private final KafkaProducer<String, String> producer;
public void publishOrder(Order order) {
try (Span span = tracer.spanBuilder("publish_order")
.setAttribute("order.id", order.getId())
.startSpan()) {
// Prepare headers
Map<String, String> headers = new HashMap<>();
// Inject current trace context into headers
propagator.inject(Context.current(), headers,
(carrier, key, value) -> carrier.put(key, value));
// Convert to Kafka headers
Headers kafkaHeaders = new RecordHeaders();
headers.forEach((k, v) -> kafkaHeaders.add(k, v.getBytes()));
// Send with trace context
producer.send(new ProducerRecord<>(
"orders",
0,
order.getId(),
JsonMapper.toJson(order),
kafkaHeaders
));
}
}
}

Consumer (extracts and continues trace):

public class OrderConsumer {
private final Tracer tracer;
private final TextMapPropagator propagator;
@KafkaListener(topics = "orders", groupId = "order-processor")
public void consume(ConsumerRecord<String, String> record) {
// Extract trace context from message headers
Context extractedContext = propagator.extract(Context.current(),
record.headers(),
(carrier, key) -> {
byte[] value = carrier.lastHeader(key).value();
return new String(value);
});
// Create CONSUMER span linked to producer
try (Span span = tracer.spanBuilder("process_order")
.setParent(extractedContext)
.setAttribute("messaging.system", "kafka")
.setAttribute("messaging.destination", "orders")
.startSpan()) {
Order order = JsonMapper.fromJson(record.value(), Order.class);
processOrder(order);
}
}
}

SQS Example

SQS is similar, but inject/extract from message attributes:

// Producer
SendMessageRequest request = new SendMessageRequest()
.withQueueUrl(queueUrl)
.withMessageBody(JsonMapper.toJson(order))
.withMessageAttributes(kafkaHeaders); // trace context here
sqs.sendMessage(request);
// Consumer
Message message = sqs.receiveMessage(queueUrl).getMessages().get(0);
Context extractedContext = propagator.extract(Context.current(),
message.getMessageAttributes(), /* extract logic */);
try (Span span = tracer.spanBuilder("process_sqs_message")
.setParent(extractedContext)
.startSpan()) {
Order order = JsonMapper.fromJson(message.getBody(), Order.class);
processOrder(order);
}

Sampling Strategies and Their Risks

Sampling reduces the volume of traces stored, keeping costs down. But sampling introduces blindspots.

Head-Based Sampling

Decision made at the start of the request (head of the trace).

// 10% sampling: keep 1 in 10 traces
TraceIdRatioBased sampler = new TraceIdRatioBased(0.1);
// When creating a span
Span span = tracer.spanBuilder("operation")
.setSampler(sampler)
.startSpan();

Pros:

  • Simple to implement (no state needed)
  • Efficient (don’t collect data for dropped traces)
  • Predictable cost
  • Can be done at any point in pipeline

Cons:

  • Can’t sample based on outcome (can’t guarantee all errors are captured)
  • High-cardinality dimensions cause cost spikes anyway

Tail-Based Sampling

Decision made at the end of the trace (after all spans collected).

// Keep all traces with errors
// Keep all slow traces
// Keep 5% of successful fast traces
// Requires OpenTelemetry Collector

Pros:

  • Can ensure 100% of errors are sampled
  • Smart policies: expensive = always sample
  • Better observability for failures

Cons:

  • More complex (need collector)
  • Higher latency (must wait for full trace)
  • Higher compute cost in collector

Hybrid Approach (Recommended)

Combine both:

  1. Head sampling: Drop obviously unimportant traces (health checks) at 1% rate
  2. Tail sampling: Use collector to ensure all errors and slow traces (>1s latency) are kept
# opentelemetry-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
tail_sampling:
policies:
- name: error-traces
traces_per_second: 100
policy:
type: status_code
status_code:
status_codes: [ERROR, UNSET]
- name: slow-traces
traces_per_second: 100
policy:
type: latency
latency:
threshold_ms: 1000
- name: probabilistic
traces_per_second: 100
policy:
type: probabilistic
probabilistic:
sampling_percentage: 5
exporters:
jaeger:
endpoint: jaeger-collector:14250
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [jaeger]

System Design Perspective

Designing Systems For Observability

Observability isn’t an afterthought bolted onto a system. Great systems are designed with observability built in from day one.

Principle 1: Instrumenting the Critical Path

Don’t instrument everything. Focus on what users care about.

Example: E-commerce checkout

Critical path:

  1. User clicks “Checkout” (frontend)
  2. Cart validation (cart-service)
  3. Payment processing (payment-service)
  4. Order persistence (order-service)
  5. Notification (email-service via async)

Instrument these deeply. Instrument admin dashboards lightly.

Principle 2: Correlation IDs Everywhere

Every request, every message, every async job must have a correlation ID. This is non-negotiable.

Principle 3: Structured Data From the Start

Log as JSON, not free-form strings. Define a schema early and enforce it.

// Schema (team agreement)
{
"timestamp": "ISO8601",
"service": "string",
"trace_id": "UUID",
"span_id": "UUID",
"level": "FATAL|ERROR|WARN|INFO|DEBUG|TRACE",
"event": "snake_case_event_name",
"user_id": "string (optional)",
"duration_ms": "integer (optional)",
"error": {
"code": "string",
"message": "string",
"stack_trace": "string (optional)"
}
}

Principle 4: Observable State Representation

Design systems so you can ask questions about state without code changes.

Bad design:

  • State is private; must add logging to debug
  • Errors silently consumed and retried
  • Retry logic invisible

Good design:

  • State machine explicit and observable (CREATE → PENDING → PROCESSING → COMPLETED)
  • Failures recorded with reason codes
  • Retry attempts logged with backoff strategy
public enum OrderStatus {
CREATED,
PAYMENT_PENDING,
PAYMENT_COMPLETED,
PROCESSING,
SHIPPED,
DELIVERED,
FAILED,
CANCELLED
}
public class Order {
private OrderStatus status;
private LocalDateTime statusChangedAt;
private String statusChangeReason; // why did we change status?
private int retryCount;
private String lastError;
}

SLIs, SLOs, SLAs (With Concrete Examples)

Definitions

SLI (Service Level Indicator): The actual measurement.

  • Example: “99.95% of requests complete in < 500ms”

SLO (Service Level Objective): Your internal goal (stricter than SLA).

  • Example: “99.95% availability” (internal commitment)

SLA (Service Level Agreement): Your external commitment to customers (looser than SLO).

  • Example: “99.9% availability” (customer-facing; includes buffer)

Golden Signals

Four metrics capture 90% of system health:

1. Latency (how fast?)

p50: 300ms (median request)
p95: 800ms (95th percentile)
p99: 2000ms (99th percentile)
Alert if p95 > 1 second

2. Traffic (how much load?)

Requests per second: 5000 RPS
Concurrent connections: 500
Database connections: 80/100
Alert if traffic unexpectedly drops (possible service failure)

3. Errors (what breaks?)

Error rate: 0.5% (5 errors per 1000 requests)
Error categories: TIMEOUT, INVALID_REQUEST, SERVER_ERROR
Alert if error rate > 1%

4. Saturation (how full?)

CPU: 60% (room to grow)
Memory: 70% (approaching limits)
Disk: 80% (urgent scaling needed)
Queue depth: 1000 messages (capacity limit is 10000)
Alert if CPU > 85%, Memory > 90%

E-Commerce Example

Checkout Service SLO:

Availability: 99.95%
- Measured as: % of requests that return 2xx or 4xx (not 5xx)
- During: All business hours
Latency:
- p50 < 200ms
- p95 < 500ms
- p99 < 1000ms
- Measured on successful requests
Error budget:
- SLA: 99.9% (customer promise)
- SLO: 99.95% (team target)
- Buffer: 0.05% (5 minutes per month)
- Can "spend" this on deployments, experiments

Measuring:

-- SLI for availability
SELECT
(COUNT(*) FILTER (WHERE status_code < 500)) * 1.0 / COUNT(*) as availability
FROM http_requests
WHERE timestamp > now() - interval '30 days'
AND service = 'checkout'
-- SLI for latency
SELECT
service,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY latency_ms) as p50,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) as p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) as p99
FROM http_requests
WHERE timestamp > now() - interval '30 days'
AND service = 'checkout'
AND status_code < 400
GROUP BY service

Investment Banking Example

Trade Processing SLO:

Correctness: 100% (no exceptions)
- Measured as: successful settlement / total trades
- Every trade must settle or be rejected, never lost
Latency:
- All trades settled by end of day (T+1 for equities, same-day for FX)
- Exceptions logged for manual review
Auditability: 100%
- Every trade logged with: trader ID, timestamp, approvals, result
- Complete audit trail for regulatory compliance

Measuring:

-- Completeness
SELECT
(COUNT(*) FILTER (WHERE final_status IN ('SETTLED', 'REJECTED'))) * 1.0 / COUNT(*)
as completeness
FROM trades
WHERE settlement_date = current_date
-- Settlement time
SELECT
AVG(settlement_timestamp - trade_timestamp) as avg_settlement_latency
FROM trades
WHERE settlement_date = current_date
-- Exception rate
SELECT
COUNT(*) FILTER (WHERE exception IS NOT NULL) * 1.0 / COUNT(*) as exception_rate
FROM trades
WHERE settlement_date = current_date

Observability for Scalability and Resilience

Observability enables smart scaling and graceful degradation.

Example: Black Friday Traffic Surge

Observability enables:

  1. Early detection (metrics show traffic spike 30 min before peak)
  2. Rapid response (auto-scaling triggers based on observed load)
  3. Graceful degradation (disable non-critical services, keep checkout running)
  4. Post-incident learning (trace shows which service became the bottleneck)

Without observability:

  • Traffic spike overwhelming system before response time
  • Cascading failures (payment timeout → order service timeout → frontend timeout)
  • No clue what broke

Domain-Specific Scenarios

E-Commerce: Checkout Latency & Inventory Mismatch

Checkout Latency

Critical issue: customers abandon carts if checkout takes > 3 seconds.

Observability instrumentation:

public class CheckoutController {
private final Tracer tracer;
private final CartService cartService;
private final PaymentService paymentService;
private final OrderService orderService;
@PostMapping("/checkout")
public CheckoutResponse checkout(@RequestBody CheckoutRequest req) {
try (Span span = tracer.spanBuilder("checkout_process")
.setAttribute("user_id", req.getUserId())
.setAttribute("cart_total", req.getTotal())
.setAttribute("item_count", req.getItems().size())
.startSpan()) {
// Step 1: Validate cart
try (Span cartSpan = tracer.spanBuilder("validate_cart")
.startSpan()) {
cartService.validate(req.getCartId());
} // Span auto-closed; latency measured
// Step 2: Process payment
try (Span paymentSpan = tracer.spanBuilder("process_payment")
.setAttribute("amount", req.getTotal())
.startSpan()) {
PaymentResult payment = paymentService.charge(req);
paymentSpan.setAttribute("gateway", payment.getGateway());
}
// Step 3: Create order
try (Span orderSpan = tracer.spanBuilder("create_order")
.startSpan()) {
Order order = orderService.create(req);
orderSpan.setAttribute("order.id", order.getId());
}
return new CheckoutResponse(order, payment);
}
}
}

Dashboard metrics:

  • Checkout latency (p50, p95, p99) by region
  • Latency breakdown: % time in payment vs inventory vs database
  • Error rate by failure reason (payment_declined, inventory_unavailable, etc.)
  • Conversion funnel: started checkout → completed

Alert thresholds:

  • p95 latency > 1 second (means 5% of users experiencing >1s)
  • Payment timeout > 3 per minute

Inventory Mismatch

Critical issue: system shows “in stock,” customer orders, then “sorry, we’re out of stock.”

Root causes:

  1. Async inventory updates lag (warehouse system delayed)
  2. Race condition (two orders placed simultaneously for last item)
  3. Manual inventory adjustment not synced
  4. Returns not reflected

Observability:

public class InventoryService {
private final Tracer tracer;
private final InventoryRepository repo;
private final EventPublisher events;
public InventoryReservation reserve(String sku, int quantity) {
try (Span span = tracer.spanBuilder("inventory_reserve")
.setAttribute("sku", sku)
.setAttribute("quantity", quantity)
.startSpan()) {
InventoryRow row = repo.findBySku(sku);
// Log before state
span.addEvent("inventory_check", Attributes.of(
AttributeKey.longKey("available"), row.getAvailable(),
AttributeKey.longKey("reserved"), row.getReserved()
));
if (row.getAvailable() < quantity) {
span.recordException(new OutOfStockException(sku));
throw new OutOfStockException(sku);
}
// Update (atomic)
InventoryReservation reservation = repo.reserve(sku, quantity);
// Log after state
span.addEvent("reservation_created", Attributes.of(
AttributeKey.stringKey("reservation_id"), reservation.getId(),
AttributeKey.longKey("remaining"), row.getAvailable() - quantity
));
// Publish event (async sync to external systems)
events.publish(new InventoryReservedEvent(sku, quantity));
return reservation;
}
}
}

Reconciliation job (daily):

@Scheduled(cron = "0 2 * * *") // 2 AM
public void reconcileInventory() {
try (Span span = tracer.spanBuilder("inventory_reconciliation")
.startSpan()) {
List<String> skus = repo.getAllSkus();
int discrepancies = 0;
for (String sku : skus) {
long systemCount = repo.getCount(sku);
long actualCount = warehouseApi.getActualCount(sku); // source of truth
if (systemCount != actualCount) {
discrepancies++;
logger.warn("inventory_mismatch",
"sku", sku,
"system_count", systemCount,
"actual_count", actualCount,
"variance", actualCount - systemCount);
// Record metric for trending
meterRegistry.counter("inventory.discrepancies",
"sku", sku).increment();
// Auto-correct if difference is small
if (Math.abs(actualCount - systemCount) <= 5) {
repo.update(sku, actualCount);
} else {
// Alert for manual review
slack.notify("Inventory discrepancy for " + sku);
}
}
}
span.setAttribute("discrepancies_found", discrepancies);
}
}

Dashboard:

  • System inventory vs actual inventory (reconciliation variance)
  • Out-of-stock errors: rate, by SKU, by region
  • Reservation success rate
  • Mismatch detection latency (how long before we notice?)

Investment Banking: Trade Processing & Reconciliation

Trade Processing

Critical: every trade must be recorded, settled, and auditable.

Observability:

public class TradeProcessor {
private final Tracer tracer;
private final TradeRepository repo;
private final SettlementService settlement;
private final AuditLog auditLog;
public Trade processIncomingTrade(IncomingTrade incoming) {
String tradeId = incoming.getTradeId();
try (Span span = tracer.spanBuilder("process_trade")
.setAttribute("trade.id", tradeId)
.setAttribute("instrument", incoming.getInstrument())
.setAttribute("quantity", incoming.getQuantity())
.setAttribute("price", incoming.getPrice())
.setAttribute("counterparty", incoming.getCounterparty())
.startSpan()) {
// Step 1: Validation
try (Span validationSpan = tracer.spanBuilder("validate_trade")
.startSpan()) {
validator.validate(incoming);
validationSpan.addEvent("validation_passed");
}
// Step 2: Book trade
Trade trade = new Trade(incoming);
trade.setStatus(TradeStatus.BOOKED);
trade.setBookingTime(Instant.now());
repo.save(trade);
// Step 3: Initiate settlement
try (Span settlementSpan = tracer.spanBuilder("initiate_settlement")
.startSpan()) {
SettlementInstruction instruction = settlement.initiate(trade);
settlementSpan.setAttribute("settlement.instruction_id",
instruction.getId());
}
// Step 4: Audit logging (for regulatory compliance)
auditLog.log(new AuditEntry()
.setAction("TRADE_BOOKED")
.setTradeId(tradeId)
.setTraderId(getCurrentTraderId()) // who booked it?
.setTimestamp(Instant.now())
.setDetails(incoming)
.setApprovals(Collections.emptyList())); // approvals if required
return trade;
}
}
public void reconcileSettlement() {
try (Span span = tracer.spanBuilder("reconcile_settlement")
.startSpan()) {
List<Trade> unsettledTrades = repo.findByStatus(TradeStatus.PENDING_SETTLEMENT);
for (Trade trade : unsettledTrades) {
try (Span tradeSpan = tracer.spanBuilder("reconcile_trade")
.setAttribute("trade.id", trade.getId())
.startSpan()) {
// Check with custodian (source of truth)
SettlementStatus custodianStatus = custodian.getStatus(trade.getId());
TradeStatus systemStatus = trade.getStatus();
if (!systemStatus.matches(custodianStatus)) {
// Reconciliation break!
tradeSpan.recordException(
new ReconciliationBreakException(
trade.getId(), systemStatus, custodianStatus));
logger.error("reconciliation_break",
"trade_id", trade.getId(),
"system_status", systemStatus,
"custodian_status", custodianStatus);
auditLog.log(new AuditEntry()
.setAction("RECONCILIATION_BREAK")
.setTradeId(trade.getId())
.setTimestamp(Instant.now())
.setDetails(Map.of(
"system_status", systemStatus.toString(),
"custodian_status", custodianStatus.toString()
)));
} else {
tradeSpan.addEvent("reconciliation_passed");
}
}
}
}
}
}

Audit dashboard:

  • Trades booked per day (volume trending)
  • Settlement latency (T+1, T+2, etc.)
  • Reconciliation breaks (rate, by reason)
  • Audit trail queries (find all trades for trader X on date Y)

Incident Response & Debugging

How Observability Helps During Production Incidents

Timeline:

  • T+0: Alert fires (metric spike detected)
  • T+1 min: Dashboard shows error rate trending up
  • T+3 min: Trace sampling captures a failing trace
  • T+5 min: Engineer examines trace, identifies slow database query
  • T+8 min: Logs for that trace context confirm query timeout
  • T+10 min: Root cause identified: missing index on orders.user_id
  • T+15 min: Index created, latency returns to baseline
  • T+25 min: Postmortem started

Without observability:

  • T+0: Alert fires
  • T+5 min: “Something’s wrong, let’s check the application logs”
  • T+20 min: Scrolling through 10M log lines, can’t find the issue
  • T+45 min: Finally found relevant error, but which service caused it?
  • T+90 min: Shotgun debugging, restarted random services
  • T+120 min: Issue mysteriously resolved (probably not actually fixed)

Root Cause Analysis Using Logs, Metrics, and Traces

Scenario: Checkout conversion drops 50% on Black Friday at 3 PM.

Step 1: Metrics confirm issue

checkout_conversion_rate: 98% → 48% (drop detected)
payment_service_error_rate: 0.5% → 15%
payment_service_latency_p99: 500ms → 5000ms

Step 2: Trace shows bottleneck
Sampled trace shows:

  • API Gateway to Checkout Service: 50ms (normal)
  • Checkout to Payment Service: 4900ms (SLOW!)
  • Payment to Stripe: timeout after 5s

Step 3: Logs reveal root cause

{
"timestamp": "2026-01-27T15:03:45Z",
"service": "payment-service",
"event": "stripe_api_timeout",
"http_status": 504,
"duration_ms": 5000,
"error": "context deadline exceeded"
}

Payment service making unoptimized call to Stripe (no batching, no connection pooling).

Step 4: Identify fix

  1. Add exponential backoff (don’t hammer Stripe)
  2. Batch stripe calls
  3. Increase timeout slightly (temporary)
  4. Enable circuit breaker (fail fast instead of hanging)

Postmortems and Learning Loops

Good postmortem structure:

## Incident: Checkout Conversion Drop
**Timeline:**
- 15:03 UTC: Error rate spike detected
- 15:08 UTC: Team paged
- 15:15 UTC: Root cause identified (Stripe timeout)
- 15:28 UTC: Circuit breaker deployed, recovered
**Duration:** 25 minutes
**Root Cause:**
Payment service was synchronously calling Stripe's API without batching or connection pooling. Black Friday traffic (10x normal) exhausted payment service's HTTP connection pool, causing timeouts.
**Contributing Factors:**
1. Load test only simulated 2x normal traffic (not enough)
2. No circuit breaker between payment and Stripe
3. Stripe's API performance not monitored (external blind spot)
**Immediate Actions:**
1. Enable circuit breaker (fail fast)
2. Implement exponential backoff
**Follow-up Actions:**
1. Add load test for 10x traffic (within 2 weeks)
2. Add Stripe API latency to dashboards
3. Implement request batching for Stripe (within sprint)
**Learning:**
External service degradation can cascade. Need circuit breakers for all external calls.

Common Questions on Observability

Q1: “Design an observability solution for a new microservices application.”

Structure your answer:

  1. Understand requirements (throughput, latency, criticality)
  2. Choose instrumentation (OpenTelemetry, auto-instrumentation)
  3. Define metrics (Golden Signals)
  4. Logging strategy (structured JSON, correlation IDs)
  5. Tracing strategy (head/tail sampling)
  6. Alerting (SLO-based)
  7. Dashboards (user journey, operational)

Example answer:

“I’d start with understanding the business criticality and scale. For an e-commerce platform:

Instrumentation:

  • OpenTelemetry Java agent for auto-instrumentation (zero code changes initially)
  • Manual spans for critical business logic (checkout, payment)

Metrics:

  • Golden Signals: latency (p50/p95/p99), traffic (RPS), errors (rate by type), saturation (CPU, memory, queue depth)
  • Business metrics: conversion rate, revenue per minute

Logging:

  • Structured JSON with correlation IDs
  • All logs must have: timestamp, service, trace_id, span_id, level, event, user_id
  • Logback with JSON encoder

Tracing:

  • CloudWatch X-Ray for AWS-native tracing
  • Sampling: head-based 10% + tail-based for all errors and slow traces (>1s)

Alerting:

  • SLO-based alerts, not metric thresholds
  • Example: alert if error budget for payment service burns > 2% per hour
  • PagerDuty for critical, Slack for warnings

Dashboards:

  • User journey dashboard (checkout start → completion)
  • Operational dashboard (latency, errors by service)
  • Resource utilization (CPU, memory, connections)

Scaling:

  • Use CloudWatch Logs Insights for ad-hoc queries
  • Plan to migrate to Datadog/New Relic as scale increases”

Q2: “How would you handle observability in a multi-cloud environment?”

Answer structure:

  1. Use vendor-agnostic standards (OpenTelemetry)
  2. Centralized collection (OpenTelemetry Collector)
  3. Dual-write to multiple backends for failover

Example:

“Multi-cloud observability requires decoupling from vendor APIs.

Instrumentation:

  • OpenTelemetry everywhere (same Java agent whether deployed on AWS, GCP, or on-premises)

Collection:

  • OpenTelemetry Collector deployed in each cloud (processes spans, metrics, logs)
  • Collector configured to export to multiple backends

Backends:

  • Primary: Datadog (multi-cloud support)
  • Fallback: Splunk (self-hosted option)
  • Each cloud-native tool (CloudWatch, GCP Monitoring) for governance

Configuration:

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
datadog:
api:
key: ${DATADOG_API_KEY}
site: datadoghq.com
splunk:
token: ${SPLUNK_TOKEN}
endpoint: https://splunk.example.com:8088
awscloudwatch:
region: us-east-1
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [datadog, splunk, awscloudwatch]

Advantages:

  • Applications are cloud-agnostic
  • Can switch backends without code changes
  • Failover if one backend is down”

Q3: “What are red flags in observability that you watch for?”

Good red flags to mention:

  1. Logs without context: If you see logs without correlation IDs, request won’t be traceable
  2. Metric cardinality explosion: High cardinality dimensions (user_id, request_id) in metrics = cost spiral
  3. No sampling: Attempting to trace 100% of traffic = cost overrun and performance degradation
  4. Alerting fatigue: Team ignoring alerts (tuning has failed)
  5. Observability only in production: Testing is blind; production issues happen on first deploy
  6. No SLOs: Team debates what “healthy” means; no objective targets
  7. Logs as debugging tool: Team adding logs to debug instead of using traces
  8. Silent failures: Errors consumed without logging; invisible to observability

Follow-up: “As a Tech Lead, I’d establish observability requirements in Definition of Done:

  • All services must emit logs, metrics, traces
  • Code review must check for correlation IDs and structured logging
  • Every service must have a dashboard before prod deployment”

How to Explain Observability Decisions Confidently

Technique: Use the “Why” Framework

Instead of: “We’re using OpenTelemetry and CloudWatch.”

Say: “We chose OpenTelemetry because it’s vendor-agnostic (we might switch to Datadog later), it auto-instruments most libraries (reducing engineering overhead), and it’s the CNCF standard (25+ companies contribute). For AWS, we’re using CloudWatch Logs and X-Ray because they’re native to Lambda (no extra agents), but we’re exporting to Datadog as our primary backend for richer querying.”

Structure:

  1. What we’re doing
  2. Why we chose it (requirements + trade-offs)
  3. How it works (brief architecture)
  4. When it’s the right choice (and when it’s not)

Architecture Diagrams (Textual)

Microservices with Full Observability

┌─────────────────────────────────────────────────────────────────┐
│ User Request Flow │
└─────────────────────────────────────────────────────────────────┘
[Load Balancer]
(X-Correlation-ID)
┌──────────────────┼──────────────────┐
│ │ │
[API GW] [API GW] [API GW]
│ │ │
└──────────────────┼──────────────────┘
[Checkout Service]
span: checkout_flow
┌──────────────────┼──────────────────┐
│ │ │
[Cart Service] [Payment Service] [Order Service]
(span) (span: payment) (span: order_create)
│ │ │
│ [Stripe API] [DB]
│ (instrumented (logs query
│ OpenTelemetry) latency)
│ │ │
└──────────────────┼──────────────────┘
[Notification Service]
(async via Kafka)
(trace context in headers)
[Email Service]
┌─────────────────────────────────────────────────────────────────┐
│ Observability Pipeline │
└─────────────────────────────────────────────────────────────────┘
All Services
└─→ [OpenTelemetry Collector]
├─→ Traces ──→ [CloudWatch X-Ray / Jaeger]
├─→ Metrics ──→ [Prometheus / CloudWatch]
└─→ Logs ──→ [CloudWatch Logs / Elasticsearch]
[CloudWatch] [Datadog] [PagerDuty]
│ │ │
└─→ [Dashboard] ←────┴──→ [Alerts] ──→ [On-call]
└─→ [Error Budget Tracking]

Request Tracing Through Async Boundary

Synchronous Request:
────────────────────
User Browser
└─→ [POST /checkout] (X-Correlation-ID: abc-123)
├─→ API Gateway (span: api_gw, trace_id: abc-123)
├─→ Checkout Service (span: checkout, parent: api_gw, baggage: {user_id, tenant_id})
├─→ Payment Service (span: payment, parent: checkout)
│ │
│ └─→ [HTTP call] (headers include trace context)
└─→ Response [200 OK] (includes X-Trace-ID: abc-123)
Asynchronous (Kafka):
─────────────────────
Checkout Service Order Processing Service
│ │
├─→ Span: publish_order │
│ [SEND to Kafka] │
│ Message headers: │
│ ├─ traceparent: 00-abc-123... │
│ ├─ tracestate: vendor=value │
│ ├─ baggage: user_id=123 │
│ │
│ [Kafka Queue] │
│ │ │
│ └──────────────────────────┤
│ │
│ Span: process_order
│ [EXTRACT from headers]
│ [Link to producer span]
│ parent: (from extracted context)
│ links: [producer span context]
└─ Trace is continuous across sync/async boundary!
Complete request journey: api_gw → checkout → kafka → order_service

Production Checklists

Observability Checklist for a New Service

Use this before deploying any new service to production.

Logging ✓

  • [ ] All logs are structured JSON (not free-form text)
  • [ ] Logs include: timestamp, service name, log level, event type, trace_id, span_id
  • [ ] Correlation IDs are propagated via X-Correlation-ID header
  • [ ] MDC (Mapped Diagnostic Context) configured in logging framework
  • [ ] Sensitive data (passwords, tokens, PII) not logged
  • [ ] Log levels are appropriate (DEBUG for verbose, INFO for decisions, ERROR for problems)
  • [ ] Logging configuration is externalized (environment variables, ConfigMap)

Metrics ✓

  • [ ] Golden Signals defined and emitted: latency (p50/p95/p99), traffic (RPS), errors (rate, by type), saturation (CPU, memory)
  • [ ] Business metrics defined: signup rate, conversion rate, revenue
  • [ ] JVM metrics enabled (via Micrometer): heap usage, GC pauses, thread count
  • [ ] Metrics have consistent naming and tags (environment, service, version)
  • [ ] High-cardinality dimensions handled via EMF or separate logs (not metrics)
  • [ ] Metrics endpoint available (/actuator/prometheus for Spring Boot)

Tracing ✓

  • [ ] OpenTelemetry Java agent configured (or manual instrumentation for custom logic)
  • [ ] Trace context propagated to all downstream services
  • [ ] Async boundaries (Kafka, SQS) properly instrumented (context in headers)
  • [ ] Sampling strategy defined: head-based rate + tail-based for errors
  • [ ] Trace exporter configured (X-Ray, Jaeger, Tempo, Datadog)
  • [ ] Custom spans added for critical business operations (checkout, payment, etc.)

Alerting ✓

  • [ ] SLOs defined: availability, latency targets (use error budget)
  • [ ] Alerts based on SLO burn rate (not static thresholds)
  • [ ] All critical alerts have runbooks with investigation steps
  • [ ] Alert routing configured (severity → team → escalation)
  • [ ] Oncall rotation established
  • [ ] Alert fatigue handled (alert on symptoms, not causes)

Dashboards ✓

  • [ ] Service dashboard created: Golden Signals + business metrics
  • [ ] Architecture dashboard: service dependencies
  • [ ] Error dashboard: error types, rates, affected users
  • [ ] Resource dashboard: CPU, memory, disk, connections
  • [ ] Drill-down capabilities (time-series → traces → logs)

Pre-Production Testing ✓

  • [ ] Load test at 2x expected peak traffic
  • [ ] Chaos test: kill dependencies, verify graceful degradation
  • [ ] Observability test: verify traces, metrics, logs are correctly emitted
  • [ ] Verify no PII in logs/traces
  • [ ] Verify sampling rates are economical (estimate daily cost)

Documentation ✓

  • [ ] Service wiki page: observability setup, how to debug
  • [ ] Dashboard links in runbooks
  • [ ] Trace ID querying instructions (how to find traces for a request)
  • [ ] Common issues and how to debug them
  • [ ] On-call guide: what to do when this service alerts

Production Validation Checklist

Before go-live (24 hours before):

  • [ ] Dashboards loaded and displaying data correctly
  • [ ] Alerts firing and routing to correct teams
  • [ ] Logs searchable in CloudWatch Logs Insights (query latency < 1 minute)
  • [ ] Traces appearing in X-Ray / Jaeger
  • [ ] Sampling rate confirmed (estimate: X traces/second, Y cost/month)
  • [ ] Cost projections reviewed (logs, metrics, traces)
  • [ ] Oncall engineer trained on dashboards and runbooks
  • [ ] Rollback plan includes observability (how do we know rollback worked?)

First 24 hours post-launch (continuous monitoring):

  • [ ] Error rate normal (no unexplained spikes)
  • [ ] Latency baseline established (compare to pre-prod)
  • [ ] Trace sampling working (no gaps in coverage)
  • [ ] Alerts tuned (no false positives, no missed failures)
  • [ ] Team familiar with debugging tools (tracing, log querying)

Key Takeaways

Remember these talking points:

  1. Observability vs Monitoring: Observability answers “why,” monitoring answers “what.”
  2. Three Pillars: Logs, metrics, traces. Each has strengths. Together, they’re powerful. Don’t leave any out.
  3. Correlation IDs are non-negotiable: Every request must have a correlation ID flowing through all services. This is basic hygiene.
  4. Structured logging: JSON logs are queryable. Free-form text logs are noise.
  5. OpenTelemetry: Industry standard, vendor-agnostic, 75% adoption. This is the safe choice.
  6. SLOs not dashboards: Define what “healthy” means using SLOs. Build alerts around error budgets, not static thresholds.
  7. Sampling is mandatory: At scale, tracing everything is expensive. Head + tail sampling balances cost and visibility.
  8. Domain context matters: E-commerce cares about conversion and inventory. Banking cares about auditability. Know your domain.
  9. Incident response workflow: Alert → Metrics → Traces → Logs. This is the gold standard for RCA.
  10. Design for observability: Build systems with instrumentation from day one. Don’t bolt it on later.

Final Reminders

As a Tech Lead, you’re responsible for observability culture on your team:

  • Establish standards: Logging format, metric naming, trace sampling policy
  • Code review: Ensure correlation IDs, no PII, structured logging
  • Team training: Teach juniors how to use dashboards and traces
  • On-call support: Ensure dashboards and runbooks are actually useful
  • Continuous improvement: Measure observability effectiveness (MTTR, alert quality)
  • Balance cost and visibility: Not “trace everything,” but “trace the right things”

Success looks like:

  • Team can diagnose issues in < 15 minutes using observability
  • New team members can debug unfamiliar services using traces
  • Incidents rarely recur (observability enables learning)
  • On-call engineers sleep better (actionable alerts, clear runbooks)

1. How to explain monitoring vs observability (using Datadog)

  • Monitoring
  • Define key service and business KPIs: error rate, p95 latency, throughput, resource usage, checkout success, etc.
  • Build dashboards and SLOs in Datadog for these KPIs.
  • Configure alerts (monitors) on symptoms, not just infrastructure: spikes in 5xx, slow endpoints, SLO burn rate.
  • Observability
  • Ensure rich telemetry:
    • Metrics from infra, apps, DB, cache.
    • Structured logs with correlation IDs.
    • Traces across services (APM).
    • RUM and synthetics for user experience.
  • Standardize tagging (env, service, version, team, etc.) so everything can be sliced and correlated.

This distinction shows ownership of both operational health and deep debugging capability.


2. End‑to‑end setup for a typical web application in Datadog

For a modern web app (e.g., frontend + Java/Spring backend + DB + cache, on cloud or Kubernetes):

  1. Infrastructure layer
  • Install the Datadog Agent on nodes/VMs or as a DaemonSet in Kubernetes.
  • Enable cloud and DB integrations (AWS, RDS/Postgres/MySQL, Redis, NGINX, etc.).
  • Outcome: baseline metrics for CPU, memory, disk, network, DB health, queue depth, etc.
  1. Backend / API (APM + logs + metrics)
  • Attach dd-java-agent (for Java) and configure:
    • DD_ENV, DD_SERVICE, DD_VERSION, DD_LOGS_INJECTION.
  • Rely on auto‑instrumentation for HTTP, JDBC, Redis, HTTP clients, etc.
  • Configure application logs to be collected by the agent, using structured logging and trace/log correlation.
  • Emit domain metrics (e.g., checkout.completed, payment.failed) via DogStatsD for business visibility.
  1. Frontend (RUM)
  • Add Datadog RUM snippet to the web app.
  • Capture page loads, JS errors, user actions, and Core Web Vitals, tagged by env, service, and optionally app version.
  1. External reliability (Synthetics)
  • Set up HTTP and browser synthetics for critical endpoints and user flows (login, search, checkout).

This gives a full picture: infra → backend → frontend → external dependencies.


3. How this is used operationally (process + workflow)

As a senior tech lead, the value is not just in plumbing, but in how teams use it:

  • Dashboards
  • Service dashboards: latency, throughput, error rate, GC, DB timings for each microservice.
  • Business dashboards: orders/minute, cart conversions, payment success, broken down by region or channel.
  • Infra dashboards: node health, pod restarts, DB and cache performance.
  • Monitors and SLOs
  • Define SLOs for critical flows (e.g., “99.9% of checkouts complete successfully in < 1s over 30 days”).
  • Use Datadog SLOs and monitors to track error budgets and burn rate.
  • Configure alerts with enough context: involved service, env, recent deployment version, and links to relevant dashboards and runbooks.
  • Standardization and governance
  • Establish a tagging and naming convention across services.
  • Set guidelines for logging levels, structured fields (user ID, order ID, correlation ID), and sensitive data.
  • Make observability part of the definition of done for new services and features.

4. Concrete incident example (shows practical application)

For example, if “checkout is slow and users see errors”:

  1. A Datadog monitor on checkout SLO / error rate fires.
  2. Open the checkout-service dashboard:
  • Spot that p95 latency and error rate increased after a specific deployment version.
  1. Jump into APM traces:
  • See slow spans for DB queries or external payment gateway calls.
  1. From a slow trace, drill into logs:
  • View the exact exception, SQL query, or external error code, along with user/order IDs.
  1. Correlate with RUM:
  • Check whether the issue affects a specific region, browser, or only mobile users.
  1. Take action:
  • Roll back, toggle feature flags, or adjust infra resources.
  1. After the incident:
  • Update monitors or add new domain metrics/log fields to prevent blind spots.

This demonstrates that Datadog is not just “graphs,” but the backbone for structured incident response and continuous improvement.


5. Key design principles a senior tech lead would drive

  • Treat telemetry as a first‑class feature:
  • Observability requirements planned alongside functional requirements.
  • Optimize for fast MTTR (mean time to recovery):
  • From alert → dashboard → trace → log → root cause in a few hops.
  • Align technical metrics with business outcomes:
  • Tie SLOs to user journeys and revenue‑critical flows (search, add‑to‑cart, checkout).
  • Promote self‑service:
  • Product teams own their dashboards, monitors, and SLOs, with shared standards and platform support around Datadog.

1. DB query regression after deployment

Scenario
A new release goes out. Within 10–15 minutes, checkout latency starts increasing and some users abandon the flow.

What Datadog shows

  • A monitor on checkout-service p95 latency and HTTP 5xx rate fires.
  • The service dashboard shows:
  • p95 latency for POST /api/checkout doubled.
  • Error rate slightly increased.
  • This started right after version 1.23.0 went live (via version tag).
  • In APM traces:
  • Slow traces show most time spent in a specific DB span, e.g. SELECT * FROM orders ....
  • The span duration went from ~20 ms to ~300 ms.
  • DB integration metrics (Postgres/MySQL):
  • Increase in rows scanned and slow queries count.
  • CPU on DB up, but not maxed.

How it’s resolved

  • Compare traces and SQL before vs after release.
  • Identify that a new filter was added but no proper index.
  • Hotfix: add the missing index, redeploy, or temporarily roll back to previous version.
  • After fix:
  • Latency and error rate return to normal.
  • Close the incident and update runbook: “DB changes must have index review + load test; add a dedicated dashboard widget tracking slow queries per table.”

2. Memory leak leading to container restarts

Scenario
Users intermittently see 502/503 errors from the API. Incidents seem random; no obvious traffic spike.

What Datadog shows

  • Infra dashboard:
  • Pods for catalog-service restarting frequently.
  • Container OOM kills visible in Kubernetes events.
  • JVM metrics on that service:
  • Heap usage climbs gradually over several hours, never fully reclaimed after GC.
  • Full GC frequency increasing.
  • APM:
  • Request latency spikes just before pod restarts.
  • Logs:
  • OutOfMemoryError or related GC errors shortly before the container dies.

How it’s resolved

  • Correlate:
  • Memory growth → more full GCs → latency spikes → container OOM → brief 5xx blips.
  • Use heap dumps / profiling (outside Datadog) to identify root cause (e.g., caching large objects per request, or unbounded in-memory cache).
  • Short-term: increase pod memory requests/limits and replica count to reduce user impact.
  • Long-term: fix the leak (proper cache eviction / avoiding large in-memory collections).
  • In Datadog:
  • Add a monitor on memory usage slope (or GC time) to detect future leaks earlier.
  • Add a widget that correlates pod restarts with heap usage.

3. Third‑party payments API latency and errors

Scenario
Checkout starts timing out and payment failures spike, mostly in certain time windows. Internally, no major code changes happened.

What Datadog shows

  • SLO/monitor on “successful payment events” fires due to increased failures and timeouts.
  • Service map highlights payment-service as a hot node (error rate and latency high).
  • In APM traces for payment-service:
  • A downstream span for https://api.payment-gateway.com/charge has:
    • Latency jumping from 200 ms → 2–3 seconds.
    • More errors with HTTP 5xx or timeouts.
  • Logs:
  • Timeouts and specific error codes from the payment provider: e.g., 504 Gateway Timeout, rate limit exceeded.
  • Internal infra (CPU, DB, network) looks normal.

How it’s resolved

  • Confirm issue is external:
  • Other services are healthy, but all failures correlate with the payment provider spans.
  • Immediate mitigation:
  • Reduce timeouts and implement proper circuit breaker behavior (failing fast instead of hanging).
  • Fallback strategies where possible (e.g., queueing payments for retry, better user messaging).
  • Medium-term:
  • Work with the provider (share Datadog metrics and timings).
  • Consider multi-provider setup or regional routing to reduce blast radius.
  • In Datadog:
  • Add a dedicated dashboard for external dependencies with:
    • Latency, error rate per third‑party.
    • Separate monitors so incidents are quickly classified as “internal vs third‑party”.

4. Region-specific degradation due to misconfigured load balancer

Scenario
Users in Europe complain about slow responses and sporadic errors, but global averages look fine.

What Datadog shows

  • Overall SLOs might still be green, but:
  • RUM dashboard:
    • Page load time and XHR latency high for region:EU.
  • APM:
    • Filtering traces by region:eu-west-1 shows p95 latency 3–4x higher versus us-east-1.
  • Infra / LB metrics:
  • EU load balancer has higher 5xx and connection errors.
  • One of the EU backend target groups shows more unhealthy instances.
  • Logs (filtered by region tag / host):
  • Increased connection reset or upstream timeout messages from NGINX / LB in EU only.

How it’s resolved

  • Trace the symptom:
  • Start from RUM (user experience) → backend traces → infra in that region.
  • Identify that:
  • A new autoscaling rule or target group configuration in EU was incorrect (e.g., health check path broken, fewer healthy instances, or mis-routed to an overloaded node pool).
  • Fix:
  • Correct LB target group configuration and health checks.
  • Redistribute traffic evenly across healthy instances.
  • Post-incident:
  • Add region-specific monitors:
    • Latency, error rate, and RUM performance per region.
  • Ensure deployment pipelines update configs consistently across regions.

5. Frontend JS error breaking a key flow after a feature rollout

Scenario
No big backend changes, but suddenly users on certain browsers cannot complete checkout. Backend metrics look normal, but conversion drops.

What Datadog shows

  • Business dashboard:
  • Drop in successful checkouts / increase in cart abandonment.
  • Backend:
  • APM metrics for POST /api/checkout look fine—no spike in errors or latency.
  • RUM:
  • Spike in JS errors on the checkout page, mainly on a specific browser/version (e.g., Safari 15).
  • Error stack: TypeError: undefined is not a function in a new piece of JS added in the latest frontend release.
  • Correlation with new RUM version or build tag.
  • RUM session replays (if enabled):
  • Show users stuck at a specific step, with a button not responding or a form not submitting.

How it’s resolved

  • Link symptom to cause entirely from the frontend:
  • Users see error in JS → action not sent to backend → backend metrics stay “green” but business metric falls.
  • Roll back the offending frontend bundle or quickly hotfix the JavaScript.
  • Improve QA/testing:
  • Add automated cross‑browser tests for critical flows.
  • In Datadog:
  • Add monitors on:
    • RUM JS error rate for critical pages.
    • Conversion rate from cart_viewcheckout_complete.
  • Ensure frontend releases are tagged and visible in dashboards for quick correlation.

Leave a comment