Target Audience: Software engineers with Java + AWS background
Table of Contents
- Foundations
- Java-Focused Observability
- AWS Observability
- Distributed Tracing & Context Propagation
- System Design Perspective
- Domain-Specific Scenarios
- Incident Response & Debugging
- Common Questions on Observability
- Architecture Diagrams (Textual)
- Production Checklists
Foundations
What Observability Really Means
Observability vs. Monitoring: The Critical Distinction
The two terms are often used interchangeably but serve fundamentally different purposes. Understanding this distinction is essential for Tech Lead credibility.
Monitoring answers the question: “What is wrong?” It uses predefined metrics, dashboards, and thresholds to detect known problems. You set up alerts for CPU > 80%, error rate > 1%, or response time > 500ms. When these thresholds trigger, you know something is broken. Monitoring works well for predictable failure modes.
Observability answers: “Why is it wrong?” It empowers engineers to ask arbitrary questions about system behavior without needing to anticipate those questions during instrumentation design. Rather than relying on predefined dashboards, observability enables exploratory analysis. For example, during an outage affecting 0.001% of requests from a specific geographic region using a particular mobile app version, observability lets you drill down into that slice of data without having pre-built a dashboard for every possible combination.
In practice: Monitoring is a subset of observability. Your monitoring alerts detect problems; observability tools provide the investigative power to diagnose them.
Why Observability Matters for Distributed Systems
In monolithic applications, a stack trace and a local debugger often suffice. In microservices:
- A single user request traverses dozens of services across multiple regions
- Failures can cascade silently (service A fails but returns cached data)
- Performance degradation may be caused by dependencies you don’t directly control
- Third-party APIs and edge functions introduce external variables
- Timing issues only manifest under production load
40% of organizations report that distributed system complexity directly contributes to major outages. Without observability, you’re essentially flying blind—you know something crashed, but tracing the root cause across ten services and three cloud providers becomes a nightmare of manual log tailing and hypothesis-driven debugging.
The Three Pillars of Observability
Each pillar provides a different lens on system behavior. Together, they form a complete picture.
Pillar 1: Logs — Detailed Event Records
What they are: Timestamped records of discrete events, typically with structured context (request ID, user ID, operation, result, duration).
Strengths:
- High cardinality: capture arbitrary data (user IDs, session details, query parameters)
- Debuggability: exact error messages, stack traces, payload details
- Compliance: audit trails with full context
- Human-readable: engineers can quickly scan and search
Limitations:
- Overwhelming at scale: millions of events per second become a wall of text
- Cost: storing all logs can be expensive (though sampling/filtering helps)
- Lack of structure: unstructured logs are nearly impossible to query
- Incomplete context: without correlation IDs, linking logs across services is manual and slow
When to use: Root cause analysis, compliance auditing, understanding exactly what went wrong in a specific request.
Pillar 2: Metrics — Aggregated Quantitative Data
What they are: Time-series data points: counters (HTTP requests), gauges (CPU usage), histograms (latency percentiles).
Strengths:
- Efficiency: minimal storage (1 data point per minute per metric)
- Real-time visibility: dashboards show system health at a glance
- Trend detection: identify patterns over hours/days
- Cost-effective: scales to billions of events
- Alerting: set thresholds and trigger automated responses
Limitations:
- Low cardinality: hard to slice by (customer_id, region, app_version) without exploding cardinality
- Loss of detail: histograms with percentiles lose individual request information
- Blindness to anomalies: a metric trending at the 99th percentile may hide the actual problem
When to use: Dashboards, alerts, capacity planning, understanding aggregate system behavior.
Pillar 3: Traces — Request Flow Across Services
What they are: Hierarchical records of a request’s journey, composed of spans (one span per operation). Each span captures latency, status, attributes.
Strengths:
- End-to-end visibility: see exactly where latency is spent
- Service dependency mapping: automatic architecture discovery
- Causality: understand how services interact
- Performance bottleneck isolation: find the slow hop in a chain
Limitations:
- Cost: storing full traces for every request is expensive; sampling becomes mandatory
- Cardinality explosion: tracing with high-cardinality baggage can overwhelm collectors
- Sampling bias: if you sample 1% of traces, you might miss rare errors
- Latency: tail-based sampling decisions add latency to the critical path
When to use: Debugging latency issues, understanding service dependencies, pinpointing bottlenecks.
How These Pillars Work Together
The most powerful observability practice is trace-first debugging: start with a trace to see the high-level request flow, identify the slow or failing span, then use that span’s ID to find all associated logs for deep investigation.
Example workflow:
- Alert fires: Error rate spike detected (monitoring)
- Metrics confirm: P99 latency up 10x in payment-service
- Trace shows: Request spends 8s in database query (slow span identified)
- Logs reveal: Query is
SELECT * FROM orders(missing WHERE clause due to bad deployment)
Java-Focused Observability
Logging Best Practices in Java
Structured Logging (Not Just Text)
The Problem with Unstructured Logs:
ERROR: Payment processing failed for user 12345java.lang.NullPointerException at PaymentService.charge(PaymentService.java:125)
This is human-readable but unsearchable. You can’t ask “show me all errors for payment-service grouped by error type” without parsing strings.
Structured Logging Solution:
{ "timestamp": "2026-01-27T09:15:30Z", "level": "ERROR", "service": "payment-service", "trace_id": "550e8400-e29b-41d4-a716-446655440000", "span_id": "7a085853-4b1e-4b8f-9c3e-5d2f1a8b9c7d", "correlation_id": "550e8400-e29b-41d4-a716-446655440000", "user_id": "12345", "event": "payment_charge_failed", "error_code": "PAYMENT_GATEWAY_TIMEOUT", "error_message": "NullPointerException at PaymentService.charge(PaymentService.java:125)", "operation": "charge", "amount_usd": 99.99, "payment_gateway": "stripe", "retry_count": 2, "duration_ms": 5000}
This is queryable: service="payment-service" AND error_code="PAYMENT_GATEWAY_TIMEOUT" AND retry_count > 1.
Implementation with SLF4J
SLF4J is the de facto standard logging facade in Java (abstraction over implementations like Logback, Log4j2).
import org.slf4j.Logger;import org.slf4j.LoggerFactory;import org.slf4j.MDC;public class PaymentService { private static final Logger logger = LoggerFactory.getLogger(PaymentService.class); public PaymentResponse charge(String userId, BigDecimal amount) { String correlationId = UUID.randomUUID().toString(); MDC.put("correlation_id", correlationId); MDC.put("user_id", userId); try { logger.info("payment_started", "amount_usd", amount, "gateway", "stripe"); PaymentResponse response = gateway.charge(amount); logger.info("payment_completed", "response_code", response.getCode(), "duration_ms", response.getDuration()); return response; } catch (PaymentGatewayException e) { logger.error("payment_charge_failed", "error_code", e.getCode(), "error_message", e.getMessage(), "retry_count", retryCount, new Exception(e)); // Include stack trace throw new PaymentFailedException(e); } finally { MDC.clear(); // Always clear to prevent leaks } }}
MDC (Mapped Diagnostic Context) is a thread-local map that automatically includes values in every log statement within that thread. When you set MDC.put("correlation_id", id), that value appears in all logs until you clear it.
Correlation IDs: Connecting the Dots
A correlation ID is a unique identifier that flows through every service involved in a request. It enables you to find all logs/traces for a single user request across the entire distributed system.
Generation & Propagation:
- API Gateway generates or inherits
X-Correlation-IDfrom request header - Includes in response headers (so clients can refer to the request in support tickets)
- Passes to all downstream services via HTTP headers
- For async (Kafka, SQS), includes in message headers
- Each service adds its own
span_idbut preservestrace_id
// Filter/Interceptor at API boundary@Componentpublic class CorrelationIdFilter implements Filter { private static final String CORRELATION_ID_HEADER = "X-Correlation-ID"; @Override public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { HttpServletRequest httpRequest = (HttpServletRequest) request; HttpServletResponse httpResponse = (HttpServletResponse) response; String correlationId = httpRequest.getHeader(CORRELATION_ID_HEADER); if (correlationId == null) { correlationId = UUID.randomUUID().toString(); } MDC.put("correlation_id", correlationId); httpResponse.setHeader(CORRELATION_ID_HEADER, correlationId); try { chain.doFilter(request, response); } finally { MDC.clear(); } }}// When calling downstream serviceRestTemplate restTemplate = new RestTemplate();HttpHeaders headers = new HttpHeaders();headers.set("X-Correlation-ID", MDC.get("correlation_id"));HttpEntity<String> entity = new HttpEntity<>(headers);restTemplate.exchange(url, HttpMethod.GET, entity, String.class);
Logging Configuration (Logback)
Logback is the recommended implementation (faster, more flexible than Log4j, better than JUL).
<!-- logback.xml --><configuration> <!-- Property definitions for environment --> <springProperty name="spring.application.name" source="spring.application.name"/> <property name="LOG_PATTERN" value="%d{ISO8601} [%thread] %-5level %logger{36} [%X{correlation_id}] - %msg%n"/> <!-- JSON structured logging --> <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender"> <encoder class="net.logstash.logback.encoder.LogstashEncoder"> <customFields>{"service":"${spring.application.name}"}</customFields> <fieldNames> <timestamp>timestamp</timestamp> <version>[ignore]</version> <level>level</level> <loggerName>logger</loggerName> <message>message</message> <thread>thread</thread> </fieldNames> </encoder> </appender> <!-- Root logger --> <root level="INFO"> <appender-ref ref="STDOUT"/> </root> <!-- Suppress noisy dependencies --> <logger name="org.springframework" level="WARN"/> <logger name="org.hibernate" level="WARN"/></configuration>
Anti-patterns to avoid:
- Logging at
DEBUGin production (costs without insight) - Including sensitive data (passwords, credit cards, PII)
- Relying on logs alone for debugging (trace tells the full story)
- Logging stack traces as strings (parse as structured exception data)
- Missing context (correlation IDs, user IDs)
JVM-Level Observability
The JVM itself is a complex distributed system. Understanding GC, heap, threads, and CPU is essential for debugging performance issues.
Memory Observability
Heap Metrics (from Micrometer):
jvm.memory.used{area="heap",id="G1 Survivor Space"} = 5MBjvm.memory.committed{area="heap"} = 256MBjvm.memory.max{area="heap"} = 4096MBjvm.memory.usage{area="heap"} = 0.06 (6% of max)
What this tells you:
- If
usedapproachesmax→ memory pressure, likely to trigger GC - If
committed < max→ JVM has room to grow (not hitting ceiling yet) - If
usage> 0.85 → risk of full GC soon
Garbage Collection
GC Metrics (from Micrometer):
jvm.gc.pause{action="end of major GC",cause="G1 Evacuation Pause"} = histogram - count: number of GC pauses - sum: total pause time - max: longest pausejvm.gc.memory.allocated = counter (bytes allocated since start)jvm.gc.memory.promoted = counter (bytes moved from young to old gen)
What this tells you:
- GC pause time → latency spikes (user-visible delays)
- Allocation rate → memory pressure (high allocation = frequent GC)
- Promotion rate → objects living too long (tune young gen size)
Healthy patterns:
- GC pauses < 100ms (minor), < 500ms (major)
- Major GC every 10+ minutes (not constantly)
- Allocation rate plateaus (not constantly increasing)
Thread Observability
Thread Metrics:
jvm.threads.live = 42jvm.threads.peak = 100jvm.threads.daemon = 40jvm.threads.deadlocked = 0
What this tells you:
- Thread creep:
livethreads growing over time = leak - Deadlocks:
deadlocked > 0= immediate action required - Daemon threads: should be mostly daemon (background)
CPU & System
process.cpu.usage = 0.45 (45% of available CPU)system.cpu.usage = 0.72 (72% of system CPU)process.runtime.jvm.cpu.time = counter (nanoseconds)
What this tells you:
- If process CPU stays high after traffic drops → memory leak or stuck thread
- If system CPU >> process CPU → other processes consuming resources
- CPU usage should scale with load; flat usage after traffic drop = problem
Implementing JVM Observability with Micrometer
Micrometer auto-instruments the JVM; you just enable it:
// Spring Boot (automatic)@Configurationpublic class ObservabilityConfig { // Auto-configured if management.endpoints.web.exposure.include=prometheus}// Gradle dependencyimplementation("io.micrometer:micrometer-registry-prometheus")implementation("io.micrometer:micrometer-core")// Actuator endpoint exposes metrics// GET /actuator/prometheus → Prometheus format// GET /actuator/metrics → JSON metadata// GET /actuator/metrics/jvm.memory.used → specific metric
Dashboards to build:
- GC pause time trend (alert if > 500ms)
- Heap usage trend (alert if > 85%)
- Thread count trend (alert on spike)
- CPU usage (correlate with traffic)
Common Java Observability Libraries
Logback vs Log4j2
| Feature | Logback | Log4j2 | Winner |
|---|---|---|---|
| Performance | Good | Better (async) | Log4j2 |
| Async support | Limited | First-class | Log4j2 |
| Garbage-free mode | No | Yes | Log4j2 |
| Configuration reload | Yes | Yes | Tie |
| Groovy config | Yes | No | Logback |
| Lambda support | No | Yes (filters) | Log4j2 |
| Spring Boot default | Yes | No | Logback |
Recommendation: Use Logback for simplicity, Log4j2 for high-throughput/low-latency systems.
Micrometer (Metrics)
Micrometer is the abstraction layer for metrics (like SLF4J for logging). It auto-instruments JVM, HTTP, database, and custom metrics, then exports to Prometheus, CloudWatch, Datadog, New Relic, etc.
public class PaymentService { private final MeterRegistry meterRegistry; public PaymentService(MeterRegistry meterRegistry) { this.meterRegistry = meterRegistry; } public void charge(BigDecimal amount) { Timer.Sample sample = Timer.start(meterRegistry); try { // charge logic meterRegistry.counter("payments.successful", "currency", "USD").increment(); } catch (Exception e) { meterRegistry.counter("payments.failed", "error_type", e.getClass().getSimpleName()).increment(); } finally { sample.stop(Timer.builder("payment.latency") .tag("status", "success") .register(meterRegistry)); } }}
OpenTelemetry (Tracing & Metrics)
OpenTelemetry is the industry standard for observability instrumentation. It’s vendor-agnostic and auto-instruments common libraries.
// Auto-instrumentation agent (zero code changes)// java -javaagent:opentelemetry-javaagent.jar -jar app.jar// Manual instrumentationimport io.opentelemetry.api.trace.Tracer;public class OrderService { private final Tracer tracer; public OrderService(Tracer tracer) { this.tracer = tracer; } public void processOrder(String orderId) { try (Span span = tracer.spanBuilder("process_order") .setAttribute("order.id", orderId) .startSpan()) { // Processing logic // Nested spans automatically created by instrumented libraries } }}
Anti-Patterns (What NOT to Do)
Over-logging: Logging every method entry/exit at INFO level drowns out signal. Use DEBUG for verbose tracing.
// BAD: Too noisylogger.info("Entering charge method");logger.info("Retrieved gateway");logger.info("Calling gateway.charge()");logger.info("Exiting charge method");// GOOD: Log decisions and outcomeslogger.info("payment_initiated", "user_id", userId, "amount", amount);logger.info("payment_completed", "duration_ms", duration);
Missing context: Logs without correlation IDs are isolated incidents, not connected stories.
Log-only debugging: Chasing logs instead of using traces. Traces show causality; logs show detail.
No sampling: Attempting to log/trace everything at high scale causes cost explosion and performance degradation. Sample early, sample often.
AWS Observability
CloudWatch (Logs, Metrics, Alarms)
AWS CloudWatch is the native observability service. While it has limitations compared to specialized tools (Datadog, New Relic), it has deep integration with AWS services.
CloudWatch Logs
Strengths:
- Native to EC2, Lambda, ECS, RDS
- Log Insights for ad-hoc querying (SQL-like syntax)
- Automatic parsing of JSON logs
- Log retention policies and expiration
- Cost: ~$0.50 per GB ingested (expensive at scale)
Limitations:
- Limited querying speed (not real-time like Elasticsearch)
- Cardinality explosion: too many high-cardinality fields = high cost
- No full-text search (must use structured fields)
CloudWatch Metrics
Built-in metrics (auto-collected):
- Lambda: Invocations, Duration, Errors, ConcurrentExecutions
- ECS: CPU, Memory, Network
- RDS: DatabaseConnections, ReadLatency, WriteLatency
- API Gateway: Count, Latency, 4xx/5xx Errors
Custom metrics (application-emitted):
PutMetricDataRequest request = new PutMetricDataRequest() .withNamespace("MyApp/Payment") .withMetricData(new MetricDatum() .withMetricName("CheckoutLatency") .withValue(latencyMs) .withUnit(StandardUnit.Milliseconds) .withDimensions( new Dimension().withName("Environment").withValue("production"), new Dimension().withName("Region").withValue("us-east-1") ));cloudWatch.putMetricData(request);
Cost trap: High-cardinality dimensions (user_id, request_id) explode your bill. Use Embedded Metric Format (EMF) to log high-cardinality data instead.
CloudWatch Alarms
Alarms trigger on metric thresholds or log pattern matching.
// Create alarm in codePutMetricAlarmRequest request = new PutMetricAlarmRequest() .withAlarmName("HighErrorRate") .withMetricName("Errors") .withNamespace("MyApp") .withStatistic(Statistic.Average) .withPeriod(60) // evaluation window: 60 seconds .withThreshold(0.05) // 5% error rate .withComparisonOperator(ComparisonOperator.GreaterThanThreshold) .withTreatMissingData(TreatMissingData.NotBreaching); // don't alarm if no datacloudWatch.putMetricAlarm(request);
X-Ray (Distributed Tracing)
X-Ray is AWS’s distributed tracing service. It’s simpler than Jaeger but has AWS-specific limitations.
How X-Ray Works
- Instrumentation: Java agent or SDK captures spans
- Sampling: By default, traces 1 req/sec (configurable)
- Collection: Data sent to X-Ray service
- Service Map: Visual representation of dependencies
- Analysis: Drill into traces, see latency by service
X-Ray with Java
// Maven dependency<dependency> <groupId>com.amazonaws</groupId> <artifactId>aws-xray-sdk-java</artifactId></dependency>// Auto-instrumentation via agent// java -javaagent:/path/to/xray-agent.jar -jar app.jar// Manual instrumentationimport com.amazonaws.xray.AWSXRay;import com.amazonaws.xray.entities.Segment;import com.amazonaws.xray.entities.Subsegment;public class PaymentService { public void charge(String userId, BigDecimal amount) { Segment segment = AWSXRay.getCurrentSegment(); // auto-created for Lambda Subsegment subsegment = segment.beginSubsegment("payment.gateway"); try { subsegment.putAnnotation("user_id", userId); // indexed, queryable subsegment.putMetadata("amount", amount); // searchable but not indexed // Call payment gateway gateway.charge(amount); } catch (Exception e) { subsegment.addException(e); throw e; } finally { subsegment.close(); } }}
X-Ray Strengths & Gaps
Strengths:
- Tight integration with Lambda, API Gateway, ECS
- Service map auto-discovery
- Error rate and latency by service visible immediately
- No extra infrastructure to manage
Gaps:
- Sampling decision is head-based (can’t capture all errors)
- Limited retention (30 days by default)
- Querying less flexible than Jaeger
- High cost per trace at scale (financial services often disable it)
Observability in ECS/Fargate
CloudWatch Container Insights
Enable Container Insights on your ECS cluster for automatic metric collection:
AWS console → ECS → Cluster → [cluster name] → Monitor → Enable Container Insights
Metrics collected:
- Cluster CPU, memory utilization
- Per-task CPU, memory
- Container CPU, memory
- Network I/O
Logging
All stdout/stderr from containers automatically goes to CloudWatch Logs if you configure the log driver:
{ "containerDefinitions": [ { "name": "my-app", "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/my-app", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "ecs" } } } ]}
Observability in Lambda
Native CloudWatch Integration
Lambda automatically logs to CloudWatch Logs (stdout/stderr). All invocations visible.
X-Ray
Enable X-Ray tracing in Lambda:
AWS console → Lambda → [function name] → Configuration → X-Ray → Active tracing
Or via code:
import com.amazonaws.xray.AWSXRay;public class LambdaHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> { @Override public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent event, Context context) { // Segment auto-created; subsegments for downstream calls AWSXRay.getCurrentSegment() .putAnnotation("request_id", event.getRequestContext().getRequestId()); // Call downstream service (auto-traced if instrumented) S3Client s3 = AWSXRay.instrumentClient(S3Client.create()); s3.getObject(GetObjectRequest.builder().bucket("my-bucket").key("file").build()); return new APIGatewayProxyResponseEvent() .withStatusCode(200) .withBody("{\"message\": \"success\"}"); }}
Embedded Metric Format (EMF)
EMF is a trick to emit custom metrics without CloudWatch API calls. You log JSON, CloudWatch automatically extracts metrics.
// Instead of this (expensive):cloudWatch.putMetricData(new PutMetricDataRequest() .withMetricData(new MetricDatum().withMetricName("OrderProcessing").withValue(latency)));// Do this (cheaper):System.out.println( "{\"_aws\": {\"CloudWatch\": {\"Namespace\": \"MyApp\", \"MetricData\": [{" + "\"MetricName\": \"OrderProcessing\", \"Value\": " + latency + "}]}}, \"order_id\": \"123\", \"user_id\": \"456\"}");
CloudWatch automatically parses the EMF wrapper and extracts the metric, but the high-cardinality fields (order_id, user_id) remain in logs, not metrics.
Distributed Tracing & Context Propagation
How Traces Flow Across Services
A trace is a directed acyclic graph (DAG) of spans. Each span represents one operation.
User Request → API Gateway (span 1) → Payment Service (span 2) [child of span 1] → Card Validator (span 3) [child of span 2] → Payment Gateway (span 4) [child of span 2, parallel] → Order Service (span 5) [child of span 1] → Database Query (span 6) [child of span 5] → Notification Service (span 7) [child of span 1, async via SQS]
Each span has:
- trace_id: shared across entire request flow (connects all spans)
- span_id: unique to this span
- parent_span_id: points to parent (creates hierarchy)
- baggage: key-value pairs propagated downstream (tenant_id, user_id, etc.)
Trace IDs, Span IDs, Baggage
Trace ID (128-bit UUID):
trace_id: 550e8400-e29b-41d4-a716-446655440000
Generated at entry point (API Gateway, message queue consumer) and propagated to all downstream services.
Span ID (64-bit random):
span_id: 7a085853-4b1e-4b8f-9c3e-5d2f1a8b9c7d (for this operation)parent_span_id: 8b9c7d-550e8400-e29b-41d4 (parent operation)
Baggage (custom context):
baggage: { "tenant_id": "acme-corp", "user_id": "user-12345", "request_source": "mobile-app", "feature_flags": {"new_checkout": "enabled"}}
Baggage is propagated to all downstream services but should be used sparingly (every field = extra header size).
Asynchronous Flows (Kafka, SQS)
The challenge with queues: producer and consumer are decoupled in time. How do you maintain trace continuity?
Kafka Example
Producer (injects trace context into headers):
import io.opentelemetry.api.trace.Tracer;import io.opentelemetry.propagation.TextMapPropagator;public class OrderProducer { private final Tracer tracer; private final TextMapPropagator propagator; private final KafkaProducer<String, String> producer; public void publishOrder(Order order) { try (Span span = tracer.spanBuilder("publish_order") .setAttribute("order.id", order.getId()) .startSpan()) { // Prepare headers Map<String, String> headers = new HashMap<>(); // Inject current trace context into headers propagator.inject(Context.current(), headers, (carrier, key, value) -> carrier.put(key, value)); // Convert to Kafka headers Headers kafkaHeaders = new RecordHeaders(); headers.forEach((k, v) -> kafkaHeaders.add(k, v.getBytes())); // Send with trace context producer.send(new ProducerRecord<>( "orders", 0, order.getId(), JsonMapper.toJson(order), kafkaHeaders )); } }}
Consumer (extracts and continues trace):
public class OrderConsumer { private final Tracer tracer; private final TextMapPropagator propagator; @KafkaListener(topics = "orders", groupId = "order-processor") public void consume(ConsumerRecord<String, String> record) { // Extract trace context from message headers Context extractedContext = propagator.extract(Context.current(), record.headers(), (carrier, key) -> { byte[] value = carrier.lastHeader(key).value(); return new String(value); }); // Create CONSUMER span linked to producer try (Span span = tracer.spanBuilder("process_order") .setParent(extractedContext) .setAttribute("messaging.system", "kafka") .setAttribute("messaging.destination", "orders") .startSpan()) { Order order = JsonMapper.fromJson(record.value(), Order.class); processOrder(order); } }}
SQS Example
SQS is similar, but inject/extract from message attributes:
// ProducerSendMessageRequest request = new SendMessageRequest() .withQueueUrl(queueUrl) .withMessageBody(JsonMapper.toJson(order)) .withMessageAttributes(kafkaHeaders); // trace context heresqs.sendMessage(request);// ConsumerMessage message = sqs.receiveMessage(queueUrl).getMessages().get(0);Context extractedContext = propagator.extract(Context.current(), message.getMessageAttributes(), /* extract logic */);try (Span span = tracer.spanBuilder("process_sqs_message") .setParent(extractedContext) .startSpan()) { Order order = JsonMapper.fromJson(message.getBody(), Order.class); processOrder(order);}
Sampling Strategies and Their Risks
Sampling reduces the volume of traces stored, keeping costs down. But sampling introduces blindspots.
Head-Based Sampling
Decision made at the start of the request (head of the trace).
// 10% sampling: keep 1 in 10 tracesTraceIdRatioBased sampler = new TraceIdRatioBased(0.1);// When creating a spanSpan span = tracer.spanBuilder("operation") .setSampler(sampler) .startSpan();
Pros:
- Simple to implement (no state needed)
- Efficient (don’t collect data for dropped traces)
- Predictable cost
- Can be done at any point in pipeline
Cons:
- Can’t sample based on outcome (can’t guarantee all errors are captured)
- High-cardinality dimensions cause cost spikes anyway
Tail-Based Sampling
Decision made at the end of the trace (after all spans collected).
// Keep all traces with errors// Keep all slow traces// Keep 5% of successful fast traces// Requires OpenTelemetry Collector
Pros:
- Can ensure 100% of errors are sampled
- Smart policies: expensive = always sample
- Better observability for failures
Cons:
- More complex (need collector)
- Higher latency (must wait for full trace)
- Higher compute cost in collector
Hybrid Approach (Recommended)
Combine both:
- Head sampling: Drop obviously unimportant traces (health checks) at 1% rate
- Tail sampling: Use collector to ensure all errors and slow traces (>1s latency) are kept
# opentelemetry-collector-config.yamlreceivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317processors: tail_sampling: policies: - name: error-traces traces_per_second: 100 policy: type: status_code status_code: status_codes: [ERROR, UNSET] - name: slow-traces traces_per_second: 100 policy: type: latency latency: threshold_ms: 1000 - name: probabilistic traces_per_second: 100 policy: type: probabilistic probabilistic: sampling_percentage: 5exporters: jaeger: endpoint: jaeger-collector:14250service: pipelines: traces: receivers: [otlp] processors: [tail_sampling] exporters: [jaeger]
System Design Perspective
Designing Systems For Observability
Observability isn’t an afterthought bolted onto a system. Great systems are designed with observability built in from day one.
Principle 1: Instrumenting the Critical Path
Don’t instrument everything. Focus on what users care about.
Example: E-commerce checkout
Critical path:
- User clicks “Checkout” (frontend)
- Cart validation (cart-service)
- Payment processing (payment-service)
- Order persistence (order-service)
- Notification (email-service via async)
Instrument these deeply. Instrument admin dashboards lightly.
Principle 2: Correlation IDs Everywhere
Every request, every message, every async job must have a correlation ID. This is non-negotiable.
Principle 3: Structured Data From the Start
Log as JSON, not free-form strings. Define a schema early and enforce it.
// Schema (team agreement){ "timestamp": "ISO8601", "service": "string", "trace_id": "UUID", "span_id": "UUID", "level": "FATAL|ERROR|WARN|INFO|DEBUG|TRACE", "event": "snake_case_event_name", "user_id": "string (optional)", "duration_ms": "integer (optional)", "error": { "code": "string", "message": "string", "stack_trace": "string (optional)" }}
Principle 4: Observable State Representation
Design systems so you can ask questions about state without code changes.
Bad design:
- State is private; must add logging to debug
- Errors silently consumed and retried
- Retry logic invisible
Good design:
- State machine explicit and observable (CREATE → PENDING → PROCESSING → COMPLETED)
- Failures recorded with reason codes
- Retry attempts logged with backoff strategy
public enum OrderStatus { CREATED, PAYMENT_PENDING, PAYMENT_COMPLETED, PROCESSING, SHIPPED, DELIVERED, FAILED, CANCELLED}public class Order { private OrderStatus status; private LocalDateTime statusChangedAt; private String statusChangeReason; // why did we change status? private int retryCount; private String lastError;}
SLIs, SLOs, SLAs (With Concrete Examples)
Definitions
SLI (Service Level Indicator): The actual measurement.
- Example: “99.95% of requests complete in < 500ms”
SLO (Service Level Objective): Your internal goal (stricter than SLA).
- Example: “99.95% availability” (internal commitment)
SLA (Service Level Agreement): Your external commitment to customers (looser than SLO).
- Example: “99.9% availability” (customer-facing; includes buffer)
Golden Signals
Four metrics capture 90% of system health:
1. Latency (how fast?)
p50: 300ms (median request)p95: 800ms (95th percentile)p99: 2000ms (99th percentile)Alert if p95 > 1 second
2. Traffic (how much load?)
Requests per second: 5000 RPSConcurrent connections: 500Database connections: 80/100Alert if traffic unexpectedly drops (possible service failure)
3. Errors (what breaks?)
Error rate: 0.5% (5 errors per 1000 requests)Error categories: TIMEOUT, INVALID_REQUEST, SERVER_ERRORAlert if error rate > 1%
4. Saturation (how full?)
CPU: 60% (room to grow)Memory: 70% (approaching limits)Disk: 80% (urgent scaling needed)Queue depth: 1000 messages (capacity limit is 10000)Alert if CPU > 85%, Memory > 90%
E-Commerce Example
Checkout Service SLO:
Availability: 99.95% - Measured as: % of requests that return 2xx or 4xx (not 5xx) - During: All business hoursLatency: - p50 < 200ms - p95 < 500ms - p99 < 1000ms - Measured on successful requestsError budget: - SLA: 99.9% (customer promise) - SLO: 99.95% (team target) - Buffer: 0.05% (5 minutes per month) - Can "spend" this on deployments, experiments
Measuring:
-- SLI for availabilitySELECT (COUNT(*) FILTER (WHERE status_code < 500)) * 1.0 / COUNT(*) as availabilityFROM http_requestsWHERE timestamp > now() - interval '30 days' AND service = 'checkout'-- SLI for latencySELECT service, PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY latency_ms) as p50, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) as p95, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) as p99FROM http_requestsWHERE timestamp > now() - interval '30 days' AND service = 'checkout' AND status_code < 400GROUP BY service
Investment Banking Example
Trade Processing SLO:
Correctness: 100% (no exceptions) - Measured as: successful settlement / total trades - Every trade must settle or be rejected, never lostLatency: - All trades settled by end of day (T+1 for equities, same-day for FX) - Exceptions logged for manual reviewAuditability: 100% - Every trade logged with: trader ID, timestamp, approvals, result - Complete audit trail for regulatory compliance
Measuring:
-- CompletenessSELECT (COUNT(*) FILTER (WHERE final_status IN ('SETTLED', 'REJECTED'))) * 1.0 / COUNT(*) as completenessFROM tradesWHERE settlement_date = current_date-- Settlement timeSELECT AVG(settlement_timestamp - trade_timestamp) as avg_settlement_latencyFROM tradesWHERE settlement_date = current_date-- Exception rateSELECT COUNT(*) FILTER (WHERE exception IS NOT NULL) * 1.0 / COUNT(*) as exception_rateFROM tradesWHERE settlement_date = current_date
Observability for Scalability and Resilience
Observability enables smart scaling and graceful degradation.
Example: Black Friday Traffic Surge
Observability enables:
- Early detection (metrics show traffic spike 30 min before peak)
- Rapid response (auto-scaling triggers based on observed load)
- Graceful degradation (disable non-critical services, keep checkout running)
- Post-incident learning (trace shows which service became the bottleneck)
Without observability:
- Traffic spike overwhelming system before response time
- Cascading failures (payment timeout → order service timeout → frontend timeout)
- No clue what broke
Domain-Specific Scenarios
E-Commerce: Checkout Latency & Inventory Mismatch
Checkout Latency
Critical issue: customers abandon carts if checkout takes > 3 seconds.
Observability instrumentation:
public class CheckoutController { private final Tracer tracer; private final CartService cartService; private final PaymentService paymentService; private final OrderService orderService; @PostMapping("/checkout") public CheckoutResponse checkout(@RequestBody CheckoutRequest req) { try (Span span = tracer.spanBuilder("checkout_process") .setAttribute("user_id", req.getUserId()) .setAttribute("cart_total", req.getTotal()) .setAttribute("item_count", req.getItems().size()) .startSpan()) { // Step 1: Validate cart try (Span cartSpan = tracer.spanBuilder("validate_cart") .startSpan()) { cartService.validate(req.getCartId()); } // Span auto-closed; latency measured // Step 2: Process payment try (Span paymentSpan = tracer.spanBuilder("process_payment") .setAttribute("amount", req.getTotal()) .startSpan()) { PaymentResult payment = paymentService.charge(req); paymentSpan.setAttribute("gateway", payment.getGateway()); } // Step 3: Create order try (Span orderSpan = tracer.spanBuilder("create_order") .startSpan()) { Order order = orderService.create(req); orderSpan.setAttribute("order.id", order.getId()); } return new CheckoutResponse(order, payment); } }}
Dashboard metrics:
- Checkout latency (p50, p95, p99) by region
- Latency breakdown: % time in payment vs inventory vs database
- Error rate by failure reason (payment_declined, inventory_unavailable, etc.)
- Conversion funnel: started checkout → completed
Alert thresholds:
- p95 latency > 1 second (means 5% of users experiencing >1s)
- Payment timeout > 3 per minute
Inventory Mismatch
Critical issue: system shows “in stock,” customer orders, then “sorry, we’re out of stock.”
Root causes:
- Async inventory updates lag (warehouse system delayed)
- Race condition (two orders placed simultaneously for last item)
- Manual inventory adjustment not synced
- Returns not reflected
Observability:
public class InventoryService { private final Tracer tracer; private final InventoryRepository repo; private final EventPublisher events; public InventoryReservation reserve(String sku, int quantity) { try (Span span = tracer.spanBuilder("inventory_reserve") .setAttribute("sku", sku) .setAttribute("quantity", quantity) .startSpan()) { InventoryRow row = repo.findBySku(sku); // Log before state span.addEvent("inventory_check", Attributes.of( AttributeKey.longKey("available"), row.getAvailable(), AttributeKey.longKey("reserved"), row.getReserved() )); if (row.getAvailable() < quantity) { span.recordException(new OutOfStockException(sku)); throw new OutOfStockException(sku); } // Update (atomic) InventoryReservation reservation = repo.reserve(sku, quantity); // Log after state span.addEvent("reservation_created", Attributes.of( AttributeKey.stringKey("reservation_id"), reservation.getId(), AttributeKey.longKey("remaining"), row.getAvailable() - quantity )); // Publish event (async sync to external systems) events.publish(new InventoryReservedEvent(sku, quantity)); return reservation; } }}
Reconciliation job (daily):
@Scheduled(cron = "0 2 * * *") // 2 AMpublic void reconcileInventory() { try (Span span = tracer.spanBuilder("inventory_reconciliation") .startSpan()) { List<String> skus = repo.getAllSkus(); int discrepancies = 0; for (String sku : skus) { long systemCount = repo.getCount(sku); long actualCount = warehouseApi.getActualCount(sku); // source of truth if (systemCount != actualCount) { discrepancies++; logger.warn("inventory_mismatch", "sku", sku, "system_count", systemCount, "actual_count", actualCount, "variance", actualCount - systemCount); // Record metric for trending meterRegistry.counter("inventory.discrepancies", "sku", sku).increment(); // Auto-correct if difference is small if (Math.abs(actualCount - systemCount) <= 5) { repo.update(sku, actualCount); } else { // Alert for manual review slack.notify("Inventory discrepancy for " + sku); } } } span.setAttribute("discrepancies_found", discrepancies); }}
Dashboard:
- System inventory vs actual inventory (reconciliation variance)
- Out-of-stock errors: rate, by SKU, by region
- Reservation success rate
- Mismatch detection latency (how long before we notice?)
Investment Banking: Trade Processing & Reconciliation
Trade Processing
Critical: every trade must be recorded, settled, and auditable.
Observability:
public class TradeProcessor { private final Tracer tracer; private final TradeRepository repo; private final SettlementService settlement; private final AuditLog auditLog; public Trade processIncomingTrade(IncomingTrade incoming) { String tradeId = incoming.getTradeId(); try (Span span = tracer.spanBuilder("process_trade") .setAttribute("trade.id", tradeId) .setAttribute("instrument", incoming.getInstrument()) .setAttribute("quantity", incoming.getQuantity()) .setAttribute("price", incoming.getPrice()) .setAttribute("counterparty", incoming.getCounterparty()) .startSpan()) { // Step 1: Validation try (Span validationSpan = tracer.spanBuilder("validate_trade") .startSpan()) { validator.validate(incoming); validationSpan.addEvent("validation_passed"); } // Step 2: Book trade Trade trade = new Trade(incoming); trade.setStatus(TradeStatus.BOOKED); trade.setBookingTime(Instant.now()); repo.save(trade); // Step 3: Initiate settlement try (Span settlementSpan = tracer.spanBuilder("initiate_settlement") .startSpan()) { SettlementInstruction instruction = settlement.initiate(trade); settlementSpan.setAttribute("settlement.instruction_id", instruction.getId()); } // Step 4: Audit logging (for regulatory compliance) auditLog.log(new AuditEntry() .setAction("TRADE_BOOKED") .setTradeId(tradeId) .setTraderId(getCurrentTraderId()) // who booked it? .setTimestamp(Instant.now()) .setDetails(incoming) .setApprovals(Collections.emptyList())); // approvals if required return trade; } } public void reconcileSettlement() { try (Span span = tracer.spanBuilder("reconcile_settlement") .startSpan()) { List<Trade> unsettledTrades = repo.findByStatus(TradeStatus.PENDING_SETTLEMENT); for (Trade trade : unsettledTrades) { try (Span tradeSpan = tracer.spanBuilder("reconcile_trade") .setAttribute("trade.id", trade.getId()) .startSpan()) { // Check with custodian (source of truth) SettlementStatus custodianStatus = custodian.getStatus(trade.getId()); TradeStatus systemStatus = trade.getStatus(); if (!systemStatus.matches(custodianStatus)) { // Reconciliation break! tradeSpan.recordException( new ReconciliationBreakException( trade.getId(), systemStatus, custodianStatus)); logger.error("reconciliation_break", "trade_id", trade.getId(), "system_status", systemStatus, "custodian_status", custodianStatus); auditLog.log(new AuditEntry() .setAction("RECONCILIATION_BREAK") .setTradeId(trade.getId()) .setTimestamp(Instant.now()) .setDetails(Map.of( "system_status", systemStatus.toString(), "custodian_status", custodianStatus.toString() ))); } else { tradeSpan.addEvent("reconciliation_passed"); } } } } }}
Audit dashboard:
- Trades booked per day (volume trending)
- Settlement latency (T+1, T+2, etc.)
- Reconciliation breaks (rate, by reason)
- Audit trail queries (find all trades for trader X on date Y)
Incident Response & Debugging
How Observability Helps During Production Incidents
Timeline:
- T+0: Alert fires (metric spike detected)
- T+1 min: Dashboard shows error rate trending up
- T+3 min: Trace sampling captures a failing trace
- T+5 min: Engineer examines trace, identifies slow database query
- T+8 min: Logs for that trace context confirm query timeout
- T+10 min: Root cause identified: missing index on
orders.user_id - T+15 min: Index created, latency returns to baseline
- T+25 min: Postmortem started
Without observability:
- T+0: Alert fires
- T+5 min: “Something’s wrong, let’s check the application logs”
- T+20 min: Scrolling through 10M log lines, can’t find the issue
- T+45 min: Finally found relevant error, but which service caused it?
- T+90 min: Shotgun debugging, restarted random services
- T+120 min: Issue mysteriously resolved (probably not actually fixed)
Root Cause Analysis Using Logs, Metrics, and Traces
Scenario: Checkout conversion drops 50% on Black Friday at 3 PM.
Step 1: Metrics confirm issue
checkout_conversion_rate: 98% → 48% (drop detected)payment_service_error_rate: 0.5% → 15%payment_service_latency_p99: 500ms → 5000ms
Step 2: Trace shows bottleneck
Sampled trace shows:
- API Gateway to Checkout Service: 50ms (normal)
- Checkout to Payment Service: 4900ms (SLOW!)
- Payment to Stripe: timeout after 5s
Step 3: Logs reveal root cause
{ "timestamp": "2026-01-27T15:03:45Z", "service": "payment-service", "event": "stripe_api_timeout", "http_status": 504, "duration_ms": 5000, "error": "context deadline exceeded"}
Payment service making unoptimized call to Stripe (no batching, no connection pooling).
Step 4: Identify fix
- Add exponential backoff (don’t hammer Stripe)
- Batch stripe calls
- Increase timeout slightly (temporary)
- Enable circuit breaker (fail fast instead of hanging)
Postmortems and Learning Loops
Good postmortem structure:
## Incident: Checkout Conversion Drop**Timeline:**- 15:03 UTC: Error rate spike detected- 15:08 UTC: Team paged- 15:15 UTC: Root cause identified (Stripe timeout)- 15:28 UTC: Circuit breaker deployed, recovered**Duration:** 25 minutes**Root Cause:**Payment service was synchronously calling Stripe's API without batching or connection pooling. Black Friday traffic (10x normal) exhausted payment service's HTTP connection pool, causing timeouts.**Contributing Factors:**1. Load test only simulated 2x normal traffic (not enough)2. No circuit breaker between payment and Stripe3. Stripe's API performance not monitored (external blind spot)**Immediate Actions:**1. Enable circuit breaker (fail fast)2. Implement exponential backoff**Follow-up Actions:**1. Add load test for 10x traffic (within 2 weeks)2. Add Stripe API latency to dashboards3. Implement request batching for Stripe (within sprint)**Learning:**External service degradation can cascade. Need circuit breakers for all external calls.
Common Questions on Observability
Q1: “Design an observability solution for a new microservices application.”
Structure your answer:
- Understand requirements (throughput, latency, criticality)
- Choose instrumentation (OpenTelemetry, auto-instrumentation)
- Define metrics (Golden Signals)
- Logging strategy (structured JSON, correlation IDs)
- Tracing strategy (head/tail sampling)
- Alerting (SLO-based)
- Dashboards (user journey, operational)
Example answer:
“I’d start with understanding the business criticality and scale. For an e-commerce platform:
Instrumentation:
- OpenTelemetry Java agent for auto-instrumentation (zero code changes initially)
- Manual spans for critical business logic (checkout, payment)
Metrics:
- Golden Signals: latency (p50/p95/p99), traffic (RPS), errors (rate by type), saturation (CPU, memory, queue depth)
- Business metrics: conversion rate, revenue per minute
Logging:
- Structured JSON with correlation IDs
- All logs must have: timestamp, service, trace_id, span_id, level, event, user_id
- Logback with JSON encoder
Tracing:
- CloudWatch X-Ray for AWS-native tracing
- Sampling: head-based 10% + tail-based for all errors and slow traces (>1s)
Alerting:
- SLO-based alerts, not metric thresholds
- Example: alert if error budget for payment service burns > 2% per hour
- PagerDuty for critical, Slack for warnings
Dashboards:
- User journey dashboard (checkout start → completion)
- Operational dashboard (latency, errors by service)
- Resource utilization (CPU, memory, connections)
Scaling:
- Use CloudWatch Logs Insights for ad-hoc queries
- Plan to migrate to Datadog/New Relic as scale increases”
Q2: “How would you handle observability in a multi-cloud environment?”
Answer structure:
- Use vendor-agnostic standards (OpenTelemetry)
- Centralized collection (OpenTelemetry Collector)
- Dual-write to multiple backends for failover
Example:
“Multi-cloud observability requires decoupling from vendor APIs.
Instrumentation:
- OpenTelemetry everywhere (same Java agent whether deployed on AWS, GCP, or on-premises)
Collection:
- OpenTelemetry Collector deployed in each cloud (processes spans, metrics, logs)
- Collector configured to export to multiple backends
Backends:
- Primary: Datadog (multi-cloud support)
- Fallback: Splunk (self-hosted option)
- Each cloud-native tool (CloudWatch, GCP Monitoring) for governance
Configuration:
receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317exporters: datadog: api: key: ${DATADOG_API_KEY} site: datadoghq.com splunk: token: ${SPLUNK_TOKEN} endpoint: https://splunk.example.com:8088 awscloudwatch: region: us-east-1service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [datadog, splunk, awscloudwatch]
Advantages:
- Applications are cloud-agnostic
- Can switch backends without code changes
- Failover if one backend is down”
Q3: “What are red flags in observability that you watch for?”
Good red flags to mention:
- Logs without context: If you see logs without correlation IDs, request won’t be traceable
- Metric cardinality explosion: High cardinality dimensions (user_id, request_id) in metrics = cost spiral
- No sampling: Attempting to trace 100% of traffic = cost overrun and performance degradation
- Alerting fatigue: Team ignoring alerts (tuning has failed)
- Observability only in production: Testing is blind; production issues happen on first deploy
- No SLOs: Team debates what “healthy” means; no objective targets
- Logs as debugging tool: Team adding logs to debug instead of using traces
- Silent failures: Errors consumed without logging; invisible to observability
Follow-up: “As a Tech Lead, I’d establish observability requirements in Definition of Done:
- All services must emit logs, metrics, traces
- Code review must check for correlation IDs and structured logging
- Every service must have a dashboard before prod deployment”
How to Explain Observability Decisions Confidently
Technique: Use the “Why” Framework
Instead of: “We’re using OpenTelemetry and CloudWatch.”
Say: “We chose OpenTelemetry because it’s vendor-agnostic (we might switch to Datadog later), it auto-instruments most libraries (reducing engineering overhead), and it’s the CNCF standard (25+ companies contribute). For AWS, we’re using CloudWatch Logs and X-Ray because they’re native to Lambda (no extra agents), but we’re exporting to Datadog as our primary backend for richer querying.”
Structure:
- What we’re doing
- Why we chose it (requirements + trade-offs)
- How it works (brief architecture)
- When it’s the right choice (and when it’s not)
Architecture Diagrams (Textual)
Microservices with Full Observability
┌─────────────────────────────────────────────────────────────────┐│ User Request Flow │└─────────────────────────────────────────────────────────────────┘ │ [Load Balancer] (X-Correlation-ID) │ ┌──────────────────┼──────────────────┐ │ │ │ [API GW] [API GW] [API GW] │ │ │ └──────────────────┼──────────────────┘ │ [Checkout Service] span: checkout_flow │ ┌──────────────────┼──────────────────┐ │ │ │ [Cart Service] [Payment Service] [Order Service] (span) (span: payment) (span: order_create) │ │ │ │ [Stripe API] [DB] │ (instrumented (logs query │ OpenTelemetry) latency) │ │ │ └──────────────────┼──────────────────┘ │ [Notification Service] (async via Kafka) (trace context in headers) │ [Email Service]┌─────────────────────────────────────────────────────────────────┐│ Observability Pipeline │└─────────────────────────────────────────────────────────────────┘All Services │ └─→ [OpenTelemetry Collector] │ ├─→ Traces ──→ [CloudWatch X-Ray / Jaeger] ├─→ Metrics ──→ [Prometheus / CloudWatch] └─→ Logs ──→ [CloudWatch Logs / Elasticsearch] [CloudWatch] [Datadog] [PagerDuty] │ │ │ └─→ [Dashboard] ←────┴──→ [Alerts] ──→ [On-call] │ └─→ [Error Budget Tracking]
Request Tracing Through Async Boundary
Synchronous Request:────────────────────User Browser │ └─→ [POST /checkout] (X-Correlation-ID: abc-123) │ ├─→ API Gateway (span: api_gw, trace_id: abc-123) │ ├─→ Checkout Service (span: checkout, parent: api_gw, baggage: {user_id, tenant_id}) │ ├─→ Payment Service (span: payment, parent: checkout) │ │ │ └─→ [HTTP call] (headers include trace context) │ └─→ Response [200 OK] (includes X-Trace-ID: abc-123)Asynchronous (Kafka):─────────────────────Checkout Service Order Processing Service │ │ ├─→ Span: publish_order │ │ [SEND to Kafka] │ │ Message headers: │ │ ├─ traceparent: 00-abc-123... │ │ ├─ tracestate: vendor=value │ │ ├─ baggage: user_id=123 │ │ │ │ [Kafka Queue] │ │ │ │ │ └──────────────────────────┤ │ │ │ Span: process_order │ [EXTRACT from headers] │ [Link to producer span] │ parent: (from extracted context) │ links: [producer span context] │ └─ Trace is continuous across sync/async boundary! Complete request journey: api_gw → checkout → kafka → order_service
Production Checklists
Observability Checklist for a New Service
Use this before deploying any new service to production.
Logging ✓
- [ ] All logs are structured JSON (not free-form text)
- [ ] Logs include: timestamp, service name, log level, event type, trace_id, span_id
- [ ] Correlation IDs are propagated via X-Correlation-ID header
- [ ] MDC (Mapped Diagnostic Context) configured in logging framework
- [ ] Sensitive data (passwords, tokens, PII) not logged
- [ ] Log levels are appropriate (DEBUG for verbose, INFO for decisions, ERROR for problems)
- [ ] Logging configuration is externalized (environment variables, ConfigMap)
Metrics ✓
- [ ] Golden Signals defined and emitted: latency (p50/p95/p99), traffic (RPS), errors (rate, by type), saturation (CPU, memory)
- [ ] Business metrics defined: signup rate, conversion rate, revenue
- [ ] JVM metrics enabled (via Micrometer): heap usage, GC pauses, thread count
- [ ] Metrics have consistent naming and tags (environment, service, version)
- [ ] High-cardinality dimensions handled via EMF or separate logs (not metrics)
- [ ] Metrics endpoint available (/actuator/prometheus for Spring Boot)
Tracing ✓
- [ ] OpenTelemetry Java agent configured (or manual instrumentation for custom logic)
- [ ] Trace context propagated to all downstream services
- [ ] Async boundaries (Kafka, SQS) properly instrumented (context in headers)
- [ ] Sampling strategy defined: head-based rate + tail-based for errors
- [ ] Trace exporter configured (X-Ray, Jaeger, Tempo, Datadog)
- [ ] Custom spans added for critical business operations (checkout, payment, etc.)
Alerting ✓
- [ ] SLOs defined: availability, latency targets (use error budget)
- [ ] Alerts based on SLO burn rate (not static thresholds)
- [ ] All critical alerts have runbooks with investigation steps
- [ ] Alert routing configured (severity → team → escalation)
- [ ] Oncall rotation established
- [ ] Alert fatigue handled (alert on symptoms, not causes)
Dashboards ✓
- [ ] Service dashboard created: Golden Signals + business metrics
- [ ] Architecture dashboard: service dependencies
- [ ] Error dashboard: error types, rates, affected users
- [ ] Resource dashboard: CPU, memory, disk, connections
- [ ] Drill-down capabilities (time-series → traces → logs)
Pre-Production Testing ✓
- [ ] Load test at 2x expected peak traffic
- [ ] Chaos test: kill dependencies, verify graceful degradation
- [ ] Observability test: verify traces, metrics, logs are correctly emitted
- [ ] Verify no PII in logs/traces
- [ ] Verify sampling rates are economical (estimate daily cost)
Documentation ✓
- [ ] Service wiki page: observability setup, how to debug
- [ ] Dashboard links in runbooks
- [ ] Trace ID querying instructions (how to find traces for a request)
- [ ] Common issues and how to debug them
- [ ] On-call guide: what to do when this service alerts
Production Validation Checklist
Before go-live (24 hours before):
- [ ] Dashboards loaded and displaying data correctly
- [ ] Alerts firing and routing to correct teams
- [ ] Logs searchable in CloudWatch Logs Insights (query latency < 1 minute)
- [ ] Traces appearing in X-Ray / Jaeger
- [ ] Sampling rate confirmed (estimate: X traces/second, Y cost/month)
- [ ] Cost projections reviewed (logs, metrics, traces)
- [ ] Oncall engineer trained on dashboards and runbooks
- [ ] Rollback plan includes observability (how do we know rollback worked?)
First 24 hours post-launch (continuous monitoring):
- [ ] Error rate normal (no unexplained spikes)
- [ ] Latency baseline established (compare to pre-prod)
- [ ] Trace sampling working (no gaps in coverage)
- [ ] Alerts tuned (no false positives, no missed failures)
- [ ] Team familiar with debugging tools (tracing, log querying)
Key Takeaways
Remember these talking points:
- Observability vs Monitoring: Observability answers “why,” monitoring answers “what.”
- Three Pillars: Logs, metrics, traces. Each has strengths. Together, they’re powerful. Don’t leave any out.
- Correlation IDs are non-negotiable: Every request must have a correlation ID flowing through all services. This is basic hygiene.
- Structured logging: JSON logs are queryable. Free-form text logs are noise.
- OpenTelemetry: Industry standard, vendor-agnostic, 75% adoption. This is the safe choice.
- SLOs not dashboards: Define what “healthy” means using SLOs. Build alerts around error budgets, not static thresholds.
- Sampling is mandatory: At scale, tracing everything is expensive. Head + tail sampling balances cost and visibility.
- Domain context matters: E-commerce cares about conversion and inventory. Banking cares about auditability. Know your domain.
- Incident response workflow: Alert → Metrics → Traces → Logs. This is the gold standard for RCA.
- Design for observability: Build systems with instrumentation from day one. Don’t bolt it on later.
Final Reminders
As a Tech Lead, you’re responsible for observability culture on your team:
- Establish standards: Logging format, metric naming, trace sampling policy
- Code review: Ensure correlation IDs, no PII, structured logging
- Team training: Teach juniors how to use dashboards and traces
- On-call support: Ensure dashboards and runbooks are actually useful
- Continuous improvement: Measure observability effectiveness (MTTR, alert quality)
- Balance cost and visibility: Not “trace everything,” but “trace the right things”
Success looks like:
- Team can diagnose issues in < 15 minutes using observability
- New team members can debug unfamiliar services using traces
- Incidents rarely recur (observability enables learning)
- On-call engineers sleep better (actionable alerts, clear runbooks)
1. How to explain monitoring vs observability (using Datadog)
- Monitoring
- Define key service and business KPIs: error rate, p95 latency, throughput, resource usage, checkout success, etc.
- Build dashboards and SLOs in Datadog for these KPIs.
- Configure alerts (monitors) on symptoms, not just infrastructure: spikes in 5xx, slow endpoints, SLO burn rate.
- Observability
- Ensure rich telemetry:
- Metrics from infra, apps, DB, cache.
- Structured logs with correlation IDs.
- Traces across services (APM).
- RUM and synthetics for user experience.
- Standardize tagging (
env,service,version,team, etc.) so everything can be sliced and correlated.
This distinction shows ownership of both operational health and deep debugging capability.
2. End‑to‑end setup for a typical web application in Datadog
For a modern web app (e.g., frontend + Java/Spring backend + DB + cache, on cloud or Kubernetes):
- Infrastructure layer
- Install the Datadog Agent on nodes/VMs or as a DaemonSet in Kubernetes.
- Enable cloud and DB integrations (AWS, RDS/Postgres/MySQL, Redis, NGINX, etc.).
- Outcome: baseline metrics for CPU, memory, disk, network, DB health, queue depth, etc.
- Backend / API (APM + logs + metrics)
- Attach
dd-java-agent(for Java) and configure:DD_ENV,DD_SERVICE,DD_VERSION,DD_LOGS_INJECTION.
- Rely on auto‑instrumentation for HTTP, JDBC, Redis, HTTP clients, etc.
- Configure application logs to be collected by the agent, using structured logging and trace/log correlation.
- Emit domain metrics (e.g.,
checkout.completed,payment.failed) via DogStatsD for business visibility.
- Frontend (RUM)
- Add Datadog RUM snippet to the web app.
- Capture page loads, JS errors, user actions, and Core Web Vitals, tagged by
env,service, and optionally app version.
- External reliability (Synthetics)
- Set up HTTP and browser synthetics for critical endpoints and user flows (login, search, checkout).
This gives a full picture: infra → backend → frontend → external dependencies.
3. How this is used operationally (process + workflow)
As a senior tech lead, the value is not just in plumbing, but in how teams use it:
- Dashboards
- Service dashboards: latency, throughput, error rate, GC, DB timings for each microservice.
- Business dashboards: orders/minute, cart conversions, payment success, broken down by region or channel.
- Infra dashboards: node health, pod restarts, DB and cache performance.
- Monitors and SLOs
- Define SLOs for critical flows (e.g., “99.9% of checkouts complete successfully in < 1s over 30 days”).
- Use Datadog SLOs and monitors to track error budgets and burn rate.
- Configure alerts with enough context: involved service, env, recent deployment version, and links to relevant dashboards and runbooks.
- Standardization and governance
- Establish a tagging and naming convention across services.
- Set guidelines for logging levels, structured fields (user ID, order ID, correlation ID), and sensitive data.
- Make observability part of the definition of done for new services and features.
4. Concrete incident example (shows practical application)
For example, if “checkout is slow and users see errors”:
- A Datadog monitor on checkout SLO / error rate fires.
- Open the
checkout-servicedashboard:
- Spot that p95 latency and error rate increased after a specific deployment version.
- Jump into APM traces:
- See slow spans for DB queries or external payment gateway calls.
- From a slow trace, drill into logs:
- View the exact exception, SQL query, or external error code, along with user/order IDs.
- Correlate with RUM:
- Check whether the issue affects a specific region, browser, or only mobile users.
- Take action:
- Roll back, toggle feature flags, or adjust infra resources.
- After the incident:
- Update monitors or add new domain metrics/log fields to prevent blind spots.
This demonstrates that Datadog is not just “graphs,” but the backbone for structured incident response and continuous improvement.
5. Key design principles a senior tech lead would drive
- Treat telemetry as a first‑class feature:
- Observability requirements planned alongside functional requirements.
- Optimize for fast MTTR (mean time to recovery):
- From alert → dashboard → trace → log → root cause in a few hops.
- Align technical metrics with business outcomes:
- Tie SLOs to user journeys and revenue‑critical flows (search, add‑to‑cart, checkout).
- Promote self‑service:
- Product teams own their dashboards, monitors, and SLOs, with shared standards and platform support around Datadog.
1. DB query regression after deployment
Scenario
A new release goes out. Within 10–15 minutes, checkout latency starts increasing and some users abandon the flow.
What Datadog shows
- A monitor on
checkout-servicep95 latency and HTTP 5xx rate fires. - The service dashboard shows:
- p95 latency for
POST /api/checkoutdoubled. - Error rate slightly increased.
- This started right after version
1.23.0went live (viaversiontag). - In APM traces:
- Slow traces show most time spent in a specific DB span, e.g.
SELECT * FROM orders .... - The span duration went from ~20 ms to ~300 ms.
- DB integration metrics (Postgres/MySQL):
- Increase in rows scanned and slow queries count.
- CPU on DB up, but not maxed.
How it’s resolved
- Compare traces and SQL before vs after release.
- Identify that a new filter was added but no proper index.
- Hotfix: add the missing index, redeploy, or temporarily roll back to previous version.
- After fix:
- Latency and error rate return to normal.
- Close the incident and update runbook: “DB changes must have index review + load test; add a dedicated dashboard widget tracking slow queries per table.”
2. Memory leak leading to container restarts
Scenario
Users intermittently see 502/503 errors from the API. Incidents seem random; no obvious traffic spike.
What Datadog shows
- Infra dashboard:
- Pods for
catalog-servicerestarting frequently. - Container OOM kills visible in Kubernetes events.
- JVM metrics on that service:
- Heap usage climbs gradually over several hours, never fully reclaimed after GC.
- Full GC frequency increasing.
- APM:
- Request latency spikes just before pod restarts.
- Logs:
OutOfMemoryErroror related GC errors shortly before the container dies.
How it’s resolved
- Correlate:
- Memory growth → more full GCs → latency spikes → container OOM → brief 5xx blips.
- Use heap dumps / profiling (outside Datadog) to identify root cause (e.g., caching large objects per request, or unbounded in-memory cache).
- Short-term: increase pod memory requests/limits and replica count to reduce user impact.
- Long-term: fix the leak (proper cache eviction / avoiding large in-memory collections).
- In Datadog:
- Add a monitor on memory usage slope (or GC time) to detect future leaks earlier.
- Add a widget that correlates pod restarts with heap usage.
3. Third‑party payments API latency and errors
Scenario
Checkout starts timing out and payment failures spike, mostly in certain time windows. Internally, no major code changes happened.
What Datadog shows
- SLO/monitor on “successful payment events” fires due to increased failures and timeouts.
- Service map highlights
payment-serviceas a hot node (error rate and latency high). - In APM traces for
payment-service: - A downstream span for
https://api.payment-gateway.com/chargehas:- Latency jumping from 200 ms → 2–3 seconds.
- More errors with HTTP 5xx or timeouts.
- Logs:
- Timeouts and specific error codes from the payment provider: e.g.,
504 Gateway Timeout,rate limit exceeded. - Internal infra (CPU, DB, network) looks normal.
How it’s resolved
- Confirm issue is external:
- Other services are healthy, but all failures correlate with the payment provider spans.
- Immediate mitigation:
- Reduce timeouts and implement proper circuit breaker behavior (failing fast instead of hanging).
- Fallback strategies where possible (e.g., queueing payments for retry, better user messaging).
- Medium-term:
- Work with the provider (share Datadog metrics and timings).
- Consider multi-provider setup or regional routing to reduce blast radius.
- In Datadog:
- Add a dedicated dashboard for external dependencies with:
- Latency, error rate per third‑party.
- Separate monitors so incidents are quickly classified as “internal vs third‑party”.
4. Region-specific degradation due to misconfigured load balancer
Scenario
Users in Europe complain about slow responses and sporadic errors, but global averages look fine.
What Datadog shows
- Overall SLOs might still be green, but:
- RUM dashboard:
- Page load time and XHR latency high for
region:EU.
- Page load time and XHR latency high for
- APM:
- Filtering traces by
region:eu-west-1shows p95 latency 3–4x higher versusus-east-1.
- Filtering traces by
- Infra / LB metrics:
- EU load balancer has higher 5xx and connection errors.
- One of the EU backend target groups shows more unhealthy instances.
- Logs (filtered by region tag / host):
- Increased
connection resetorupstream timeoutmessages from NGINX / LB in EU only.
How it’s resolved
- Trace the symptom:
- Start from RUM (user experience) → backend traces → infra in that region.
- Identify that:
- A new autoscaling rule or target group configuration in EU was incorrect (e.g., health check path broken, fewer healthy instances, or mis-routed to an overloaded node pool).
- Fix:
- Correct LB target group configuration and health checks.
- Redistribute traffic evenly across healthy instances.
- Post-incident:
- Add region-specific monitors:
- Latency, error rate, and RUM performance per region.
- Ensure deployment pipelines update configs consistently across regions.
5. Frontend JS error breaking a key flow after a feature rollout
Scenario
No big backend changes, but suddenly users on certain browsers cannot complete checkout. Backend metrics look normal, but conversion drops.
What Datadog shows
- Business dashboard:
- Drop in successful checkouts / increase in cart abandonment.
- Backend:
- APM metrics for
POST /api/checkoutlook fine—no spike in errors or latency. - RUM:
- Spike in JS errors on the checkout page, mainly on a specific browser/version (e.g., Safari 15).
- Error stack:
TypeError: undefined is not a functionin a new piece of JS added in the latest frontend release. - Correlation with new RUM
versionorbuildtag. - RUM session replays (if enabled):
- Show users stuck at a specific step, with a button not responding or a form not submitting.
How it’s resolved
- Link symptom to cause entirely from the frontend:
- Users see error in JS → action not sent to backend → backend metrics stay “green” but business metric falls.
- Roll back the offending frontend bundle or quickly hotfix the JavaScript.
- Improve QA/testing:
- Add automated cross‑browser tests for critical flows.
- In Datadog:
- Add monitors on:
- RUM JS error rate for critical pages.
- Conversion rate from
cart_view→checkout_complete.
- Ensure frontend releases are tagged and visible in dashboards for quick correlation.