OpenTelemetry Distributed Tracing on AWS GovCloud

Distributed tracing in government cloud environments presents a specific challenge: you need the same observability depth as commercial systems, but every component must satisfy FedRAMP boundary requirements, FIPS 140-2 encryption mandates, and audit logging standards. Standard observability stacks designed for commercial AWS often use services or exporters that don't exist in GovCloud — or exist in limited forms.

OpenTelemetry (OTEL) solves the vendor lock-in problem and provides a standardized instrumentation layer that maps cleanly to what federal systems actually need. This article covers how we architect OpenTelemetry distributed tracing for federal programs running on AWS GovCloud.

Why OpenTelemetry for Government Systems

OpenTelemetry has become the industry standard for instrumentation because it separates signal collection from signal backend. Your application code uses the OTEL SDK to emit spans, metrics, and logs. The OTEL Collector receives those signals and routes them to whatever backend you choose — X-Ray, Grafana Tempo, Jaeger, or a commercial APM.

For federal programs, this separation matters because:

Backend flexibility: You can start with X-Ray (GovCloud-native) and migrate to a compliant commercial solution without re-instrumenting your application code
Reduced scope: The OTEL Collector can scrub sensitive data before export, narrowing what enters your observability backend and what you need to audit
Standardization: CNCF-backed open standard means assessors can find documentation, reducing ATO friction
Multi-signal: Traces, metrics, and logs share the same pipeline — single boundary to authorize and monitor

GovCloud-Specific Architecture

A typical OpenTelemetry architecture in GovCloud follows this pattern:

Application (OTEL SDK) → OTEL Collector (sidecar or DaemonSet) → Backend (X-Ray / Grafana)

The Collector Deployment Model

For Kubernetes workloads on GovCloud EKS, deploy the OTEL Collector as a DaemonSet — one Collector per node rather than one per pod. This reduces resource overhead and consolidates network egress to a single path per node, which simplifies security group rules and VPC flow log analysis.

# collector-daemonset.yaml (simplified)
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  namespace: observability
spec:
  mode: daemonset
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      memory_limiter:
        check_interval: 1s
        limit_mib: 400
      batch:
        send_batch_size: 10000
        timeout: 10s
      # Scrub PII before export — required for systems handling CUI
      attributes:
        actions:
          - key: user.email
            action: delete
          - key: user.id
            action: hash

    exporters:
      awsxray:
        region: us-gov-west-1
      logging:
        verbosity: detailed

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, attributes, batch]
          exporters: [awsxray]

FIPS 140-2 Compliance

GovCloud mandates FIPS 140-2 validated encryption for data in transit. The OTEL Collector's gRPC and HTTP endpoints must use TLS with FIPS-validated cipher suites. AWS X-Ray's SDK and daemon handle FIPS automatically when running in GovCloud. If you're exporting to a non-AWS backend, verify the client library uses a FIPS-validated TLS implementation — in Go, this typically means building with GOEXPERIMENT=boringcrypto.

Instrumentation Patterns

Automatic Instrumentation (Agents)

For Java and Node.js services, use OTEL's auto-instrumentation agents. These agents attach to the JVM or Node process and instrument HTTP clients, database calls, and framework-level operations without code changes.

# Java service startup with OTEL agent
java -javaagent:/otel/opentelemetry-javaagent.jar \
  -Dotel.service.name=mission-api \
  -Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
  -Dotel.resource.attributes=environment=prod,cloud.region=us-gov-west-1 \
  -jar app.jar

Manual Instrumentation for Business Logic

Auto-instrumentation covers framework boundaries but not application logic. For high-value business operations — particularly anything that touches authorization decisions or data access in a federal system — add manual spans to create a complete audit-quality trace:

# Python example with boto3 context propagation
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

tracer = trace.get_tracer(__name__)

def process_document(doc_id: str, user_context: dict) -> dict:
    with tracer.start_as_current_span("process_document") as span:
        span.set_attribute("document.id", doc_id)
        span.set_attribute("user.role", user_context.get("role"))
        span.set_attribute("classification.level", "CUI")
        
        # Business logic here — every branch is traceable
        result = fetch_and_validate(doc_id)
        span.set_attribute("document.pages", result.get("page_count"))
        
        return result

The span attributes above create a queryable audit trail: you can answer "what did this user access and when" from traces alone, supplementing CloudTrail for application-layer audit requirements.

Sampling Strategy for Government Systems

Sampling is required for high-throughput services — sending 100% of traces is cost-prohibitive and creates storage compliance burdens. But federal systems often have an audit requirement: certain transaction types must always be traced.

Use composite sampling with the OTEL Collector's tail sampler:

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      # Always trace errors
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      # Always trace privileged operations
      - name: privileged-ops-policy
        type: string_attribute
        string_attribute:
          key: operation.privilege_level
          values: [admin, system, elevated]
      # Sample 10% of normal traffic
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

This approach ensures auditability of high-risk operations (100% trace retention) while controlling costs for routine traffic.

Connecting to the Broader Observability Stack

Distributed tracing is one leg of observability. Correlate trace IDs with structured logs and metrics for full context during incidents:

Logs: Inject OTEL trace ID into every log line (most logging frameworks support MDC or similar). In AWS CloudWatch Logs Insights, you can pivot from a trace to all logs sharing that trace ID
Metrics: OpenTelemetry metrics shipped via the same Collector provide latency histograms and error rates correlated with your trace data
Alerts: Set SLO-based alerts on trace-derived metrics (p99 latency, error rate) rather than infrastructure metrics alone

This architecture pairs well with continuous monitoring pipelines covered in our NIST RMF continuous monitoring guide and integrates into DevSecOps pipelines we build for government programs.

Frequently Asked Questions

Does OpenTelemetry work with AWS X-Ray in GovCloud?

Yes. AWS X-Ray is available in both us-gov-west-1 and us-gov-east-1. The OTEL Collector's awsxray exporter translates OTEL spans to X-Ray format and sends them via the X-Ray daemon or directly via the AWS SDK. The X-Ray SDK is not required on the application side when using OTEL — the Collector handles conversion.

How do we handle trace data that might contain CUI?

Use the OTEL Collector's attributes processor to remove or hash sensitive fields before export. Define a sanitization layer as a required step in every pipeline — before the exporter stage. Establish which span attributes are safe to export (operation names, HTTP status codes, latency) and which require scrubbing (user identifiers, document IDs, query parameters). Document this policy in your SSP as part of your audit logging and data handling controls.

Is OpenTelemetry approved for FedRAMP-authorized systems?

OpenTelemetry is open-source instrumentation — it is not itself a FedRAMP-authorized service. What matters for FedRAMP is where trace data is stored and who can access it. X-Ray on GovCloud is within the FedRAMP authorization boundary. If you export to a commercial APM, that service must also be FedRAMP authorized or operate within your system boundary. The OTEL Collector acts as a boundary control point — data that doesn't leave GovCloud doesn't require a separate authorization assessment.

What's the performance overhead of OTEL instrumentation?

In production systems, OTEL instrumentation adds approximately 2-5% CPU overhead with default settings. Use batch processing in the Collector (not synchronous span export), set memory limits, and configure appropriate sampling rates for your throughput. For high-throughput services processing thousands of requests per second, tail-based sampling at the Collector level is significantly more efficient than head-based sampling in the SDK.

Can we use OpenTelemetry for metrics collection alongside traces?

Yes — OpenTelemetry Metrics is a production-stable component of the OTEL specification. You can replace Prometheus instrumentation libraries with OTEL Metrics SDKs and route metrics through the same Collector. In GovCloud, metrics can be exported to Amazon Managed Service for Prometheus (available in GovCloud) or to CloudWatch Metrics via the awsemf exporter.

Learn how Rutagon architects observability for federal programs →

Discuss your project with Rutagon