Distributed tracing in government cloud environments presents a specific challenge: you need the same observability depth as commercial systems, but every component must satisfy FedRAMP boundary requirements, FIPS 140-2 encryption mandates, and audit logging standards. Standard observability stacks designed for commercial AWS often use services or exporters that don't exist in GovCloud — or exist in limited forms.
OpenTelemetry (OTEL) solves the vendor lock-in problem and provides a standardized instrumentation layer that maps cleanly to what federal systems actually need. This article covers how we architect OpenTelemetry distributed tracing for federal programs running on AWS GovCloud.
Why OpenTelemetry for Government Systems
OpenTelemetry has become the industry standard for instrumentation because it separates signal collection from signal backend. Your application code uses the OTEL SDK to emit spans, metrics, and logs. The OTEL Collector receives those signals and routes them to whatever backend you choose — X-Ray, Grafana Tempo, Jaeger, or a commercial APM.
For federal programs, this separation matters because:
- Backend flexibility: You can start with X-Ray (GovCloud-native) and migrate to a compliant commercial solution without re-instrumenting your application code
- Reduced scope: The OTEL Collector can scrub sensitive data before export, narrowing what enters your observability backend and what you need to audit
- Standardization: CNCF-backed open standard means assessors can find documentation, reducing ATO friction
- Multi-signal: Traces, metrics, and logs share the same pipeline — single boundary to authorize and monitor
GovCloud-Specific Architecture
A typical OpenTelemetry architecture in GovCloud follows this pattern:
Application (OTEL SDK) → OTEL Collector (sidecar or DaemonSet) → Backend (X-Ray / Grafana) The Collector Deployment Model
For Kubernetes workloads on GovCloud EKS, deploy the OTEL Collector as a DaemonSet — one Collector per node rather than one per pod. This reduces resource overhead and consolidates network egress to a single path per node, which simplifies security group rules and VPC flow log analysis.
# collector-daemonset.yaml (simplified)
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: observability
spec:
mode: daemonset
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 400
batch:
send_batch_size: 10000
timeout: 10s
# Scrub PII before export — required for systems handling CUI
attributes:
actions:
- key: user.email
action: delete
- key: user.id
action: hash
exporters:
awsxray:
region: us-gov-west-1
logging:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, attributes, batch]
exporters: [awsxray] FIPS 140-2 Compliance
GovCloud mandates FIPS 140-2 validated encryption for data in transit. The OTEL Collector's gRPC and HTTP endpoints must use TLS with FIPS-validated cipher suites. AWS X-Ray's SDK and daemon handle FIPS automatically when running in GovCloud. If you're exporting to a non-AWS backend, verify the client library uses a FIPS-validated TLS implementation — in Go, this typically means building with GOEXPERIMENT=boringcrypto.
Instrumentation Patterns
Automatic Instrumentation (Agents)
For Java and Node.js services, use OTEL's auto-instrumentation agents. These agents attach to the JVM or Node process and instrument HTTP clients, database calls, and framework-level operations without code changes.
# Java service startup with OTEL agent
java -javaagent:/otel/opentelemetry-javaagent.jar \
-Dotel.service.name=mission-api \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-Dotel.resource.attributes=environment=prod,cloud.region=us-gov-west-1 \
-jar app.jar Manual Instrumentation for Business Logic
Auto-instrumentation covers framework boundaries but not application logic. For high-value business operations — particularly anything that touches authorization decisions or data access in a federal system — add manual spans to create a complete audit-quality trace:
# Python example with boto3 context propagation
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
tracer = trace.get_tracer(__name__)
def process_document(doc_id: str, user_context: dict) -> dict:
with tracer.start_as_current_span("process_document") as span:
span.set_attribute("document.id", doc_id)
span.set_attribute("user.role", user_context.get("role"))
span.set_attribute("classification.level", "CUI")
# Business logic here — every branch is traceable
result = fetch_and_validate(doc_id)
span.set_attribute("document.pages", result.get("page_count"))
return result The span attributes above create a queryable audit trail: you can answer "what did this user access and when" from traces alone, supplementing CloudTrail for application-layer audit requirements.
Sampling Strategy for Government Systems
Sampling is required for high-throughput services — sending 100% of traces is cost-prohibitive and creates storage compliance burdens. But federal systems often have an audit requirement: certain transaction types must always be traced.
Use composite sampling with the OTEL Collector's tail sampler:
processors:
tail_sampling:
decision_wait: 10s
policies:
# Always trace errors
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
# Always trace privileged operations
- name: privileged-ops-policy
type: string_attribute
string_attribute:
key: operation.privilege_level
values: [admin, system, elevated]
# Sample 10% of normal traffic
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 10} This approach ensures auditability of high-risk operations (100% trace retention) while controlling costs for routine traffic.
Connecting to the Broader Observability Stack
Distributed tracing is one leg of observability. Correlate trace IDs with structured logs and metrics for full context during incidents:
- Logs: Inject OTEL trace ID into every log line (most logging frameworks support MDC or similar). In AWS CloudWatch Logs Insights, you can pivot from a trace to all logs sharing that trace ID
- Metrics: OpenTelemetry metrics shipped via the same Collector provide latency histograms and error rates correlated with your trace data
- Alerts: Set SLO-based alerts on trace-derived metrics (p99 latency, error rate) rather than infrastructure metrics alone
This architecture pairs well with continuous monitoring pipelines covered in our NIST RMF continuous monitoring guide and integrates into DevSecOps pipelines we build for government programs.
Frequently Asked Questions
Does OpenTelemetry work with AWS X-Ray in GovCloud?
Yes. AWS X-Ray is available in both us-gov-west-1 and us-gov-east-1. The OTEL Collector's awsxray exporter translates OTEL spans to X-Ray format and sends them via the X-Ray daemon or directly via the AWS SDK. The X-Ray SDK is not required on the application side when using OTEL — the Collector handles conversion.
How do we handle trace data that might contain CUI?
Use the OTEL Collector's attributes processor to remove or hash sensitive fields before export. Define a sanitization layer as a required step in every pipeline — before the exporter stage. Establish which span attributes are safe to export (operation names, HTTP status codes, latency) and which require scrubbing (user identifiers, document IDs, query parameters). Document this policy in your SSP as part of your audit logging and data handling controls.
Is OpenTelemetry approved for FedRAMP-authorized systems?
OpenTelemetry is open-source instrumentation — it is not itself a FedRAMP-authorized service. What matters for FedRAMP is where trace data is stored and who can access it. X-Ray on GovCloud is within the FedRAMP authorization boundary. If you export to a commercial APM, that service must also be FedRAMP authorized or operate within your system boundary. The OTEL Collector acts as a boundary control point — data that doesn't leave GovCloud doesn't require a separate authorization assessment.
What's the performance overhead of OTEL instrumentation?
In production systems, OTEL instrumentation adds approximately 2-5% CPU overhead with default settings. Use batch processing in the Collector (not synchronous span export), set memory limits, and configure appropriate sampling rates for your throughput. For high-throughput services processing thousands of requests per second, tail-based sampling at the Collector level is significantly more efficient than head-based sampling in the SDK.
Can we use OpenTelemetry for metrics collection alongside traces?
Yes — OpenTelemetry Metrics is a production-stable component of the OTEL specification. You can replace Prometheus instrumentation libraries with OTEL Metrics SDKs and route metrics through the same Collector. In GovCloud, metrics can be exported to Amazon Managed Service for Prometheus (available in GovCloud) or to CloudWatch Metrics via the awsemf exporter.
Learn how Rutagon architects observability for federal programs →
Discuss your project with Rutagon
Contact Us →