Defense AI programs operate at the intersection of machine learning engineering and classified cloud compliance — two disciplines that rarely overlap in the commercial ML literature. MLOps practices developed for consumer AI don't translate directly to IL5 environments: the tools, the data handling procedures, and the audit requirements are different at impact level 5.
This article covers how Rutagon architectures MLOps pipelines for defense programs operating in classified and near-classified cloud environments — using publicly available services and patterns that have established DISA authorization paths.
Why MLOps Matters for Defense AI
Defense AI programs often produce good models in development that fail in production. The gap: no structured pipeline for training, validation, deployment, and monitoring. Models trained on one time slice perform differently on current data. Sensor-fed models drift as sensor calibration changes. Decision support models require audit trails that ad-hoc scripts can't provide.
MLOps provides the engineering structure:
- Reproducible model training (same code, same data, same model — every time)
- Automated model validation before deployment
- Canary deployment for model updates without operational disruption
- Drift detection to catch performance degradation before it affects mission
- Audit trail for model decisions — required for any program where AI assists human decision-making
IL5 Architecture: What Changes from Commercial MLOps
Commercial MLOps platforms (MLflow, Weights & Biases, SageMaker in commercial regions) are not authorized for IL5. In GovCloud, the equivalent services are:
| Commercial | GovCloud IL5 |
|---|---|
| SageMaker (us-east-1) | SageMaker (us-gov-west-1) with DISA PA |
| MLflow (SaaS) | Self-hosted MLflow on EKS |
| W&B tracking | Custom tracking in DynamoDB |
| GitHub Actions ML | GitLab CI/CD on-prem or GovCloud |
| Docker Hub | ECR (private, GovCloud) or Ironbank |
Data handling at IL5: Training data classified at IL5 cannot leave the GovCloud authorization boundary. All model training must run within the GovCloud VPC with no internet egress. Datasets stored in S3 GovCloud must use CMK encryption (SC-28) with access restricted to training roles via IRSA.
Model Training Pipeline Architecture
A production-grade IL5 training pipeline on GovCloud EKS:
# Training job orchestration with Argo Workflows or native K8s Jobs
from kubernetes import client, config
def submit_training_job(
model_name: str,
dataset_s3_uri: str,
hyperparameters: dict,
experiment_id: str
) -> str:
"""
Submit model training job as a Kubernetes Job.
All data stays within the GovCloud VPC — no internet egress.
"""
config.load_incluster_config()
batch_v1 = client.BatchV1Api()
job = client.V1Job(
metadata=client.V1ObjectMeta(
name=f"training-{model_name}-{experiment_id}",
labels={
"app": "ml-training",
"experiment": experiment_id,
"model": model_name
}
),
spec=client.V1JobSpec(
template=client.V1PodTemplateSpec(
spec=client.V1PodSpec(
restart_policy="Never",
service_account_name="ml-training-sa", # IRSA role
containers=[
client.V1Container(
name="trainer",
# Iron Bank base image or STIG-hardened custom image
image=f"{ECR_REGISTRY}/ml-training:{TRAINING_VERSION}",
env=[
client.V1EnvVar(name="DATASET_URI", value=dataset_s3_uri),
client.V1EnvVar(name="MODEL_NAME", value=model_name),
client.V1EnvVar(name="EXPERIMENT_ID", value=experiment_id),
client.V1EnvVar(name="HYPERPARAMS", value=json.dumps(hyperparameters))
],
resources=client.V1ResourceRequirements(
requests={"cpu": "4", "memory": "16Gi"},
limits={"cpu": "8", "memory": "32Gi"}
)
)
]
)
)
)
)
batch_v1.create_namespaced_job(namespace="ml-production", body=job)
return job.metadata.name Training Script Pattern
import boto3
import mlflow # Self-hosted MLflow tracking server
import logging
import json
import os
from datetime import datetime, timezone
def train_model(dataset_uri: str, model_name: str, hyperparams: dict):
"""
Training function with full audit trail.
Logs to self-hosted MLflow and CloudWatch for ConMon audit.
"""
logger = logging.getLogger(__name__)
# Audit log: training started (AU-2, AU-12)
logger.info(json.dumps({
"event": "MODEL_TRAINING_STARTED",
"timestamp": datetime.now(timezone.utc).isoformat(),
"model_name": model_name,
"dataset_uri": dataset_uri,
"hyperparams": hyperparams,
"log_type": "AUDIT"
}))
# Set MLflow experiment
mlflow.set_experiment(model_name)
with mlflow.start_run() as run:
# Log hyperparameters for reproducibility
mlflow.log_params(hyperparams)
# Load data from S3 GovCloud (stays within VPC via VPC endpoint)
s3 = boto3.client('s3', region_name='us-gov-west-1')
# ... load and preprocess dataset ...
# Train model
model = build_and_train_model(X_train, y_train, hyperparams)
# Evaluate
metrics = evaluate_model(model, X_test, y_test)
mlflow.log_metrics(metrics)
# Save model artifact to S3 with CMK encryption
model_uri = save_model_artifact(model, model_name, run.info.run_id)
mlflow.log_artifact(model_uri)
logger.info(json.dumps({
"event": "MODEL_TRAINING_COMPLETED",
"timestamp": datetime.now(timezone.utc).isoformat(),
"model_name": model_name,
"run_id": run.info.run_id,
"metrics": metrics,
"log_type": "AUDIT"
}))
return run.info.run_id, metrics Every training run generates an audit log (AU-2), tracks provenance (what data, what code, what hyperparameters), and produces a model artifact with a unique run ID. This is the reproducibility and auditability baseline that defense AI programs require.
Model Validation Gates
No model deploys without passing validation gates. This is the gating check before any model reaches production:
class ModelValidationPipeline:
def __init__(self, model_registry_uri: str):
self.registry = mlflow.tracking.MlflowClient(tracking_uri=model_registry_uri)
def validate_for_deployment(self, run_id: str, validation_config: dict) -> bool:
"""
Automated validation before deployment.
Gates: accuracy threshold, bias metrics, prediction latency.
"""
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
validation_data = load_validation_dataset(validation_config['validation_dataset_uri'])
gates = {
'accuracy': self._check_accuracy(model, validation_data, validation_config['min_accuracy']),
'latency_p99': self._check_latency(model, validation_data, validation_config['max_latency_ms']),
'bias_check': self._check_bias_metrics(model, validation_data, validation_config['bias_config'])
}
all_passed = all(gates.values())
# Log validation outcome as audit event
logging.info(json.dumps({
"event": "MODEL_VALIDATION_RESULT",
"run_id": run_id,
"gates": gates,
"approved_for_deployment": all_passed,
"timestamp": datetime.now(timezone.utc).isoformat(),
"log_type": "AUDIT"
}))
return all_passed If any gate fails, the deployment pipeline stops and the team is notified. The validation outcome is logged as an audit event — for defense AI programs where model performance is mission-critical, this audit trail demonstrates responsible AI deployment.
Model Monitoring and Drift Detection
Post-deployment monitoring detects when a model's performance degrades — a common failure mode for sensor-fed models as operational conditions change:
def calculate_prediction_distribution_shift(
baseline_predictions: list,
current_predictions: list,
threshold: float = 0.1
) -> dict:
"""
KS test for distribution shift detection.
Alert when current predictions diverge from baseline distribution.
"""
from scipy import stats
ks_stat, p_value = stats.ks_2samp(baseline_predictions, current_predictions)
drift_detected = ks_stat > threshold
if drift_detected:
# Trigger ConMon alert (SI-4)
boto3.client('cloudwatch').put_metric_data(
Namespace='MLOps/ModelMonitoring',
MetricData=[{
'MetricName': 'PredictionDriftDetected',
'Value': ks_stat,
'Unit': 'None'
}]
)
return {'ks_statistic': ks_stat, 'p_value': p_value, 'drift_detected': drift_detected} Drift detection alerts feed into CloudWatch Alarms (SI-4 System Monitoring), which notify the ML engineering team for investigation. A model showing drift triggers a retraining pipeline with current data.
ATO Considerations for Defense AI Systems
AI/ML systems in government require specific SSP documentation beyond standard software systems:
- AI model as a software component: Document the model in the SSP as a software component with its own configuration baseline (CM-3, CM-6)
- Training data handling: Document data provenance, access controls, and retention — especially for models trained on sensitive operational data
- Algorithmic transparency: For decision-support systems, document what the model outputs and what human review is required before action (this is an increasing DoD requirement)
- Model change control: Model updates are system changes — they need change control documentation under CM-3
Related: continuous ATO automation for DevSecOps programs, CSPM for government cloud ConMon.
Rutagon's engineering approach brings commercial MLOps rigor to defense cloud environments where compliance cannot be an afterthought. Contact Rutagon to discuss AI/ML architecture for your defense program.
Frequently Asked Questions
Is AWS SageMaker available in GovCloud for defense AI programs?
Yes. AWS SageMaker is available in us-gov-west-1 GovCloud and has DISA PA for IL4. For IL5 programs, the authorization path depends on your specific system boundary and impact level determination. Many programs use SageMaker for non-classified training with classified data remaining in a separate IL5-authorized data store. Confirm the specific authorization status with your ISSO before designing the system boundary.
What's the difference between MLOps and DataOps in the government context?
MLOps focuses on the model lifecycle: training, validation, deployment, monitoring, and retraining. DataOps focuses on the data pipeline: ingestion, transformation, quality, and governance. Government AI programs need both — DataOps to ensure the training data is accurate and properly handled, MLOps to ensure the trained model performs consistently and is deployed safely. Both require audit trails that standard commercial tools don't always provide.
How do we handle model explainability requirements for government AI programs?
DoD AI ethics principles and emerging acquisition requirements increasingly ask for explainable AI — systems where a human can understand why a model produced a given output. For tabular/structured data models, SHAP (SHapley Additive exPlanations) provides feature importance explanations that can be logged alongside predictions. For deep learning models, explainability is harder and may require simpler model selection. Document your explainability approach in the SSP and in any AI risk assessment required by the program.
What audit logging is required for AI/ML systems in federal programs?
At minimum, log: model version deployed, prediction inputs (or a hash for PII protection), prediction outputs, timestamp, and user or system requesting the prediction. For decision-support systems, also log whether the human operator accepted or overrode the model recommendation. This satisfies AU-2 (Event Logging) and creates the accountability record for AI-assisted decisions. Store logs with the same retention and protection requirements as other system audit logs.
Can Rutagon support model training on classified datasets?
Rutagon can architect the infrastructure for classified model training within a GovCloud IL5 boundary — the Kubernetes training jobs, encrypted S3 data stores, IRSA-based access controls, and audit logging. Access to actual classified datasets requires program-specific clearance sponsorship and security procedures. We work within the program's data handling requirements and design the MLOps infrastructure to meet the classification handling standards defined by the program ISSO.
Discuss your project with Rutagon
Contact Us →