Data Lake Architecture on AWS GovCloud

Federal agencies sit on vast stores of mission data that remain largely inaccessible for analysis due to architectural fragmentation, classification complexity, and the operational cost of maintaining custom data pipelines for every consumer. A well-designed data lake on AWS GovCloud changes that equation — providing a centralized, classified-aware repository that satisfies FISMA and FedRAMP requirements while making mission data actionable.

Rutagon's data lake implementations on AWS GovCloud serve both analytics workloads (internal mission analysis) and data sharing missions (providing authorized access to mission partners). This architectural guide covers the components that make government data lakes work in practice.

GovCloud Data Lake Architecture Overview

A federal data lake on GovCloud organizes around three zones:

┌─────────────────────────────────────────────────────────────────┐
│                        AWS GovCloud (US-West)                    │
│                                                                   │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────────┐ │
│  │   Raw Zone   │   │  Curated Zone │   │   Consumption Zone   │ │
│  │  (Bronze)    │──▶│   (Silver)   │──▶│      (Gold)          │ │
│  │              │   │              │   │                       │ │
│  │ s3://raw/    │   │ s3://curated/│   │  Athena / Redshift /  │ │
│  │ Immutable    │   │ Transformed  │   │  SageMaker / QuickSight│ │
│  │ Encrypted    │   │ Validated    │   │                       │ │
│  └──────────────┘   └──────────────┘   └──────────────────────┘ │
│                                                                   │
│  ┌──────────────────────────────────────────────────────────────┐│
│  │               AWS Lake Formation — Access Control Layer       ││
│  │    ABAC policies · Column-level security · Row-level filters  ││
│  └──────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Each zone has distinct access controls, retention policies, and encryption configurations. Raw zone data is immutable (S3 Object Lock) — once written, it cannot be modified or deleted within the retention period, satisfying NIST AU-9 and SI-7 requirements.

S3 Foundation: Encryption and Immutability

# Terraform — Raw zone bucket with Object Lock and KMS encryption
resource "aws_s3_bucket" "raw_zone" {
  bucket = "${var.program_name}-data-lake-raw"
  
  object_lock_enabled = true  # Immutable storage for AU-9 compliance
}

resource "aws_s3_bucket_object_lock_configuration" "raw_lock" {
  bucket = aws_s3_bucket.raw_zone.id
  
  rule {
    default_retention {
      mode = "GOVERNANCE"  # COMPLIANCE for stricter lock
      years = 7            # Match program retention requirement
    }
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "raw_encryption" {
  bucket = aws_s3_bucket.raw_zone.id
  
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.data_lake_key.id
    }
    bucket_key_enabled = true  # Reduces KMS API costs for high-throughput ingestion
  }
}

resource "aws_kms_key" "data_lake_key" {
  description              = "Data lake encryption key — FIPS 140-2 validated"
  enable_key_rotation      = true  # Annual rotation — NIST SC-12
  deletion_window_in_days  = 30
  multi_region             = false
  
  policy = jsonencode({
    # Key policy limits access to program boundary
    # ...
  })
}

All GovCloud KMS keys use FIPS 140-2 Level 2 validated HSMs by default. Key rotation is enforced annually, satisfying NIST SC-12 (Cryptographic Key Establishment and Management).

Data Classification with AWS Macie

CUI and sensitive data in a federal data lake must be identified and handled appropriately. AWS Macie provides automated PII and sensitive data discovery:

# Lambda — Macie findings processor for classification tagging
import boto3
import json

macie = boto3.client('macie2', region_name='us-gov-west-1')
s3 = boto3.client('s3', region_name='us-gov-west-1')

def process_macie_finding(event: dict, context) -> None:
    """
    Process Macie finding and apply classification tag to S3 object.
    Called by EventBridge rule on Macie finding events.
    """
    finding_id = event['detail']['findingIds'][0]
    
    # Get finding details
    response = macie.get_findings(findingIds=[finding_id])
    finding = response['findings'][0]
    
    bucket = finding['resourcesAffected']['s3Bucket']['name']
    object_key = finding['resourcesAffected']['s3Object']['key']
    
    # Map Macie finding type to classification level
    finding_type = finding['type']
    
    if 'SSN' in finding_type or 'DOB' in finding_type:
        classification = 'CUI-PRIVACY'
    elif 'FINANCIAL' in finding_type:
        classification = 'CUI-FINANCIAL'
    else:
        classification = 'CUI'
    
    # Apply classification tag
    s3.put_object_tagging(
        Bucket=bucket,
        Key=object_key,
        Tagging={
            'TagSet': [
                {'Key': 'DataClassification', 'Value': classification},
                {'Key': 'MacieFindingId', 'Value': finding_id},
                {'Key': 'ClassifiedAt', 'Value': finding['createdAt']},
            ]
        }
    )
    
    print(f"Tagged {bucket}/{object_key} as {classification}")

Macie findings trigger automatic tagging, which in turn triggers Lake Formation policy evaluation — CUI-tagged objects are immediately subject to the CUI access control policies before any consumer can query them.

Lake Formation: Attribute-Based Access Control

AWS Lake Formation provides the access control layer that makes data lake access manageable at scale. Instead of managing S3 bucket policies for every user/role combination, Lake Formation evaluates table- and column-level permissions centrally.

# Lake Formation ABAC — grant access based on IAM session tags
import boto3

lakeformation = boto3.client('lakeformation', region_name='us-gov-west-1')

def grant_data_lake_access(
    principal_arn: str,
    database: str,
    table: str,
    columns: list[str],
    row_filter: str = None
):
    """
    Grant column-level and optional row-level access in Lake Formation.
    Used when provisioning analyst access to mission data.
    """
    grant_kwargs = {
        "Principal": {"DataLakePrincipalIdentifier": principal_arn},
        "Resource": {
            "TableWithColumns": {
                "DatabaseName": database,
                "Name": table,
                "ColumnNames": columns  # Only grant specific columns
            }
        },
        "Permissions": ["SELECT"],
        "PermissionsWithGrantOption": []
    }
    
    if row_filter:
        # Row-level security: analyst can only see their mission area
        grant_kwargs["Resource"]["TableWithColumns"]["ColumnWildcard"] = {}
        # Row filter expression: e.g., "mission_area = 'ALASKA'"
        # Applied transparently by Lake Formation query rewrite
    
    response = lakeformation.grant_permissions(**grant_kwargs)
    return response

Column-level security ensures analysts can query mission tables without accessing columns containing PII or classification-incompatible data. Row-level filters limit analysts to data within their authorized mission scope — without modifying the underlying data or application code.

Glue Data Catalog: Schema Management

AWS Glue catalogs all data lake tables and manages schema evolution:

# Terraform — Glue database and crawler for raw zone cataloging
resource "aws_glue_catalog_database" "mission_raw" {
  name = "${var.program_name}_raw"
  
  create_table_default_permission {
    permissions = ["SELECT"]
    principal {
      data_lake_principal_identifier = "IAM_ALLOWED_PRINCIPALS"
    }
  }
}

resource "aws_glue_crawler" "raw_zone_crawler" {
  name          = "${var.program_name}-raw-crawler"
  database_name = aws_glue_catalog_database.mission_raw.name
  role          = aws_iam_role.glue_crawler.arn
  
  s3_target {
    path = "s3://${aws_s3_bucket.raw_zone.id}/telemetry/"
    
    exclusions = [
      "*.tmp",
      "*.temp",
    ]
  }
  
  configuration = jsonencode({
    Version = 1.0
    Grouping = {
      TableGroupingPolicy = "CombineCompatibleSchemas"
    }
  })
  
  schedule = "cron(0 2 * * ? *)"  # Nightly schema discovery
}

Schema evolution is handled automatically — when upstream data producers add fields, the crawler detects and updates the catalog schema. Lake Formation permissions automatically apply to new columns based on classification tag patterns.

Analytics Query Layer with Athena

Athena provides serverless SQL queries against the data lake — no cluster management, pay-per-query pricing:

-- Athena query example: mission telemetry analysis
-- Lake Formation transparently applies row and column filters
-- User only sees data they're authorized for

SELECT 
    DATE_TRUNC('hour', timestamp) as hour,
    sensor_id,
    AVG(temperature_celsius) as avg_temp,
    MAX(temperature_celsius) as peak_temp,
    COUNT(*) as reading_count
FROM mission_telemetry.curated_sensor_data
WHERE 
    timestamp BETWEEN TIMESTAMP '2026-01-01' AND TIMESTAMP '2026-01-31'
    AND sensor_status = 'NOMINAL'
GROUP BY 1, 2
ORDER BY 1, 2;

Athena query results are written to a results bucket with the same encryption and access controls as the data lake — preventing data exfiltration through the query results path.

NIST Control Coverage

| Control | Implementation | |---|---| | AU-9 | S3 Object Lock — immutable raw zone | | SC-12 | KMS key rotation, FIPS 140-2 HSMs | | SC-28 | KMS server-side encryption on all zones | | AC-3 | Lake Formation ABAC + column-level security | | AC-16 | Macie classification tagging | | SI-7 | Object Lock prevents modification of raw data | | AU-12 | S3 access logging + Lake Formation audit logs |

Federal data lake implementations require careful attention to classification boundaries, access control granularity, and audit logging from the start — retrofitting these controls after data is in the lake is significantly more expensive. Rutagon builds these controls in at the architecture level, ensuring the lake is compliant at launch rather than after the ATO audit.

Discuss data lake architecture for your program →

Frequently Asked Questions

What is the difference between a data warehouse and a data lake on GovCloud?

A data warehouse (Redshift) stores structured, transformed data optimized for analytical queries — schema is defined before data is loaded (schema-on-write). A data lake (S3 + Glue + Athena) stores data in its raw or minimally transformed format and applies schema when queried (schema-on-read). For federal programs, data lakes are preferred as the primary store because they preserve raw data fidelity, support diverse data types (telemetry, documents, imagery), and defer schema decisions until query time. Redshift is often used as a performance layer on top of the data lake for high-frequency analytical workloads.

How does Lake Formation ABAC differ from S3 bucket policies?

S3 bucket policies operate at the object level — they're powerful but require managing large numbers of policies as users and data multiply. Lake Formation ABAC operates at the table and column level, using IAM session tags to match user attributes (clearance level, mission area, role) to data attributes (classification tag, mission scope). This attribute matching scales to thousands of users without proportional policy management overhead, and policies are evaluated centrally rather than distributed across hundreds of S3 bucket policies.

Is AWS Macie available in GovCloud?

Yes, Amazon Macie is available in AWS GovCloud (US-West) as a FedRAMP-authorized service. It supports PII detection, financial data classification, and custom sensitive data types using regular expressions and keyword matching. Macie findings in GovCloud are not transmitted outside the GovCloud region — all finding data stays within the program boundary, satisfying data residency requirements.

What retention periods are typical for federal data lake raw zones?

Retention periods are determined by the records schedule applicable to the data type — established by NARA (National Archives and Records Administration) and agency records management policies. Common retention periods range from 3 years (routine operational data) to 75 years (personnel records) to permanent (significant historical mission data). S3 Object Lock COMPLIANCE mode enforces these retention periods at the storage level, preventing accidental or intentional deletion before the retention period expires.

How does the data lake integrate with mission analytics tools?

The consumption zone is designed for integration: Athena provides standard JDBC/ODBC connectivity for BI tools (Tableau, Power BI); SageMaker connects directly to the data lake for ML training workloads; custom analytics services access data through the Glue catalog with Lake Formation authorization. All these integrations use the same IAM/Lake Formation identity — there are no separate user stores or access control systems for individual tools.