Cloud cost optimization

Find waste without breaking SLOs: rightsizing, autoscaling, cold storage tiers, and egress—useful when FinOps wants agents to suggest cost views by account and tag.

The SKILL should break spend down by resource type (compute, storage, network, managed services) and tie to business tags (team, env, product)—avoid blunt “turn everything off” moves that hit production.

Common levers: idle volumes/snapshots, over-provisioned K8s requests, unused elastic IPs, cross-region replication traffic, log/metric retention—each with a validation method and expected savings band.

Separate one-off optimizations from continuous governance (budget alerts, anomaly detection, naming/tag standards), and call out cash-flow and commitment risk for RIs / Savings Plans.

Cost review flow

  [ Roll up bills by account / tag / service → Top N ]
        │
        ▼
  ┌─────────────┐     Idle: unmounted volumes, empty node pools, zombie snapshots
  │ Waste scan   │──── Over-provision: CPU/mem P95, K8s request vs actual
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Next-size families / storage tiers / egress paths vs SLO history
  │ Rightsizing  │──── Change window, rollback, watch p99 latency & errors
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Interruptible → Spot / preemptible; mix & buffer capacity
  │ Spot & scale │──── Baseline → on-demand / reserved; steady long-term → RI / SP
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Budgets, anomalies, tag policy; ticket owners & compliance exceptions
  │ Governance   │──── One-time savings vs recurring FinOps cadence
  └─────────────┘

Rightsizing & utilization evidence

Before shrinking anything, compare p99 latency and error rate history; the SKILL should state the measurement window (e.g. 14d/30d) and whether peak events (sales, batch) are included.

Compute & orchestration

VMs: avg / P95 CPU, memory, disk throughput; ARM or smaller SKUs?
Kubernetes: when requests ≫ actuals, trim requests/limits and watch HPA.
Prefer autoscaling for stateless; stateful needs migration/replication cost analysis.

Storage & network

Object/block: hot/warm/cold tiers, lifecycle rules, dedupe snapshots.
Network: cross-region/public egress, NAT, CDN vs private paths on the bill.
Managed: read replicas, serverless tiers, log/metric retention days.

Cloud cost analysis: AWS Cost Explorer API example (Python boto3):

import boto3
import json
from datetime import datetime, timedelta

ce = boto3.client('ce', region_name='us-east-1')

# Aggregate last 30 days of costs by service to find top spenders
def get_cost_by_service(days=30):
    end = datetime.today().strftime('%Y-%m-%d')
    start = (datetime.today() - timedelta(days=days)).strftime('%Y-%m-%d')

    resp = ce.get_cost_and_usage(
        TimePeriod={'Start': start, 'End': end},
        Granularity='MONTHLY',
        Filter={
            'Dimensions': {
                'Key': 'RECORD_TYPE',
                'Values': ['Usage'],
            }
        },
        GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}],
        Metrics=['UnblendedCost'],
    )

    results = []
    for group in resp['ResultsByTime'][0]['Groups']:
        service = group['Keys'][0]
        cost = float(group['Metrics']['UnblendedCost']['Amount'])
        results.append({'service': service, 'cost_usd': round(cost, 2)})

    return sorted(results, key=lambda x: x['cost_usd'], reverse=True)

# Cost allocation by tag (analyze by team/environment)
def get_cost_by_tag(tag_key='team', days=30):
    end = datetime.today().strftime('%Y-%m-%d')
    start = (datetime.today() - timedelta(days=days)).strftime('%Y-%m-%d')

    resp = ce.get_cost_and_usage(
        TimePeriod={'Start': start, 'End': end},
        Granularity='MONTHLY',
        GroupBy=[{'Type': 'TAG', 'Key': tag_key}],
        Metrics=['UnblendedCost'],
    )
    return resp['ResultsByTime'][0]['Groups']

if __name__ == '__main__':
    top_services = get_cost_by_service()
    print("Top 5 service costs:")
    for s in top_services[:5]:
        print(f"  {s['service']}: ${s['cost_usd']}")

Spot / preemptible & interruption tolerance

Spot (or cloud “preemptible”) fits restartable, queueable, checkpointable work; the SKILL must spell out data consistency and SLA degradation when interruptions happen.

Mix with on-demand / RI: fixed baseline, Spot absorbs burst.
Good fits: batch, render farms, elastic CI workers, stateless scale groups.
Interruption handling: draining, job retries, queue visibility timeouts aligned.

Spot instance interruption handler (Node.js, monitoring EC2 metadata service interruption notices):

// spot-interrupt-handler.js — listen for EC2 Spot interruption notices and gracefully exit
const http = require('http');
const { execSync } = require('child_process');

const METADATA_URL = 'http://169.254.169.254/latest/meta-data/spot/interruption-action';
const CHECK_INTERVAL_MS = 5000;   // poll every 5 seconds

let isShuttingDown = false;

async function checkSpotInterruption() {
  return new Promise((resolve) => {
    const req = http.get(
      { hostname: '169.254.169.254', path: '/latest/meta-data/spot/interruption-action',
        timeout: 2000 },
      (res) => {
        if (res.statusCode === 200) {
          resolve(true);   // interruption notice received
        } else {
          resolve(false);
        }
      }
    );
    req.on('error', () => resolve(false));
    req.on('timeout', () => { req.destroy(); resolve(false); });
  });
}

async function gracefulShutdown() {
  if (isShuttingDown) return;
  isShuttingDown = true;

  console.log('Spot interruption detected! Starting graceful shutdown...');

  // 1. Stop accepting new requests
  server.close();

  // 2. Save current task state to SQS / DynamoDB (checkpoint)
  await saveCheckpoint();

  // 3. Wait for in-flight requests to complete (max 25s; interruption notice comes 2 min early)
  await new Promise(resolve => setTimeout(resolve, 25000));

  process.exit(0);
}

// Poll for interruption notices
setInterval(async () => {
  if (!isShuttingDown && await checkSpotInterruption()) {
    gracefulShutdown();
  }
}, CHECK_INTERVAL_MS);

// Also listen for SIGTERM (from ECS/K8s draining)
process.on('SIGTERM', gracefulShutdown);

Lambda scheduled shutdown of non-production environments (Terraform configuration):

# Lambda scheduled stop for non-prod RDS / ECS instances
# terraform/modules/cost-scheduler/main.tf

resource "aws_lambda_function" "stop_nonprod" {
  function_name = "stop-nonprod-resources"
  runtime       = "python3.12"
  handler       = "index.handler"
  role          = aws_iam_role.lambda_exec.arn
  filename      = "${path.module}/stop_nonprod.zip"

  environment {
    variables = {
      RDS_INSTANCES  = "dev-db,staging-db"
      ECS_CLUSTERS   = "dev-cluster,staging-cluster"
      TARGET_ENV_TAG = "non-production"
    }
  }
}

# Stop at 10 PM weekdays (UTC 14:00)
resource "aws_cloudwatch_event_rule" "stop_schedule" {
  name                = "stop-nonprod-nightly"
  schedule_expression = "cron(0 14 ? * MON-FRI *)"  # Mon-Fri UTC 14:00
}

# Start at 8 AM weekdays (UTC 0:00)
resource "aws_cloudwatch_event_rule" "start_schedule" {
  name                = "start-nonprod-morning"
  schedule_expression = "cron(0 0 ? * MON-FRI *)"
}

resource "aws_cloudwatch_event_target" "stop_target" {
  rule = aws_cloudwatch_event_rule.stop_schedule.name
  arn  = aws_lambda_function.stop_nonprod.arn
  input = jsonencode({ action = "stop" })
}

# Estimated savings: non-prod weekends + 14 hrs/weekday ≈ ~65% compute cost reduction

Ongoing governance & commitments

When security/compliance conflicts (e.g. audit logs must be kept), compliance wins.
RIs / Savings Plans: coverage, transferability, cash flow, commitment term—document in the output.
Attach “owning team + likely ticket” not just numbers; bake budget alerts and anomaly subscriptions into routine.

S3 lifecycle policy configuration (hot/warm/cold tiering) and mandatory resource tag IAM Policy:

# S3 lifecycle policy (Terraform aws_s3_bucket_lifecycle_configuration)
resource "aws_s3_bucket_lifecycle_configuration" "app_logs" {
  bucket = aws_s3_bucket.app_logs.id

  rule {
    id     = "log-tiering"
    status = "Enabled"

    filter {
      prefix = "logs/"
    }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"   # move to infrequent access after 30 days (~46% savings)
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"    # move to Glacier Instant after 90 days (~68% savings)
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"  # move to deep archive after 1 year (~95% savings)
    }

    expiration {
      days = 2555   # delete after 7 years (compliance requirement)
    }

    noncurrent_version_expiration {
      noncurrent_days = 30   # delete old versions after 30 days
    }
  }
}

# IAM Policy: force all new resources to have team and environment tags
resource "aws_iam_policy" "require_tags" {
  name = "RequireResourceTags"
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Deny"
      Action = [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "elasticloadbalancing:CreateLoadBalancer",
      ]
      Resource = "*"
      Condition = {
        "Null" = {
          "aws:RequestedRegion"              = "false"
          "aws:RequestTag/team"              = "true"   # team tag required
          "aws:RequestTag/environment"       = "true"   # environment tag required
          "aws:RequestTag/cost-center"       = "true"   # cost-center tag required
        }
      }
    }]
  })
}

Focus checklist generator

Select review dimensions to emit Markdown for tickets or SKILL appendices—aligned with the flow, rightsizing, and Spot sections above.

Review dimensions

Billing & tag rollups Idle & zombie resources Compute rightsizing K8s request/limit Storage tiers & lifecycle Egress & replication Spot / preemptible candidates RI / Savings Plan Budgets & tag governance

Unchecked items are omitted; after copying, add concrete accounts, resource IDs, and owners in your PR or FinOps board.

---
name: cost-optimization
description: Cloud cost analysis, Spot strategy, automated scheduling, and storage tiering
tags: [finops, aws, cost, cloud]
---
# Cost analysis
1. AWS Cost Explorer API: aggregate by service/tag to identify top spenders
2. Cost allocation: team + environment + cost-center as three mandatory tags
3. Measurement window: 14d/30d including peak events (sales, batch) before rightsizing decisions

# Compute optimization
4. Rightsizing: compare P95/P99 CPU/memory utilization; validate SLO history before downscaling
5. ARM migration (Graviton): typically 20-40% price advantage; confirm dependency compatibility
6. Spot use cases: batch jobs, CI workers, stateless services with no local state
7. Spot interruption handling: listen for EC2 metadata notice (2-min warning) + SIGTERM graceful exit
8. Mixed strategy: baseline on On-Demand/RI, elastic burst uses Spot

# Storage optimization
9. S3 lifecycle: 30d→STANDARD_IA, 90d→GLACIER_IR, 365d→DEEP_ARCHIVE
10. Delete old versions after 30 days; retain logs for 7 years for compliance, then delete
11. Idle scan: unattached EBS volumes, stale snapshots, unassigned elastic IPs

# Governance & automation
12. Lambda scheduled shutdown: stop non-prod at 10 PM weekdays, start at 8 AM (65%+ savings)
13. IAM Policy enforces tags: Deny resource creation without team/environment/cost-center tags
14. Budget alerts: AWS Budgets thresholds by account/tag; alert at 80% consumed
15. RI/Savings Plan: 60-70% coverage (stable baseline); don't over-commit to retain flexibility
16. When compliance conflicts, compliance wins (audit logs must not be shortened or deleted)

All skills More skills