Operational Runbook¶

Common Tasks & Procedures¶

Table of Contents¶

Scaling Operations
Database Operations
Application Deployment
Monitoring & Debugging
Maintenance Windows

Scaling Operations¶

1. Increase Application Capacity (Production)¶

Scenario: Traffic spike detected, need to handle 3000 concurrent users

Steps:

# 1. Check current capacity
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names basecoat-portal-asg \
  --query 'AutoScalingGroups[0].[MinSize,DesiredCapacity,MaxSize,Instances[*].[InstanceId,LifecycleState]]'

# Output example:
# [3, 5, 20, [["i-001...", "InService"], ["i-002...", "InService"], ...]]

# 2. Set new desired capacity
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name basecoat-portal-asg \
  --desired-capacity 12

# 3. Monitor scale-up progress
watch -n 30 'aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names basecoat-portal-asg \
  --query "AutoScalingGroups[0].Instances[*].[InstanceId,LifecycleState]" | grep -c InService'

# 4. Verify ALB target health
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:...:targetgroup/basecoat-portal/...

# 5. Monitor application metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name TargetResponseTime \
  --start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Average,Maximum

# Expected: Response time remains < 500ms with new capacity

Success Criteria: - All new instances show "InService" in ALB - Response time p99 < 500ms - Error rate < 0.1% - CPU utilization across fleet < 60%

2. Scale Down Application (Post-Traffic Spike)¶

# 1. Set reduced capacity
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name basecoat-portal-asg \
  --desired-capacity 3

# 2. Monitor drain time (90 seconds max)
watch -n 10 'aws elbv2 describe-target-health \
  --target-group-arn ... | grep -c "Deregistering\|InService"'

# 3. Verify no dropped connections
# Check application logs for incomplete requests
aws logs tail /aws/basecoat-portal/application --follow --grep "ERROR"

3. Database Connection Scaling¶

Scenario: Database connection pool exhausted (> 800 connections)

# 1. Check connection usage
psql -h basecoat-portal-db.cxxx.us-east-1.rds.amazonaws.com \
  -U postgres -d basecoat_db \
  -c "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"

# 2. Increase RDS Proxy max connections
terraform apply -var-file=environments/prod/terraform.tfvars \
  -var='rds_proxy_max_connections=200'

# 3. Update application connection pool size
# Edit application config or Secrets Manager parameter
aws secretsmanager update-secret \
  --secret-id basecoat-portal/db/pool \
  --secret-string '{"pool_size":50,"max_overflow":20}'

# 4. Verify connection health
psql -h proxy-endpoint ... \
  -c "SELECT count(*) FROM pg_stat_activity WHERE state='active';"

Database Operations¶

1. Create Manual Backup¶

# Create snapshot
aws rds create-db-snapshot \
  --db-instance-identifier basecoat-portal-db \
  --db-snapshot-identifier basecoat-backup-$(date +%Y%m%d-%H%M%S)

# Monitor snapshot creation
watch -n 30 'aws rds describe-db-snapshots \
  --db-snapshot-identifier basecoat-backup-... \
  --query "DBSnapshots[0].[PercentProgress,Status]"'

# Expected: 100% Complete

2. Point-in-Time Recovery¶

# List recovery points
aws rds describe-db-instances \
  --db-instance-identifier basecoat-portal-db \
  --query 'DBInstances[0].[LatestRestorableTime,AvailabilityZone]'

# Restore to specific point (e.g., 1 hour ago)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier basecoat-portal-db \
  --target-db-instance-identifier basecoat-portal-db-restored \
  --restore-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --availability-zone us-east-1a

# Wait for restore (~10 minutes)
watch -n 30 'aws rds describe-db-instances \
  --db-instance-identifier basecoat-portal-db-restored \
  --query "DBInstances[0].DBInstanceStatus"'

# Test data
psql -h basecoat-portal-db-restored.cxxx.us-east-1.rds.amazonaws.com \
  -U postgres -d basecoat_db \
  -c "SELECT COUNT(*) FROM users; SELECT MAX(updated_at) FROM audit_log;"

# Rename for failover (after validation)
# aws rds modify-db-instance --db-instance-identifier basecoat-portal-db-restored \
#   --new-db-instance-identifier basecoat-portal-db

3. Database Parameter Tuning¶

# Create custom parameter group
aws rds create-db-parameter-group \
  --db-parameter-group-name basecoat-prod-params \
  --db-parameter-group-family postgres15 \
  --description "Production tuned parameters"

# Modify parameters
aws rds modify-db-parameter-group \
  --db-parameter-group-name basecoat-prod-params \
  --parameters "ParameterName=shared_buffers,ParameterValue={DBParameterGroupName=basecoat-prod-params,ApplyMethod=pending-reboot}"

# Apply to database
aws rds modify-db-instance \
  --db-instance-identifier basecoat-portal-db \
  --db-parameter-group-name basecoat-prod-params \
  --apply-immediately

4. Read Replica Promotion¶

# Promote read replica to standalone
aws rds promote-read-replica \
  --db-instance-identifier basecoat-portal-db-read-replica \
  --backup-retention-period 30

# Monitor promotion
watch -n 10 'aws rds describe-db-instances \
  --db-instance-identifier basecoat-portal-db-read-replica \
  --query "DBInstances[0].DBInstanceStatus"'

# After promotion, update application connection
aws secretsmanager update-secret \
  --secret-id basecoat-portal/db/endpoint \
  --secret-string "basecoat-portal-db-read-replica.cxxx.us-east-1.rds.amazonaws.com"

Application Deployment¶

1. Blue/Green Deployment¶

# 1. Create new launch template version
aws ec2 create-launch-template-version \
  --launch-template-name basecoat-lt \
  --source-version \$Latest \
  --launch-template-data '{
    "ImageId":"ami-0c123456789abcdef",
    "UserData":"base64_encoded_script"
  }'

# Get new version number
NEW_VERSION=$(aws ec2 describe-launch-template-versions \
  --launch-template-name basecoat-lt \
  --query 'LaunchTemplateVersions[0].VersionNumber')

# 2. Create new ASG with new template (blue)
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name basecoat-portal-asg-blue \
  --launch-template LaunchTemplateName=basecoat-lt,Version=$NEW_VERSION \
  --min-size 3 \
  --max-size 20 \
  --desired-capacity 3 \
  --vpc-zone-identifier subnet-xxx,subnet-yyy,subnet-zzz

# 3. Wait for new instances to become healthy
watch -n 20 'aws elbv2 describe-target-health \
  --target-group-arn ... | grep -c InService'

# 4. Update load balancer to route to new ASG
aws elbv2 modify-target-group \
  --target-group-arn ... \
  --target-group-name basecoat-portal-blue

# 5. Verify traffic (monitor error rate)
watch -n 10 'aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_Target_5XX_Count \
  --period 60 \
  --statistics Sum'

# 6. Delete old ASG (after 5-10 min monitoring)
aws autoscaling delete-auto-scaling-group \
  --auto-scaling-group-name basecoat-portal-asg \
  --force-delete

2. Canary Deployment (10% traffic)¶

# 1. Create new ASG with updated version
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name basecoat-portal-asg-canary \
  --launch-template LaunchTemplateName=basecoat-lt,Version=$NEW_VERSION \
  --desired-capacity 1

# 2. Register new instances with ALB target group at 10% weight
aws elbv2 register-targets \
  --target-group-arn ... \
  --targets Id=i-canary-instance-id,Port=80,Weight=10

# 3. Monitor canary metrics (1 hour)
watch -n 60 'aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_Target_5XX_Count \
  --statistics Sum'

# Expected: Error rate same as existing traffic

# 4a. If successful: increase to 50%, then 100%
# 4b. If issues: drain and terminate canary ASG
aws autoscaling detach-instances \
  --auto-scaling-group-name basecoat-portal-asg-canary \
  --should-decrement-desired-capacity

3. Rollback Deployment¶

# 1. Identify previous working version
aws ec2 describe-launch-template-versions \
  --launch-template-name basecoat-lt \
  --query 'LaunchTemplateVersions[*].[VersionNumber,CreateTime]' \
  --sort-by create-time | head -5

# 2. Update ASG to use previous template
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name basecoat-portal-asg \
  --launch-template LaunchTemplateName=basecoat-lt,Version=2

# 3. Replace running instances
aws autoscaling start-instance-refresh \
  --auto-scaling-group-name basecoat-portal-asg \
  --preferences '{"MinHealthyPercentage":50,"InstanceWarmup":300}'

# 4. Monitor instance refresh
watch -n 20 'aws autoscaling describe-instance-refreshes \
  --auto-scaling-group-name basecoat-portal-asg \
  --query "InstanceRefreshes[0].[PercentageComplete,Status]"'

Monitoring & Debugging¶

1. High CPU Alert Response¶

# 1. Identify affected instances
aws ec2 describe-instances \
  --filters "Name=tag:aws:autoscaling:groupName,Values=basecoat-portal-asg" \
  --query 'Reservations[].Instances[].[InstanceId,PrivateIpAddress,State.Name]'

# 2. Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --start-time $(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average,Maximum \
  --dimensions Name=AutoScalingGroupName,Value=basecoat-portal-asg

# 3. Check application logs
aws logs tail /aws/basecoat-portal/application --follow --since 30m | head -50

# 4. Options:
#   a) Scale up (covered in scaling section)
#   b) Investigate memory leak
#   c) Check for runaway queries
#   d) Deploy code fix

2. High Memory Usage (Database)¶

# 1. Connect to database
psql -h basecoat-portal-db.cxxx.us-east-1.rds.amazonaws.com \
  -U postgres -d basecoat_db

# 2. Check active queries
SELECT pid, query, query_start, wait_event FROM pg_stat_activity 
WHERE state = 'active' 
ORDER BY query_start DESC;

# 3. Kill long-running query (if needed)
SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
WHERE query_start < now() - interval '1 hour' 
AND state = 'active';

# 4. Check table bloat
SELECT schemaname, tablename, 
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables 
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

# 5. Run ANALYZE to update statistics
ANALYZE;

# 6. Upgrade RDS instance class if needed
aws rds modify-db-instance \
  --db-instance-identifier basecoat-portal-db \
  --db-instance-class db.r6i.2xlarge \
  --apply-immediately

3. Cache Eviction Issues¶

# 1. Check ElastiCache metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/ElastiCache \
  --metric-name Evictions \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum \
  --dimensions Name=ReplicationGroupId,Value=basecoat-portal-cache

# 2. If evictions > 0, check memory usage
aws elasticache describe-replication-groups \
  --replication-group-id basecoat-portal-cache \
  --query 'ReplicationGroups[0].[CacheNodeType,AutomaticFailover,EngineVersion]'

# 3. Options:
#   a) Scale up node type
#   b) Add more nodes
#   c) Implement cache eviction policy
#   d) Optimize application cache usage

# 4. Scale up cache
aws elasticache modify-replication-group \
  --replication-group-id basecoat-portal-cache \
  --cache-node-type cache.r6g.xlarge \
  --apply-immediately

4. Database Connection Pool Exhaustion¶

# 1. Check connection count
psql -h basecoat-portal-db.cxxx.us-east-1.rds.amazonaws.com \
  -U postgres -d basecoat_db \
  -c "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname ORDER BY count DESC;"

# 2. Identify connection hogs
SELECT application_name, client_addr, state, query_start, query 
FROM pg_stat_activity 
ORDER BY query_start DESC LIMIT 20;

# 3. Kill idle connections (if safe)
SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
WHERE state = 'idle' 
AND query_start < now() - interval '30 minutes';

# 4. Increase connection pool
terraform apply -var-file=environments/prod/terraform.tfvars \
  -var='rds_proxy_max_connections=300'

# 5. Check RDS Proxy status
aws rds describe-db-proxies \
  --db-proxy-name basecoat-portal-proxy \
  --query 'DBProxies[0].[Status,MaxConnectionsPercent,SessionPinningFilters]'

Maintenance Windows¶

1. Scheduled Database Maintenance¶

Window: Sunday 2-4 AM UTC

# 1. Update maintenance window
aws rds modify-db-instance \
  --db-instance-identifier basecoat-portal-db \
  --preferred-maintenance-window sun:02:00-sun:04:00

# 2. Notify stakeholders via status page
# 3. Monitor during window
aws rds describe-db-instances \
  --db-instance-identifier basecoat-portal-db \
  --query 'DBInstances[0].[DBInstanceStatus,PendingModifiedValues]'

# 4. After completion, verify
psql -h endpoint -U postgres -d basecoat_db -c "SELECT version();"

2. Certificate Renewal¶

# 1. Check certificate expiration
aws acm describe-certificate \
  --certificate-arn arn:aws:acm:us-east-1:...:certificate/xxx

# 2. Request new certificate (if > 30 days remaining)
# Already managed by Terraform if using ACM

# 3. Update ALB listener
aws elbv2 modify-listener \
  --listener-arn arn:aws:elasticloadbalancing:... \
  --certificates CertificateArn=arn:aws:acm:...

# 4. Verify SSL handshake
openssl s_client -connect basecoat-portal.prod.example.com:443 -showcerts

3. Security Patch Application¶

# 1. Create new AMI with patches
# Build process outside this script
# Deploy new version using Blue/Green deployment (see section above)

# 2. Verify patches applied
aws ec2 describe-instances \
  --instance-ids i-xxx \
  --query 'Reservations[0].Instances[0].ImageId'

# 3. Check OS version
# SSH to instance and run: cat /etc/os-release

Emergency Procedures¶

SEV-1: Complete Service Down¶

# 1. Declare incident
# Notify: VP Eng, on-call manager, customer success

# 2. Assess what's down
curl -v https://basecoat-portal.prod.example.com/health
aws elbv2 describe-target-health --target-group-arn ...
aws rds describe-db-instances --db-instance-identifier ...

# 3. Check AWS status page
# https://phd.aws.amazon.com

# 4. If regional outage, initiate DR failover
# See: docs/DISASTER_RECOVERY.md - Scenario 4

# 5. If application code issue, rollback
# See: Application Deployment - Rollback section

# 6. Status updates every 15 minutes
# Update status page, notify stakeholders

SEV-2: Data Inconsistency¶

# 1. Isolate database
# Stop application writes (temporarily)

# 2. Check replication lag
aws rds describe-db-instances \
  --db-instance-identifier basecoat-portal-db-read-replica \
  --query 'DBInstances[0].StatusInfos'

# 3. Take snapshot for forensics
aws rds create-db-snapshot \
  --db-instance-identifier basecoat-portal-db \
  --db-snapshot-identifier forensics-$(date +%s)

# 4. Query affected data
psql -h endpoint -U postgres -d basecoat_db \
  -c "SELECT * FROM audit_log WHERE action='DELETE' ORDER BY created_at DESC LIMIT 20;"

# 5. Restore from point-in-time backup
# See: Database Operations - Point-in-Time Recovery

# 6. Validate data consistency
# Compare record counts, checksums across replicas

Last Updated: May 2024 Document Version: 1.0 Owner: Infrastructure Team