07 — AWS & Infrastructure Interview Guide

Priority: MEDIUM — You own AWS infra at Intensel. Know the services you use well. Startups care less about certifications, more about practical experience.

Core AWS Services (Must Know)
Networking & Security
Docker & Containers
CI/CD & Deployment
Infrastructure as Code
Monitoring & Logging
Cost Optimization
Common Interview Questions
Resources

Core AWS Services (Must Know)

Compute

EC2 (Elastic Compute Cloud):
  - Virtual machines in the cloud
  - Instance types: general (t3, m5), compute-optimized (c5), memory (r5), GPU (p3)
  - Pricing: on-demand, reserved (1-3 year commitment, 40-60% savings),
    spot (up to 90% savings, can be interrupted)
  - Auto Scaling Groups: scale EC2 count based on metrics (CPU, queue depth)
  - Launch Templates: define AMI, instance type, security groups, user data

Lambda (Serverless):
  - Run code without managing servers
  - Pay per invocation + duration (100ms granularity)
  - Max 15 min execution, 10GB memory
  - Cold starts: first invocation has higher latency
  - Triggers: API Gateway, SQS, S3 events, EventBridge, etc.
  - Good for: event-driven processing, webhooks, scheduled tasks
  - Not good for: long-running tasks, sustained high traffic (cost)

ECS (Elastic Container Service):
  - Run Docker containers on AWS
  - Fargate: serverless containers (no EC2 management)
  - EC2 launch type: you manage EC2 instances
  - Task definitions: container image, CPU/memory, ports, env vars
  - Services: desired count, load balancer integration, auto-scaling

EKS (Elastic Kubernetes Service):
  - Managed Kubernetes
  - More complex than ECS, but industry standard
  - Good for: multi-cloud strategy, complex orchestration needs

Storage

S3 (Simple Storage Service):
  - Object storage: key-value (key = path, value = file)
  - Virtually unlimited storage
  - Durability: 99.999999999% (11 nines)
  - Storage classes: Standard, Infrequent Access, Glacier (archival)
  - Features: versioning, lifecycle rules, event notifications
  - Presigned URLs: time-limited access to private objects
  - Multipart upload: for large files
  
  Your experience: "Raw climate data and backups stored in S3."

EBS (Elastic Block Store):
  - Block storage for EC2 (like a hard drive)
  - Types: gp3 (general), io2 (high IOPS), st1 (throughput)
  - Snapshots for backup (stored in S3)

EFS (Elastic File System):
  - Managed NFS, accessible from multiple EC2 instances
  - Good for: shared file storage across instances

Database

RDS (Relational Database Service):
  - Managed PostgreSQL, MySQL, MariaDB, Oracle, SQL Server
  - Handles: backups, patching, failover, read replicas
  - Multi-AZ: synchronous standby for high availability
  - Read replicas: async replication for read scaling
  - Performance Insights: query performance monitoring
  
  Your experience: "PostgreSQL on RDS for the climate platform."

ElastiCache:
  - Managed Redis or Memcached
  - Cluster mode: Redis cluster for horizontal scaling
  - Replication: read replicas for Redis

DynamoDB:
  - Managed NoSQL (key-value + document)
  - Single-digit millisecond latency at any scale
  - Provisioned or on-demand capacity
  - Global tables: multi-region replication
  - Streams: change data capture (CDC)

Messaging

SQS (Simple Queue Service):
  - Managed message queue
  - Standard: best-effort ordering, at-least-once delivery
  - FIFO: exactly-once processing, strict ordering
  - Dead Letter Queue support
  - Long polling: efficient message retrieval
  - Visibility timeout: message hidden while being processed

SNS (Simple Notification Service):
  - Pub/sub messaging
  - Topics: publish once, deliver to multiple subscribers
  - Subscribers: SQS, Lambda, HTTP, email, SMS
  - Fan-out pattern: SNS → multiple SQS queues

EventBridge:
  - Serverless event bus
  - Rules: route events based on patterns
  - Scheduled events (cron replacement)
  - Good for: decoupling services, event-driven architectures

Networking

VPC (Virtual Private Cloud):
  - Isolated network in AWS
  - Subnets: public (internet-facing), private (internal)
  - Internet Gateway: connects VPC to internet
  - NAT Gateway: lets private subnets access internet (outbound only)
  - Route tables: control traffic routing

ALB (Application Load Balancer):
  - Layer 7 (HTTP/HTTPS) load balancing
  - Path-based routing: /api → service A, /images → service B
  - Host-based routing: api.example.com → service A
  - Health checks: automatically removes unhealthy targets
  - SSL/TLS termination

NLB (Network Load Balancer):
  - Layer 4 (TCP/UDP) load balancing
  - Ultra-low latency
  - Static IP address
  - Good for: non-HTTP protocols, extreme performance

Route 53:
  - DNS service
  - Routing policies: simple, weighted, latency, failover, geolocation
  - Health checks: automatic DNS failover

CloudFront:
  - CDN (Content Delivery Network)
  - Edge locations worldwide
  - Origins: S3, ALB, custom HTTP
  - SSL/TLS certificates (ACM)
  - Cache behaviors per path pattern

Networking & Security

VPC Architecture

Typical production VPC:
  
  VPC (10.0.0.0/16)
  ├── Public Subnet A (10.0.1.0/24) - AZ a
  │   ├── ALB
  │   └── NAT Gateway
  ├── Public Subnet B (10.0.2.0/24) - AZ b
  │   └── ALB
  ├── Private Subnet A (10.0.3.0/24) - AZ a
  │   ├── EC2 / ECS instances
  │   └── Application servers
  ├── Private Subnet B (10.0.4.0/24) - AZ b
  │   ├── EC2 / ECS instances
  │   └── Application servers
  ├── Database Subnet A (10.0.5.0/24) - AZ a
  │   └── RDS Primary
  └── Database Subnet B (10.0.6.0/24) - AZ b
      └── RDS Standby

Security Groups (stateful firewall):
  - ALB SG: inbound 80/443 from 0.0.0.0/0
  - App SG: inbound 8000 from ALB SG only
  - DB SG: inbound 5432 from App SG only

IAM (Identity & Access Management)

Key Concepts:
  - Users: human identities
  - Roles: for services (EC2, Lambda) — no credentials, assumed temporarily
  - Policies: JSON documents defining permissions
  - Principle of least privilege: only grant needed permissions

Best Practices:
  - Never use root account for daily work
  - Use IAM roles for services (not access keys)
  - Enable MFA for all users
  - Use IAM policies for fine-grained access control
  - Cross-account access via assume-role

Docker & Containers

Docker Fundamentals

# Multi-stage build (production-ready)
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Key Concepts:
  - Image: immutable blueprint (layers, cached)
  - Container: running instance of an image
  - Layer caching: order Dockerfile instructions from least → most changing
  - Multi-stage builds: smaller final images (don't include build tools)
  - .dockerignore: exclude files from build context

Best Practices:
  - Use specific base image tags (python:3.12-slim, not python:latest)
  - Run as non-root user
  - COPY requirements.txt before COPY . (layer caching for deps)
  - Use multi-stage builds
  - One process per container
  - Health checks: HEALTHCHECK CMD curl -f http://localhost:8000/health
  - Don't store secrets in images (use environment variables, secrets managers)

Docker Compose (Development)

# docker-compose.yml
services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      DATABASE_URL: postgresql://user:pass@db:5432/mydb
      REDIS_URL: redis://redis:6379
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started

  worker:
    build: .
    command: celery -A tasks worker
    environment:
      DATABASE_URL: postgresql://user:pass@db:5432/mydb
      REDIS_URL: redis://redis:6379

  db:
    image: postgis/postgis:16-3.4
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: pg_isready -U user
      interval: 5s
      retries: 5

  redis:
    image: redis:7-alpine

volumes:
  pgdata:

CI/CD & Deployment

Deployment Strategies

1. Rolling Deployment:
   - Replace instances one by one
   - Zero downtime
   - Rollback: redeploy previous version
   - Risk: mixed versions during deployment

2. Blue-Green Deployment:
   - Two identical environments (Blue = current, Green = new)
   - Switch routing from Blue to Green
   - Instant rollback: switch back to Blue
   - Cost: need double infrastructure temporarily

3. Canary Deployment:
   - Route small % of traffic to new version (1% → 5% → 25% → 100%)
   - Monitor metrics at each step
   - If errors spike, route back to old version
   - Best for: reducing risk, critical services

4. Feature Flags:
   - Deploy code but hide features behind flags
   - Enable for specific users, %, or regions
   - Decouple deployment from release
   - Tools: LaunchDarkly, Unleash, custom Redis-based

CI/CD Pipeline (Typical)

Code Push → Build → Test → Deploy

1. Build:
   - Install dependencies
   - Build Docker image
   - Push to container registry (ECR)

2. Test:
   - Unit tests (pytest)
   - Integration tests (testcontainers)
   - Linting (ruff, mypy)
   - Security scan (bandit, trivy)

3. Deploy:
   - Staging → smoke tests → production
   - Rolling or blue-green deployment
   - Post-deploy health checks
   - Automatic rollback on failure

Tools: GitHub Actions, GitLab CI, AWS CodePipeline, ArgoCD

Infrastructure as Code

Terraform (most common):
  - Declarative: describe desired state
  - Provider ecosystem (AWS, GCP, Azure, etc.)
  - State management: track current infrastructure
  - Plan → Apply workflow (preview changes before applying)

CloudFormation (AWS-specific):
  - Native AWS IaC
  - YAML/JSON templates
  - Stack management: create/update/delete resources as a unit

Your answer: "I use Docker for containerization, with Terraform/CloudFormation
for infrastructure. Our deployment pipeline: GitHub Actions runs tests,
builds Docker image, pushes to ECR, and deploys to ECS with rolling updates."

Monitoring & Logging

CloudWatch:
  - Metrics: CPU, memory, network, custom metrics
  - Logs: centralized log storage from all services
  - Alarms: trigger on metric thresholds → SNS notification
  - Dashboards: visualization

Key Metrics to Monitor:
  - Application: request rate, error rate, latency (p50, p95, p99)
  - Database: connections, query latency, replication lag, disk usage
  - Cache: hit rate, memory usage, evictions
  - Queue: depth, age of oldest message, consumer lag
  - Infrastructure: CPU, memory, disk, network

Alerting Rules:
  - Error rate > 1% for 5 minutes → page on-call
  - p99 latency > 2s for 5 minutes → warn
  - Queue depth growing for 15 minutes → warn
  - Disk > 80% → warn, > 90% → critical
  - Health check failure → page immediately

Cost Optimization

Common Strategies:
  1. Right-sizing: match instance types to actual usage
  2. Reserved Instances/Savings Plans: commit for 1-3 years (40-60% savings)
  3. Spot Instances: for fault-tolerant workloads (up to 90% savings)
  4. Auto-scaling: scale down during low traffic
  5. S3 lifecycle rules: move old data to cheaper storage classes
  6. Delete unused resources: EBS volumes, old snapshots, unused EIPs
  7. Use managed services where appropriate (RDS vs self-managed DB)

Your answer: "I monitor AWS costs and optimize by right-sizing instances,
using spot instances for Dask workers (fault-tolerant), S3 lifecycle rules
for archival data, and auto-scaling for the API tier."

Common Interview Questions

Q: Walk me through deploying a Python API to AWS.
A: 1. Containerize with Docker (multi-stage build)
   2. Push image to ECR
   3. Define ECS task definition (image, CPU, memory, env vars)
   4. Create ECS service behind ALB
   5. Configure auto-scaling (CPU/request-based)
   6. Set up RDS PostgreSQL in private subnet
   7. ElastiCache Redis for caching
   8. CloudWatch for monitoring + alerts
   9. Route 53 for DNS
   10. ACM for SSL certificate

Q: How do you handle secrets in production?
A: AWS Secrets Manager or SSM Parameter Store (encrypted).
   ECS injects secrets as environment variables at runtime.
   Never in code, Docker images, or git.
   Rotate secrets periodically.

Q: What happens when an EC2 instance fails?
A: Auto Scaling Group detects the failed health check, terminates
   the unhealthy instance, and launches a replacement. ALB stops
   routing traffic to the failed instance immediately. If using
   ECS, the service scheduler replaces the task.

Q: How would you set up a database for high availability?
A: RDS Multi-AZ: synchronous standby in another AZ, automatic
   failover (<60s). Read replicas in same or different region.
   Regular automated backups. Point-in-time recovery enabled.
   Connection through PgBouncer for pooling.

Q: How do you debug a production issue?
A: 1. Check alerts/dashboards for anomalies
   2. Check CloudWatch logs filtered by time window
   3. Look for error patterns (status codes, exceptions)
   4. Trace specific requests using request IDs
   5. Check dependent services (DB, cache, queues)
   6. If needed, increase logging temporarily
   7. Fix → deploy → verify → postmortem

Resources

AWS Well-Architected Framework: https://aws.amazon.com/architecture/well-architected/
AWS Documentation: https://docs.aws.amazon.com/
Docker Documentation: https://docs.docker.com/
Terraform Documentation: https://developer.hashicorp.com/terraform/docs
AWS Free Tier: https://aws.amazon.com/free/ — practice with real services

My Notes

AWS services I use daily:
-

Infrastructure I've set up:
-

Things I need to learn better:
-

Next: 08-rust-interview-prep.md

07 — AWS & Infrastructure Interview Guide

Table of Contents

Core AWS Services (Must Know)

Compute

Storage

Database

Messaging

Networking

Networking & Security

VPC Architecture

IAM (Identity & Access Management)

Docker & Containers

Docker Fundamentals

Docker Compose (Development)

CI/CD & Deployment

Deployment Strategies

CI/CD Pipeline (Typical)

Infrastructure as Code

Monitoring & Logging

Cost Optimization

Common Interview Questions

Resources

My Notes