07 — AWS & Infrastructure Interview Guide
Priority: MEDIUM — You own AWS infra at Intensel. Know the services you use well. Startups care less about certifications, more about practical experience.
Table of Contents
- Core AWS Services (Must Know)
- Networking & Security
- Docker & Containers
- CI/CD & Deployment
- Infrastructure as Code
- Monitoring & Logging
- Cost Optimization
- Common Interview Questions
- Resources
Core AWS Services (Must Know)
Compute
EC2 (Elastic Compute Cloud):
- Virtual machines in the cloud
- Instance types: general (t3, m5), compute-optimized (c5), memory (r5), GPU (p3)
- Pricing: on-demand, reserved (1-3 year commitment, 40-60% savings),
spot (up to 90% savings, can be interrupted)
- Auto Scaling Groups: scale EC2 count based on metrics (CPU, queue depth)
- Launch Templates: define AMI, instance type, security groups, user data
Lambda (Serverless):
- Run code without managing servers
- Pay per invocation + duration (100ms granularity)
- Max 15 min execution, 10GB memory
- Cold starts: first invocation has higher latency
- Triggers: API Gateway, SQS, S3 events, EventBridge, etc.
- Good for: event-driven processing, webhooks, scheduled tasks
- Not good for: long-running tasks, sustained high traffic (cost)
ECS (Elastic Container Service):
- Run Docker containers on AWS
- Fargate: serverless containers (no EC2 management)
- EC2 launch type: you manage EC2 instances
- Task definitions: container image, CPU/memory, ports, env vars
- Services: desired count, load balancer integration, auto-scaling
EKS (Elastic Kubernetes Service):
- Managed Kubernetes
- More complex than ECS, but industry standard
- Good for: multi-cloud strategy, complex orchestration needs
Storage
S3 (Simple Storage Service):
- Object storage: key-value (key = path, value = file)
- Virtually unlimited storage
- Durability: 99.999999999% (11 nines)
- Storage classes: Standard, Infrequent Access, Glacier (archival)
- Features: versioning, lifecycle rules, event notifications
- Presigned URLs: time-limited access to private objects
- Multipart upload: for large files
Your experience: "Raw climate data and backups stored in S3."
EBS (Elastic Block Store):
- Block storage for EC2 (like a hard drive)
- Types: gp3 (general), io2 (high IOPS), st1 (throughput)
- Snapshots for backup (stored in S3)
EFS (Elastic File System):
- Managed NFS, accessible from multiple EC2 instances
- Good for: shared file storage across instances
Database
RDS (Relational Database Service):
- Managed PostgreSQL, MySQL, MariaDB, Oracle, SQL Server
- Handles: backups, patching, failover, read replicas
- Multi-AZ: synchronous standby for high availability
- Read replicas: async replication for read scaling
- Performance Insights: query performance monitoring
Your experience: "PostgreSQL on RDS for the climate platform."
ElastiCache:
- Managed Redis or Memcached
- Cluster mode: Redis cluster for horizontal scaling
- Replication: read replicas for Redis
DynamoDB:
- Managed NoSQL (key-value + document)
- Single-digit millisecond latency at any scale
- Provisioned or on-demand capacity
- Global tables: multi-region replication
- Streams: change data capture (CDC)
Messaging
SQS (Simple Queue Service):
- Managed message queue
- Standard: best-effort ordering, at-least-once delivery
- FIFO: exactly-once processing, strict ordering
- Dead Letter Queue support
- Long polling: efficient message retrieval
- Visibility timeout: message hidden while being processed
SNS (Simple Notification Service):
- Pub/sub messaging
- Topics: publish once, deliver to multiple subscribers
- Subscribers: SQS, Lambda, HTTP, email, SMS
- Fan-out pattern: SNS → multiple SQS queues
EventBridge:
- Serverless event bus
- Rules: route events based on patterns
- Scheduled events (cron replacement)
- Good for: decoupling services, event-driven architectures
Networking
VPC (Virtual Private Cloud):
- Isolated network in AWS
- Subnets: public (internet-facing), private (internal)
- Internet Gateway: connects VPC to internet
- NAT Gateway: lets private subnets access internet (outbound only)
- Route tables: control traffic routing
ALB (Application Load Balancer):
- Layer 7 (HTTP/HTTPS) load balancing
- Path-based routing: /api → service A, /images → service B
- Host-based routing: api.example.com → service A
- Health checks: automatically removes unhealthy targets
- SSL/TLS termination
NLB (Network Load Balancer):
- Layer 4 (TCP/UDP) load balancing
- Ultra-low latency
- Static IP address
- Good for: non-HTTP protocols, extreme performance
Route 53:
- DNS service
- Routing policies: simple, weighted, latency, failover, geolocation
- Health checks: automatic DNS failover
CloudFront:
- CDN (Content Delivery Network)
- Edge locations worldwide
- Origins: S3, ALB, custom HTTP
- SSL/TLS certificates (ACM)
- Cache behaviors per path pattern
Networking & Security
VPC Architecture
Typical production VPC:
VPC (10.0.0.0/16)
├── Public Subnet A (10.0.1.0/24) - AZ a
│ ├── ALB
│ └── NAT Gateway
├── Public Subnet B (10.0.2.0/24) - AZ b
│ └── ALB
├── Private Subnet A (10.0.3.0/24) - AZ a
│ ├── EC2 / ECS instances
│ └── Application servers
├── Private Subnet B (10.0.4.0/24) - AZ b
│ ├── EC2 / ECS instances
│ └── Application servers
├── Database Subnet A (10.0.5.0/24) - AZ a
│ └── RDS Primary
└── Database Subnet B (10.0.6.0/24) - AZ b
└── RDS Standby
Security Groups (stateful firewall):
- ALB SG: inbound 80/443 from 0.0.0.0/0
- App SG: inbound 8000 from ALB SG only
- DB SG: inbound 5432 from App SG only
IAM (Identity & Access Management)
Key Concepts:
- Users: human identities
- Roles: for services (EC2, Lambda) — no credentials, assumed temporarily
- Policies: JSON documents defining permissions
- Principle of least privilege: only grant needed permissions
Best Practices:
- Never use root account for daily work
- Use IAM roles for services (not access keys)
- Enable MFA for all users
- Use IAM policies for fine-grained access control
- Cross-account access via assume-role
Docker & Containers
Docker Fundamentals
# Multi-stage build (production-ready)
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .
ENV PATH=/root/.local/bin:$PATH
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Key Concepts:
- Image: immutable blueprint (layers, cached)
- Container: running instance of an image
- Layer caching: order Dockerfile instructions from least → most changing
- Multi-stage builds: smaller final images (don't include build tools)
- .dockerignore: exclude files from build context
Best Practices:
- Use specific base image tags (python:3.12-slim, not python:latest)
- Run as non-root user
- COPY requirements.txt before COPY . (layer caching for deps)
- Use multi-stage builds
- One process per container
- Health checks: HEALTHCHECK CMD curl -f http://localhost:8000/health
- Don't store secrets in images (use environment variables, secrets managers)
Docker Compose (Development)
# docker-compose.yml
services:
api:
build: .
ports:
- "8000:8000"
environment:
DATABASE_URL: postgresql://user:pass@db:5432/mydb
REDIS_URL: redis://redis:6379
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
worker:
build: .
command: celery -A tasks worker
environment:
DATABASE_URL: postgresql://user:pass@db:5432/mydb
REDIS_URL: redis://redis:6379
db:
image: postgis/postgis:16-3.4
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: pg_isready -U user
interval: 5s
retries: 5
redis:
image: redis:7-alpine
volumes:
pgdata:
CI/CD & Deployment
Deployment Strategies
1. Rolling Deployment:
- Replace instances one by one
- Zero downtime
- Rollback: redeploy previous version
- Risk: mixed versions during deployment
2. Blue-Green Deployment:
- Two identical environments (Blue = current, Green = new)
- Switch routing from Blue to Green
- Instant rollback: switch back to Blue
- Cost: need double infrastructure temporarily
3. Canary Deployment:
- Route small % of traffic to new version (1% → 5% → 25% → 100%)
- Monitor metrics at each step
- If errors spike, route back to old version
- Best for: reducing risk, critical services
4. Feature Flags:
- Deploy code but hide features behind flags
- Enable for specific users, %, or regions
- Decouple deployment from release
- Tools: LaunchDarkly, Unleash, custom Redis-based
CI/CD Pipeline (Typical)
Code Push → Build → Test → Deploy
1. Build:
- Install dependencies
- Build Docker image
- Push to container registry (ECR)
2. Test:
- Unit tests (pytest)
- Integration tests (testcontainers)
- Linting (ruff, mypy)
- Security scan (bandit, trivy)
3. Deploy:
- Staging → smoke tests → production
- Rolling or blue-green deployment
- Post-deploy health checks
- Automatic rollback on failure
Tools: GitHub Actions, GitLab CI, AWS CodePipeline, ArgoCD
Infrastructure as Code
Terraform (most common):
- Declarative: describe desired state
- Provider ecosystem (AWS, GCP, Azure, etc.)
- State management: track current infrastructure
- Plan → Apply workflow (preview changes before applying)
CloudFormation (AWS-specific):
- Native AWS IaC
- YAML/JSON templates
- Stack management: create/update/delete resources as a unit
Your answer: "I use Docker for containerization, with Terraform/CloudFormation
for infrastructure. Our deployment pipeline: GitHub Actions runs tests,
builds Docker image, pushes to ECR, and deploys to ECS with rolling updates."
Monitoring & Logging
CloudWatch:
- Metrics: CPU, memory, network, custom metrics
- Logs: centralized log storage from all services
- Alarms: trigger on metric thresholds → SNS notification
- Dashboards: visualization
Key Metrics to Monitor:
- Application: request rate, error rate, latency (p50, p95, p99)
- Database: connections, query latency, replication lag, disk usage
- Cache: hit rate, memory usage, evictions
- Queue: depth, age of oldest message, consumer lag
- Infrastructure: CPU, memory, disk, network
Alerting Rules:
- Error rate > 1% for 5 minutes → page on-call
- p99 latency > 2s for 5 minutes → warn
- Queue depth growing for 15 minutes → warn
- Disk > 80% → warn, > 90% → critical
- Health check failure → page immediately
Cost Optimization
Common Strategies:
1. Right-sizing: match instance types to actual usage
2. Reserved Instances/Savings Plans: commit for 1-3 years (40-60% savings)
3. Spot Instances: for fault-tolerant workloads (up to 90% savings)
4. Auto-scaling: scale down during low traffic
5. S3 lifecycle rules: move old data to cheaper storage classes
6. Delete unused resources: EBS volumes, old snapshots, unused EIPs
7. Use managed services where appropriate (RDS vs self-managed DB)
Your answer: "I monitor AWS costs and optimize by right-sizing instances,
using spot instances for Dask workers (fault-tolerant), S3 lifecycle rules
for archival data, and auto-scaling for the API tier."
Common Interview Questions
Q: Walk me through deploying a Python API to AWS.
A: 1. Containerize with Docker (multi-stage build)
2. Push image to ECR
3. Define ECS task definition (image, CPU, memory, env vars)
4. Create ECS service behind ALB
5. Configure auto-scaling (CPU/request-based)
6. Set up RDS PostgreSQL in private subnet
7. ElastiCache Redis for caching
8. CloudWatch for monitoring + alerts
9. Route 53 for DNS
10. ACM for SSL certificate
Q: How do you handle secrets in production?
A: AWS Secrets Manager or SSM Parameter Store (encrypted).
ECS injects secrets as environment variables at runtime.
Never in code, Docker images, or git.
Rotate secrets periodically.
Q: What happens when an EC2 instance fails?
A: Auto Scaling Group detects the failed health check, terminates
the unhealthy instance, and launches a replacement. ALB stops
routing traffic to the failed instance immediately. If using
ECS, the service scheduler replaces the task.
Q: How would you set up a database for high availability?
A: RDS Multi-AZ: synchronous standby in another AZ, automatic
failover (<60s). Read replicas in same or different region.
Regular automated backups. Point-in-time recovery enabled.
Connection through PgBouncer for pooling.
Q: How do you debug a production issue?
A: 1. Check alerts/dashboards for anomalies
2. Check CloudWatch logs filtered by time window
3. Look for error patterns (status codes, exceptions)
4. Trace specific requests using request IDs
5. Check dependent services (DB, cache, queues)
6. If needed, increase logging temporarily
7. Fix → deploy → verify → postmortem
Resources
- AWS Well-Architected Framework: https://aws.amazon.com/architecture/well-architected/
- AWS Documentation: https://docs.aws.amazon.com/
- Docker Documentation: https://docs.docker.com/
- Terraform Documentation: https://developer.hashicorp.com/terraform/docs
- AWS Free Tier: https://aws.amazon.com/free/ — practice with real services
My Notes
AWS services I use daily:
-
Infrastructure I've set up:
-
Things I need to learn better:
-