06 — Distributed Systems Interview Guide

Priority: MEDIUM-HIGH — You’re studying DDIA. This overlaps with system design but focuses on the theory that interviewers expect at a 5+ YOE level.


Table of Contents

  1. Fundamental Concepts
  2. Consistency Models
  3. Consensus Protocols
  4. Data Partitioning & Replication
  5. Distributed Transactions
  6. Clocks & Ordering
  7. Failure Modes
  8. Message Queues & Event-Driven Architecture
  9. Key Theorems & Results
  10. Common Interview Questions
  11. Resources

Fundamental Concepts

Why Distributed Systems?

1. Scalability: handle more load than a single machine can
2. Reliability: survive hardware failures (no single point of failure)
3. Latency: serve users from geographically closer nodes
4. Cost: many commodity machines cheaper than one supercomputer

Challenges:
  - Network is unreliable (partitions, delays, packet loss)
  - Clocks are not synchronized
  - Processes can crash at any time
  - Partial failures: some nodes work, some don't
  - No shared memory between nodes

The 8 Fallacies of Distributed Computing

1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn't change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous

→ Assume ALL of these will fail. Design for it.

Consistency Models

Strong Consistency

Definition: After a write completes, all subsequent reads return that write's value.
"Linearizability" — looks like a single copy of data, even with replicas.

How achieved:
  - Single leader with synchronous replication
  - Consensus protocol (Raft, Paxos)

Trade-offs:
  - Higher latency (wait for acknowledgment from replicas)
  - Lower availability under network partitions

When to use:
  - Financial transactions (account balances)
  - Inventory management (stock counts)
  - Leader election

Eventual Consistency

Definition: If no new writes, all replicas will EVENTUALLY converge.
No guarantee on WHEN convergence happens.

How achieved:
  - Async replication
  - Anti-entropy protocols (Merkle trees)
  - Read repair, hinted handoff

Trade-offs:
  - Low latency (write locally, replicate later)
  - High availability
  - Stale reads possible

When to use:
  - Social media (likes, follower counts)
  - DNS
  - Search indexes
  - Analytics data

Other Consistency Models

Causal Consistency:
  - Respects causal order (if A caused B, everyone sees A before B)
  - Concurrent operations can be seen in different orders
  - Weaker than strong, stronger than eventual

Read-your-writes:
  - A user always sees their own writes
  - Others may see stale data
  - Common in web apps (user edits profile → sees update immediately)

Monotonic Reads:
  - Once you've seen a value, you won't see an older value
  - Prevents "time travel" between replicas

Session Consistency:
  - Within a session, reads are consistent with that session's writes
  - Different sessions may see different things

Consensus Protocols

Raft (Most Interview-Relevant)

Goal: Multiple nodes agree on a sequence of values, even if some nodes fail.

Roles:
  - Leader: handles all client writes, replicates to followers
  - Follower: accepts log entries from leader
  - Candidate: trying to become leader

Key Mechanisms:
  1. Leader Election:
     - If follower doesn't hear from leader (timeout), becomes candidate
     - Sends RequestVote RPCs to all nodes
     - Wins if gets majority votes
     - Random timeouts prevent split votes

  2. Log Replication:
     - Leader appends entry to its log
     - Sends AppendEntries RPC to followers
     - Waits for majority acknowledgment
     - Then commits (applies to state machine)
     - Tells followers to commit

  3. Safety:
     - Only the most up-to-date candidate can win election
     - Committed entries are never lost (even if leader crashes)

Key properties:
  - Tolerates f failures with 2f+1 nodes (e.g., 2 failures with 5 nodes)
  - Requires majority quorum for all operations
  - Strong consistency guaranteed

Interview tip: "Raft is a consensus protocol that ensures a replicated log
is consistent across nodes. It uses a single leader for writes, majority
quorum for commits, and term-based leader election. It tolerates minority
node failures."

Paxos

Similar goal to Raft but more complex. High-level understanding sufficient:
  - Proposers propose values
  - Acceptors accept/reject proposals
  - Learners learn the chosen value
  - Prepare → Promise → Accept → Accepted

Raft was designed to be easier to understand than Paxos.
Most interviewers accept "I'd use Raft because..." and don't expect
Paxos implementation details.

Data Partitioning & Replication

Partitioning (Sharding)

Strategies:

1. Range Partitioning:
   user_id 0-999999 → Shard A
   user_id 1000000-1999999 → Shard B
   
   Pros: Range queries efficient within a shard
   Cons: Hot spots (if IDs are sequential, latest shard gets all writes)

2. Hash Partitioning:
   hash(user_id) % num_shards → shard number
   
   Pros: Even distribution
   Cons: Range queries span all shards, adding shards requires rehashing

3. Consistent Hashing:
   Hash ring with virtual nodes
   
   Pros: Minimal data movement when adding/removing nodes
   Cons: Complexity, still need to handle hotspots

4. Directory-based:
   Lookup service maps key → shard
   
   Pros: Flexible, can rebalance easily
   Cons: Lookup service is a bottleneck/SPOF

Challenges:
  - Cross-shard queries (joins, aggregations)
  - Rebalancing when adding/removing shards
  - Shard key selection (affects all query patterns)
  - Hot spots (celebrity problem: one user gets all the traffic)

Replication

1. Single-Leader (Primary-Replica):
   - One leader handles writes, replicates to followers
   - Followers serve reads
   - Sync vs Async replication
   - Failover: promote a replica to leader
   - PostgreSQL streaming replication is this model
   
2. Multi-Leader:
   - Multiple nodes accept writes
   - Need conflict resolution (last-writer-wins, custom merge)
   - Use case: multi-datacenter (each DC has a leader)
   - Complex, hard to reason about

3. Leaderless:
   - Any node accepts reads and writes
   - Quorum: W + R > N (write to W nodes, read from R nodes, N total replicas)
   - Example: W=2, R=2, N=3 — guarantees overlap, so reads see latest write
   - Cassandra, DynamoDB use this model
   - Conflict resolution: vector clocks, last-writer-wins

Quorum

N = number of replicas
W = number of write acknowledgments required
R = number of replicas to read from

If W + R > N → guaranteed to read latest write (overlap)

Common configurations:
  N=3, W=2, R=2: strong consistency, tolerates 1 failure
  N=3, W=1, R=3: fast writes, consistent reads
  N=3, W=3, R=1: slow writes, fast reads (all must acknowledge)

Sloppy quorum: write to any W nodes (not necessarily the "home" nodes)
  → Higher availability, weaker consistency
  → Hinted handoff delivers writes to correct nodes later

Distributed Transactions

Two-Phase Commit (2PC)

Phase 1 — Prepare:
  Coordinator asks all participants: "Can you commit?"
  Participants: prepare (lock resources) → vote YES or NO

Phase 2 — Commit/Abort:
  If all voted YES → Coordinator sends COMMIT
  If any voted NO → Coordinator sends ABORT

Problems:
  - Blocking: if coordinator crashes after prepare, participants are stuck
  - Single point of failure: coordinator
  - Latency: two round trips
  - Not widely used across services (too slow, too fragile)

Saga Pattern (Preferred for Microservices)

A sequence of local transactions. Each step has a compensating action.

Example: Order Processing
  1. Create Order → (compensate: Cancel Order)
  2. Reserve Inventory → (compensate: Release Inventory)
  3. Charge Payment → (compensate: Refund Payment)
  4. Ship Order

If step 3 fails:
  → Compensate step 2 (Release Inventory)
  → Compensate step 1 (Cancel Order)

Choreography: each service publishes events, next service listens
Orchestration: central orchestrator coordinates the steps

Trade-offs:
  - No atomicity (partial states are visible)
  - Compensating actions must be idempotent
  - Eventual consistency
  - More complex error handling

Event Sourcing & CQRS

Event Sourcing:
  - Store EVENTS, not current state
  - Current state = replay all events from beginning
  - Events are immutable (append-only log)
  - Full audit trail
  - Can rebuild state at any point in time

CQRS (Command Query Responsibility Segregation):
  - Separate read model from write model
  - Write model: optimized for writes (event store)
  - Read model: optimized for queries (materialized views, denormalized)
  - Sync read model from events (eventual consistency)

When to use:
  - Complex domain logic
  - Audit requirements
  - Different read/write patterns
  - Event-driven architecture

When NOT to use:
  - Simple CRUD
  - Small scale
  - Strong consistency required everywhere

Clocks & Ordering

Physical Clocks

Problem: clocks on different machines drift.
NTP syncs clocks, but only within milliseconds of accuracy.
Never rely on timestamps for ordering events across machines.

Logical Clocks

Lamport Clocks:
  - Each node maintains a counter
  - On local event: counter++
  - On send: include counter in message
  - On receive: counter = max(local, received) + 1
  - If event A happened-before B, then timestamp(A) < timestamp(B)
  - BUT: timestamp(A) < timestamp(B) does NOT mean A happened-before B

Vector Clocks:
  - Each node maintains a vector of counters (one per node)
  - Can detect concurrent events (unlike Lamport clocks)
  - Used in DynamoDB, Riak for conflict detection
  - Trade-off: vector size grows with number of nodes

Failure Modes

1. Crash failure: node stops and never comes back
   → Detection: heartbeats, timeouts
   → Handling: failover, replication

2. Omission failure: node fails to send/receive messages
   → Send omission: message never sent
   → Receive omission: message never processed

3. Timing failure: node responds too slowly
   → Timeout detection
   → Mark as suspect, not failed (might just be slow)

4. Byzantine failure: node behaves arbitrarily (buggy or malicious)
   → Hardest to handle, requires BFT protocols (3f+1 nodes for f failures)
   → Most internal systems assume non-Byzantine (crash-only)

5. Network partition: network splits into groups that can't communicate
   → CAP theorem: choose between consistency and availability
   → Most systems: detect partition, degrade gracefully

Split Brain

When: network partition causes two groups, each thinking it's the leader.
Example: primary and replica can't communicate, replica promotes itself.
Now TWO nodes accept writes → data divergence.

Prevention:
  - Quorum-based leader election (need majority)
  - Fencing tokens (old leader's token is invalidated)
  - STONITH (Shoot The Other Node In The Head) — force-kill old leader

Message Queues & Event-Driven Architecture

Queue vs Log

Queue (RabbitMQ, SQS):
  - Message consumed once (removed from queue)
  - Multiple consumers → each gets different messages
  - Good for: task distribution, work queues

Log (Kafka, Kinesis):
  - Messages retained for a period (configurable)
  - Multiple consumers can read at different offsets (replay)
  - Good for: event sourcing, data pipelines, multiple consumers

Choosing:
  - Need replay? → Log
  - Need complex routing? → Queue (RabbitMQ exchanges)
  - Need massive throughput? → Log (Kafka)
  - Simple job queue? → Queue (SQS/RabbitMQ)

Exactly-Once Delivery

In distributed systems, "exactly-once" is extremely hard.

At-most-once: send and forget (may lose messages)
At-least-once: retry until acknowledged (may duplicate)
Exactly-once: theoretical ideal

Practical "exactly-once":
  - At-least-once delivery + idempotent processing
  - Deduplication at consumer (track processed message IDs)
  - Transactional outbox pattern (atomically write to DB + outbox table)

Transactional Outbox:
  1. Business logic + outbox message written in same DB transaction
  2. Separate poller reads outbox, publishes to queue
  3. On successful publish, marks outbox entry as sent
  → Guarantees at-least-once, consumer handles dedup

Key Theorems & Results

CAP Theorem:
  Can't have all three: Consistency, Availability, Partition Tolerance.
  In practice: P is mandatory, choose C or A.

PACELC Theorem:
  Extension of CAP: even when no Partition,
  you must trade-off between Latency and Consistency.
  PAC: during partition, choose A or C
  ELC: else (no partition), choose L or C

FLP Impossibility:
  In an asynchronous system with even one faulty process,
  there's no deterministic algorithm that always achieves consensus.
  → Real systems use timeouts and randomization to work around this.

Consistency Hierarchy (strongest to weakest):
  Linearizability > Sequential > Causal > Eventual

Common Interview Questions

Q: Explain CAP theorem with an example.
A: "In a distributed database with replicas across two data centers, if the
network between them fails (partition): we can either reject writes to
maintain consistency (CP), or accept writes on both sides and reconcile
later with eventual consistency (AP). PostgreSQL streaming replication is
CP — if the primary is unreachable, replicas won't accept writes. Cassandra
is AP — any node accepts writes, conflicts resolved later."

Q: How would you design a system that's both highly available and consistent?
A: "You can't have both during a partition (CAP), but you can:
1. Use strong consistency for critical data (payments, inventory)
2. Use eventual consistency for non-critical data (analytics, likes)
3. Minimize partition impact with multi-AZ deployment
4. Use Raft-based consensus for leader election (CP within a cluster)
5. Cache aggressively for read availability"

Q: What's the difference between Kafka and RabbitMQ?
A: "RabbitMQ is a message queue — messages are consumed and removed, good for
task distribution with complex routing. Kafka is a distributed log —
messages are retained, consumers track offsets, supports replay. Kafka
excels at high-throughput event streaming with multiple consumers. RabbitMQ
excels at traditional work queues with routing and priority."

Q: How do you handle a hot spot in a sharded database?
A: "1. Hash-based partitioning instead of range-based for better distribution.
2. Sub-sharding: further split the hot shard.
3. Caching: cache hot data in Redis to reduce DB load.
4. Read replicas for the hot shard.
5. Application-level logic to distribute writes (e.g., random suffix on keys)."

Q: Design a distributed lock.
A: "Use Redis with SETNX (SET if Not eXists) + TTL. For stronger guarantees,
use Redlock (quorum across multiple Redis instances). Key considerations:
lock TTL (prevents deadlock if holder crashes), fencing tokens (prevents
stale lock holders from acting), and clock drift. For strongest guarantees,
use ZooKeeper or etcd with consensus-based locks."

Resources

Books

Online

Videos


My Notes

DDIA chapters I've finished:
-

Distributed concepts I use at work:
-

Concepts I find confusing:
-

Next: 07-aws-and-infrastructure.md