🌐 Distributed Systems

CAP, consensus, replication, partitioning

CAP Theorem

Consistency

Every read gets most recent write

Zookeeper, HBase, Spanner

Availability

Every request gets a response

Cassandra, DynamoDB, CouchDB

Partition Tolerance

System works despite network splits

Required in distributed systems

During a network partition, choose either CP (consistent but some requests fail) or AP (available but stale reads).

Core Concepts

Consensus Algorithms

Raft

Leader election + log replication. Understandable. etcd, CockroachDB

Paxos

Theoretical foundation. Complex. Google Chubby

ZAB

Zookeeper Atomic Broadcast. Leader-based. Zookeeper

Replication Strategies

Leader-Follower

One writer, N readers. Simple. PostgreSQL streaming replication

Multi-Leader

Multiple writers. Conflict resolution needed. CRDTs

Leaderless

Quorum reads/writes (R+W>N). Cassandra, DynamoDB

Partitioning / Sharding

Hash-based

Consistent hashing. Even distribution. No range queries

Range-based

Sorted ranges. Good for scans. Hotspot risk

Directory

Lookup table for routing. Flexible. Single point of failure

Consistency Models

Strong

Linearizable. All reads see latest write. Spanner

Eventual

Converges over time. DynamoDB, Cassandra

Causal

Respects causality. Vector clocks. COPS

Read-your-writes

User sees their own writes. Session consistency

Distributed Clocks

Lamport Clock

Logical counter. Incremented on send/receive. Total order but not causal.

Vector Clock

Array of counters per node. Detects causality and conflicts. [A:2, B:3, C:1]

Hybrid Logical Clock

Physical time + logical counter. CockroachDB. Bounded clock skew.

Key Patterns

Saga Pattern

Distributed transactions via compensating actions. Choreography vs Orchestration.

Cross-service transactions

Event Sourcing

Store events, not state. Rebuild state by replaying. Append-only log.

Audit trails, undo

CQRS

Separate read/write models. Optimized independently.

Complex domains, high read load

Circuit Breaker

Fail fast when downstream is unhealthy. Closed → Open → Half-Open.

Fault tolerance

Sidecar

Proxy alongside each service. mTLS, observability, retry.

Service mesh (Istio, Envoy)

Bulkhead

Isolate failures. Separate thread pools / connection pools per dependency.

Resilience

Fundamental Theorems

CAP Pick 2 of 3: Consistency, Availability, Partition Tolerance

PACELC If Partition: A or C. Else: Latency or Consistency

FLP Impossible: async consensus with even 1 crash failure

Two Generals No protocol guarantees agreement over unreliable channel

Byzantine Fault Up to ⌊(n-1)/3⌋ malicious nodes tolerable (BFT)