04 — Databases & SQL Interview Guide

Priority: HIGH — PostgreSQL is one of your core strengths. You’ve optimized queries from minutes to milliseconds. Own this topic.


Table of Contents

  1. SQL Fundamentals
  2. PostgreSQL Specifics
  3. Indexing Deep Dive
  4. Query Optimization
  5. Transactions & Concurrency
  6. Schema Design & Normalization
  7. Scaling PostgreSQL
  8. PostGIS & Geospatial
  9. Redis
  10. NoSQL Concepts
  11. Common Interview Questions
  12. Resources

SQL Fundamentals

Query Execution Order

Actual execution order (not syntax order!):

1. FROM / JOIN       — determine working set of rows
2. WHERE             — filter rows
3. GROUP BY          — group rows
4. HAVING            — filter groups
5. SELECT            — choose columns, compute expressions
6. DISTINCT          — remove duplicates
7. ORDER BY          — sort results
8. LIMIT / OFFSET    — page results

This order explains why:
- You can't use column aliases in WHERE (SELECT hasn't run yet)
- You CAN use column aliases in ORDER BY (runs after SELECT)
- HAVING filters groups, WHERE filters rows (GROUP BY hasn't happened yet for WHERE)

Essential SQL Patterns

-- Window functions (frequently asked!)
SELECT
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
    AVG(salary) OVER (PARTITION BY department) as dept_avg,
    salary - LAG(salary) OVER (ORDER BY hire_date) as salary_diff_from_prev
FROM employees;

-- Common window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD,
--   SUM, AVG, COUNT, FIRST_VALUE, LAST_VALUE, NTH_VALUE, NTILE

-- CTEs (Common Table Expressions)
WITH ranked_employees AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rn
    FROM employees
)
SELECT * FROM ranked_employees WHERE rn <= 3;  -- top 3 per department

-- Recursive CTE (tree/hierarchy traversal)
WITH RECURSIVE org_tree AS (
    -- Base case: root nodes (no manager)
    SELECT id, name, manager_id, 1 as depth
    FROM employees WHERE manager_id IS NULL
    UNION ALL
    -- Recursive case: children
    SELECT e.id, e.name, e.manager_id, ot.depth + 1
    FROM employees e
    JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT * FROM org_tree ORDER BY depth, name;

-- LATERAL JOIN (like a for-each loop)
SELECT d.name, latest.*
FROM departments d
CROSS JOIN LATERAL (
    SELECT * FROM employees e
    WHERE e.dept_id = d.id
    ORDER BY hire_date DESC
    LIMIT 3
) AS latest;

-- UPSERT (INSERT ... ON CONFLICT)
INSERT INTO users (email, name, login_count)
VALUES ('karan@jadhav.dev', 'Karan', 1)
ON CONFLICT (email) DO UPDATE SET
    login_count = users.login_count + 1,
    last_login = NOW();

-- JSON operations
SELECT
    data->>'name' as name,                    -- extract text
    data->'address'->>'city' as city,         -- nested extract
    jsonb_array_elements(data->'tags') as tag  -- unnest array
FROM json_table
WHERE data @> '{"status": "active"}';          -- containment check

Joins — Know Them Cold

-- INNER JOIN: only matching rows from both tables
SELECT * FROM orders o INNER JOIN customers c ON o.customer_id = c.id;

-- LEFT JOIN: all rows from left + matching from right (NULL if no match)
SELECT * FROM customers c LEFT JOIN orders o ON c.id = o.customer_id;
-- Shows all customers, even those with no orders

-- RIGHT JOIN: all rows from right + matching from left
-- (Less common — just swap table order and use LEFT JOIN)

-- FULL OUTER JOIN: all rows from both tables
SELECT * FROM table_a a FULL OUTER JOIN table_b b ON a.id = b.id;

-- CROSS JOIN: cartesian product (every row × every row)
SELECT * FROM sizes CROSS JOIN colors;

-- SELF JOIN: join table to itself
SELECT e.name as employee, m.name as manager
FROM employees e LEFT JOIN employees m ON e.manager_id = m.id;

-- Anti-join: rows in A that have NO match in B
SELECT * FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.id IS NULL;  -- customers with no orders

-- Semi-join: rows in A that HAVE a match in B (without duplicating)
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

PostgreSQL Specifics

Architecture

Client → Connection (via PgBouncer / direct)
       → Backend Process (one per connection)
       → Shared Buffers (buffer cache in memory)
       → WAL (Write-Ahead Log) → Disk
       → Data Files → Disk

Key Components:
  - Shared Buffers: PostgreSQL's cache (typically 25% of RAM)
  - WAL: Write-ahead log for durability. Writes go to WAL first.
  - Autovacuum: Cleans up dead tuples from MVCC
  - Statistics Collector: Tracks table/index usage for query planner
  - Background Writer: Flushes dirty buffers to disk
  - Checkpointer: Periodic WAL checkpoint (recovery point)

MVCC (Multi-Version Concurrency Control)

How PostgreSQL handles concurrent access:
  - Each row version has xmin (creating transaction) and xmax (deleting transaction)
  - Readers don't block writers, writers don't block readers
  - Each transaction sees a consistent snapshot of data
  - Old row versions ("dead tuples") cleaned by VACUUM

Why it matters:
  - Explains why UPDATE creates new tuple (not in-place)
  - Explains table bloat if VACUUM doesn't run
  - Explains why long-running transactions are problematic
  - Explains why you see "dead tuples" in pg_stat_user_tables

Interview answer: "PostgreSQL uses MVCC for concurrency. Each transaction
sees a snapshot. UPDATEs create new row versions, old ones are cleaned by
VACUUM. This means reads never block writes, but requires regular
vacuuming to prevent table bloat."

Key PostgreSQL Features

- Partitioning: range, list, hash partitioning for large tables
- Materialized Views: precomputed query results (REFRESH MATERIALIZED VIEW)
- JSONB: binary JSON with indexing (GIN index)
- Full-text search: tsvector, tsquery, GIN indexes
- PostGIS: geospatial data types and functions
- Foreign Data Wrappers: query external databases as local tables
- Logical Replication: selective table replication
- Row-Level Security: fine-grained access control per row
- pg_stat_statements: query performance statistics
- EXPLAIN ANALYZE: query execution plan analysis

Indexing Deep Dive

Index Types in PostgreSQL

B-tree (default):
  - Balanced tree structure
  - Good for: equality (=), range (<, >, BETWEEN), ORDER BY, LIKE 'prefix%'
  - Most common index type
  - Supports: unique, multi-column, covering (INCLUDE)
  CREATE INDEX idx_users_email ON users (email);
  CREATE UNIQUE INDEX idx_users_email_unique ON users (email);

Hash:
  - Only equality comparisons (=)
  - Slightly faster than B-tree for equality, but limited
  - Rarely used (B-tree is almost always better)
  CREATE INDEX idx_users_id_hash ON users USING hash (id);

GiST (Generalized Search Tree):
  - For geometric, full-text, range types
  - PostGIS spatial indexes
  - Supports: containment, overlap, nearest-neighbor
  CREATE INDEX idx_geom ON buildings USING gist (geom);

GIN (Generalized Inverted Index):
  - For composite types: arrays, JSONB, full-text search
  - Stores: value → list of rows containing that value
  - Good for: @>, ?, ?&, ?| operators on JSONB
  CREATE INDEX idx_tags ON articles USING gin (tags);
  CREATE INDEX idx_data ON events USING gin (data jsonb_path_ops);

BRIN (Block Range Index):
  - For naturally ordered data (timestamps, sequential IDs)
  - Tiny index size — stores min/max per block range
  - Perfect for time-series data or append-only tables
  CREATE INDEX idx_created ON events USING brin (created_at);

Partial Index:
  - Index only rows matching a condition
  - Smaller, more efficient
  CREATE INDEX idx_active_users ON users (email) WHERE active = true;

Covering Index (INCLUDE):
  - Include extra columns in index for index-only scans
  CREATE INDEX idx_orders ON orders (customer_id) INCLUDE (total, status);

When to Create Indexes

DO index:
  - Columns in WHERE clauses (especially with equality/range)
  - Columns in JOIN conditions
  - Columns in ORDER BY / GROUP BY
  - Foreign key columns (avoid sequential scans on JOINs)
  - Columns with high selectivity (many unique values)

DON'T index:
  - Small tables (sequential scan is faster)
  - Columns with low selectivity (boolean, status with few values)
  - Tables with heavy write load and rare reads
  - Columns rarely used in queries

Remember:
  - Indexes speed up reads but slow down writes (must update index too)
  - Indexes consume storage
  - Multi-column index: leftmost prefix is usable
    INDEX (a, b, c) helps: WHERE a=1, WHERE a=1 AND b=2
    Does NOT help: WHERE b=2, WHERE c=3

Query Optimization

EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;

-- Key things to look for:
-- 1. Seq Scan on large tables (might need an index)
-- 2. Nested Loop joins with large tables (might need Hash/Merge join)
-- 3. Sort operations (might need index for ORDER BY)
-- 4. High "actual rows" vs "estimated rows" (stale statistics → ANALYZE)
-- 5. Buffers: shared hit vs read (cache hit rate)

Common Optimization Techniques

1. Add appropriate indexes (B-tree, GiST, GIN, BRIN)

2. Rewrite queries:
   - Use EXISTS instead of IN for subqueries
   - Avoid SELECT * — select only needed columns
   - Use LIMIT early when possible
   - Avoid functions on indexed columns in WHERE: WHERE LOWER(name) = 'karan'
     → Use: CREATE INDEX ON users (LOWER(name)); or WHERE name ILIKE 'karan'

3. Connection pooling (PgBouncer):
   - PostgreSQL forks a new process per connection (expensive)
   - PgBouncer maintains a pool of connections
   - Modes: session, transaction, statement

4. Partitioning large tables:
   - Range partitioning by date (most common)
   - PostgreSQL prunes partitions not matching WHERE clause
   - Reduces index size per partition

5. Materialized views for complex aggregations:
   CREATE MATERIALIZED VIEW monthly_stats AS
   SELECT date_trunc('month', created_at) as month, COUNT(*), AVG(amount)
   FROM orders GROUP BY 1;
   
   REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_stats;

6. Use pg_stat_statements to find slow queries:
   SELECT query, mean_exec_time, calls
   FROM pg_stat_statements
   ORDER BY total_exec_time DESC
   LIMIT 20;

7. Batch operations:
   - COPY instead of INSERT for bulk loading
   - Batch UPDATEs with CTEs
   - Use UNNEST for bulk operations

8. VACUUM and ANALYZE:
   - VACUUM: reclaims dead tuple space
   - ANALYZE: updates table statistics for query planner
   - Autovacuum should handle this, but check settings for large tables

Your story: "I optimized PostgreSQL for 1.8TB of building footprint data.
Key techniques: spatial indexing with GiST, BRIN indexes for timestamp
columns, table partitioning by geography, query plan analysis with EXPLAIN
ANALYZE, and connection pooling with PgBouncer. This reduced query times
from seconds to milliseconds."

Transactions & Concurrency

ACID Properties

Atomicity:   All or nothing. Transaction fully commits or fully rolls back.
Consistency: Database goes from one valid state to another.
Isolation:   Concurrent transactions don't interfere with each other.
Durability:  Once committed, data survives crashes (WAL ensures this).

Isolation Levels

| Level              | Dirty Read | Non-Repeatable Read | Phantom Read | PostgreSQL Default |
|--------------------|-----------|--------------------|--------------|--------------------|
| Read Uncommitted   | Possible  | Possible           | Possible     | No (treated as Read Committed) |
| Read Committed     | No        | Possible           | Possible     | ← DEFAULT          |
| Repeatable Read    | No        | No                 | No*          |                    |
| Serializable       | No        | No                 | No           |                    |

* PostgreSQL's Repeatable Read actually prevents phantoms too (snapshot isolation)

Dirty Read: Reading uncommitted data from another transaction
Non-Repeatable Read: Same query returns different results within a transaction
Phantom Read: New rows appear in a repeated range query

Locking

Row-level locks:
  - SELECT ... FOR UPDATE: exclusive lock on selected rows
  - SELECT ... FOR SHARE: shared lock (allows other reads, blocks writes)
  - FOR UPDATE SKIP LOCKED: skip already-locked rows (great for job queues!)

Advisory locks:
  - Application-level locks managed by PostgreSQL
  - pg_advisory_lock(key): blocking
  - pg_try_advisory_lock(key): non-blocking

Deadlocks:
  - Two transactions each waiting for the other's locked resource
  - PostgreSQL detects and aborts one transaction
  - Prevention: always lock resources in the same order

Schema Design & Normalization

Normal Forms

1NF: Atomic values (no arrays or nested tables in columns)
2NF: 1NF + no partial dependencies (every non-key column depends on whole PK)
3NF: 2NF + no transitive dependencies (non-key columns don't depend on each other)

In practice:
  - Normalize to 3NF by default
  - Denormalize strategically for read performance
  - Document why you denormalized (informed decision, not laziness)

Schema Design Best Practices

1. Use appropriate data types:
   - UUID vs SERIAL for primary keys
   - TIMESTAMPTZ (not TIMESTAMP) for timestamps
   - NUMERIC for money (not FLOAT)
   - TEXT over VARCHAR (PostgreSQL treats them the same internally)
   - JSONB for semi-structured data (not JSON — JSONB is binary, indexable)

2. Constraints:
   - NOT NULL by default (add NULL only when truly optional)
   - Foreign keys for referential integrity
   - CHECK constraints for business rules
   - UNIQUE constraints/indexes for natural keys

3. Naming conventions:
   - Plural table names: users, orders, products
   - Snake_case: created_at, first_name
   - FK column matches referenced table: user_id → users.id

4. Common patterns:
   - Soft delete: deleted_at TIMESTAMPTZ (not boolean)
   - Audit columns: created_at, updated_at on every table
   - Polymorphic associations: type + id columns
   - ENUM types for fixed sets: CREATE TYPE status AS ENUM ('active', 'inactive')

Scaling PostgreSQL

Vertical Scaling:
  - More RAM (bigger shared_buffers, effective_cache_size)
  - Faster storage (NVMe SSDs)
  - More CPU cores (parallel query execution)

Read Replicas:
  - Streaming replication: async or sync
  - Route reads to replicas, writes to primary
  - Replication lag: milliseconds (async) to zero (sync, but slower writes)
  - Tools: Patroni for HA, PgBouncer for connection routing

Partitioning:
  - Declarative partitioning (PostgreSQL 10+)
  - Range: by date (most common), by ID range
  - List: by category, region
  - Hash: for even distribution
  - Partition pruning: planner skips irrelevant partitions

Sharding:
  - PostgreSQL doesn't natively shard (unlike Cassandra)
  - Options: Citus extension, application-level sharding
  - Foreign Data Wrappers for cross-shard queries

Connection Pooling:
  - PgBouncer: external connection pooler
  - Modes: session (safest), transaction (most efficient), statement
  - Why: PostgreSQL process-per-connection model doesn't scale to thousands

PostGIS & Geospatial

This is your superpower. Few candidates know this.

-- Geometry vs Geography types
-- Geometry: planar (flat earth), faster, for local areas
-- Geography: spherical (real earth), accurate for global, slower

-- Spatial indexing (GiST)
CREATE INDEX idx_buildings_geom ON buildings USING gist (geom);

-- Common spatial queries
-- Find buildings within a bounding box
SELECT * FROM buildings
WHERE geom && ST_MakeEnvelope(xmin, ymin, xmax, ymax, 4326);

-- Find buildings within 1km of a point
SELECT * FROM buildings
WHERE ST_DWithin(
    geom::geography,
    ST_SetSRID(ST_MakePoint(72.8777, 19.0760), 4326)::geography,
    1000  -- meters
);

-- Spatial join: find buildings in each district
SELECT d.name, COUNT(b.id)
FROM districts d
JOIN buildings b ON ST_Contains(d.geom, b.geom)
GROUP BY d.name;

-- Area calculation
SELECT name, ST_Area(geom::geography) as area_sq_meters
FROM parcels;

Key concepts for interviews:
  - SRID 4326: WGS84 (GPS coordinates, lat/long)
  - GiST index: R-tree based, essential for spatial queries
  - ST_Contains, ST_Intersects, ST_DWithin: most used functions
  - Bounding box operators (&&) for fast filtering before precise calculations
  - Index-optimized: ST_DWithin, && (bounding box), ST_Intersects
  - NOT index-optimized: ST_Distance (use ST_DWithin instead)

Redis

What: In-memory key-value store. Used for caching, sessions, queues, real-time.

Data Structures:
  - String: simple key-value, counters (INCR/DECR), TTL
  - Hash: object-like storage (HSET, HGET, HGETALL)
  - List: ordered collection, queues (LPUSH, RPOP), bounded lists
  - Set: unique members, intersections/unions
  - Sorted Set: ranked members (leaderboards, priority queues)
  - Stream: append-only log (like Kafka lite)

Common Patterns:
  - Cache-aside: check Redis → miss → query DB → store in Redis
  - Rate limiting: INCR + EXPIRE (token bucket)
  - Session storage: hash with TTL
  - Pub/Sub: real-time notifications
  - Distributed lock: SET key value NX PX milliseconds (Redlock pattern)
  - Job queue: LPUSH + BRPOP (blocking pop)

Eviction Policies:
  - noeviction: return errors when memory full
  - allkeys-lru: evict least recently used keys
  - volatile-lru: evict LRU among keys with TTL
  - allkeys-random: evict random keys

Persistence:
  - RDB: periodic snapshots (point-in-time backup)
  - AOF: append every write (more durable, larger files)
  - Both: RDB for backup, AOF for durability

Your experience: "I use Redis extensively — for caching API responses with
TTL, as a message broker for background tasks, for rate limiting, and as a
session store. My open-source project RediServe provides HTTP API access
to Redis operations."

NoSQL Concepts

When to use NoSQL over SQL:
  - Need horizontal scalability for simple lookups
  - Schema is truly flexible/evolving
  - Don't need complex joins or transactions
  - Very high write throughput needed

Types:
  Key-Value (Redis, DynamoDB):
    Use for: caching, sessions, simple lookups
    Access: by key only

  Document (MongoDB, CouchDB):
    Use for: flexible schema, nested data, content management
    Access: by key or queries on document fields

  Column-Family (Cassandra, HBase):
    Use for: time-series, write-heavy, wide rows
    Access: by partition key + clustering columns

  Graph (Neo4j, Amazon Neptune):
    Use for: social networks, recommendations, knowledge graphs
    Access: graph traversal queries

Interview answer for "SQL vs NoSQL":
  "It depends on the use case. I default to PostgreSQL because ACID
  guarantees, complex queries, and strong ecosystem matter for most
  applications. I use Redis for caching and real-time features.
  I'd consider a document store if the schema is truly dynamic,
  or a column-family store for append-heavy time-series at massive scale."

Common Interview Questions

Q: How would you design a database for a social media app?
A: Tables: users, posts, follows, likes, comments
   Denormalize: follower_count, like_count on posts (with triggers or app logic)
   Index: follows(follower_id), follows(following_id), posts(user_id, created_at)
   Partitioning: posts by date range
   Read replicas for timeline queries

Q: Your query is slow. How do you debug it?
A: 1. EXPLAIN ANALYZE to see execution plan
   2. Check for Seq Scans on large tables → add indexes
   3. Check row estimates vs actual (stale stats → ANALYZE)
   4. Check for N+1 queries in application code
   5. Check connection count (PgBouncer)
   6. Check table bloat (pg_stat_user_tables → dead tuples)
   7. Check for lock contention (pg_stat_activity)

Q: What's the difference between clustered and non-clustered index?
A: PostgreSQL doesn't have clustered indexes in the SQL Server sense.
   But CLUSTER command physically reorders table data by an index (one-time).
   All PostgreSQL indexes are "non-clustered" — they point to heap tuples.
   The closest thing: BRIN index on naturally ordered data.

Q: How do you handle database migrations safely?
A: 1. Always forward-only migrations (no destructive changes in prod)
   2. Add new columns as nullable → backfill → add NOT NULL constraint
   3. Never rename/drop columns directly — add new, migrate, then drop old
   4. Use CREATE INDEX CONCURRENTLY (doesn't lock the table)
   5. Test migrations on a copy of production data
   6. Have a rollback plan

Resources


My Notes

Queries I've optimized (for interview stories):
-

PostgreSQL features I've used:
-

Things I need to review:
-

Next: 05-backend-and-api-design.md