04 — Databases & SQL Interview Guide
Priority: HIGH — PostgreSQL is one of your core strengths. You’ve optimized queries from minutes to milliseconds. Own this topic.
Table of Contents
- SQL Fundamentals
- PostgreSQL Specifics
- Indexing Deep Dive
- Query Optimization
- Transactions & Concurrency
- Schema Design & Normalization
- Scaling PostgreSQL
- PostGIS & Geospatial
- Redis
- NoSQL Concepts
- Common Interview Questions
- Resources
SQL Fundamentals
Query Execution Order
Actual execution order (not syntax order!):
1. FROM / JOIN — determine working set of rows
2. WHERE — filter rows
3. GROUP BY — group rows
4. HAVING — filter groups
5. SELECT — choose columns, compute expressions
6. DISTINCT — remove duplicates
7. ORDER BY — sort results
8. LIMIT / OFFSET — page results
This order explains why:
- You can't use column aliases in WHERE (SELECT hasn't run yet)
- You CAN use column aliases in ORDER BY (runs after SELECT)
- HAVING filters groups, WHERE filters rows (GROUP BY hasn't happened yet for WHERE)
Essential SQL Patterns
-- Window functions (frequently asked!)
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
AVG(salary) OVER (PARTITION BY department) as dept_avg,
salary - LAG(salary) OVER (ORDER BY hire_date) as salary_diff_from_prev
FROM employees;
-- Common window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD,
-- SUM, AVG, COUNT, FIRST_VALUE, LAST_VALUE, NTH_VALUE, NTILE
-- CTEs (Common Table Expressions)
WITH ranked_employees AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rn
FROM employees
)
SELECT * FROM ranked_employees WHERE rn <= 3; -- top 3 per department
-- Recursive CTE (tree/hierarchy traversal)
WITH RECURSIVE org_tree AS (
-- Base case: root nodes (no manager)
SELECT id, name, manager_id, 1 as depth
FROM employees WHERE manager_id IS NULL
UNION ALL
-- Recursive case: children
SELECT e.id, e.name, e.manager_id, ot.depth + 1
FROM employees e
JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT * FROM org_tree ORDER BY depth, name;
-- LATERAL JOIN (like a for-each loop)
SELECT d.name, latest.*
FROM departments d
CROSS JOIN LATERAL (
SELECT * FROM employees e
WHERE e.dept_id = d.id
ORDER BY hire_date DESC
LIMIT 3
) AS latest;
-- UPSERT (INSERT ... ON CONFLICT)
INSERT INTO users (email, name, login_count)
VALUES ('karan@jadhav.dev', 'Karan', 1)
ON CONFLICT (email) DO UPDATE SET
login_count = users.login_count + 1,
last_login = NOW();
-- JSON operations
SELECT
data->>'name' as name, -- extract text
data->'address'->>'city' as city, -- nested extract
jsonb_array_elements(data->'tags') as tag -- unnest array
FROM json_table
WHERE data @> '{"status": "active"}'; -- containment check
Joins — Know Them Cold
-- INNER JOIN: only matching rows from both tables
SELECT * FROM orders o INNER JOIN customers c ON o.customer_id = c.id;
-- LEFT JOIN: all rows from left + matching from right (NULL if no match)
SELECT * FROM customers c LEFT JOIN orders o ON c.id = o.customer_id;
-- Shows all customers, even those with no orders
-- RIGHT JOIN: all rows from right + matching from left
-- (Less common — just swap table order and use LEFT JOIN)
-- FULL OUTER JOIN: all rows from both tables
SELECT * FROM table_a a FULL OUTER JOIN table_b b ON a.id = b.id;
-- CROSS JOIN: cartesian product (every row × every row)
SELECT * FROM sizes CROSS JOIN colors;
-- SELF JOIN: join table to itself
SELECT e.name as employee, m.name as manager
FROM employees e LEFT JOIN employees m ON e.manager_id = m.id;
-- Anti-join: rows in A that have NO match in B
SELECT * FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.id IS NULL; -- customers with no orders
-- Semi-join: rows in A that HAVE a match in B (without duplicating)
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);
PostgreSQL Specifics
Architecture
Client → Connection (via PgBouncer / direct)
→ Backend Process (one per connection)
→ Shared Buffers (buffer cache in memory)
→ WAL (Write-Ahead Log) → Disk
→ Data Files → Disk
Key Components:
- Shared Buffers: PostgreSQL's cache (typically 25% of RAM)
- WAL: Write-ahead log for durability. Writes go to WAL first.
- Autovacuum: Cleans up dead tuples from MVCC
- Statistics Collector: Tracks table/index usage for query planner
- Background Writer: Flushes dirty buffers to disk
- Checkpointer: Periodic WAL checkpoint (recovery point)
MVCC (Multi-Version Concurrency Control)
How PostgreSQL handles concurrent access:
- Each row version has xmin (creating transaction) and xmax (deleting transaction)
- Readers don't block writers, writers don't block readers
- Each transaction sees a consistent snapshot of data
- Old row versions ("dead tuples") cleaned by VACUUM
Why it matters:
- Explains why UPDATE creates new tuple (not in-place)
- Explains table bloat if VACUUM doesn't run
- Explains why long-running transactions are problematic
- Explains why you see "dead tuples" in pg_stat_user_tables
Interview answer: "PostgreSQL uses MVCC for concurrency. Each transaction
sees a snapshot. UPDATEs create new row versions, old ones are cleaned by
VACUUM. This means reads never block writes, but requires regular
vacuuming to prevent table bloat."
Key PostgreSQL Features
- Partitioning: range, list, hash partitioning for large tables
- Materialized Views: precomputed query results (REFRESH MATERIALIZED VIEW)
- JSONB: binary JSON with indexing (GIN index)
- Full-text search: tsvector, tsquery, GIN indexes
- PostGIS: geospatial data types and functions
- Foreign Data Wrappers: query external databases as local tables
- Logical Replication: selective table replication
- Row-Level Security: fine-grained access control per row
- pg_stat_statements: query performance statistics
- EXPLAIN ANALYZE: query execution plan analysis
Indexing Deep Dive
Index Types in PostgreSQL
B-tree (default):
- Balanced tree structure
- Good for: equality (=), range (<, >, BETWEEN), ORDER BY, LIKE 'prefix%'
- Most common index type
- Supports: unique, multi-column, covering (INCLUDE)
CREATE INDEX idx_users_email ON users (email);
CREATE UNIQUE INDEX idx_users_email_unique ON users (email);
Hash:
- Only equality comparisons (=)
- Slightly faster than B-tree for equality, but limited
- Rarely used (B-tree is almost always better)
CREATE INDEX idx_users_id_hash ON users USING hash (id);
GiST (Generalized Search Tree):
- For geometric, full-text, range types
- PostGIS spatial indexes
- Supports: containment, overlap, nearest-neighbor
CREATE INDEX idx_geom ON buildings USING gist (geom);
GIN (Generalized Inverted Index):
- For composite types: arrays, JSONB, full-text search
- Stores: value → list of rows containing that value
- Good for: @>, ?, ?&, ?| operators on JSONB
CREATE INDEX idx_tags ON articles USING gin (tags);
CREATE INDEX idx_data ON events USING gin (data jsonb_path_ops);
BRIN (Block Range Index):
- For naturally ordered data (timestamps, sequential IDs)
- Tiny index size — stores min/max per block range
- Perfect for time-series data or append-only tables
CREATE INDEX idx_created ON events USING brin (created_at);
Partial Index:
- Index only rows matching a condition
- Smaller, more efficient
CREATE INDEX idx_active_users ON users (email) WHERE active = true;
Covering Index (INCLUDE):
- Include extra columns in index for index-only scans
CREATE INDEX idx_orders ON orders (customer_id) INCLUDE (total, status);
When to Create Indexes
DO index:
- Columns in WHERE clauses (especially with equality/range)
- Columns in JOIN conditions
- Columns in ORDER BY / GROUP BY
- Foreign key columns (avoid sequential scans on JOINs)
- Columns with high selectivity (many unique values)
DON'T index:
- Small tables (sequential scan is faster)
- Columns with low selectivity (boolean, status with few values)
- Tables with heavy write load and rare reads
- Columns rarely used in queries
Remember:
- Indexes speed up reads but slow down writes (must update index too)
- Indexes consume storage
- Multi-column index: leftmost prefix is usable
INDEX (a, b, c) helps: WHERE a=1, WHERE a=1 AND b=2
Does NOT help: WHERE b=2, WHERE c=3
Query Optimization
EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;
-- Key things to look for:
-- 1. Seq Scan on large tables (might need an index)
-- 2. Nested Loop joins with large tables (might need Hash/Merge join)
-- 3. Sort operations (might need index for ORDER BY)
-- 4. High "actual rows" vs "estimated rows" (stale statistics → ANALYZE)
-- 5. Buffers: shared hit vs read (cache hit rate)
Common Optimization Techniques
1. Add appropriate indexes (B-tree, GiST, GIN, BRIN)
2. Rewrite queries:
- Use EXISTS instead of IN for subqueries
- Avoid SELECT * — select only needed columns
- Use LIMIT early when possible
- Avoid functions on indexed columns in WHERE: WHERE LOWER(name) = 'karan'
→ Use: CREATE INDEX ON users (LOWER(name)); or WHERE name ILIKE 'karan'
3. Connection pooling (PgBouncer):
- PostgreSQL forks a new process per connection (expensive)
- PgBouncer maintains a pool of connections
- Modes: session, transaction, statement
4. Partitioning large tables:
- Range partitioning by date (most common)
- PostgreSQL prunes partitions not matching WHERE clause
- Reduces index size per partition
5. Materialized views for complex aggregations:
CREATE MATERIALIZED VIEW monthly_stats AS
SELECT date_trunc('month', created_at) as month, COUNT(*), AVG(amount)
FROM orders GROUP BY 1;
REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_stats;
6. Use pg_stat_statements to find slow queries:
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
7. Batch operations:
- COPY instead of INSERT for bulk loading
- Batch UPDATEs with CTEs
- Use UNNEST for bulk operations
8. VACUUM and ANALYZE:
- VACUUM: reclaims dead tuple space
- ANALYZE: updates table statistics for query planner
- Autovacuum should handle this, but check settings for large tables
Your story: "I optimized PostgreSQL for 1.8TB of building footprint data.
Key techniques: spatial indexing with GiST, BRIN indexes for timestamp
columns, table partitioning by geography, query plan analysis with EXPLAIN
ANALYZE, and connection pooling with PgBouncer. This reduced query times
from seconds to milliseconds."
Transactions & Concurrency
ACID Properties
Atomicity: All or nothing. Transaction fully commits or fully rolls back.
Consistency: Database goes from one valid state to another.
Isolation: Concurrent transactions don't interfere with each other.
Durability: Once committed, data survives crashes (WAL ensures this).
Isolation Levels
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | PostgreSQL Default |
|--------------------|-----------|--------------------|--------------|--------------------|
| Read Uncommitted | Possible | Possible | Possible | No (treated as Read Committed) |
| Read Committed | No | Possible | Possible | ← DEFAULT |
| Repeatable Read | No | No | No* | |
| Serializable | No | No | No | |
* PostgreSQL's Repeatable Read actually prevents phantoms too (snapshot isolation)
Dirty Read: Reading uncommitted data from another transaction
Non-Repeatable Read: Same query returns different results within a transaction
Phantom Read: New rows appear in a repeated range query
Locking
Row-level locks:
- SELECT ... FOR UPDATE: exclusive lock on selected rows
- SELECT ... FOR SHARE: shared lock (allows other reads, blocks writes)
- FOR UPDATE SKIP LOCKED: skip already-locked rows (great for job queues!)
Advisory locks:
- Application-level locks managed by PostgreSQL
- pg_advisory_lock(key): blocking
- pg_try_advisory_lock(key): non-blocking
Deadlocks:
- Two transactions each waiting for the other's locked resource
- PostgreSQL detects and aborts one transaction
- Prevention: always lock resources in the same order
Schema Design & Normalization
Normal Forms
1NF: Atomic values (no arrays or nested tables in columns)
2NF: 1NF + no partial dependencies (every non-key column depends on whole PK)
3NF: 2NF + no transitive dependencies (non-key columns don't depend on each other)
In practice:
- Normalize to 3NF by default
- Denormalize strategically for read performance
- Document why you denormalized (informed decision, not laziness)
Schema Design Best Practices
1. Use appropriate data types:
- UUID vs SERIAL for primary keys
- TIMESTAMPTZ (not TIMESTAMP) for timestamps
- NUMERIC for money (not FLOAT)
- TEXT over VARCHAR (PostgreSQL treats them the same internally)
- JSONB for semi-structured data (not JSON — JSONB is binary, indexable)
2. Constraints:
- NOT NULL by default (add NULL only when truly optional)
- Foreign keys for referential integrity
- CHECK constraints for business rules
- UNIQUE constraints/indexes for natural keys
3. Naming conventions:
- Plural table names: users, orders, products
- Snake_case: created_at, first_name
- FK column matches referenced table: user_id → users.id
4. Common patterns:
- Soft delete: deleted_at TIMESTAMPTZ (not boolean)
- Audit columns: created_at, updated_at on every table
- Polymorphic associations: type + id columns
- ENUM types for fixed sets: CREATE TYPE status AS ENUM ('active', 'inactive')
Scaling PostgreSQL
Vertical Scaling:
- More RAM (bigger shared_buffers, effective_cache_size)
- Faster storage (NVMe SSDs)
- More CPU cores (parallel query execution)
Read Replicas:
- Streaming replication: async or sync
- Route reads to replicas, writes to primary
- Replication lag: milliseconds (async) to zero (sync, but slower writes)
- Tools: Patroni for HA, PgBouncer for connection routing
Partitioning:
- Declarative partitioning (PostgreSQL 10+)
- Range: by date (most common), by ID range
- List: by category, region
- Hash: for even distribution
- Partition pruning: planner skips irrelevant partitions
Sharding:
- PostgreSQL doesn't natively shard (unlike Cassandra)
- Options: Citus extension, application-level sharding
- Foreign Data Wrappers for cross-shard queries
Connection Pooling:
- PgBouncer: external connection pooler
- Modes: session (safest), transaction (most efficient), statement
- Why: PostgreSQL process-per-connection model doesn't scale to thousands
PostGIS & Geospatial
This is your superpower. Few candidates know this.
-- Geometry vs Geography types
-- Geometry: planar (flat earth), faster, for local areas
-- Geography: spherical (real earth), accurate for global, slower
-- Spatial indexing (GiST)
CREATE INDEX idx_buildings_geom ON buildings USING gist (geom);
-- Common spatial queries
-- Find buildings within a bounding box
SELECT * FROM buildings
WHERE geom && ST_MakeEnvelope(xmin, ymin, xmax, ymax, 4326);
-- Find buildings within 1km of a point
SELECT * FROM buildings
WHERE ST_DWithin(
geom::geography,
ST_SetSRID(ST_MakePoint(72.8777, 19.0760), 4326)::geography,
1000 -- meters
);
-- Spatial join: find buildings in each district
SELECT d.name, COUNT(b.id)
FROM districts d
JOIN buildings b ON ST_Contains(d.geom, b.geom)
GROUP BY d.name;
-- Area calculation
SELECT name, ST_Area(geom::geography) as area_sq_meters
FROM parcels;
Key concepts for interviews:
- SRID 4326: WGS84 (GPS coordinates, lat/long)
- GiST index: R-tree based, essential for spatial queries
- ST_Contains, ST_Intersects, ST_DWithin: most used functions
- Bounding box operators (&&) for fast filtering before precise calculations
- Index-optimized: ST_DWithin, && (bounding box), ST_Intersects
- NOT index-optimized: ST_Distance (use ST_DWithin instead)
Redis
What: In-memory key-value store. Used for caching, sessions, queues, real-time.
Data Structures:
- String: simple key-value, counters (INCR/DECR), TTL
- Hash: object-like storage (HSET, HGET, HGETALL)
- List: ordered collection, queues (LPUSH, RPOP), bounded lists
- Set: unique members, intersections/unions
- Sorted Set: ranked members (leaderboards, priority queues)
- Stream: append-only log (like Kafka lite)
Common Patterns:
- Cache-aside: check Redis → miss → query DB → store in Redis
- Rate limiting: INCR + EXPIRE (token bucket)
- Session storage: hash with TTL
- Pub/Sub: real-time notifications
- Distributed lock: SET key value NX PX milliseconds (Redlock pattern)
- Job queue: LPUSH + BRPOP (blocking pop)
Eviction Policies:
- noeviction: return errors when memory full
- allkeys-lru: evict least recently used keys
- volatile-lru: evict LRU among keys with TTL
- allkeys-random: evict random keys
Persistence:
- RDB: periodic snapshots (point-in-time backup)
- AOF: append every write (more durable, larger files)
- Both: RDB for backup, AOF for durability
Your experience: "I use Redis extensively — for caching API responses with
TTL, as a message broker for background tasks, for rate limiting, and as a
session store. My open-source project RediServe provides HTTP API access
to Redis operations."
NoSQL Concepts
When to use NoSQL over SQL:
- Need horizontal scalability for simple lookups
- Schema is truly flexible/evolving
- Don't need complex joins or transactions
- Very high write throughput needed
Types:
Key-Value (Redis, DynamoDB):
Use for: caching, sessions, simple lookups
Access: by key only
Document (MongoDB, CouchDB):
Use for: flexible schema, nested data, content management
Access: by key or queries on document fields
Column-Family (Cassandra, HBase):
Use for: time-series, write-heavy, wide rows
Access: by partition key + clustering columns
Graph (Neo4j, Amazon Neptune):
Use for: social networks, recommendations, knowledge graphs
Access: graph traversal queries
Interview answer for "SQL vs NoSQL":
"It depends on the use case. I default to PostgreSQL because ACID
guarantees, complex queries, and strong ecosystem matter for most
applications. I use Redis for caching and real-time features.
I'd consider a document store if the schema is truly dynamic,
or a column-family store for append-heavy time-series at massive scale."
Common Interview Questions
Q: How would you design a database for a social media app?
A: Tables: users, posts, follows, likes, comments
Denormalize: follower_count, like_count on posts (with triggers or app logic)
Index: follows(follower_id), follows(following_id), posts(user_id, created_at)
Partitioning: posts by date range
Read replicas for timeline queries
Q: Your query is slow. How do you debug it?
A: 1. EXPLAIN ANALYZE to see execution plan
2. Check for Seq Scans on large tables → add indexes
3. Check row estimates vs actual (stale stats → ANALYZE)
4. Check for N+1 queries in application code
5. Check connection count (PgBouncer)
6. Check table bloat (pg_stat_user_tables → dead tuples)
7. Check for lock contention (pg_stat_activity)
Q: What's the difference between clustered and non-clustered index?
A: PostgreSQL doesn't have clustered indexes in the SQL Server sense.
But CLUSTER command physically reorders table data by an index (one-time).
All PostgreSQL indexes are "non-clustered" — they point to heap tuples.
The closest thing: BRIN index on naturally ordered data.
Q: How do you handle database migrations safely?
A: 1. Always forward-only migrations (no destructive changes in prod)
2. Add new columns as nullable → backfill → add NOT NULL constraint
3. Never rename/drop columns directly — add new, migrate, then drop old
4. Use CREATE INDEX CONCURRENTLY (doesn't lock the table)
5. Test migrations on a copy of production data
6. Have a rollback plan
Resources
- PostgreSQL Official Docs: https://www.postgresql.org/docs/current/
- Use The Index, Luke: https://use-the-index-luke.com/ — SQL indexing tutorial
- PostgreSQL Wiki — Performance: https://wiki.postgresql.org/wiki/Performance_Optimization
- pganalyze blog: https://pganalyze.com/blog — PostgreSQL performance articles
- PostGIS docs: https://postgis.net/documentation/
- Redis University: https://university.redis.io/ — free courses
- SQL practice: https://sqlbolt.com/, https://pgexercises.com/
My Notes
Queries I've optimized (for interview stories):
-
PostgreSQL features I've used:
-
Things I need to review:
-