Context
I needed to do some migration from MongoDB to PostgreSQL for a thing I was building. Even though I’d used Postgres at work, we never really got to run complex queries (in code) or work on features that tested the limits of my knowledge. To cut the long story short, I decided to take the time to better understand PostgreSQL and its key features.
After some digging through subreddits and Hacker News, I decided on studying 3 books:
- The Art of PostgreSQL by Dimitri Fontaine ✅
- SQL performance explained by Markus Winand
- A Curious Mind by Rob Conery
key concepts and some best practices
JOIN Syntax and Best Practices
- Use
USING(column)
when joining tables with matching column names (more concise)
SELECT * FROM orders
JOIN customers USING(customer_id); -- When both tables have customer_id column
- Use
ON table_a.column = table_b.column
when:- Joining tables with different column names
- Need more complex join conditions
SELECT * FROM orders o
JOIN customers c ON o.cust_id = c.customer_id; -- Different column names
Query Testing and Validation
RegreSQL
can be used to create regression tests for your queries- Regression tests help ensure query behavior remains consistent across database changes
Indexing Strategy and Trade-offs
- Different types of indexes serve different query patterns
- Indexes have a cost for DML (Data Manipulation Language) operations:
INSERT
operations need to update indexesUPDATE
operations may need to modify multiple index entriesDELETE
operations need to remove index entries
- Some indexes are mandatory for data consistency:
UNIQUE
,PRIMARY KEY
andEXCLUDE USING
constraints all require an index
- Benefits of proper indexing:
- Faster data access
- Improved query performance
- Better data retrieval patterns
Query Optimization and Analysis
- Single Query vs Multiple Queries
- Complex business logic can often be handled more efficiently with a single PostgreSQL query rather than multiple queries with application-level processing
- Network latency typically forms the bulk of request completion time, making single queries more efficient
- Databases are optimized to handle complex queries efficiently, leveraging built-in query planning and optimization
- Use
pg_stat_statements
to understand query usage statistics. This extension helps with query optimization by showing which indexes to add and how to structure them - Avoid
SELECT
* as it:- Retrieves more rows than needed
- Utilizes unnecessary network bandwidth
- Can break application code during deserialization if column names change
LATERAL JOIN
: Functions similarly to a foreach loopDISTINCT ON
: Equivalent to using GROUP BY but picks the first result of each group. Important to use ORDER BY when needed for your use case
For example, these two queries return the same results (first name and last name of drivers who have ever won an F1 race):
select distinct on(driverid) forename, surname from results
join drivers using(driverid)
where position = 1
select min(forename), min(surname) from results
join drivers using(driverid)
where position = 1
group by driverid
Pagination and Performance
- Avoid using
OFFSET
as it scans through all rows before filtering out specified values - Standard SQL offers
FETCH
instead ofOFFSET
andLIMIT
- Better pagination can be achieved using index lookups (
row()
) with properly indexed columns - Use
EXPLAIN ANALYZE
to understand query performance (helpful visualization tool: https://pev2.pages.dev/)
Grouping Operations
HAVING
: Acts like WHERE but applies to group results instead of individual rowsGROUPING SETS
: Concise way to merge results of multiple groupings in one queryROLL UP
: Generates grouping sets from most granular to least granular. For example:
GROUP BY ROLLUP(A, B, C) = GROUP BY GROUPING SETS((A, B, C), (A, B), (A), ())
CUBE
: Similar to ROLL UP but creates complete permutations of provided columns:
GROUP BY CUBE(A, B, C) = GROUP BY GROUPING SETS((A, B, C), (A, B), (A, C), (B, C), (A), (B), (C), ())
Set Operations
UNION ALL
: Concatenates result sets (must have matching column types and count)UNION
: Same asUNION ALL
but removes duplicates (slower)EXCEPT
: Excludes results of second query from first queryINTERSECT
: Returns results present in both queries
Window Functions and Rankings
Structure for window functions:
[function you want to apply] OVER (
[PARTITION BY column_list] -- Optional
[ORDER BY column_list] -- Optional
[frame_clause] -- Optional
)
-- e.g Calculating Running Totals within Groups
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total
FROM orders;
- popular aggregate functions that can be used over a window frame definition
- general purpose in-built window function
Use window functions whenever you want to compute values for each row of the result set and those computations depend on other rows within the same result set.
RANK()
: Assigns rankings within partitioned data
SELECT product_name,
category,
price,
RANK() OVER (PARTITION BY category ORDER BY price DESC) as price_rank
FROM products;
NULL Handling
- Use
IS DISTINCT FROM
orIS NOT DISTINCT FROM
for null comparisons instead of=
or<>
- Remember three-valued logic with nulls:
null = true
evaluates tonull
x
IS NULL
is not the same asx = null
- Three-valued logic truth table for
NULL
comparisons:left right = <> “is distinct” “is not distinct from” true true true false false true true false false true true false true null
null
null
true false false true false true true false false false true false false true false null
null
null
true false null
true null
null
true false null
false null
null
true false null
null
null
null
false true
Useful Built-in Functions
generate_series()
: Creates a series of values
-- Generate dates for each day in January
SELECT generate_series(
'2024-01-01'::date,
'2024-01-31'::date,
'1 day'::interval
);
- Date/Time Functions:
extract()
: Extracts specific parts of a date/time valueto_char()
: Formats dates and numbers as stringsdate_trunc()
: Truncates date/time values to specified precision
SELECT
extract(YEAR FROM timestamp '2024-01-13'),
to_char(timestamp '2024-01-13', 'Month DD, YYYY'),
date_trunc('month', timestamp '2024-01-13');
CASE
: Adds conditional logic in queries
SELECT
product_name,
price,
CASE
WHEN price < 100 THEN 'Budget'
WHEN price < 500 THEN 'Mid-range'
ELSE 'Premium'
END as price_category
FROM products;
data types
Arrays
- PostgreSQL supports arrays of any built-in or user-defined type
- Useful for denormalized data structures where appropriate
- Key array operations:
unnest(column):
Expands an array into a set of rows
-- Example: Converting an array of tags into separate rows
SELECT id, unnest(tags) as tag
FROM posts
WHERE tags IS NOT NULL;
- Particularly useful for:
- Querying individual array elements
- Joining array elements with other tables
- Aggregate operations on array elements
- Can help avoid junction tables in some cases, though normalization is still preferred for most relationships
Boolean
- Use
IS
to test against literal true, false, and null rather than=
- Can be aggregated with
bool_and
andbool_or
Character and Text
text
andvarchar
are identical in PostgreSQLcharacter varying
is an alias forvarchar
varchar(X)
is a text column with a check constraint of X characters- Useful functions:
regexp_split_to_table()
andregexp_split_to_array()
regexp_matches
: Extract pattern matches from strings
-- Example: Extract all email addresses from a text column
SELECT regexp_matches(description, '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}', 'g')
FROM documents;
Numeric Types
- Available types
integer
: 32-bit signed numbersbigint
: 64-bit signed numberssmallint
: 16-bit signed numbersnumeric
: arbitrary precision numbersreal
: 32-bit floating point (6 decimal digits precision)double precision
: 64-bit floating point (15 decimal digits precision)
- Given how floating points are stored in computers, never use
real
ordouble precision
for monetary values
Auto-incrementing IDs and UUIDs
- Prefer
bigserial
overserial
for new applications to avoid overflow issues serial
is a pseudo-type backed by aninteger
- When using
serial
, PostgreSQL automatically creates a backing sequence and picks values from it:
CREATE TABLE tablename (colname SERIAL);
-- This is equivalent to:
CREATE SEQUENCE tablename_colname_seq; -- backing sequence is created
CREATE TABLE tablename (
colname integer NOT NULL DEFAULT nextval('tablename_colname_seq')
); -- column values are picked from the sequence
ALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;
- Be cautious: The backing sequence uses
bigint
, butserial
usesinteger
, which can lead to overflow issues
UUIDs
- Use the UUID data type instead of text for storing UUIDs
- More memory efficient than storing UUIDs as text
- Provides better type safety and validation
Binary Data Storage (bytea)
- PostgreSQL’s bytea type can store binary data (like images)
- Not recommended for:
- Large files
- High volume of binary data
- Limitations:
- No native chunking API
- Can impact database performance
- Use case: When you specifically need transactional properties for your binary data
Date/Time
- Always use timestamps with time zones
- PostgreSQL uses
bigint
internally for timestamp storage - ISO format is preferred for timestamp input:
select timestamptz '2017-01-08 04:05:06',
timestamptz '2017-01-08 04:05:06+02';
- Interval arithmetic is straightforward
interval '[numberic value] [time period]'
-- e.g
interval '1 month'
select d::date as month,
(d + interval '1 month' - interval '1 day')::date as month_end,
(d + interval '1 month')::date as next_month,
(d + interval '1 month')::date - d::date as days
from generate_series(
date '2017-01-01',
date '2017-12-01',
interval '1 month')
as t(d);
-- the query above shows how mnay days are in the months of the year 2017
JSON and JSONB
- JSON is stored as text with format validation
- JSONB offers binary storage with:
- Full indexing capabilities
- Advanced searching and processing
- Single value per key restriction
- Best practice: Use traditional columns for static data and JSONB for flexible, occasional-use data
Enums
- PostgreSQL supports enum types
- Best practice: Use a reference table instead of enum types
- Creates a table with ID and values
- More flexible for future modifications
- Can be referenced from other tables
business logic in db
Check Constraints
- Add business rules directly in the database schema
- Examples:
-- Simple check constraint
CREATE TABLE products (
product_no integer,
name text,
price numeric CHECK (price > 0)
);
-- Multiple check constraints
CREATE TABLE products (
product_no integer,
name text,
price numeric CHECK (price > 0),
discounted_price numeric,
CHECK (discounted_price > 0 AND price > discounted_price)
);
Using Domains
- Create reusable data types with built-in constraints
- Useful for consistent validation across tables
-- Create a domain for positive integers
CREATE DOMAIN PositiveInteger AS INTEGER
CHECK (VALUE > 0);
-- Use the domain in a table
CREATE TABLE People (
id SERIAL PRIMARY KEY,
name TEXT,
age PositiveInteger -- Using domain instead of plain integer
);
Custom Types and Constraints
-- Define a composite type
CREATE TYPE address AS (
street TEXT,
city TEXT,
zip_code VARCHAR(10)
);
-- Use the composite type in a table
CREATE TABLE customer (
id SERIAL PRIMARY KEY,
name TEXT,
billing_address address
);
INSERT INTO customer (name, billing_address)
VALUES ('John Doe', ROW('123 Main St', 'Anytown', '12345'));
antipatterns to avoid
Entity-Attribute-Value (EAV)
Problems include:
- No data type constraints (all values stored as text)
- Susceptible to typos in entity and parameter fields
- Difficult to query and interpret results
- Solution: Use proper modeling with JSONB for extension points
Multiple Values per Column
Instead of storing multiple values in a text field (e.g., hashtags), use PostgreSQL arrays for better searching and sorting capabilities.
other features and considerations
Partitioning
Trade-offs include:
- Need for per-partition indexes
- Foreign key relationship limitations
- Benefits from partition pruning often outweigh limitations
Concurrency and Transactions
- Default isolation level is “read committed”
- Consider concurrency behavior in data modeling
- Avoid designs that create competition for single shared resources
- Batch updates need careful handling to prevent lost updates
Views and Materialized Views
- Views hide query complexity
- Materialized Views cache results but need periodic refreshing
- Consider setting up CRON jobs for materialized view refreshes
Event Handling
- Triggers run within transactions but can create bottlenecks
LISTEN/NOTIFY
provides event-driven capabilities but:- Requires active connections
- Not suitable for event accumulation
- Works well for notification-driven query refreshes
final thoughts
This book transformed my view of databases. I now see them as fully-qualified services that are primarily optimized for data storage and retrieval but capable of much more. I’ll definitely be spending more time deepening my knowledge of databases.
things I didn’t like about this book
- It contained numerous typos
- Required frequent context switching across different datasets (Twitter, F1, geonames, pubnames, etc.)
- Some key dataset links were broken