TIL: Logging and analysing slow PostgreSQL queries
Published onSQL is a wonderful, declarative abstraction over the details of fetching information from a database. It allows us to express what we want, letting a query engine like that found in PostgreSQL deal with the how of retrieving the data from the underlying store.
Web APIs are often just relatively simple wrappers around some SQL queries, which means if API response times are measurably slow then an SQL query is a often the culprit. But how do you identify which queries are causing the slow-down in a large application? And how do you figure out how to speed those queries up once you find them?
In this post we’ll cover PostgreSQL’s log_min_duration_statement
configuration for automatic logging of slow
queries, and we’ll also touch on PostgreSQL’s EXPLAIN
statement
for understanding what the query engine is doing when trying to optimise
queries.
This is part of a series of ‘TIL’, or ‘today I learned’ posts; shorter articles designed to offer quick tips and to encourage me to write more often. My link in the inspiration chain connects back to my friend Tim Head and his TIL series; thanks Tim!
Logging
PostgreSQL is highly configurable, and while the number of options can be daunting it is worthwhile skimming over what’s available to see what might useful for you.
I was doing just that when I came across the log_min_duration_statement
parameter. When set to a value greater than or
equal to zero (the default is -1) PostgreSQL will emit logs for any query which
takes at least as long than that value.
For example setting
log_min_duration_statement = 100ms
will result in logs for all queries whose total execution time (parsing, planning, and evaluation) takes 100 milliseconds or more.1
A query taking longer than the log_min_duration_statement
value produces a
log like this:
2022-05-12 15:40:13.566 UTC [186488] postgres_user@app_db LOG: duration: 301.129 ms execute __asyncpg_stmt_162ac__: SELECT resource.id, count
(task.id) AS count_1
FROM resource JOIN task ON resource.id = task.resource_id
WHERE task.state = $1 GROUP BY resource.id
2022-05-12 15:40:13.566 UTC [186488] postgres_user@app_db DETAIL: parameters: $1 = 'sent'
This is packed full of useful information!
- The time the query was executed, which you could correlate with your application logs to determine the source.
- The user and database executing the query (
postgres_user
andapp_db
above). - The duration of the query.
- The query itself, typically
parameterised
(with variables
$1
,$2
, and so on). - The parameter values, if any, under the separate
DETAIL
line.
If you have a high query volume and you’re worried about
performance you can use the
similar log_min_duration_sample
parameter,
combined with log_statement_sample_rate
, to log
only a sample of queries exceeding the threshold.
Monitoring
After deploying this change and waiting a while to collect the data, you’ll likely have a lot of logs to sift through.
In production you might have access to log aggregation and filtering tools, such as Grafana Loki or the ELK stack. You can combine these with alerting systems to automatically warn you of a high number or high rate of slow queries, so that you don’t have to keep an eye on the log file yourself.
Wrangling the logs by hand isn’t a terrible way to get started though! Assuming a standard Ubuntu installation we can find slow query log lines like this:
# Select all slow-query lines
$ grep 'LOG: duration: ' /var/log/postgresql/postgresql-12-main.log
We could then use our favourite plain-text filtering and sorting tools to see, for example, the counts of each unique parameterised query:
# Select all slow-query lines and filter out everything except the actual query statement
$ grep 'LOG: duration: ' /var/log/postgresql/postgresql-12-main.log \
| sed -E 's/.*statement: //' \
| sort \
| uniq -c
This might return something like this:
2 SELECT job.id, count (task.id) AS count_1
65 SELECT resource.id, count (task.id) AS count_1
Use this information to figure out which queries are worth tackling first, for example the most common or those taken the most cumulative time.
Analysis
So you have some slow queries, how do you speed them up?
Ideally we’d apply a variation of Richard Feynman’s problem solving algorithm:
- Find the slow query.
- Think real hard.
- Implement the solution.
Unfortunately, just like theoretical physics, the world of SQL optimisation is deep and full of oft-obtuse tricks of the trade. A combination of reading the documentation and using your own experience is often the best way to start figuring out how to optimise a slow query.
Luckily there is one tool that will almost certainly be valuable to you in
understanding why the query is slow, and that boils down to getting a better
understanding of how the query engine is executing the SQL: PostgreSQL’s
EXPLAIN
statement.
EXPLAIN
will return the query engine’s plan for how to process the query,
including what scan techniques to use (such as sequential or index) and what
join techniques to use. The plan includes the engine’s estimate for how long
each step will take, which can be compared to the other steps in the plan to
see which parts are the most expensive.
My usual analysis pattern is to copy-paste the slow query from the logs into an
interactive database tool such as Postico, substitute in the
parameter values, and prepend EXPLAIN ANALYZE
to the query.
The query we saw in the example log above would then look like this for analysis:
EXPLAIN ANALYZE
SELECT resource.id, count (task.id) AS count_1
FROM resource JOIN task ON resource.id = task.resource_id
WHERE task.state = 'sent' GROUP BY resource.id
This executes the query, so be careful when trying to understand slow
queries which perform modifications (such as UPDATE
and DELETE
)! You can
omit the ANALYZE
option to omit the measured execution time, leaving only the
engine’s estimates.
With the execution plan in front of you, you can see what steps are taking the bulk of the time, and what those steps are doing. Query optimisation is a broad topic and there are many tips and tricks to be found, and with experience you’ll quickly see low-hanging fruit (if any! 🤞).
In my work so far the most common red flag is seeing that a sequential table scan is taking the majority of the execution time. When the engine performs a sequential scan it iterates over the rows of the table in question, an operation which scales linearly with the size of the table. The typical solution is to add some form of index to the table being queried, which trades disk space for constant-time lookups.
As with any optimisation problem there are other considerations worth making before you start tuning what’s in front of you:
- Does your application really need to perform the query? Perhaps it only uses a subset of the information, which may be cheaper to retrieve.
- Are there other ways of constructing the same query? It could be possible to query a table directly rather than using joins, for example.
- Is the execution time truly ‘slow’ for your needs? If the query runs within a task whose main computation takes ten times longer, it might be worth focusing on optimising that computation first.
- Could the PostgreSQL server be tuned more appropriately? It could be that a faster storage medium will improve access times enough, or perhaps the PostgreSQL cache size is too low, leading to more cache misses and slower queries.
All that being said, I find myself revisiting the slow-query logs and using
EXPLAIN
to figure out why my API is slower than I’d like more and more often.
One of PostgreSQL’s many strengths is providing tools like these!
Footnotes
-
The location of your PostgreSQL configuration and log files will depend on your method of installation. On Ubuntu, for example, the configuration is at
/etc/postgresql/12/main/postgresql.conf
and the current log file is at/var/log/postgresql/postgresql-12-main.log
. ↩