Is sampling session activity a good way to understand Postgres workload? Part 2 - Query Latency :: sonotley.uk

In my previous post, I promised to write about the different characteristics of database sessions that one might measure by sampling the session state. However, the post had become so long that I had to wrap it up having only discussed a single quantity: the total session time spent on a given query or wait state. It turned out that sampling is extremely effective in that case.

In this post, I address the question “can sampling session activity tell me query latency?”. The answer is surprisingly nuanced.

What is query latency?
Query latency just the execution time of a query. It’s commonly referred to as latency because executing a query to retrieve or store data is often on the critical path for applications meaning the time the database takes to perform this task is experienced by the application and the end user as dead time, a lag between actions and reaction: latency. It’s common for DBAs and application teams to monitor the 99th percentile latency to make sure a significant majority of queries are executed in good time.

On the face of it, sampling is a terrible way to identify or measure individual events that are shorter than the sampling frequency. However, when you take into account some of the additional information we can collect when sampling a Postgres session, it’s actually somewhat viable to use sampling to measure latency. The reality though is that most sampling-based tools don’t bother trying to measure latency because it’s complicated to implement, prone to mathematical weirdness, and difficult for users to understand. Most importantly, there are other ways to do it much more precisely for only a small overhead.

Here on my personal blog, I’m not interested in offering practical advice; I’m interested in seeing how far we can push an idea. So let’s start by exploring why sampling is so bad, and then figure out how we can make it work.

Why is sampling so terrible?

It’s not necessarily intuitive that sampling can be extremely effective at one task and terrible at a seemingly adjacent task. Let’s do what all academics do when faced with a tricky problem: have a cup of tea.

A nice cup of resampled tea

Imagine you are interested in my tea-drinking habits. You decide to devote a week of your life to undertake a study. Every hour, you check on me to see whether or not I have a cup of tea. Of your 168 observations I had a cup of tea in 84 and you conclude that I spend 50% of my life drinking tea. But how many cups did I drink, and how long do I typically spend on each cup? Do some cups take longer than others or do I drink them all at the same pace?

Let’s start the first question. How many cups did I drink? If you recorded your data as shown above, then you are out of luck. I could have drunk a single cup of tea over the course of 84 hours, or a hundred thousand cups in a few seconds each and your results would look the same. If instead you recorded your data including the time of each sample, as shown below, then you have a fighting chance.

Every time you observed me with a cup of tea and then without a cup of tea, I must have finished a cup at some point between those observations. So perhaps if you just count up the number of times this happened, you would know how many cups I’d drunk?

Well, kind of. What you actually have is a lower bound. Say you observed three transitions from ’tea’ to ’no tea’ in a particular day, I must have drunk at least three cups of tea to produce those transitions, but I could have drunk far more. If I finished a cup and started another one without you observing me inbetween, you’d never know. It’s quite conceivable I could have drunk thirty cups of tea or even three hundred, but no fewer than three.

Biased estimators
What we have here is a biased estimator. Our algorithm “count the number of tea/no-tea transitions in the sample” gives us an estimate of the number of cups drunk, but it’s a biased estimate. It’s always either correct or lower than the truth, never higher. With an unbiased estimator, the error can be postive or negative and, crucially, averages out to zero. This means as we collect more data, our error tends to zero. This is not the case for biased estimators. You could watch me drink tea for a hundred years and you’d still be underestimating.

You may be wondering at this point why we are talking about how many cups of tea I drink and not about how long I take to drink them. Wasn’t this analogy supposed to tell us something about query latency? Well, if you know how long I spend drinking tea and how many cups I drink you could at least calculate the average time per cup. That’s really about the best option with the data we’ve have available. Since so far the best we can do is a lower bound on the number of cups, the best we can do is an upper bound for the time per cup.

If there was some minimum time required to drink a cup of tea and some minimum time I must leave between cups, you could solve the problem by sampling at a period less than these minima. But at this point you’re not really sampling in the statistical sense, you’re just recording every event. You no longer have any error in your count because you have counted every cup individually. To achieve this for queries in a database session isn’t at all practical because queries can be extremely short, and the gap between queries can be essentially zero.

So there we have it, counting queries by sampling is impossible and therefore so is measuring latency. Or is it? Perhaps we took a wrong turn somewhere with all that nonsense about tea?

Reality to the rescue!

You may recall from the dim and distant past of Part 1, that this is supposed to be about Postgres. When you sample the session state in Postgres you get a little bit more information that just the current query state. Crucially for us, you also get the start time of the current query. This dramatically boosts our sampling power because we effectively get a ‘free’ sample at the start of any query we have sampled. This means any time we sample any query, we can infer something about how long it has been running. We can also tell definitively whether two consecutive observations of the same query are from the same execution (rather than two separate executions of the same query) because they will share the same start time. These two things unlock a totally different approach to the problem. Rather than trying to count the queries we can jump directly to estimating the duration of the executions we observe.

Let’s get down to (maths) business

The timeline below shows the scenario where just one sample falls in a particular query execution. If we’re sampling infrequently compared to the typical duration of a query, this is likely to be the case most of the time.

Inferring the duration of a single query with one sample

Figure 1. Timeline showing a query execution which has been sampled a single time.

We know that this particular execution ran for more than \(a\) and less than \(a+b\), but what is our best estimate of the true duration?

If we postulate that our sample is equally likely to fall anywhere within the query, on average it must fall in the centre. This leads us to estmate the true duration to be \(2a\).

For the more general case where one or more sample falls in a particular query execution, we can use the same logic to justify an estimate of \(2a+c\).

Inferring the duration of a single query with one or more samples

Figure 2. Timeline showing a query execution which has been sampled multiple times.

We can test this estimator with a simple ‘brute force’ simulation. Here take a query of 1000ms duration and evaluate \(2a+c\) for every possible position of the query relative to the samples for a range of sampling intervals. For example, a sampling period of 1000ms means we will only land one sample in our 1000ms query, so we calculate our estimate assuming it landed 1ms into the query, then 2ms, then 3ms… etc.

Estimating the duration of a single query

Figure 3. Chart of the absolute error the estimated duration of an observed query as a function of sample period.

This looks promising. As you would expect the error is lower with sampling periods less than the duration of the query, but importantly the average error is zero across the whole range meaning we have an unbiased estimator.

The elephant in the room (and for once it’s not Postgres)
If you’re reading this and thinking “This \(2a+c\) thing was a bit of a leap, I’m not convinced.”, then hold tight. Hopefully the chart above has shown you that it is as unbiased estimator, but if you’re not sure about the logic, you’re not alone. I haven’t found any rigorous way to justify this estimator theoretically.

So, with no further ado let’s take this estimator and apply it to some simulated sessions. Specifically we’re going to hypothesise that the true mean latency is estimated by the mean value of \(2a+c\) for the sampled queries. Because these are simulated sessions, we know exactly how long every query execution really took, so we can easily calculate the accuracy of our estimates.

Table 1. Comparing the true mean latency (± standard deviation of the mean) with our estimator over 1000 sessions of 1 hour duration sampled every second.

Query ID	True Mean Latency / ms	Estimated Mean Latency / ms
1	56 ± 3	520 ± 70
2	300 ± 10	520 ± 40
3	550 ± 20	680 ± 40
4	800 ± 20	870 ± 40
5	1050 ± 20	1080 ± 40

Oh, that doesn’t look right. Our estimates aren’t too bad when the mean query duration approaches 1000ms, but below this they are way out.

Where did we go wrong?

There are two sources of error in our estimate: our imperfect estimate of how long the observed queries ran for, and the sampling error that comes from observing only a subset of the queries. We established in Figure 3 that our algorithm for the former is unbiased, so suspicion falls on the latter.

Systematic error in sampling generally occurs when the sample is not representative of the population, and this is exactly what has happened here. The query executions included in our sample are the ones that were running when we sampled the session state. In the case where query executions are typically much shorter than our sampling interval, longer execution are more likely to be observed than shorter ones. This creates a bias towards longer executions and explains why we are overestimating. To convince ourselves of this we can calculate the true mean of only the queries that were included in our sample, and sure enough it matches our estimates extremely well.

Table 2. the true mean latency per query (± standard deviation of the mean) for the same data as Table 1 but only including sampled query executions.

Query ID	True Mean Latency (sampled queries only) / ms
1	520 ± 60
2	520 ± 30
3	680 ± 30
4	870 ± 20
5	1080 ± 20

All is not lost. Because we understand the source of our bias, we can attempt to compensate for it. Assuming we are sampling at a fixed period \(T\), the probability of sampling a query execution of duration \(x\) is given by the formula below.

\[ \require{mathtools} P_{i} = \begin{cases} 1 & \text{if } x_{i} \geq T \\[5pt] \dfrac{{x}_{i}}{T} & \text{if } x_{i} < T \end{cases} \]

Of course we don’t know the true duration, but we do have our \(2a+c\) estimate. We can use that estimate to calculate a weighted average where the weight of each observation of the inverse of the probability of sampling it, assuming its estimated duration is correct. We’ll denote the estimated average of a query \(q\) as \(\hat{\mu}_q\).

\[ \require{mathtools} \begin{align*} \hat{d}_{iq} &= 2a_{iq} + c_{iq} \\[10pt] \omega_{iq} &= \begin{cases} 1 & \text{if } \hat{d}_{iq} \geq T \\[5pt] \dfrac{T}{\hat{d}_{iq}} & \text{if } \hat{d}_{iq} < T \end{cases} \\[10pt] \hat{\mu}_{q} &= \frac{\sum_{i}\hat{d}_{iq}\cdot\omega_{iq}}{\sum_{i}{\omega_{iq}}} \end{align*} \]

The world in harmony
If you consider only the \(\hat{d}_{iq} < T\) case, the formulae above can be simplified and you end up with the harmonic mean of \(\hat{d}_{iq}\). This is not an especially suprising result as the harmonic mean is commonly used in situations somewhat analogous to ours, for example estimating the average speed of vehicles over some distance from observations of speed at a single point.

Using this new weighted version should sort everything out and remove that horrible bias.

Table 3. Comparing the true mean latency (± standard deviation of the mean) with our new estimator over 1000 sessions.

Query ID	True Mean Latency / ms	Estimated Mean Latency / ms
1	56 ± 3	19 ± 4
2	300 ± 10	80 ± 20
3	550 ± 20	150 ± 50
4	800 ± 20	270 ± 100
5	1050 ± 20	500 ± 200

Or not.

Out of the selection bias frying pan and into the fire

So, we did some lovely maths to correct our bias and ended up almost as biased, but in the opposite direction. Our problem stems from the fact that we used the estimated duration of the query execution to calculate the weight. But earlier we established our estimate of execution duration is unbiased, so how it is causing a bias now?

The answer is that the bias in our new estimator doesn’t come from some systematic error in the \(2a+c\) estimator, it comes from the random error.

Consider a query that always takes exactly 100ms to execute and that runs frequently during our session. If we sample that session with a sampling period of 1000ms, over time plenty of our samples will fall within an execution of the query. Naturally, our sample could fall anywhere from the first millisecond to the hundredth, so applying our \(2a+c\) (c being zero here) we’ll get a range of estimated query durations between 0 and 200ms just like we saw in Figure 1. The problem comes when we calculate the weights. The shorter estimates get a massively higher weight that the longer ones, even though they should have all had the same weight. The nice, tame random error in our first estimator has been transformed into scary systematic error.

Fractals, dark matter and Russian dolls

As well as being biased, our latest attempt is flawed in a more philosophical way; it lacks any sense of scale. As \(2a+c\) tends to zero the weight tends to infinity. We could sample a query execution one nanosecond after its start time and it would be given an extremely high weight. One way to interpret this is that our algorithm is assuming that the most probable explanation of observing a single execution one nanosecond after its start is that our session is in fact populated by billions of two-nanosecond executions that we rarely observe - a kind of database dark matter. In reality, we know that there is some ’natural scale’ to query executions. Just as we hypothesised that there is some minimum time that I take to drink a cup of tea, there is probably a minimum feasible time a single query execution could last. To put it another way, if you keep zooming in on one region of our session (or one cup of tea) eventually you end up looking at just one execution (or one cup of tea) rather than finding it’s comprised of an endless russian doll of smaller executions (or cups of tea.. or dolls).

This extreme behaviour hasn’t manifested in our test data because my simulation has a resolution of one millisecond. In fact, the more I ponder on it, the more I think this just a different way of rationalizing why our weighting approach failed.

Shut up and inspect my paradox!

Our whack-a-mole approach to just fixing each source of bias as we spot it seems to have led to a dead end, so it’s time to take a step back. Reading around, I learnt that the class of problem we are dealing with is known as an ‘inspection’ problem - because it’s the manner in which we inspect the data that gives rise to the bias. I found a few articles about the ‘inspection paradox’ which is not so much a paradox but an surprising result whereby the information you observe corresponds to the true data in a unintuitive way.

The most commonly cited example of this paradox involves waiting for a bus. Imagine a bus service that runs, on average, once every ten minutes. If you turn up at a bus stop at a random time and wait for a bus to arrive, how long would you expect to wait. I think most people would say five minutes. And those people would be right if the buses arrive every ten minutes. That is to say, the elapsed time between one bus and the next is always exactly ten minutes. However, I said that the bus ran on average once every ten minutes. It turns out that if the elapsed time between buses is exponentially distributed, with a mean of ten minutes (equivalent to saying that in any given minute there is a 10% chance of a bus arriving), then the average waiting time a person will experience is also ten minutes, not five.

You could argue it’s hardly surprising that our intuition is incorrect given such a perverse bus timetable (or lack thereof). However, it’s easy to draw some parallels between this example and our sampling problem. Statistically, turning up at a bus stop and waiting until the next bus arrives is no different than turning up at the bus stop and being told how long ago the previous bus left; the maths doesn’t care which direction time runs in.

The latter example is quite clearly analagous to our sampling of a query execution and learning when the execution started. So any interesting maths relating to the arrival of buses is is directly applicable to query latency. And it turns out there is some interesting maths.

There’s a fairly simple equation that links the expected observed duration of an event (\(d_{observed}\), equivalent to our \(a\)), with the mean of and standard deviation of the true event duration (\(\mu_{true}\) and \(\sigma_{true}\)).

\[ \require{mathtools} \mathop{\mathbb{E}}[d_{observed}] = \frac{\mu_{true}+\sigma_{true}}{2\mu_{true}} \]

This hints at why we were finding it so difficult to ‘undo’ the inspection bias. It depends critically on how the true data is distributed, something which is very difficult to infer from our observed data.

We can make a couple of interesting observations from this formula. Firstly if the standard deviation is zero (i.e. all query executions are the same) then it reduces to the below, which seems further vindication of the \(2a\) part of the estimator.

\[ \require{mathtools} 2\cdot\mathop{\mathbb{E}}[d_{observed}] = \mu_{true} \]

Secondly, if the standard deviation is equal to the mean - as is the case in the exponential distribution - it reduces to this trivial form:

\[ \require{mathtools} \mathop{\mathbb{E}}[d_{observed}] = \mu_{true} \]

I tried running my simulation with exponentially distributed query durations (all much less than \(c\)) and can confirm that the mean of \(a\) was a very effective estimator of the true mean.

Table 4. Comparing the true mean latency (± standard deviation of the mean) with the mean of over 1000 sessions with exponentially distributed query execution times.

Query ID	True Mean Latency / ms	Mean a / ms
1	20.5 ± 0.1	20 ± 1
2	40.5 ± 0.4	40 ± 2
3	60.5 ± 0.8	60 ± 3
4	81 ± 1	80 ± 4
5	100 ± 1	100 ± 5

Resolution by distribution

Since the distribution of the data is clearly important, let’s plot the distribution of our \(2a+c\) estimate next to the true distribution of execution duration, \(x\). As you can see below, where the true duration is constant and less than the sampling interval, the distribution of our estimate is just a uniform distribution from zero to twice the true duration. When the true duration has some variance we get a slightly smoothed-out version of this. So far, so predictable.

Figure 4. Distributions for a query with true latency uniformly distributed between 190 and 210ms.

Figure 5. Distributions for a query with true latency log-normally distributed with mean 300ms and SD 100ms.

But when the true duration starts to exceed the sampling interval, we get some really funky shaped charts.

Figure 6. Distributions for a query with true latency log-normally distributed with mean 1000ms and SD 500ms.

These shapes were so odd that wanted to check they were correct, so I wrote a function to calculate the expected distribution of \(2a+c\) for a given true distribution and sampling interval.

Figure 7. Distributions for the same query as Figure 6 with the expected distribution shown in grey.

It turned out that the distributions were correct, in fact they matched so well that it gave me a new idea. Perhaps if I precalculated the expected distribution of \(2a+c\) for a range of different true distributions I could just check which one most closely matched my observed data. It was then that it hit me.

Nobody expects vector similarity search!

Okay, not exactly AI. But what I found myself implementing was a vector similarity search, which is a central part of AI techniques such as Retrieval Augmented Generation. I created a library of histograms, each represented as a vector and tried to find the one that most closely matched the measured distribution of estimates.

I experimented with a couple of different ways of implementing this search. Inevitably my first attempt was a brute force scan through my whole library, which was actually surprisingly fast. But for the purposes of the narrative, there was one method I had to try… pgvector.

I encoded my histograms using the halfvec type, which gave me 4000 elements to play with. I didn’t do anything fancy, just used one vector element for 1ms of the x-axis. I created an index, and used the euclidean distance as my distance metric.

CREATE TABLE dists_of_estimates (
 id serial primary key,
 mean int,
 sd int,
 dist_type_id int,
 dist halfvec (4000)
);

CREATE INDEX ON dists_of_estimates
USING hnsw (dist halfvec_l2_ops);

SELECT mean, sd
FROM dists_of_estimates
ORDER BY dist <-> %s::halfvec
LIMIT 1

And… it worked.

Table 4. The results of our vector search method for the same simulation as Tables 1 and 2

Query ID	True Mean Latency/ ms	Estimated Mean Latency/ ms	Mean Percentage Error
1	56 ± 3	60 ± 20	10 ± 30
2	300 ± 10	300 ± 40	0 ± 10
3	550 ± 20	550 ± 50	0 ± 9
4	800 ± 20	800 ± 60	0 ± 7
5	1050 ± 20	1060 ± 60	1 ± 5

There are loads of caveats to this. For one, I know for certain that my query latencies were generated from a distribution that matches one of the ones in my library. I did experiment a little with weirder distributions and results weren’t terrible, but I’m sure there are some circumstances where they would be. And, of course, this whole approach is based on the premise that we have to use session sampling, whereas in reality there are plenty of ways to get query latency measurements from Postgres that don’t apply this artificial constraint.

But hey, we had fun. At least, I had fun. I think.