Is sampling session activity a good way to understand Postgres workload? Part 1

That depends on what you mean by ‘good’, ‘understand’ and ‘workload’.

This is ostensibly a post about Postgres. However, in my mind at least, this post is more of a meditation on sampling in general, and one that I plan to continue over several posts. My focus is on exploring the mathematics of sampling, the techniques we can use to extract information from sampled data, and how far we can push those techniques. Inevitably at some point I’m going to pass the ‘silliness threshold’ at which point you should really be using a more appropriate tool. But let’s get started and see where we end up.

Introducing Postgres sessions and sampling

If you want to learn Postgres internals there are many great resources out there, this post is not one of them. I’ll explain the Postgres stuff as succinctly as I can and then proceed to geek out about sampling.

What is a session?

When a client connects to a Postgres database, it starts a session. The session lasts until the client disconnects. For this reason ‘session’ and ‘connection’ are often used interchangeably. A session is a somewhat abstract concept, but it’s sufficient to think of it as the computational ‘space’ on the Postgres server reserved solely for handling transactions received on a specific connection. Indeed, in Postgres, each session corresponds to a server process that is spawned when the connection is opened and ceases to exist when the connection is closed.

In a session, only one thing happens at a time. If you could glance into a session, you would find it doing exactly one thing at any given moment. It could be idling waiting for the client to start a transaction, or it could be performing a transaction. In the latter case it could be in a number of different states within that transaction: performing computation, waiting on I/O, waiting on a table lock, etc. You can visualise a session as a continuous timeline something like this:

Timeline showing two queries in sequence with some idle time between them

Figure 1. Representing a session as a timeline

A Postgres database will typically have many sessions open at the same time. The default maximum is 100, but I’ve seen it set far higher. Picture this as multiple timelines running in parallel, each doing its own thing. You can find out exactly what all of them are doing right now simply by running this one weird query that DBAs don’t want you to know.

SELECT * FROM pg_stat_activity;

pg_stat_activity is a view in Postgres that consists of one row for each running system process spawned by Postgres. It is what the PostgreSQL docs call a “dynamic statistics view”, which consists of information on “exactly what is going on in the system right now”.

What is sampling?

The key words in the previous sentence are ‘right now’. pg_stat_activity is just a snapshot, a frozen moment. If you want to assemble a timeline like that shown in our diagram, you need to run this query repeatedly and build up a picture by piecing together the results. This process of repeatedly glancing at an ongoing process is called sampling. There’s a whole field of statistics dedicated to sampling and what you can learn from sample data. Unsurprisingly enough, this is generally called sampling theory.

Some practical points

There are tools out there that continually sample the state of every session and save it for later analysis. Some do it by repeatedly querying pg_stat_activity, others hook into the internals of Postgres to collect the data. In either case, the general shape of the information is the same.

For the purposes of this post, we’re going to be dealing with a single session. This simplifies the language, diagrams and formulae without taking away any of the essential parts of the problem.

Resampling and Occam’s Razor

An intuitive approach to working with sampled data is to take the samples and attempt to reconstruct the underlying ’truth’ from which the samples were taken. This is often referred to as ‘resampling’. So, let’s give resampling a try. A fairly intuitive approach is to simply assume that whatever was happening when the sample was taken, was happening for the whole of the sample period. This is an example of Occam’s Razor—we assume the simplest possible ’truth’ that explains the data we’ve observed.

The diagram below shows a ’true’ session timeline and one constructed by resampling. The tick marks between the two timelines show the moments when a sample was taken. These are the only moments at which we know the state of the true session, and it is this knowledge that we use to create the resampled one. Note that for simplicity I’ve not shown any internal state for these two queries.

Figure 2. Resampling a session

We can see that, at least in this tiny example, our approach looks to work pretty well. We’ve fractionally overestimated the duration of the first query, and underestimated the duration of the second. It seems plausible that these errors will average out in the long run, but more of that later. First I want to try a different example. This time our sampling period is rather longer compared to the duration of our queries.

Figure 3. Resampling with a longer sampling period.

Now the resampled session doesn’t look very much like the true one at all. But does that mean the information we’ve collected is worthless? We can’t answer that question until we’ve properly defined what it is we are trying to learn from the sampled data. Let’s do that now.

Estimating total query time

I’m going to cheat slightly here and pick a problem that I already know gives an interesting and useful result. Most of the tools I avoided mentioning above offer to show you something like ‘your top ten queries by total execution time’ or some variation. So let’s assume that’s a tractable problem and start there.

It’s important at this stage to clearly define what we mean by a query.

For this problem, we’re going to stick with our ‘Occam’s Razor’ assumption above that whatever we observe when we sample happened for the whole sample period, but we’re going to forget about trying to construct a timeline and instead just add up the total amount of time for each query.

To implement and test our ideas, I’m using a simulator that will be the subject of a future post. Suffice to say it generates random sessions based on some specified characteristics, then samples them. In all my examples I’m simulating a single session of one hour duration. Let’s start with a scenario whereby the queries are very long compared to the sampling period, like our first image above. You can see from the results in the table that we get quite an impressive level of precision—most of our results are within 1% of the true value.

Table 1. Results of sampling a session with five queries, all of which run on average once every 50,000 ms lasting between 1500 and 3500 ms. The sampling period was 1000ms

Query ID	Runtime True	Runtime Resampled	Relative Error (%)
1	165129	166000	0.53
2	181758	181000	-0.42
3	177868	178000	0.07
4	179565	175000	-2.54
5	146395	146000	-0.27

Now let’s try a much more extreme version of our second image. I’ve reduced the length of the queries by a factor of one hundred, so they are far shorter than the sampling period. To keep the overall runtimes approximately the same, I’ve increased the frequency at which the queries run by the same amount. This time, we get a larger error but not that large. It’s still under 10% in all cases.

Table 2. Results of sampling a session with five queries, all of which run on average once every 500 ms lasting between 15 and 35 ms

Query ID	Runtime True	Runtime Resampled	Relative Error (%)
1	172169	187000	8.61
2	169184	158000	-6.61
3	173251	188000	8.51
4	175873	159000	-9.59
5	174545	185000	5.99

You may find this result surprising. What’s interesting is that you may find it surprising for one of two opposite reasons.

If you’re unfamiliar with sampling theory, you may well be surprised that this approach can give an accurate measurement of the total runtime of a query when the sample period is so much longer than the duration of the individual query executions. How can we possibly get the total near-correct when most executions go completely unnoticed (that is to say, we do not sample the session activity even once during the execution)?

If you’re somewhat more familiar with the unreasonable effectiveness of sampling, you may be surprised that there is any difference at all between the two.

To put it another way, you may be asking “Why are the results so similar?” or “Why are the results so different?” and both are totally valid questions that I will attempt to answer now.

Why are the results so similar?

Here are a couple of ways to convince yourself why sampling should still be effective, even when the thing we are sampling is much shorter-lived than our sampling period.

Consider that when we do sample a query execution we are assuming that it took the full sample period. In our second example this means we could be observing a query that really took 25ms and assuming it took a full second — a huge over-estimate. To be precise, we’ve overestimated by a factor of 1000/25, also known as 40. However, because each query only lasts 25ms and we are sampling every 1000ms, the probability that we actually observe any given query is 25/1000, or one in forty. So, in the long run, these two factors perfectly cancel each other out. More on this ’long run’ shortly.

Another way to think about this is that our samples are just a random subset of the many moments that make up the true timeline. It is perhaps intuitive that the proportion of our samples in any given state should be the same as the proportion of all the moments in this state. This is the basis on which elections and consumer surveys work. If, therefore, we accept that we can estimate the proportion of the session spent in a given state, it is a trivial jump to say that we must also be able to estimate the total time spent in that state simply by multiplying by the duration of the session.

Why are the results so different?

I started writing the example above genuinely expecting the two results to come out the same. I was surprised when they didn’t and had to rewrite most of this blog as a consequence.

To be honest, I don’t have a satisfactory explanation for this. I stopped writing the blog for a while until one of my colleagues suggested that the fastest way to figure out what’s going on is to publish and wait for people to tell me why I’m wrong. So, here are a couple of ideas that spring to mind, I’ll revisit these in a future post.

Longer queries mean the session is effectively organised into contiguous areas, so when I sample a single millisecond there’s a good chance all the surrounding milliseconds are the same as the one I sampled. It seems plausible that when the sample period is comparable to, or smaller than, the smallest ‘structure’ in the session we might get more accurate results than where the structure is too small to be of consequence.

This isn’t wildly implausible. Nyquist’s law states that to accurately reconstruct a signal you need to sample at a frequency twice that of the highest frequency present in the signal. So I could convince myself that what we’re seeing here is a transition from a probabilistic “sampling from a population” regime to a more deterministic “resampling a signal” regime. However, until I can find some way of quantifying this I shall remain skeptical as just because something makes a nice narrative, doesn’t mean it’s actually true.

For now, let’s just try to answer the question we posed at the start: is this useful?

Quantifying error in total query duration

Sixty percent of the time, it works every time

I would sample 30,000 sessions, and I would sample 30,000 more

We saw a couple of example runs from my simulation above, but to say anything more concrete about the error we really need to measure it over many simulations. The chart below is the result of simulating 30,000 sessions. For simplicity these sessions only run a single query which is an average of 40ms in duration and runs on average every 100ms, meaning we would expect the session to be idle for around 60% of the time and running our query for the other 40%. For each session, we measure the true total duration of the query and the estimated total duration and calculate the relative error. We repeat this 1000 times for 30 different sample periods.

The bars in the chart show the distribution of this error. Each bar consists of a central ‘box’ with ‘whiskers’ extending from it. In each case, fifty percent of our error measurements lie within the range shown by the box and ninety percent lie within the range denoted by the whiskers. The median error is shown by the line in the middle of the bar. The dotted line shows the error we would predict based on the mathematics of sampling from a population—more on this in a future post.

Until now, we’ve been talking about the sampling period. However, for these charts, I’ve chosen to use the inverse: sampling frequency. I’ve done this because the cost of sampling, in terms of storage space and CPU cycles, scales linearly with the frequency. For example, sampling five times per second (5Hz) costs five times as much as sampling once per second (1Hz).

Figure 4. Sampling error as a function of sampling frequency for a query with mean duration 40ms and mean periodicity 100ms over a one-hour session.

Two things are striking to me about this chart. Firstly the shape shows a clear pattern of ‘diminishing returns’; whilst the error does diminish with increasing frequency, the rate at which it diminishes gets rapidly less. Secondly, the error is remarkably small even at very low sampling frequencies. Even for the lowest frequency, corresponding to a sampling period of 5 seconds, we’re still able to make an estimate that’s accurate to within eight percent in 90% of cases.

If that sounds a bit like the Anchorman quote at the start of this section, that’s to be expected. We’re dealing with sampling; uncertainty comes with the territory. Before we look at some more examples, let’s define what we consider to be an acceptable error. This is always going to be highly subjective so let’s just say a 90th percentile error of ±10% is our target. I’ll show this target as a blue dotted line in subsequent charts.

A couple of asides
The reason that the bars gradually diverge from the pink line is a demonstration of the same effect I failed to explain above. As the sampling period gets closer to the typical query duration, we start to systematically achieve better error than we predict by modelling this process as a sample from a population.
Another notable thing about the chart is that it is symmetrical about the x axis and the median error is approximately zero in all cases. This is because our estimation of the total frequency is unbiased. This means that whilst it may not always get the duration exactly right, we don’t expect it to systematically under- or over-estimate the error. This is important because biased estimators are very awkward to work with.

Varying query duration and frequency

Let’s repeat the analysis we did above with one alteration. Instead of running every 100ms, our 40ms query now runs every 1000ms on average, occupying only 4% of the session.

Figure 5. Sampling error as a function of sampling frequency for a query with mean duration 40ms and mean periodicity 1000ms over a one-hour session.

The shape of the chart is similar, but this time the errors are much larger. To achieve our target we need to sample at approximately 2Hz—twice a second. Compared with the query above that occupied ten times as much of the session, we have to sample well over ten times as often to achieve the same precision.

Lower-level info—query states

Returning to Postgres for a moment, you may recall several thousand words ago we said that one of the reasons for looking at session activity is to discover what state a given query is in. So far we’ve just looked at the total duration for which each query ran, which is something you can get more simply from extensions such as pg_stat_statements. If we really want to consider whether sampling is useful we need to look one layer deeper.

In this example, we again look at just one query that, on average, runs for 40ms every 1000ms. However, instead of estimating the total duration of the query, we now assume that there are three internal states that we’re interested in. These states could be anything: using CPU, waiting on I/O, waiting on locks, etc. We’ll just refer to them using numbers. State 1 lasts for half of each query execution, state 2 for one-third and state 3 for one-sixth. The chart below now shows groups of three bars, one for each of the three states.

Figure 6. Sampling error as a function of sampling frequency for ‘wait states’ of a query with mean duration 40ms and mean periodicity 1000ms over a one-hour session.

Instantly we can see that the error bars are much bigger across the board. The sampling rate required to achieve our target error of 10% for all the wait types is now somewhere around 10Hz, an increase of five times on the 2Hz we needed when we were looking at query duration. We’ve paid a heavy price for the ability to drill one layer deeper; we suddenly need to collect, transmit and store five times as much data.

It’s about time

All my examples so far have used a one-hour session. Whether this is realistic depends on exactly what question you are trying to answer in the first place. If you have a fairly consistent load on your Postgres instance and you want to characterise that load in more detail it may be quite reasonable to look at 24 hours or even more. If you’re investigating an outage and need to know what was going on in the minutes before everything went dark, an hour is probably too long.

Let’s go back to our 10% error target and our two scenarios from above: a query that runs on average for 40ms and occurs on average every 100 or 1000 milliseconds. The charts below show how we perform against this target in both scenarios for a range of session durations and sampling frequencies.

Figure 7. Sampling error as a function of session duration for a query with mean duration 40ms and mean periodicity 100ms.

Figure 8. Sampling error as a function of session duration for a query with mean duration 40ms and mean periodicity 1000ms.

We can see that the session duration has a significant impact. With a five-minute session, we need to sample at 10Hz to hit our target even for the much more frequent query. Conversely, if you’re happy to look over a duration of three days, you can achieve an acceptable error even for the lower frequency query with a sampling frequency of 0.02 Hz. That’s one sample every 50 seconds!

Conclusion

Time to wrap this post up. Is sampling session activity a good way to understand Postgres workload?

Just answer the question!

Yes, sampling session activity is definitely a useful technique. As long as the thing you’re interested in (a certain query or state) occupies a significant proportion of the session duration, sampling is an extremely good value way to estimate exactly how much of the session it occupies. Crucially, we saw that even if these queries or states are individually very short-lived, sampling is still extremely effective. This makes it very well suited to answering questions such as “What query spent longest on CPU?”, “what were the top five queries by total execution time?” or “What query was responsible for the large amount of I/O wait?”. In such examples, the thing you are interested in is by definition something that occurred a lot and therefore will have a relatively low sampling error. Furthermore, if you are content to get answers at a relatively crude temporal resolution (“What were my top queries in the last 24 hours?”) even a very low frequency—and therefore low cost—sampling regime can yield highly accurate insights.

The flip side of this is that the efficacy of sampling starts to drop away as you look at states occupying less of the session. This is perhaps most evident when you look first at a query, and then at the states within that query. Inevitably the error for the states within the query will be greater than for the query as a whole. This is compounded if you narrow your focus to a shorter period of time. This can be frustrating as it is natural to want to ‘drill into’ problems, but the fact that error increases rapidly as you drill in quickly erodes any insight. Of course, you can mitigate this by sampling at a high enough rate that means collecting vastly more data all the time to allow you to drill in on a few specific moments.