How BigQuery Actually Executes a Query (and why most optimization advice misses half the picture)

By the end, you should be able to look at a BigQuery execution plan and tell a story about what the engine is doing in physical terms.

May 06, 2026

How BigQuery actually executes a query (and why most optimization advice misses half the picture)

Many BigQuery users have never opened the Execution Details panel.

This isn’t a criticism. The panel sits behind a tab most people don’t click, full of terminology that isn’t intuitive, displaying numbers that don’t have an obvious meaning. You can use BigQuery productively for years without ever looking at it. Most people do.

But that panel tells you exactly what the engine is doing with your query — physically, stage by stage, worker by worker. And the gap between users who read it and users who don’t is the gap between optimization-as-pattern-matching (filter early, aggregate early) and optimization-as-reasoning (I can see why this stage is slow, and I know what to do about it).

This article is about closing that gap.

By the end, you should be able to look at a BigQuery execution plan and tell a story about what the engine is doing in physical terms.

There’s also a second motivation, which I’ll come back to at the end. Most BigQuery optimization advice on the internet is written from the perspective of on-demand pricing, where you’re billed on bytes scanned. That advice is correct, but it’s incomplete. The shift to capacity pricing — where you reserve slots and pay for compute capacity — changes what “expensive” means. Without understanding the execution model, you can’t think about capacity pricing well. With it, you can.

Let’s start at the bottom.

BigQuery has no servers (sort of)

When you run a query in Postgres or MySQL, you’re sending it to a server. There’s a machine somewhere with CPU cores, memory, and disks. Your query runs on that machine, sharing resources with every other query on the same machine. If you want it to run faster, you make the machine bigger.

BigQuery doesn’t work this way. There’s no server you provisioned. There’s no node count to choose. When you run a query, you’re not running it on anything — you’re submitting work to Google’s infrastructure, which finds available compute and runs your query on it.

The unit of compute is called a slot. A slot is roughly equivalent to a CPU core with some attached memory. When BigQuery runs your query, it allocates some number of slots to it, and the query runs in parallel across those slots.

The unit of work is called a slot-second. One slot working for one second is one slot-second. A query that uses 100 slots for 5 seconds consumes 500 slot-seconds, regardless of wall-clock time. This is the fundamental measure of computational cost in BigQuery.

When you look at the Execution Details panel and see “Total slot ms: 8,432,000,” that’s 8,432 slot-seconds — the cumulative work performed by all parallel workers across all stages. A small query might consume a few hundred slot-seconds. A large analytical query might consume hundreds of thousands.

This abstraction has consequences worth noting.

You don’t size BigQuery for peak capacity. There’s no “we need a bigger cluster for the year-end run.” You submit a query, BigQuery decides how many slots to throw at it, and the query runs.

You don’t pay for idle time (in on-demand). The slot pool is shared across Google’s customers; when you’re not running queries, you’re not consuming anything.

You don’t control parallelism directly. Unlike Spark or traditional databases, you can’t tell BigQuery “run this with 200 parallel workers.” The engine decides, based on the query’s needs and the available pool.

This last point matters a lot. Optimization in BigQuery isn’t about telling the engine what to do. It’s about giving the engine a query that maps cleanly to its execution model — so the optimizer’s defaults work in your favor.

A query is a DAG of stages

When you submit a query, BigQuery doesn’t run it as a single monolithic computation. It decomposes the SQL into a directed acyclic graph of stages.

Each stage is a unit of work — read these rows from storage, hash these keys, aggregate by this column, join these two inputs. Stages have dependencies: stage S03 might need the output of stages S01 and S02 before it can begin.

Inside each stage, work happens in parallel across many workers. Across stages, work happens in the partial order defined by the DAG — independent stages can run in parallel, dependent stages run after their inputs are ready.

You can see all of this directly in the BigQuery console. Run any non-trivial query, click “Execution Details,” and you see the full breakdown. Each stage shows:

How many workers ran it (”parallel inputs”)
How much data went in and out
Cumulative compute time across all workers
Average and maximum compute time per worker
Wait time, read time, write time

Here’s a simple example:

SELECT region, SUM(order_total) AS revenue
FROM fct_orders
WHERE order_date >= '2026-01-01'
GROUP BY region;

Behind the scenes, BigQuery decomposes this into something like:

Stage S00: Read partitions from fct_orders matching the date filter. Project the columns we need (region, order_total). Output: filtered, projected rows.
Stage S01: Hash by region for the aggregation. This is a shuffle (more on this in a moment). Output: rows grouped by region, distributed across workers.
Stage S02: Compute SUM(order_total) per region locally on each worker, then combine across workers. Output: one row per region.
Stage S03: Final consolidation and write to result table.

Each stage runs in parallel across multiple workers. Stages run in dependency order. The execution details panel shows you exactly how many workers each stage used, how long each stage took, and what it produced.

The cost of moving data — shuffles

This is the part most BigQuery users have heard of but don't really understand.

Between stages, BigQuery often needs to redistribute data. If stage S00 reads 10 billion rows and stage S01 needs to group them by customer_id, all the rows for a given customer need to end up on the same worker so they can be aggregated together. That redistribution is called a shuffle.

A shuffle is exactly what it sounds like. Every worker in the upstream stage takes its output rows and hashes the partitioning key (often the join key or GROUP BY column). Based on the hash, it sends each row to a specific worker in the downstream stage. If 1,000 workers produced output and 500 workers will consume it, every upstream worker sends some of its data to every downstream worker.

Shuffles are the dominant cost in distributed analytical queries. Not because the computation is hard — hashing and bucketing are cheap operations — but because moving data across the network is fundamentally slower than processing it in memory or on local disk.

This is why almost every piece of BigQuery optimization advice traces back to the same root: make shuffles smaller.

Filter early — fewer rows to shuffle. Aggregate early — fewer rows to shuffle. Avoid joins at fine grains — fewer rows produced and shuffled. Denormalize when joins are repeated — eliminate shuffles entirely. Choose partition and clustering keys to match common query patterns — reduce data before it’s read, which reduces what gets shuffled.

Once you see this, “BigQuery optimization” stops feeling like a list of unrelated tactics. It’s all the same principle, applied at different points in the query lifecycle.

There’s one more dimension to shuffles that’s worth knowing about: skew.

If the partitioning key has uneven distribution — one customer has 50 million events, the next-largest has 50 thousand — the worker assigned that customer’s bucket gets dramatically more data than every other worker. The shuffle still happens. The receiving worker is just busy, while the other workers finish quickly and wait.

This is why the Execution Details panel shows you both average and maximum compute time per stage. If max is much higher than average — say, 100x — you have skew. One worker is doing most of the work, every other worker is idle, and the stage’s wall time equals the slow worker’s time. The fast workers are sitting there spending slot-time waiting.

In capacity pricing, this is doubly painful. Your reserved slots are tied up doing nothing while one worker churns through a hot key.

Joins, in physical terms

Now we can talk about joins properly.

When BigQuery joins two tables, it has two main strategies. The choice depends on the size of the smaller side.

Broadcast join. One side is small enough to send to every worker handling the other side. No shuffle of the big side — only the small side moves. Cheap. Fast. The preferred strategy when feasible.

In the execution plan, this appears as JOIN EACH WITH ALL. The “each” is the big side; the “all” is the small side, broadcast to all workers.

Hash join with shuffle. Both sides are too large to broadcast. BigQuery hashes the join keys on both sides and shuffles all the rows so matching keys land on the same worker. Then each worker does the local join on its slice of data. Expensive — both sides move across the network.

In the execution plan, this appears as JOIN EACH WITH EACH.

The threshold for broadcast isn’t fixed. It depends on the memory available per worker and other factors, but practically:

Under ~100 MB: almost always broadcast.
100 MB to ~1 GB: depends on optimizer estimates and available memory.
Over ~1 GB: usually hash join with shuffle.

Here’s the part that really matters: the optimizer’s estimate is based on what it can predict before running the query. If you have a CTE that filters a large table down to a small result, the optimizer might not know the result is small — it estimates based on the unfiltered size. So you might write a query that should qualify for broadcast but doesn’t, because the pre-execution estimate is wrong.

This is why some queries get better plans when you materialize an intermediate result to a temporary table. You’re giving the optimizer better information about how small the data really is. The next query can broadcast it cleanly because BigQuery now knows its actual size.

This is a tradeoff — you pay for the materialization — but on repeated joins to the same filtered set, it pays back.

Reading the plan like a story

You now have the vocabulary: stages, slots, shuffles, broadcast vs hash join, fan-out, skew.

Six concepts. That’s most of what you need to read a BigQuery execution plan as a coherent story rather than a wall of numbers.

Here are the patterns to look for:

High slot-ms with low wall time. The query parallelized well. The engine threw lots of slots at it. Usually fine.

Low slot-ms with high wall time. Poor parallelism. Something in the query — a single-worker bottleneck, a sequential dependency, a sort with limited parallelism — kept the engine from spreading the work out. Worth investigating.

High max compute time vs avg compute time within a stage. Skew. One worker is doing much more work than the rest. Look at the upstream stage — what was the partitioning key? Is one value of that key over-represented?

A stage producing far more output rows than input rows. Fan-out, almost always from a join at the wrong grain. The classic case is joining fct_orders to fct_order_items and aggregating later, the order rows multiply by the average item count. Aggregate first, join second.

A JOIN EACH WITH EACH on a stage with high data volume. Expensive shuffle join. Could the smaller side be filtered or aggregated to broadcast scale? Could the join be eliminated by denormalization?

This is what people mean when they say “read the execution plan.” It’s not magic. It’s pattern recognition on a vocabulary of about six concepts. Once you have the vocabulary, the plan tells you what’s happening — and usually what to do about it.

The pricing model changes everything

Here’s where most BigQuery optimization content stops being useful.

In on-demand pricing, BigQuery bills you on bytes scanned during the read stage. Not slot consumption. Not shuffle volume. Not wall time. Just bytes scanned from storage.

This means a poorly-written query that scans 10TB and a well-written query that scans 10TB cost exactly the same. One might run in 30 seconds and the other in 30 minutes — the bill is identical. You’re paying for the read, not the work.

This is why almost all BigQuery optimization advice on the internet is about reducing bytes scanned. Partition filtering, column pruning, avoiding SELECT *, using clustering. All of these reduce the number of bytes scanned, which lowers your bill under on-demand pricing.

In capacity pricing, you reserve a fixed number of slots — say, 500. Those slots are yours, full-time, for the duration of the contract. You pay a flat rate regardless of usage. The slot pool is shared across all queries in your reservation.

Now, the number of bytes scanned doesn’t directly affect your bill. What matters is slot consumption — how many slot-seconds your query uses. A query that consumes 400 slots for 10 minutes is affecting every other query in your reservation during that window. The other queries either queue up behind it or run with fewer slots and take longer.

This is the part most teams switching to capacity pricing don’t fully appreciate. A poorly-written query in on-demand costs only its initiator. A poorly-written query in capacity pricing costs everyone sharing the slot pool.

The advice that follows from this is different. Filtering early still matters, but for a different reason — it reduces shuffle volume and slot consumption, not just the number of bytes scanned. Avoiding fan-out still matters, but because the inflated row count consumes slots from your reservation. Joining at the right grain matters because the wrong grain produces shuffle volume that ties up your team’s compute.

The senior framing: every BigQuery operation has a bytes-scanned cost and a slot-consumption cost. In on-demand pricing, you only directly pay for the first. In capacity pricing, you pay for the second, and your team feels it. Most optimization advice you read addresses one axis. The execution model gives you the other.

Closing

Most BigQuery users will never read this far into how the engine works. They’ll continue to write SQL, look at the bytes-scanned estimate before running it, and trust that the rest will be fine. For most queries, that’s enough.

But once you’re past the basics — once you’re building production pipelines, running queries that affect other people’s work, making cost-related decisions about pricing models — the surface-level mental model runs out. You start needing the layer underneath.

The execution model is that layer. Slots, stages, shuffles, broadcast vs hash joins, skew, fan-out. It’s standard distributed-systems vocabulary applied to BigQuery’s specific implementation. Most BigQuery users don’t have it because they were never given a reason to learn it.

The next time you run a non-trivial query, open the Execution Details panel. Look at the stage breakdown. Find the stage with the highest slot-ms. Check whether the max compute time is much higher than the average. Trace the data volumes through the stages. See if you can tell the story of what the engine did.

That habit — reading the plan instead of just trusting the result — is what separates people who use BigQuery from people who actually understand it.

Sources

Google Cloud Documentation:

Google Cloud Blog:

Foundational reading:

Melnik et al., Dremel: Interactive Analysis of Web-Scale Datasets (2010)

Discussion about this post

Ready for more?