How Columnar Storage Actually Works
By the end of this article you will be able to explain why columnar storage is a standard for analytical workloads.
Most data professionals have an intuition that columnar storage is “fast for analytics” without ever fully understanding why. Columnar = fast. Move on.
This is a missed opportunity. A surprising amount of the optimization advice you’ve encountered — filter early, select specific columns, partition by date, cluster on common filters, avoid SELECT * — works because of the specific physical properties of how columnar data is stored. Without understanding those properties, the advice is just pattern-matching. With them, the advice becomes something you can reason about and extend.
This article is about the storage layer underneath modern analytical systems. Parquet is the most widely used columnar format in the data engineering ecosystem, and we’ll use it as the primary reference. But the principles apply equally to ORC, to BigQuery’s Capacitor, to Snowflake’s micropartitions, and to ClickHouse’s MergeTree storage. The implementations differ in interesting details. The architecture is the same.
By the end, you should be able to explain why selecting fewer columns reduces I/O, why some filters get pruned at the file level, and others don’t, what SELECT * actually costs in physical terms, and why columnar storage is great for analytics and terrible for OLTP. Most of this transfers across engines.
Why columnar storage exists
A traditional database stores data row by row. A row in an employees table — Alice, 28, Engineering, 95000, "123 Main St" — sits together on disk. To read Alice’s record, the database performs one I/O operation and gets all her fields. This is excellent for transactional workloads, where the application typically wants complete records.
It’s terrible for analytics.
When an analytical query asks for SELECT AVG(salary) FROM employees, the engine has to read every row’s complete data — including the name, the address, the department, all the columns it doesn’t care about — just to extract the salary. On a 200-column table, you’re reading 199 columns of data you’ll throw away.
Columnar storage does the opposite. Instead of grouping a row’s fields together, it groups all values of one column together. Salaries sit with salaries. Names sit with names. Departments sit with departments. To compute average salary, the engine reads only the salary column. The other 199 columns are never touched.
Three concrete physical reasons this is faster for analytics.
You read less data. A query that touches three columns of a fifty-column table reads roughly 6% of what it would in row storage. The savings compound at scale — on a 10TB table, that’s 600GB scanned instead of 10TB. Cloud warehouses that bill on bytes scanned (BigQuery’s on-demand pricing being the most famous example) make these savings directly visible in the bill.
You compress better. When you store all department values together — Engineering, Marketing, Sales, Engineering, Engineering, Sales, ... — they compress beautifully because they have low cardinality and high repetition. Compression algorithms like Snappy, ZSTD, and dictionary encoding can shrink such columns by 10-100x. When you store all values of a row together, the data types and value patterns are mixed, and compression works far less effectively. Columnar files are typically 5-20x smaller than row-based files on the same data.
You execute faster. Modern CPUs can process arrays of a single type in vectorized operations using SIMD instructions — for example, adding 8 integers in parallel with a single CPU instruction. Columnar storage delivers data in exactly this shape: large contiguous arrays of one type. Query engines exploit this with vectorized execution, processing thousands of values at a time. Row storage can’t easily do this because the values for one column are scattered throughout the file, separated by all the other columns.
The same properties that make columnar storage great for analytics make it bad for OLTP. Inserting one row requires touching every column block. Updating a single field requires rewriting a compressed column block. Reading one complete record requires touching every column. This is why we have two database worlds: row-oriented systems for applications, columnar systems for analytics. Companies typically maintain both, replicating data between them.
Inside a columnar file
Parquet files have a hierarchical structure that’s worth understanding because it shapes everything else.
File
├── Row Group 1 (typically 128MB - 1GB of rows)
│ ├── Column Chunk: name
│ │ ├── Page 1 (compressed values)
│ │ ├── Page 2
│ │ └── Statistics: min, max, null count
│ ├── Column Chunk: age
│ │ ├── ... (same structure)
│ └── Column Chunk: department
├── Row Group 2
│ └── ...
└── Footer (file-level metadata)
├── Schema definition
└── Row group statistics (min, max per column per row group)A few things are worth noticing about this structure.
Row groups are batches of rows. A 10GB Parquet file might be broken into 10-50 row groups. Within each row group, data is columnar — but the file as a whole is divided into row-group-sized chunks. This is a deliberate compromise. Pure columnar storage would mean every column in one giant block, which would prevent parallel reads (different workers couldn’t read different parts independently) and prevent statistics-based skipping (you couldn’t skip portions of a column). Row groups solve both problems: each is a self-contained slice, readable independently, with its own statistics.
Each column chunk has min/max statistics in the footer. When the query engine reads the file, it reads the footer first. The footer tells it: row group 5 contains name values where the minimum is “Alpha” and the maximum is “Charlie.” If your query has WHERE name = 'Zoe', the engine knows row group 5 cannot contain matches and skips it entirely — without reading any of its data pages.
This is called predicate pushdown, and it’s the foundation of how columnar formats achieve sub-linear scan performance for selective queries. Most of the optimization work isn’t reading data. It’s not reading data. The footer guides which row groups to skip; columnar storage lets you skip irrelevant columns; only after both filters do any actual data pages get decompressed.
Pages are the unit of compression. Within a column chunk, data is split into pages of typically a few MB. Each page is compressed independently with its own statistics. This enables fine-grained skipping during query execution and allows different compression codecs for different columns based on their data characteristics.
The framing worth internalizing: a Parquet file isn’t just a file. It’s a self-describing, compressed, statistics-rich, parallel-readable analytical artifact. The file format is a query-optimized data structure on disk.
Other formats handle this differently in detail, but with the same principles. ORC organizes data into “stripes” with footer-level statistics. BigQuery’s Capacitor stores data in columnar blocks with per-block metadata. Snowflake’s micropartitions are similarly self-contained columnar units with statistics. The vocabulary differs; the architecture is shared.
How rows get reconstructed from columns
This is the question that sounds harder than it is.
When you store data column by column, you’ve physically separated the values that originally belonged together. The name, age, and department for a single logical row now live in three different places on disk. How does the engine know which age value belongs with which name?
The answer is simpler than you’d expect: position-based correspondence.
Within a row group, every column chunk has exactly the same number of values, in exactly the same order. The Nth value in the name column corresponds to the Nth value in the age column corresponds to the Nth value in the department column. There’s no row ID, no foreign key, no join key. The correspondence is implicit in the position.
If row group 1 contains 100,000 rows, then:
The
namecolumn chunk has exactly 100,000 valuesThe
agecolumn chunk has exactly 100,000 valuesThe
departmentcolumn chunk has exactly 100,000 valuesPosition 47,293 across all three columns refers to the same logical row
Reading “row 47,293” means reading the 47,293rd value from each column you care about and stitching them together in memory. The engine never needed an explicit row ID — the position is the row ID.
This is why row groups exist as a unit. Without them, you couldn’t easily parallelize reading. With them, the file is broken into self-contained sections — each one independently readable, each one with its own internal position-based correspondence. The footer tells you where each row group starts in the file via byte offsets per column chunk, so jumping to a specific row group is a direct seek rather than a sequential scan.
The clever part of columnar storage isn’t how rows get reconstructed. That’s trivial. The clever part is that you almost never have to reconstruct full rows. Most analytical queries only need a few columns, so most columns never get touched. And metadata-based pruning means most row groups never get touched either. Reconstruction happens only on the small slice of data that survives both filters.
That’s the real efficiency. Not faster row assembly, but avoiding row assembly entirely for most of the data.
Compression strategies that exploit the layout
Columnar storage’s compression advantage isn’t accidental. The format enables specific compression techniques that aren’t practical in row storage.
Dictionary encoding. When a column has low cardinality — say, a country_code column with 200 distinct values across millions of rows — the engine builds a dictionary mapping each unique value to a small integer ID. The actual column data becomes a sequence of integer IDs instead of repeated string values. The dictionary is stored once. The savings are dramatic for any column with repeated values: a 10-character country name takes 10 bytes per row in raw form and 1-2 bytes per row when dictionary-encoded.
Run-length encoding (RLE). When a column has long runs of identical values — common in sorted or clustered data — RLE replaces sequences with a single value plus a count. “Engineering, Engineering, Engineering, Engineering, Engineering” becomes “Engineering × 5.” For sorted boolean or low-cardinality columns, this can compress data by orders of magnitude.
Delta encoding. When values change slowly between consecutive rows — timestamps, sequential IDs, sorted numeric data — the engine stores the difference between consecutive values rather than the full values themselves. A column of timestamps spaced one second apart compresses to a series of “1, 1, 1, 1, ...” which then compresses further with RLE.
Bit-packing. When values fit in fewer bits than their declared type — a NUMERIC column where all values are between 0 and 100 doesn’t need a full 64-bit representation — the engine packs values into the minimum bit width required.
These techniques compose. A column might be dictionary-encoded, then run-length-encoded on the dictionary IDs, then bit-packed on the run-length counts. The result is data that takes a fraction of its uncompressed size while remaining queryable without full decompression.
This is why pre-sorting data by columns you query frequently — what most warehouses call “clustering” — has such an outsized impact. Sorted data dramatically improves compression ratios for run-length and delta encoding, which means smaller files, less I/O, and faster queries. The performance gain isn’t just about query-time pruning. It’s also about the underlying compression working better on organized data.
What predicate pushdown actually does
Predicate pushdown is the term for the engine’s ability to filter data before reading it, using the metadata stored in the file footer.
Take a concrete example. You have a 100GB Parquet file containing a year’s worth of order data. You run:
SELECT customer_id, order_total
FROM orders
WHERE order_date >= '2026-01-01' AND order_date < '2026-02-01';A naive engine would read the entire file — all 100GB — and filter rows after reading. The pushdown-aware engine does something different.
First, it reads the file’s footer (small, fast). The footer contains, for each row group, the minimum and maximum order_date. The engine looks at row group 1: min order_date is 2025-04-15, max is 2025-04-22. Outside the query’s date range. Skip the entire row group.
Row group 2: min 2025-04-23, max 2025-05-01. Skip.
Row group 47: min 2025-12-29, max 2026-01-04. Possible match. Open this row group.
Row group 48: min 2026-01-04, max 2026-01-12. Possible match. Open this row group.
The engine ends up opening maybe 4-5 row groups out of 200. Within those, it reads only the customer_id, order_total, and order_date columns — ignoring the other 47 columns the table contains. The order_date column is read to verify each row’s date matches the filter; the other two columns are read for the rows that pass.
Total data read: maybe 200MB out of the original 100GB. The other 99.8% never moved off disk.
This is where the optimization advice you’ve heard your whole career — filter early, select specific columns, avoid SELECT * — has its physical foundation. Each piece of advice maps directly to a layer of the pushdown system:
Filter on partition keys — fewer files to even open.
Filter on clustered/sorted columns — fewer row groups to read within each file.
Select specific columns — fewer column chunks to decompress within each row group.
Avoid
SELECT *— read every column chunk in every row group that survives the other filters.
The advice isn’t arbitrary. It’s a direct reflection of how the storage layer is organized.
The boundaries of pruning
A natural question follows: if every column has min/max statistics, why don’t filters on every column reduce data scanned the way partition filters do?
The answer involves a distinction that’s easy to miss. There are two layers where pruning can happen.
File-level pruning happens before any file is opened. It uses metadata that’s available without reading anything from the files themselves. Partitioning is the canonical example: a partitioned table is split into separate physical files (or directories), one per partition value. The query engine knows the partitioning scheme from table-level metadata — typically just a directory listing or a manifest file — and can skip entire files without touching them.
Block-level pruning happens after a file is opened. It uses the per-block statistics in the file’s footer. The engine has to read the footer to do this pruning, which means the file is at least partially accessed.
These are different operations with different costs and different scaling properties.
For a small table — a few files, each with a few row groups — both kinds of pruning are cheap. The footer is small. Reading it to make pruning decisions is fast. Modern columnar engines do block-level pruning routinely.
For a very large table — say, billions of rows across millions of blocks — the metadata itself becomes substantial. The minimum and maximum values for every column in every block, multiplied by billions of blocks, can produce metadata measured in terabytes. Reading all of that to make pruning decisions starts becoming its own performance problem. Most engines have historically handled this by being selective about which columns get used for block-level pruning at scale: typically, partition columns and sort/cluster keys.
This is why clustering matters more than people realize. Clustering sorts rows within partitions on specific columns, which makes the per-block min/max statistics for those columns tight — each block covers a narrow range of values. Tight statistics make block-level pruning effective: a query filtering on a clustered column can skip many blocks because their min/max ranges definitely don’t overlap with the filter.
For non-clustered columns, the statistics still exist, but they tend to span wide ranges — every block might have a min customer_id near “cust_001” and a max near “cust_999,” because the data is randomly distributed across that column. Wide statistics make pruning ineffective. The filter passes through to the actual data reads.
The principle that emerges: columnar storage’s pruning power depends on how the data is physically organized. Random distribution defeats pruning even when the statistics exist. A clustered or sorted distribution makes pruning effective. The choice of which columns to cluster on is, therefore, a choice about which filters will be cheap and which will be expensive.
What this means in practice
The optimization advice that’s been correct for years is still correct, but understanding the storage layer gives it different weight.
SELECT * is expensive because columnar storage makes selectivity meaningful. Every column chunk you don’t select is data you don’t read. Every column chunk you do select is decompressed and processed. A SELECT * on a 200-column table is reading 200 column chunks per row group. A SELECT col1, col2 is reading two. The cost difference is roughly proportional to the column count.
Filter on partitioning columns when you can. This is the cheapest filter — it doesn’t even open files. On systems where you control partitioning, the choice of partition column should be driven by which time-range or category filter dominates your queries.
Cluster on the columns you frequently filter on, especially after the partition filter has reduced the working set. Clustering tightens block-level statistics, which makes block-level pruning effective. Without clustering, block-level filters often degrade to full scans of whatever survived partition pruning.
Filters on random non-clustered columns are not free, but they’re not as expensive as you might fear. Even when block-level pruning fails, the engine still only reads the columns referenced in the query — not the full row width. The cost is proportional to how many column chunks you touch and how much of each must be decompressed.
Compression matters as much as pruning. A well-clustered column compresses dramatically better than a randomly distributed one, which means less data has to be moved and decompressed even before pruning enters the picture. The two effects compound.
These rules apply across columnar systems. The specific names differ — partitioning vs partition keys, clustering vs sort keys, row groups vs stripes vs micropartitions — but the underlying mechanics are remarkably consistent.
What’s worth taking away
The principle running through all of this: columnar storage is faster for analytics because it lets the engine skip reading most of the data. The actual computation — averaging salaries, summing revenue, counting rows — is rarely the bottleneck. The bottleneck is moving bytes from disk into memory. Every architectural decision in modern columnar formats serves the goal of moving fewer bytes.
That goal manifests at multiple layers. Column-level skipping, so you only read the columns you query. Row-group-level skipping, so you only read the chunks of those columns that could match your filters. Compression, so the bytes you do read are as small as possible. Vectorized execution, so the bytes you do process get processed efficiently.
These layers aren’t independent optimizations. They’re a coordinated system. Partitioning and clustering are used to make skipping effective. Compression exists to make the reading efficient. Position-based correspondence exists, so reconstruction happens only when needed. Take any one piece away, and the others work less well.
The concrete optimization advice that follows from this understanding is: Choose partition keys that match how queries filter by time. Choose clustering keys that match how queries filter by category or identifier. Select only the columns you need. Filter early. None of this is new advice. But after you understand the storage layer, it stops being rules to follow and starts being the natural consequence of how the system works.
Most of these transfers between systems. Parquet, ORC, Capacitor, Snowflake’s micropartitions, ClickHouse’s MergeTree storage — they all implement the same architecture with different details. Understanding the architecture once means understanding the family. The specific tool you use today may not be the tool you use in five years. The architecture will probably be.
Sources
Apache Parquet documentation:
Foundational columnar storage:
Format-specific references:
Performance and compression:



