Massive Tables¶
If you require tables that contain more than 1,048,576 rows -- up to billions of rows -- Codcel can handle them using external CSV or Parquet files backed by the high-performance Parquet columnar storage engine.
Overview¶
Excel has a limit of 1,048,576 rows per sheet. For datasets that exceed this limit, Codcel allows you to provide the data as a separate CSV or Parquet file. During code generation, the data is converted to Parquet format (if not already) and queried using the Apache DataFusion SQL engine, enabling efficient lookups and filtering across billions of rows.
Massive tables are read-only. You can use lookup and filtering functions on them, but CRUD operations (add, update, delete) are not supported. For tables that need CRUD operations, see CRUD Tables.
How It Works¶
1. Create a CSV File¶
Prepare your data as a CSV file. The file must be UTF-8 encoded.
Example T_Measurements.csv:
1b. Or Provide Parquet Files¶
If your data is already in Parquet format (e.g., from a Spark or Hive pipeline), you can provide it directly without converting to CSV first.
Parquet files are self-describing -- no header or delimiter configuration is needed. Columns are automatically mapped by position to match the Excel sheet's column headers.
A single table can be split across multiple Parquet files using prefix naming. For example, for a sheet named T_Sales:
T_Sales.parquet-- single fileT_Sales.parquet,T_Sales_part1.parquet,T_Sales_part2.parquet-- split across files
All files whose name starts with the sheet name (without the T_ prefix) followed by a separator (_, -, or digit) are automatically grouped together.
Note
If both Parquet and CSV files exist for the same table, the Parquet files take priority.
2. Name the CSV File to Match Your Excel Sheet¶
The data file name must match an Excel sheet name that uses the T_ prefix. For example:
| Excel Sheet Name | CSV File Name | Parquet File Name(s) |
|---|---|---|
T_Measurements |
T_Measurements.csv |
T_Measurements.parquet |
T_Addresses |
T_Addresses.csv |
T_Addresses.parquet, T_Addresses_part2.parquet |
T_Transactions |
T_Transactions.csv |
T_Transactions.parquet |
The Excel sheet defines the table's schema (column headers in row 1). The CSV file provides the actual data. The same rules as Large Tables apply to the Excel sheet: the table must start at column A, row 1 contains column headers, and one table per sheet.
3. Configure CSV Settings¶
In the desktop app's Settings view (or in codcel.toml), configure how your CSV files are parsed:
| Setting | Default | Description |
|---|---|---|
| CSV Has Header | false |
Whether the first row of the CSV is a header row |
| CSV Delimiter | , |
The character that separates fields (e.g., ,, ;, \t) |
4. Reference the Table in Your Formulas¶
Use standard Excel lookup and filtering functions to query the massive table. Codcel supports the following functions on massive tables:
- VLOOKUP -- vertical lookup by key
- HLOOKUP -- horizontal lookup
- XLOOKUP -- advanced lookup with flexible match and search modes
- LOOKUP -- simplified lookup
- INDEX -- direct cell or row access
- MATCH -- find position of a value
- XMATCH -- extended match with advanced modes
- FILTER -- conditional row filtering
These functions work the same way as on regular tables, but are executed as optimised SQL queries against the Parquet data rather than scanning rows in memory.
Match and Search Modes¶
When using XLOOKUP or XMATCH on massive tables, the following match and search modes are supported:
Match Modes¶
| Mode | Description |
|---|---|
0 |
Exact match only |
-1 |
Exact match, or next smallest value |
1 |
Exact match, or next largest value |
2 |
Wildcard matching (* and ?) |
Search Modes¶
| Mode | Description |
|---|---|
1 |
Search from first to last (default) |
-1 |
Search from last to first |
2 |
Binary search (data must be sorted ascending) |
-2 |
Binary search (data must be sorted descending) |
Performance¶
Query Caching¶
To avoid re-executing expensive queries, the Parquet engine caches query results with a 5-minute time-to-live (TTL). Repeated lookups with the same parameters return instantly from cache.
Request Coalescing¶
When multiple concurrent requests execute the same query (common in server deployments), only one query is actually executed. Other identical requests wait for and share the result, significantly reducing load on large datasets.
Tips for Best Performance¶
- Use exact match mode (
0) when possible -- it generates the most efficient queries - Structure your data so that frequently queried columns appear early in the table
- For repeated queries with varying parameters, the caching system handles optimisation automatically
Supported Data Types¶
The Parquet engine supports the following column types:
| Type | Description |
|---|---|
| String | Text values |
| Integer | 32-bit integer values |
| Double | 64-bit floating-point values |
| Date | Date values |
| Timestamp | Date and time values |
| Boolean | True/false values |
Limitations¶
- Read-only: Massive tables do not support CRUD operations (
ADD_ROW,UPDATE_ROW,DELETE_ROW,READ_ROW). For mutable tables, use CRUD Tables with PostgreSQL. - No formulas in data: Like large tables, CSV and Parquet data is static. Formulas in the Excel sheet are computed at import time and cannot reference input parameters.
- Range parameters ignored: When calling VLOOKUP or similar functions, the cell range parameter in the formula is ignored -- the query always operates on the entire table.
Worked Example¶
For a complete worked example using massive tables with 1 billion rows of data, see the One Billion Row Challenge.
See Also¶
- Tables Overview -- comparison of all table types
- Large Tables -- tables up to 1,048,576 rows within Excel
- Small Tables -- in-memory tables with formula support
- CRUD Tables -- mutable tables backed by PostgreSQL
- Big Data -- overview of Codcel's data scaling strategy