Skip to content

Massive Tables

If you require tables that contain more than 1,048,576 rows -- up to billions of rows -- Codcel can handle them using external CSV or Parquet files backed by the high-performance Parquet columnar storage engine.


Overview

Excel has a limit of 1,048,576 rows per sheet. For datasets that exceed this limit, Codcel allows you to provide the data as a separate CSV or Parquet file. During code generation, the data is converted to Parquet format (if not already) and queried using the Apache DataFusion SQL engine, enabling efficient lookups and filtering across billions of rows.

Massive tables are read-only. You can use lookup and filtering functions on them, but CRUD operations (add, update, delete) are not supported. For tables that need CRUD operations, see CRUD Tables.


How It Works

1. Create a CSV File

Prepare your data as a CSV file. The file must be UTF-8 encoded.

Example T_Measurements.csv:

Hamburg;12.0
Berlin;8.5
Munich;15.3
Hamburg;11.2
Berlin;7.8

1b. Or Provide Parquet Files

If your data is already in Parquet format (e.g., from a Spark or Hive pipeline), you can provide it directly without converting to CSV first.

Parquet files are self-describing -- no header or delimiter configuration is needed. Columns are automatically mapped by position to match the Excel sheet's column headers.

A single table can be split across multiple Parquet files using prefix naming. For example, for a sheet named T_Sales:

  • T_Sales.parquet -- single file
  • T_Sales.parquet, T_Sales_part1.parquet, T_Sales_part2.parquet -- split across files

All files whose name starts with the sheet name (without the T_ prefix) followed by a separator (_, -, or digit) are automatically grouped together.

Note

If both Parquet and CSV files exist for the same table, the Parquet files take priority.

2. Name the CSV File to Match Your Excel Sheet

The data file name must match an Excel sheet name that uses the T_ prefix. For example:

Excel Sheet Name CSV File Name Parquet File Name(s)
T_Measurements T_Measurements.csv T_Measurements.parquet
T_Addresses T_Addresses.csv T_Addresses.parquet, T_Addresses_part2.parquet
T_Transactions T_Transactions.csv T_Transactions.parquet

The Excel sheet defines the table's schema (column headers in row 1). The CSV file provides the actual data. The same rules as Large Tables apply to the Excel sheet: the table must start at column A, row 1 contains column headers, and one table per sheet.

3. Configure CSV Settings

In the desktop app's Settings view (or in codcel.toml), configure how your CSV files are parsed:

Setting Default Description
CSV Has Header false Whether the first row of the CSV is a header row
CSV Delimiter , The character that separates fields (e.g., ,, ;, \t)

4. Reference the Table in Your Formulas

Use standard Excel lookup and filtering functions to query the massive table. Codcel supports the following functions on massive tables:

  • VLOOKUP -- vertical lookup by key
  • HLOOKUP -- horizontal lookup
  • XLOOKUP -- advanced lookup with flexible match and search modes
  • LOOKUP -- simplified lookup
  • INDEX -- direct cell or row access
  • MATCH -- find position of a value
  • XMATCH -- extended match with advanced modes
  • FILTER -- conditional row filtering

These functions work the same way as on regular tables, but are executed as optimised SQL queries against the Parquet data rather than scanning rows in memory.


Match and Search Modes

When using XLOOKUP or XMATCH on massive tables, the following match and search modes are supported:

Match Modes

Mode Description
0 Exact match only
-1 Exact match, or next smallest value
1 Exact match, or next largest value
2 Wildcard matching (* and ?)

Search Modes

Mode Description
1 Search from first to last (default)
-1 Search from last to first
2 Binary search (data must be sorted ascending)
-2 Binary search (data must be sorted descending)

Performance

Query Caching

To avoid re-executing expensive queries, the Parquet engine caches query results with a 5-minute time-to-live (TTL). Repeated lookups with the same parameters return instantly from cache.

Request Coalescing

When multiple concurrent requests execute the same query (common in server deployments), only one query is actually executed. Other identical requests wait for and share the result, significantly reducing load on large datasets.

Tips for Best Performance

  • Use exact match mode (0) when possible -- it generates the most efficient queries
  • Structure your data so that frequently queried columns appear early in the table
  • For repeated queries with varying parameters, the caching system handles optimisation automatically

Supported Data Types

The Parquet engine supports the following column types:

Type Description
String Text values
Integer 32-bit integer values
Double 64-bit floating-point values
Date Date values
Timestamp Date and time values
Boolean True/false values

Limitations

  • Read-only: Massive tables do not support CRUD operations (ADD_ROW, UPDATE_ROW, DELETE_ROW, READ_ROW). For mutable tables, use CRUD Tables with PostgreSQL.
  • No formulas in data: Like large tables, CSV and Parquet data is static. Formulas in the Excel sheet are computed at import time and cannot reference input parameters.
  • Range parameters ignored: When calling VLOOKUP or similar functions, the cell range parameter in the formula is ignored -- the query always operates on the entire table.

Worked Example

For a complete worked example using massive tables with 1 billion rows of data, see the One Billion Row Challenge.


See Also