LATTEArena — An Evaluation Framework for LLM-powered Tabular Feature Engineering

Abstract

Distilling a combinatorial design space into systematic, cost-aware insight.

Feature engineering remains a cornerstone of tabular data analysis, and LLMs have emerged as a promising paradigm for its automation — giving rise to LLM-powered Automated Tabular Feature Engineering (LATTE). Yet the field lacks standardized, cost-aware evaluation, and a combinatorial explosion of design choices obscures true algorithmic progress.

We systematically deconstruct 15 representative LATTE methods into a unified 6-dimensional taxonomy, then introduce LATTEArena: a standardized, modular, extensible framework that decouples monolithic pipelines into reusable execution blocks. We evaluate 24 core configurations across 7 research questions, going beyond accuracy to quantify token efficiency and execution robustness.

The result: 17 empirical findings on cost-effectiveness trade-offs and 3 concrete recommendations for deployment. All code, datasets, and 4,000+ execution logs are public.

LLM for feature engineeringbenchmarkingcontext engineeringtabular dataAutoML

Why a benchmark?

Two bottlenecks hold the field back

Despite rapid emergence, LATTE significantly lags behind other LLM-driven tabular tasks. The community confronts two critical, intertwined obstacles.

01

Challenge 1

Combinatorial explosion obscures attribution

Existing methods are proposed as monolithic pipelines, arbitrarily bundling prompting paradigms, search strategies, and output formats. This entanglement makes it impossible to isolate which components actually drive gains versus which merely add overhead.

02

Challenge 2

Absence of standardized, cost-aware evaluation

Current literature relies on fragmented setups and exclusively reports predictive accuracy. Crucial deployment metrics — token consumption, latency, and stability — are systematically neglected, leaving true cost-effectiveness obscured.

Fig 1. The LATTEArena taxonomy and challenges — Method B may beat Method A, but what actually drives the gain? And how do we compare cost fairly across inconsistent LLMs and downstream models?

The LATTE pipeline

An LLM iteratively proposes feature transformations

Guided by validation feedback, the optimizer progressively improves downstream performance through three main stages — the conceptual backbone every LATTE method shares.

S1

Prompt Construction

High-quality prompts guide the LLM toward effective feature generation, composed of six key components.

RoleTaskMetadataInstancesDemonstrationsInstructions

S2

LLM-powered FE

The optimizer 𝒫_ℳ determines candidate transformations through iterative querying under a search strategy.

GreedyEvolutionaryUCB / MCTSCoT · SC · RAG

S3

Post-processing

A parser translates textual output ϕ into executable programs, applies transforms, and logs (𝒞, ϕ) pairs for future rounds.

NLRuleCodeRPN

The LATTE pipeline and six-dimensional taxonomy

Fig 2. The LATTE pipeline and its six-dimensional taxonomy. Letters Ⓐ–Ⓕ mark where each design dimension enters the three-stage flow.

The 6-dimensional taxonomy

Every method is a point in an orthogonal design space

Despite apparent diversity, LATTE methods share a compact set of orthogonal axes. Prompting and FE Strategy primarily dictate behavior; the rest modulate cost and robustness.

A

Prompting

Search policy over feature space

Chain of Thoughtbaseline
Tree of Thoughtbranching
EvoPromptpopulation
OPROfeedback
Generator-Criticclosed-loop

B

FE Strategy

Exploration / exploitation

Expand-Reducedecoupled
Greedy Incremental8/15
MCTS ExplorationUCB
Best-of-Nselect top
Select-Expand-Ensembledata-space

C

Demonstration

Inter-iteration memory

Ranked / Top-kby score
Positive-Negativecontrast
Full Contextscales poorly
Textual Gradientdistilled

D

Output Format

Reachable space & failure mode

Natural Languageexpressive
Rulediscrete
Codeexecutable
RPNcompact

E

Metadata

Channel for domain knowledge

Native6/15
Human-Writtenhi-fidelity
LLM-Generatedaugment
Calculated ValuePyMFE
RAG-Enhancedretrieval

F

Data Sampling

Context compression

Random-Selecteduniform
Cluster-basedby label
Human-Selectedexpert

Takeaway — The prompting families form a spectrum from single-trajectory refinement (CoT) through branching exploration (ToT), population search (EvoPrompt) and feedback optimization (OPRO) to closed-loop collaboration (Generator-Critic). Moving along it increases exploration and feedback quality — but with diminishing returns against escalating query cost.

Fig 3. The five FE strategies: (a) Expand-Reduce · (b) Greedy Incremental · (c) MCTS-based Exploration · (d) Best-of-N · (e) Select-Expand-Ensemble.

15 representative methods

Every published LATTE method, decomposed

Compiled since CAAFE (2023) to span the current technique space, grouped by core prompting paradigm. Highlighted columns are the performance-driving dimensions.

Family	Method	Venue	Ⓐ Prompting	Ⓒ Demonstration	Ⓔ Metadata	Ⓕ Sampling	Ⓑ FE Strategy	Ⓓ Output
CoT	CAAFE	NeurIPS'24	Vanilla CoT	Full Context	Human-Written	Random	Greedy Incremental	Code
	FEBias	NeurIPSW'24	Vanilla CoT	Full Context	Calculated	Random	Greedy Incremental	NL
	GPT-Signal	FINNLP'24	Vanilla CoT	/	LLM-Generated	Human	Greedy Incremental	NL
	RAFG	ICDM'25	Vanilla CoT	/	RAG-Enhanced	/	Greedy Incremental	Code
	SMARTFEAT	CIDR'24	Operator-based CoT	/	Native	/	Greedy Incremental	NL
	FeatLLM	ICML'24	CoT + SC	/	Native	Cluster-based	Select-Expand-Ensemble	Rule
	FREEFORM	AMIA'25	CoT + SC	Human-Written NL	/	Random	Select-Expand-Ensemble	NL
ToT	LFG	IJCAI'25	ToT	Positive-Negative	Native	/	MCTS Exploration	NL
ToT	Adda	SIGMOD'25	ToT	Top-k Code Snippets	Calculated	Random	MCTS Exploration	Code
OPRO	OCTree	NeurIPS'24	CART-based OPRO	Top-k (Code,CART,Score)	Native	/	Greedy Incremental	Code
OPRO	FEBP	Preprint'25	Vanilla OPRO	Top-k (RPN,Score)	Native	/	Expand-Reduce	RPN
Evo	ELLM-FT	AAAI'25	EvoPrompt	Ranked (RPNSet,Score)	/	/	Best-of-N	RPN
Evo	LLM-FE	Preprint'25	EvoPrompt	Top-k Code Snippets	Native	Random	Best-of-N	Code
Critic	LPFG	IJCAI'25	Generator-critic	Textual Gradient	Calculated	/	Greedy Incremental	RPN
Critic	Rouge One	Preprint'25	Generator-critic	Textual Gradient	RAG-Enhanced	/	Greedy Incremental	Code

Core performance-driving dimensions Bold = benchmarked option · plain = excluded variant / = absent

LATTEArena: design & usage

A modular, execution-safe arena

The three-stage paradigm is realized through seven core modules with strict I/O specifications — any technique from the taxonomy can be plugged in or adaptively routed without touching the backbone.

Fig 4. The LATTEArena pipeline. Blue dashed boxes are optional modules adaptively routed by configuration; numbers ❶–❼ trace the iterative workflow.

① Serializer

Fuses task specs, metadata & tabular data via a unified template library, abstracting away prompting & format idiosyncrasies.

routable

② Retriever

Constructs in-context demonstrations from the History Database, operationalizing the memory mechanisms.

③ FE Agent

The cognitive engine: policy optimizer 𝒫_ℳ selects exploration strategies and interacts with the LLM to generate ϕ and context 𝒞.

④ Post-processor

Translates raw outputs into executable code with strict format constraints & LLM-based error recovery — the execution safety net.

⑤ Evaluator

Scores features with downstream models, integrating NAS & HPO for realistic full-AutoML evaluation.

⑥ History Database

Archives generated code, metadata & scores — with the Retriever, a structured framework for context management.

optional

⑦ Warm-up Module

Pre-populates the database via RL-based algorithms, mitigating the cold-start bottleneck with high-quality few-shot guidance.

Success Rate

Fraction of LLM outputs parsed & executed without runtime error — maximized by strict constraints & error recovery.

High-level Abstraction

A unified interface decouples the algorithmic search strategy from execution logic across prompts, formats, and LLM backends.

Seamless Extensibility

Strict abstraction barriers let any novel technique from the six dimensions plug in or be adaptively routed.

Execution Safety

Inherently sanitizes and robustifies LLM outputs, preventing the runtime failures that plague code-generation methods.

Distilling the design space

From 500 pipelines down to 24 core configs

Naively crossing every option yields an intractable space — 5 prompting × 5 FE × 5 demonstration × 4 output formats = 500 unique pipelines. Three configuration principles filter confounds and enforce rigor, distilling this down to just 24 core configurations.

CGN

C oT prompting

G reedy search

N L output

Aliases concatenate dimension initials.
Subscript h = Positive-Negative history · t = top-k · w = warm-up · c = CART.

P1

Focus on core methodologies

Prioritize methodologically distinct approaches; strictly exclude stylistic variations or external dependencies (human-in-the-loop, RAG) that skew comparisons.

P2

Respect component constraints

Enforce inherent design dependencies — e.g. pairing MCTS exclusively with ToT, or restricting the Warm-up module to the RPN format.

P3

Prioritize cost-effectiveness

Exhaust foundational configurations before escalating to complex reasoning strategies, ensuring performance gains are cost-justified.

Alias	Prompting	Strategy	Output	Demonstration	Maps to original methods
CGN / CGC / CGR	CoT	Greedy	NL · Code · RPN	/	SMARTFEAT, GPT-Signal, RAFG, FeatLLM, FREEFORM
CGN_h / CGC_h / CGR_h	CoT	Greedy	NL · Code · RPN	Positive-Negative (h)	FEBias, CAAFE
CGN_t / CGC_t / CGR_t	CoT	Greedy	NL · Code · RPN	top-k	New variant
TMN_h / TMC_h / TMR_h	ToT	MCTS	NL · Code · RPN	Positive-Negative (h)	LFG, Adda
TMN / TMC / TMR	ToT	MCTS	NL · Code · RPN	/	New variant
GGN / GGC / GGR	Generator-critic	Greedy	NL · Code · RPN	Positive-Negative (h)	LPFG, Rouge One
OGC_c	CART-based OPRO	Greedy	Code	top-k + CART	OCTree
OGC / OGR	OPRO	Greedy	Code · RPN	top-k	FEBP (RPN) · Base
EBR_w / EBR / EBC	EvoPrompt	Best-of-N	RPN · Code	Ranked (+ warm-up)	ELLM-FT, LLM-FE

Interactive leaderboard

Rank every configuration by gain

Head-to-head performance gains from our main benchmark (Table 4). Switch the task, aggregation, and metric, or isolate a method family — the board re-ranks live. Higher is always better.

Task

Aggregation

Metric

Method family

VG validation · TG test · AG AutoML test gain

Benchmarking & findings

17 findings across 7 research questions

Massive controlled benchmarking on 16 datasets (294 → 1M+ instances) and four LLM backbones — going beyond accuracy to token efficiency and execution robustness.

RQ1

Performance gain across tasks

Finding 1

Task complexity dictates optimal search strategies. A sharp divergence: classification favors exploratory paradigms (Evo / OPRO reach the highest test gains), while regression favors greedy incremental approaches (CoT / ToT secure the highest AutoML gains).

Finding 2

Structured formats overcome NL's expressive bottleneck. RPN excels in classification via concise high-throughput exploration; Code strictly dominates regression where row-wise transforms exceed RPN's structural limits.

Finding 3

The evaluators in LATTE methods exhibit severe overfitting. Validation gains are drastically inflated relative to true downstream gains (e.g. OGR reports +3.55% VG but −0.32% AG). Best-of-N sampling partially mitigates this collapse.

Finding 4

Naïve history demonstrations degrade trajectories. Equipping CoT/ToT with history (h variants) often hurts — context-independent marginal scores isolate features from interactions, misguiding the evaluator.

RQ2 · RQ3

Time, token cost & efficiency

Finding 5

Architectural complexity drives exponential cost. CoT→ToT/Critic incurs moderate growth (~1.3× / 2×), but evolutionary / optimization architectures trigger a near 10× surge — a steep scalability cliff.

Finding 6

LATTE is a new time-efficient paradigm. Most variants finish in ~10% of Autofeat/GRFG runtime; CGR concludes in just 5%, thanks to directed, semantic-driven exploration.

Finding 7

Demonstrations inflate input; format governs latency. Adding demonstrations raises total tokens by 19.6%; switching NL→RPN slashes output tokens 81.3% and cuts runtime 36.1%, while Code bloats output by over 150%.

Finding 8

CoT wins low budgets; ToT scales. A distinct crossover: CoT shows fast initial gains then hits greedy local optima, while ToT sustains scaling and surpasses CoT under high budgets.

Finding 9

Complex architectures need massive token investment. Critic, OPRO and Evo show poor token efficiency under restricted budgets — their advantages are strictly conditional on relaxed cost constraints.

Finding 10

Demonstrations dilute token efficiency. Positive-Negative gains in fixed-round settings are entirely offset by compounded input overhead per call; zero-shot structured prompts stay more cost-effective.

Cost. Average time/token cost across configs. Note the exponential surge from CoT to iterative OPRO/Evo, and demonstration overhead inflating input context.

Efficiency. Accuracy vs token cost on 5 classification datasets — the CoT/ToT crossover and the prohibitive thresholds of complex methods.

Accuracy vs token cost under matched high budgets

High-budget scaling. Accuracy vs token cost for simple CoT methods and OGR at matched high budgets (140k+ tokens) — granted an equal token budget, even zero-shot CoT overtakes OGR's iterative single-output refinement.

RQ4 · RQ5

Component & module analysis

Finding 11

CART delivers near-free cost savings; the OPRO loop monopolizes overhead. Replacing CART reasoning with raw metadata inflates tokens ~280% while slightly degrading VG — yet the OPRO feedback loop alone drives a 19× token surge, rendering the full pipeline impractical for standard tabular tasks unless budget constraints are entirely relaxed.

Finding 12

Evolutionary mutation synthesizes generalizable features, but warm-up quality sets the ceiling. The mutation phase predominantly boosts Test Gain and AutoML Gain, evidencing robust feature synthesis rather than validation overfitting. However, degrading the warm-up to a random collector collapses AG by up to 358%, confirming that collector quality — not the mutation operator — is the decisive bottleneck.

Finding 13

System bottlenecks dictate design priorities. Feature selection is the linchpin for temporal efficiency (removing it halves latency but collapses VG by 39.5%); metadata generation & compression govern the token-performance trade-off.

LLM backbones. Deepseek-V3.1 achieves Pareto optimality comparable to GPT-4o; o4-mini secures the highest VG at an ~80% token premium; Llama-3.1-8B struggles with instruction-following formatting.

RQ6 · RQ7

Scalability & robustness

Finding 14

Data scale regulates generalization. On large datasets (100k–1M) the validation-test gap virtually vanishes — sheer scale neutralizes overfitting risk without algorithmic intervention.

Finding 15

Task logic rigidly constrains format. When a task demands strict rule-based reasoning (e.g. poker-hand), Code's structural rigor achieves absolute dominance, superseding sophisticated prompting.

Finding 16

Expressiveness trades off with stability. Success rates degrade as formats grow expressive (NL > RPN > Code) — Code's unbounded flexibility triggers fabricated operators and hallucinatory features.

Finding 17

Iterative bottlenecks regularize generation. By constraining the LLM to refine exactly one feature per iteration, OGR and OGC_c maintain a 99% success rate — a highly effective structural regularizer.

Robustness. Success rates across methods. The inverse relationship between format expressiveness and stability — and the near-perfect robustness of iterative OPRO.

For practitioners

Three recommendations for real-world deployment

Distilled from extensive evaluation of the sprawling design space, organized along overall performance, component design, and scalability.

1

Match the search strategy to the budget and the format to the task

Default to RPN zero-shot prompting (CGR, TMR) under tight budgets, tree-based planning (ToT/MCTS) at moderate budgets, and OPRO with Best-of-N only when budgets are ample — its gains carry super-linear cost. Orthogonally, pick Code for regression and RPN for classification.

2

Spend tokens on cheap structural context, not algorithmic complexity

Retain lightweight structural priors (e.g., CART-style reasoning) over bulky metadata, and keep the feature selector — the linchpin of both latency and validation gain. Trim metadata, instances, and calculated values to cut tokens cheaply, and skip history demonstrations whose input overhead outweighs their benefit.

3

Scale safeguards to data size and task logic

On small datasets, counter LLM-evaluator overfitting with Best-of-N. Use Code for logic-bound tasks, and apply rule-based error-correction (or OPRO's single-feature refinement) to secure success rates.

Beyond

Three bottlenecks point the way forward

No universal winner

No single method dominates across all data scales — the optimal choice shifts with task type and budget.

Diminishing returns

Algorithmic complexification yields diminishing returns relative to escalating query cost.

Costly memory

Existing demonstration forms remain cost-ineffective — future work must pivot to better tabular context management.

Citation

Cite LATTEArena

All code, datasets, and over 4,000 execution logs are publicly released to foster a dynamic, community-driven benchmark. Please cite the extended version on arXiv.

@misc{hao2026lattearenaevaluationframeworkllmpowered,
      title={LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)},
      author={Ankai Hao and Ke Chen and Huan Li and Lidan Shou},
      year={2026},
      eprint={2606.09004},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.09004},
}