Challenges Pipeline Taxonomy Framework Leaderboard Findings Recommendations Cite
Experiment · Analysis · Benchmark

LATTEArena

An Evaluation Framework for LLM-powered Tabular Feature Engineering

The first standardized, modular, and execution-safe benchmarking platform that deconstructs monolithic LATTE pipelines into reusable execution blocks — shifting the field from ad-hoc prompt tuning to systematic context engineering.

Ankai HaoKe ChenHuan LiLidan Shou
Zhejiang University · Hangzhou, China
Scroll
15
Methods Surveyed
6
Taxonomy Dimensions
24
Core Configurations
7
Research Questions
17
Empirical Findings
500
Possible Pipelines
Abstract

Distilling a combinatorial design space into systematic, cost-aware insight.

Feature engineering remains a cornerstone of tabular data analysis, and LLMs have emerged as a promising paradigm for its automation — giving rise to LLM-powered Automated Tabular Feature Engineering (LATTE). Yet the field lacks standardized, cost-aware evaluation, and a combinatorial explosion of design choices obscures true algorithmic progress.

We systematically deconstruct 15 representative LATTE methods into a unified 6-dimensional taxonomy, then introduce LATTEArena: a standardized, modular, extensible framework that decouples monolithic pipelines into reusable execution blocks. We evaluate 24 core configurations across 7 research questions, going beyond accuracy to quantify token efficiency and execution robustness.

The result: 17 empirical findings on cost-effectiveness trade-offs and 3 concrete recommendations for deployment. All code, datasets, and 4,000+ execution logs are public.

LLM for feature engineeringbenchmarkingcontext engineeringtabular dataAutoML
Why a benchmark?

Two bottlenecks hold the field back

Despite rapid emergence, LATTE significantly lags behind other LLM-driven tabular tasks. The community confronts two critical, intertwined obstacles.

01
Challenge 1

Combinatorial explosion obscures attribution

Existing methods are proposed as monolithic pipelines, arbitrarily bundling prompting paradigms, search strategies, and output formats. This entanglement makes it impossible to isolate which components actually drive gains versus which merely add overhead.

02
Challenge 2

Absence of standardized, cost-aware evaluation

Current literature relies on fragmented setups and exclusively reports predictive accuracy. Crucial deployment metrics — token consumption, latency, and stability — are systematically neglected, leaving true cost-effectiveness obscured.

LATTEArena taxonomy and challenges

Fig 1. The LATTEArena taxonomy and challenges — Method B may beat Method A, but what actually drives the gain? And how do we compare cost fairly across inconsistent LLMs and downstream models?

The LATTE pipeline

An LLM iteratively proposes feature transformations

Guided by validation feedback, the optimizer progressively improves downstream performance through three main stages — the conceptual backbone every LATTE method shares.

S1

Prompt Construction

High-quality prompts guide the LLM toward effective feature generation, composed of six key components.

RoleTaskMetadataInstancesDemonstrationsInstructions
S2

LLM-powered FE

The optimizer 𝒫 determines candidate transformations through iterative querying under a search strategy.

GreedyEvolutionaryUCB / MCTSCoT · SC · RAG
S3

Post-processing

A parser translates textual output ϕ into executable programs, applies transforms, and logs (𝒞, ϕ) pairs for future rounds.

NLRuleCodeRPN
The LATTE pipeline and six-dimensional taxonomy

Fig 2. The LATTE pipeline and its six-dimensional taxonomy. Letters Ⓐ–Ⓕ mark where each design dimension enters the three-stage flow.

The 6-dimensional taxonomy

Every method is a point in an orthogonal design space

Despite apparent diversity, LATTE methods share a compact set of orthogonal axes. Prompting and FE Strategy primarily dictate behavior; the rest modulate cost and robustness.

A

Prompting

Search policy over feature space
  • Chain of Thoughtbaseline
  • Tree of Thoughtbranching
  • EvoPromptpopulation
  • OPROfeedback
  • Generator-Criticclosed-loop
B

FE Strategy

Exploration / exploitation
  • Expand-Reducedecoupled
  • Greedy Incremental8/15
  • MCTS ExplorationUCB
  • Best-of-Nselect top
  • Select-Expand-Ensembledata-space
C

Demonstration

Inter-iteration memory
  • Ranked / Top-kby score
  • Positive-Negativecontrast
  • Full Contextscales poorly
  • Textual Gradientdistilled
D

Output Format

Reachable space & failure mode
  • Natural Languageexpressive
  • Rulediscrete
  • Codeexecutable
  • RPNcompact
E

Metadata

Channel for domain knowledge
  • Native6/15
  • Human-Writtenhi-fidelity
  • LLM-Generatedaugment
  • Calculated ValuePyMFE
  • RAG-Enhancedretrieval
F

Data Sampling

Context compression
  • Random-Selecteduniform
  • Cluster-basedby label
  • Human-Selectedexpert
Takeaway — The prompting families form a spectrum from single-trajectory refinement (CoT) through branching exploration (ToT), population search (EvoPrompt) and feedback optimization (OPRO) to closed-loop collaboration (Generator-Critic). Moving along it increases exploration and feedback quality — but with diminishing returns against escalating query cost.
LATTE feature engineering strategies

Fig 3. The five FE strategies: (a) Expand-Reduce · (b) Greedy Incremental · (c) MCTS-based Exploration · (d) Best-of-N · (e) Select-Expand-Ensemble.

15 representative methods

Every published LATTE method, decomposed

Compiled since CAAFE (2023) to span the current technique space, grouped by core prompting paradigm. Highlighted columns are the performance-driving dimensions.

FamilyMethodVenue Ⓐ PromptingⒸ Demonstration Ⓔ MetadataⒻ Sampling Ⓑ FE StrategyⒹ Output
CoTCAAFENeurIPS'24Vanilla CoTFull ContextHuman-WrittenRandomGreedy IncrementalCode
FEBiasNeurIPSW'24Vanilla CoTFull ContextCalculatedRandomGreedy IncrementalNL
GPT-SignalFINNLP'24Vanilla CoT/LLM-GeneratedHumanGreedy IncrementalNL
RAFGICDM'25Vanilla CoT/RAG-Enhanced/Greedy IncrementalCode
SMARTFEATCIDR'24Operator-based CoT/Native/Greedy IncrementalNL
FeatLLMICML'24CoT + SC/NativeCluster-basedSelect-Expand-EnsembleRule
FREEFORMAMIA'25CoT + SCHuman-Written NL/RandomSelect-Expand-EnsembleNL
ToTLFGIJCAI'25ToTPositive-NegativeNative/MCTS ExplorationNL
AddaSIGMOD'25ToTTop-k Code SnippetsCalculatedRandomMCTS ExplorationCode
OPROOCTreeNeurIPS'24CART-based OPROTop-k (Code,CART,Score)Native/Greedy IncrementalCode
FEBPPreprint'25Vanilla OPROTop-k (RPN,Score)Native/Expand-ReduceRPN
EvoELLM-FTAAAI'25EvoPromptRanked (RPNSet,Score)//Best-of-NRPN
LLM-FEPreprint'25EvoPromptTop-k Code SnippetsNativeRandomBest-of-NCode
CriticLPFGIJCAI'25Generator-criticTextual GradientCalculated/Greedy IncrementalRPN
Rouge OnePreprint'25Generator-criticTextual GradientRAG-Enhanced/Greedy IncrementalCode
Core performance-driving dimensions Bold = benchmarked option · plain = excluded variant / = absent
LATTEArena: design & usage

A modular, execution-safe arena

The three-stage paradigm is realized through seven core modules with strict I/O specifications — any technique from the taxonomy can be plugged in or adaptively routed without touching the backbone.

LATTEArena pipeline architecture

Fig 4. The LATTEArena pipeline. Blue dashed boxes are optional modules adaptively routed by configuration; numbers ❶–❼ trace the iterative workflow.

① Serializer

Fuses task specs, metadata & tabular data via a unified template library, abstracting away prompting & format idiosyncrasies.

routable

② Retriever

Constructs in-context demonstrations from the History Database, operationalizing the memory mechanisms.

③ FE Agent

The cognitive engine: policy optimizer 𝒫 selects exploration strategies and interacts with the LLM to generate ϕ and context 𝒞.

④ Post-processor

Translates raw outputs into executable code with strict format constraints & LLM-based error recovery — the execution safety net.

⑤ Evaluator

Scores features with downstream models, integrating NAS & HPO for realistic full-AutoML evaluation.

⑥ History Database

Archives generated code, metadata & scores — with the Retriever, a structured framework for context management.

optional

⑦ Warm-up Module

Pre-populates the database via RL-based algorithms, mitigating the cold-start bottleneck with high-quality few-shot guidance.

Success Rate

Fraction of LLM outputs parsed & executed without runtime error — maximized by strict constraints & error recovery.

High-level Abstraction

A unified interface decouples the algorithmic search strategy from execution logic across prompts, formats, and LLM backends.

Seamless Extensibility

Strict abstraction barriers let any novel technique from the six dimensions plug in or be adaptively routed.

Execution Safety

Inherently sanitizes and robustifies LLM outputs, preventing the runtime failures that plague code-generation methods.

Distilling the design space

From 500 pipelines down to 24 core configs

Naively crossing every option yields an intractable space — 5 prompting × 5 FE × 5 demonstration × 4 output formats = 500 unique pipelines. Three configuration principles filter confounds and enforce rigor, distilling this down to just 24 core configurations.

CGN
C oT prompting
G reedy search
N L output
Aliases concatenate dimension initials.
Subscript h = Positive-Negative history · t = top-k · w = warm-up · c = CART.
P1

Focus on core methodologies

Prioritize methodologically distinct approaches; strictly exclude stylistic variations or external dependencies (human-in-the-loop, RAG) that skew comparisons.

P2

Respect component constraints

Enforce inherent design dependencies — e.g. pairing MCTS exclusively with ToT, or restricting the Warm-up module to the RPN format.

P3

Prioritize cost-effectiveness

Exhaust foundational configurations before escalating to complex reasoning strategies, ensuring performance gains are cost-justified.

AliasPromptingStrategyOutputDemonstrationMaps to original methods
CGN / CGC / CGRCoTGreedyNL · Code · RPN/SMARTFEAT, GPT-Signal, RAFG, FeatLLM, FREEFORM
CGNh / CGCh / CGRhCoTGreedyNL · Code · RPNPositive-Negative (h)FEBias, CAAFE
CGNt / CGCt / CGRtCoTGreedyNL · Code · RPNtop-kNew variant
TMNh / TMCh / TMRhToTMCTSNL · Code · RPNPositive-Negative (h)LFG, Adda
TMN / TMC / TMRToTMCTSNL · Code · RPN/New variant
GGN / GGC / GGRGenerator-criticGreedyNL · Code · RPNPositive-Negative (h)LPFG, Rouge One
OGCcCART-based OPROGreedyCodetop-k + CARTOCTree
OGC / OGROPROGreedyCode · RPNtop-kFEBP (RPN) · Base
EBRw / EBR / EBCEvoPromptBest-of-NRPN · CodeRanked (+ warm-up)ELLM-FT, LLM-FE
Interactive leaderboard

Rank every configuration by gain

Head-to-head performance gains from our main benchmark (Table 4). Switch the task, aggregation, and metric, or isolate a method family — the board re-ranks live. Higher is always better.

Task
Aggregation
Metric
Method family
VG validation · TG test · AG AutoML test gain
Benchmarking & findings

17 findings across 7 research questions

Massive controlled benchmarking on 16 datasets (294 → 1M+ instances) and four LLM backbones — going beyond accuracy to token efficiency and execution robustness.

RQ1

Performance gain across tasks

Finding 1

Task complexity dictates optimal search strategies. A sharp divergence: classification favors exploratory paradigms (Evo / OPRO reach the highest test gains), while regression favors greedy incremental approaches (CoT / ToT secure the highest AutoML gains).

Finding 2

Structured formats overcome NL's expressive bottleneck. RPN excels in classification via concise high-throughput exploration; Code strictly dominates regression where row-wise transforms exceed RPN's structural limits.

Finding 3

The evaluators in LATTE methods exhibit severe overfitting. Validation gains are drastically inflated relative to true downstream gains (e.g. OGR reports +3.55% VG but −0.32% AG). Best-of-N sampling partially mitigates this collapse.

Finding 4

Naïve history demonstrations degrade trajectories. Equipping CoT/ToT with history (h variants) often hurts — context-independent marginal scores isolate features from interactions, misguiding the evaluator.

RQ2 · RQ3

Time, token cost & efficiency

Finding 5

Architectural complexity drives exponential cost. CoT→ToT/Critic incurs moderate growth (~1.3× / 2×), but evolutionary / optimization architectures trigger a near 10× surge — a steep scalability cliff.

Finding 6

LATTE is a new time-efficient paradigm. Most variants finish in ~10% of Autofeat/GRFG runtime; CGR concludes in just 5%, thanks to directed, semantic-driven exploration.

Finding 7

Demonstrations inflate input; format governs latency. Adding demonstrations raises total tokens by 19.6%; switching NL→RPN slashes output tokens 81.3% and cuts runtime 36.1%, while Code bloats output by over 150%.

Finding 8

CoT wins low budgets; ToT scales. A distinct crossover: CoT shows fast initial gains then hits greedy local optima, while ToT sustains scaling and surpasses CoT under high budgets.

Finding 9

Complex architectures need massive token investment. Critic, OPRO and Evo show poor token efficiency under restricted budgets — their advantages are strictly conditional on relaxed cost constraints.

Finding 10

Demonstrations dilute token efficiency. Positive-Negative gains in fixed-round settings are entirely offset by compounded input overhead per call; zero-shot structured prompts stay more cost-effective.

Time and token cost comparison

Cost. Average time/token cost across configs. Note the exponential surge from CoT to iterative OPRO/Evo, and demonstration overhead inflating input context.

Accuracy vs token cost

Efficiency. Accuracy vs token cost on 5 classification datasets — the CoT/ToT crossover and the prohibitive thresholds of complex methods.

Accuracy vs token cost under matched high budgets

High-budget scaling. Accuracy vs token cost for simple CoT methods and OGR at matched high budgets (140k+ tokens) — granted an equal token budget, even zero-shot CoT overtakes OGR's iterative single-output refinement.

RQ4 · RQ5

Component & module analysis

Finding 11

CART delivers near-free cost savings; the OPRO loop monopolizes overhead. Replacing CART reasoning with raw metadata inflates tokens ~280% while slightly degrading VG — yet the OPRO feedback loop alone drives a 19× token surge, rendering the full pipeline impractical for standard tabular tasks unless budget constraints are entirely relaxed.

Finding 12

Evolutionary mutation synthesizes generalizable features, but warm-up quality sets the ceiling. The mutation phase predominantly boosts Test Gain and AutoML Gain, evidencing robust feature synthesis rather than validation overfitting. However, degrading the warm-up to a random collector collapses AG by up to 358%, confirming that collector quality — not the mutation operator — is the decisive bottleneck.

Finding 13

System bottlenecks dictate design priorities. Feature selection is the linchpin for temporal efficiency (removing it halves latency but collapses VG by 39.5%); metadata generation & compression govern the token-performance trade-off.

Performance and cost of different LLMs

LLM backbones. Deepseek-V3.1 achieves Pareto optimality comparable to GPT-4o; o4-mini secures the highest VG at an ~80% token premium; Llama-3.1-8B struggles with instruction-following formatting.

RQ6 · RQ7

Scalability & robustness

Finding 14

Data scale regulates generalization. On large datasets (100k–1M) the validation-test gap virtually vanishes — sheer scale neutralizes overfitting risk without algorithmic intervention.

Finding 15

Task logic rigidly constrains format. When a task demands strict rule-based reasoning (e.g. poker-hand), Code's structural rigor achieves absolute dominance, superseding sophisticated prompting.

Finding 16

Expressiveness trades off with stability. Success rates degrade as formats grow expressive (NL > RPN > Code) — Code's unbounded flexibility triggers fabricated operators and hallucinatory features.

Finding 17

Iterative bottlenecks regularize generation. By constraining the LLM to refine exactly one feature per iteration, OGR and OGCc maintain a 99% success rate — a highly effective structural regularizer.

Success rates across methods

Robustness. Success rates across methods. The inverse relationship between format expressiveness and stability — and the near-perfect robustness of iterative OPRO.

For practitioners

Three recommendations for real-world deployment

Distilled from extensive evaluation of the sprawling design space, organized along overall performance, component design, and scalability.

1

Match the search strategy to the budget and the format to the task

Default to RPN zero-shot prompting (CGR, TMR) under tight budgets, tree-based planning (ToT/MCTS) at moderate budgets, and OPRO with Best-of-N only when budgets are ample — its gains carry super-linear cost. Orthogonally, pick Code for regression and RPN for classification.

2

Spend tokens on cheap structural context, not algorithmic complexity

Retain lightweight structural priors (e.g., CART-style reasoning) over bulky metadata, and keep the feature selector — the linchpin of both latency and validation gain. Trim metadata, instances, and calculated values to cut tokens cheaply, and skip history demonstrations whose input overhead outweighs their benefit.

3

Scale safeguards to data size and task logic

On small datasets, counter LLM-evaluator overfitting with Best-of-N. Use Code for logic-bound tasks, and apply rule-based error-correction (or OPRO's single-feature refinement) to secure success rates.

Beyond

Three bottlenecks point the way forward

No universal winner

No single method dominates across all data scales — the optimal choice shifts with task type and budget.

Diminishing returns

Algorithmic complexification yields diminishing returns relative to escalating query cost.

Costly memory

Existing demonstration forms remain cost-ineffective — future work must pivot to better tabular context management.

Citation

Cite LATTEArena

All code, datasets, and over 4,000 execution logs are publicly released to foster a dynamic, community-driven benchmark. Please cite the extended version on arXiv.

@misc{hao2026lattearenaevaluationframeworkllmpowered,
      title={LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)},
      author={Ankai Hao and Ke Chen and Huan Li and Lidan Shou},
      year={2026},
      eprint={2606.09004},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.09004},
}