ProofGrid: A New Benchmark for LLM Reasoning

ProofGrid steps into the spotlight as a groundbreaking benchmark for evaluating large language model (LLM) reasoning through machine-checkable proofs. By focusing on proof processes rather than mere final answers, it aims to raise the bar in LLM evaluation.

A Suite of Diverse Tasks

ProofGrid presents 15 distinct tasks that range from proof writing and checking to proof masking and gap-filling. Unlike previous benchmarks that often rely on human judgment, ProofGrid utilizes a minimal formal notation called NDL, or natural-deduction language. This allows for concise prompts and supports precise, auditable verification.

The paper's key contribution is its calibrated difficulty spectrum. ProofGrid includes foundational reasoning tests alongside more challenging tasks that no current model can solve entirely. This offers a clear picture of where current models stand and what remains to be conquered.

Mechanical Evaluation and Stability

By adopting a mechanical evaluation approach, ProofGrid ensures reproducibility and fine-grained assessment. This is a significant shift from previous methods that depended on subjective human judgment or LLM outputs.

A notable innovation is the instrumented proof-checking pipeline. It tolerates minor deviations but identifies the first substantive reasoning failure. This method improves measurement precision and distinguishes proof planning from execution noise.

Interestingly, ProofGrid also identifies something called epistemic instability. Models generate flawed proofs yet can correctly reject those local inferences when isolated. The study formalizes this phenomenon with an Epistemic Stability Index, adding another layer to model evaluation.

Current Models: Progress and Limits

Testing a broad range of open and proprietary models, ProofGrid reveals rapid progress in foundational tasks. However, it also exposes substantial limitations. Tasks requiring global combinatorial reasoning or low-level proof synthesis are still far from solved.

While frontier models show promise, the real question is: How long until they can tackle more complex challenges? Current models must evolve significantly to address tasks that demand deep reasoning and synthesis skills.

The ablation study reveals that while models are improving, there remains a long road ahead. The challenges posed by ProofGrid might just be the catalyst needed to push the boundaries of what's possible in LLM reasoning.

ProofGrid: A New Benchmark for LLM Reasoning

A Suite of Diverse Tasks

Mechanical Evaluation and Stability

Current Models: Progress and Limits

Key Terms Explained