Evaluation Methodology

We evaluated model performance across three datasets:

Adaptive Reasoning (28 examples): Tests how well models adapt to new contexts based on logic puzzles they’ve seen before. Hardest SAT Problems (50 examples): Measures reasoning ability on difficult academic-style questions. Real-World Customer Tickets (100 examples): Assesses classification accuracy on de-identified support tickets (names, phones, and URLs removed).

Models Tested

We included both open-source and proprietary models. If a model had a specific reasoning or "thinking" mode , we used that variant (e.g. Claude 3.7 Sonnet Thinking instead of the standard Claude 3.7 Sonnet).

Process

All prompts ran with temperature = 0 . If a model failed to return an answer in <final answer> brackets, we marked it as 0 (or incorrect) Each dataset was evaluated once, no reruns or prompt variation.

Scoring

We used GPT-4o as an automatic judge with this prompt:

Your job is to determine whether an answer to a question is correct given the correct answer. If there's anything incorrect about the answer, the answer should be marked as completely incorrect. Question: {{ question }} Correct answer: {{ correct_answer }} Answer to evaluate: the <final answer> part in {{ answer }} Return only the number "1" if the answer is essentially correct and contains no major errors, and the number "0" if the answer contains any significant errors.

Outputs were scored as binary (1 = correct, 0 = incorrect).

Human Review

After the LLM scoring pass, we manually reviewed the results. This helped catch errors where models overfit or where the auto-judge was too lenient or inconsistent.

‍

Evaluating models on adaptive reasoning, SAT questions & real-world classification tasks

Evaluation Methodology

Models Tested

Process

Scoring

Human Review

Introducing Vellum for Agents

Vellum Product Update | December

GPT-5.2 Benchmarks