Deep Research: Comparing Model Performance

Multi-step AI workflow for comparing models head-to-head on a real dataset.

Overview

Run multiple models in parallel against the same inputs from your log set, compare their outputs side by side, and pick the best one. The same pattern works on any input column: agent traces, chat prompts, support tickets, or document corpora. This walkthrough uses a public dataset of financial news articles financial-news-articles as a concrete example, but you can substitute the user-message column from your own chat logs or the prompt column from your agent traces.

Demo showing deep research in Hyperparam

Steps

Load the dataset
Open financial-news-articles
Give the model a research prompt
Use chat to request: "Summarize each news article using claude-haiku, claude-sonnet, and gpt-5-mini. Then compare their summaries and explain how the models perform differently. Which model do you recommend for best quality?"

Expected Results

Summary Comparison: An analysis of the strengths and weaknesses of each model's summaries
Model Recommendation: A recommendation for which model to use for best summary results

Other Use Cases

Dataset Discovery: Use natural language to find public datasets
Classifying Prompt Patterns: Categorize unstructured prompts to see your real traffic mix
Patient Data Workflow: Extract, filter, and export structured medical data
Quality Filtering: Score and filter low-quality, sycophantic responses in chat logs