Use Cases

This guide provides practical examples of using Hyperparam for common machine learning data workflows.

Dataset Discovery: Finding Datasets via Chat

Use natural language to search and discover datasets within Hyperparam.

Overview

Leverage the Hyperparam chat interface to find relevant datasets without browsing through file listings.

Steps

Open Hyperparam Chat
Access the chat interface from the main navigation
Search using natural language
Example query: “find me anonymized patient data with medical charting”
Hyperparam Chat will return data sets from Hugging Face that match criteria
Open dataset from results
Click on a result (e.g., chunked-ehr/0000) to open in the data viewer
Dataset loads with all columns and metadata

Expected Results

Quick discovery: Natural language search returns relevant datasets
Direct access: One-click opening into data viewer
Context preserved: Chat understands domain-specific terminology (medical, ML, etc.)

Quality Filtering: Removing Sycophantic Responses

Filter out low-quality, overly agreeable responses from a chat log dataset using LLM-generated quality scores.

Overview

Starting with a 200k-row chat log dataset (ultrachat_200k), generate a sycophancy score for each conversation, then filter out highly sycophantic responses.

Steps

Load the dataset
Open ultrachat_200k/train_gen/0000
Generate sycophancy scores
Open the chat panel on the right-hand side
Use chat to request: “add a sycophancy score for each row”
Hyperparam analyzes each conversation and creates a sycophancy_score column (0.0 = authentic, 1.0 = highly sycophantic)
Create workspace and review scores
Edit data in workspace to create a 100-row sample
High sycophancy scores show overly agreeable responses, low scores show authentic engagement
Apply filter
Add filter: sycophancy_score < 0.2
Keeps only responses with low sycophancy (authentic, non-pandering) responses
Export filtered dataset
Click export
Enable “Apply current table filters”
Set output filename (e.g., ultrachat_200k_filtered.parquet)
Export processes full dataset with filter applied

Expected Results

Generated column: sycophancy_score rating each response’s authenticity
Filtered dataset: Only rows with sycophancy score < 0.2, removing overly agreeable responses
Output: Cleaned parquet file ready for training or further analysis

Data Transformation: Categorizing System Prompts

Derive categorical data from unstructured text fields to better understand dataset composition.

Overview

Starting with a large conversation dataset (Orca), analyze system prompts and create a categorical column to classify prompt types.

Steps

Find a conversation dataset
Open Hyperparam chat
Query: “look for conversation datasets with at least 100,000 examples”
Select OpenOrca/partial-train/0000 from results
Examine the data
Review the system_prompt column containing system prompts
See the different kind of system prompts
Create categorical transformation
In chat, request: “create a new column that categorizes the system prompt”
Hyperparam analyzes system prompts and generates categories (e.g., “Education/Tutor”, “General Assistant”, “Informaton retrieval/QA”)
New column system_prompt_category appears with classifications
Review categorization
Scroll through rows to verify category assignments
Categories help identify prompt patterns across the dataset
Export categorized dataset
Click export
Select columns including system_prompt_category
Export processes full 100k+ dataset

Expected Results

New categorical column: system_prompt_category classifying each system prompt
Pattern insights: Understand distribution of prompt types in the dataset
Enhanced metadata: Enables filtering and analysis by prompt category

Complete Workflow: Patient Data Extraction and Filtering

Extract structured fields from unstructured medical records, filter by criteria, and export a refined dataset.

Overview

Starting with an 53,000+-row parquet file containing unstructured patient records, use LLM-based extraction to create structured columns, filter the dataset by age and diagnosis criteria, and export a subset with selected columns.

Steps

Search for Dataset
First lets open the Hyperparm Chat and ask ‘find me anonymized patient data with medical charting’
The first file ‘chunked-ehr/0000’ will work for our example click an open into the data viewer
Extract structured fields using chat
Open any cell from the ‘input_text’ column containing patient information
We can full unstructured text data for individual chart by scrolling down
Use chat to request extraction: “extract age, diagnosis, symptoms, comorbilities, treatments, outcome from input_text”
Hyperparam will create 6 new columns and populate them with the extractions.
Columns appear as: extracted_age, diagnosis, symptoms, comorbidities, treatments, outcomes
Scroll down and you will see Hyperparam filling out all 53k+ rows
Edit data in workspace sample
Open the file in a workspace (creates a 100-row random sample)
This forces all generated columns to load for faster iteration
Apply filters
Add Filter by age: age > 50
Add Filter by diagnosis: diagnosis contains respiratory
Add Filter by symptoms: contains "fever"
Applied filters reduce sample from 100 rows to 2 matching patients
Export filtered dataset
Click export
Select specific columns: subject_id, age, diagnosis, symptoms, comobidities, treatments, outcome
Enable “Apply current table filters”
Set output filename: filtered_patients.parquet
Click export to process full 53,000-row dataset with filters applied

Expected Results

Extracted columns: Structured fields parsed from unstructured patient text
Filtered sample: 5 patients matching criteria (age > 50, some sort of respiratory diagnosis, fever symptom)
Final export: ~655 rows (.7% of 53k+ rows, based on sample proportion) with selected columns only