Use Cases
This guide provides practical examples of using Hyperparam for common machine learning data workflows.
Dataset Discovery: Finding Datasets via Chat
Use natural language to search and discover datasets within Hyperparam.
Overview
Leverage the Hyperparam chat interface to find relevant datasets without browsing through file listings.
Steps
- Open Hyperparam Chat
Access the chat interface from the main navigation
- Search using natural language
Example query: “find me anonymized patient data with medical charting”
Hyperparam Chat will return data sets from Hugging Face that match criteria
- Open dataset from results
Click on a result (e.g.,
chunked-ehr/0000) to open in the data viewerDataset loads with all columns and metadata
Expected Results
- Quick discovery: Natural language search returns relevant datasets
- Direct access: One-click opening into data viewer
- Context preserved: Chat understands domain-specific terminology (medical, ML, etc.)
Quality Filtering: Removing Sycophantic Responses
Filter out low-quality, overly agreeable responses from a chat log dataset using LLM-generated quality scores.
Overview
Starting with a 200k-row chat log dataset (ultrachat_200k), generate a sycophancy score for each conversation, then filter out highly sycophantic responses.
Steps
- Load the dataset
- Generate sycophancy scores
Open the chat panel on the right-hand side
Use chat to request: “add a sycophancy score for each row”
Hyperparam analyzes each conversation and creates a
sycophancy_scorecolumn (0.0 = authentic, 1.0 = highly sycophantic) - Create workspace and review scores
Edit data in workspace to create a 100-row sample
High sycophancy scores show overly agreeable responses, low scores show authentic engagement
- Apply filter
Add filter:
sycophancy_score < 0.2Keeps only responses with low sycophancy (authentic, non-pandering) responses
- Export filtered dataset
Click export
Enable “Apply current table filters”
Set output filename (e.g.,
ultrachat_200k_filtered.parquet)Export processes full dataset with filter applied
Expected Results
- Generated column:
sycophancy_scorerating each response’s authenticity - Filtered dataset: Only rows with sycophancy score < 0.2, removing overly agreeable responses
- Output: Cleaned parquet file ready for training or further analysis
Data Transformation: Categorizing System Prompts
Derive categorical data from unstructured text fields to better understand dataset composition.
Overview
Starting with a large conversation dataset (Orca), analyze system prompts and create a categorical column to classify prompt types.
Steps
- Find a conversation dataset
Open Hyperparam chat
Query: “look for conversation datasets with at least 100,000 examples”
Select OpenOrca/partial-train/0000 from results
- Examine the data
Review the
system_promptcolumn containing system promptsSee the different kind of system prompts
- Create categorical transformation
In chat, request: “create a new column that categorizes the system prompt”
Hyperparam analyzes system prompts and generates categories (e.g., “Education/Tutor”, “General Assistant”, “Informaton retrieval/QA”)
New column
system_prompt_categoryappears with classifications - Review categorization
Scroll through rows to verify category assignments
Categories help identify prompt patterns across the dataset
- Export categorized dataset
Click export
Select columns including
system_prompt_categoryExport processes full 100k+ dataset
Expected Results
- New categorical column:
system_prompt_categoryclassifying each system prompt - Pattern insights: Understand distribution of prompt types in the dataset
- Enhanced metadata: Enables filtering and analysis by prompt category
Complete Workflow: Patient Data Extraction and Filtering
Extract structured fields from unstructured medical records, filter by criteria, and export a refined dataset.
Overview
Starting with an 53,000+-row parquet file containing unstructured patient records, use LLM-based extraction to create structured columns, filter the dataset by age and diagnosis criteria, and export a subset with selected columns.
Steps
- Search for Dataset
First lets open the Hyperparm Chat and ask ‘find me anonymized patient data with medical charting’
The first file ‘chunked-ehr/0000’ will work for our example click an open into the data viewer
- Extract structured fields using chat
Open any cell from the ‘input_text’ column containing patient information
We can full unstructured text data for individual chart by scrolling down
Use chat to request extraction: “extract age, diagnosis, symptoms, comorbilities, treatments, outcome from input_text”
Hyperparam will create 6 new columns and populate them with the extractions.
Columns appear as:
extracted_age,diagnosis,symptoms,comorbidities,treatments,outcomesScroll down and you will see Hyperparam filling out all 53k+ rows
- Edit data in workspace sample
Open the file in a workspace (creates a 100-row random sample)
This forces all generated columns to load for faster iteration
- Apply filters
Add Filter by age:
age > 50Add Filter by diagnosis:
diagnosis contains respiratoryAdd Filter by symptoms:
contains "fever"Applied filters reduce sample from 100 rows to 2 matching patients
- Export filtered dataset
Click export
Select specific columns:
subject_id,age,diagnosis,symptoms,comobidities,treatments,outcomeEnable “Apply current table filters”
Set output filename:
filtered_patients.parquetClick export to process full 53,000-row dataset with filters applied
Expected Results
- Extracted columns: Structured fields parsed from unstructured patient text
- Filtered sample: 5 patients matching criteria (age > 50, some sort of respiratory diagnosis, fever symptom)
- Final export: ~655 rows (.7% of 53k+ rows, based on sample proportion) with selected columns only