Use Cases

This guide provides practical examples of using Hyperparam for common machine learning data workflows.

Dataset Discovery: Finding Datasets via Chat

Use natural language to search and discover datasets within Hyperparam.

Overview

Leverage the Hyperparam chat interface to find relevant datasets without browsing through file listings.

Steps

  1. Open Hyperparam Chat

    Access the chat interface from the main navigation

  2. Search using natural language

    Example query: “find me anonymized patient data with medical charting”

    Hyperparam Chat will return data sets from Hugging Face that match criteria

  3. Open dataset from results

    Click on a result (e.g., chunked-ehr/0000) to open in the data viewer

    Dataset loads with all columns and metadata

Expected Results

  • Quick discovery: Natural language search returns relevant datasets
  • Direct access: One-click opening into data viewer
  • Context preserved: Chat understands domain-specific terminology (medical, ML, etc.)

Quality Filtering: Removing Sycophantic Responses

Filter out low-quality, overly agreeable responses from a chat log dataset using LLM-generated quality scores.

Overview

Starting with a 200k-row chat log dataset (ultrachat_200k), generate a sycophancy score for each conversation, then filter out highly sycophantic responses.

Steps

  1. Load the dataset

    Open ultrachat_200k/train_gen/0000

  2. Generate sycophancy scores

    Open the chat panel on the right-hand side

    Use chat to request: “add a sycophancy score for each row”

    Hyperparam analyzes each conversation and creates a sycophancy_score column (0.0 = authentic, 1.0 = highly sycophantic)

  3. Create workspace and review scores

    Edit data in workspace to create a 100-row sample

    High sycophancy scores show overly agreeable responses, low scores show authentic engagement

  4. Apply filter

    Add filter: sycophancy_score < 0.2

    Keeps only responses with low sycophancy (authentic, non-pandering) responses

  5. Export filtered dataset

    Click export

    Enable “Apply current table filters”

    Set output filename (e.g., ultrachat_200k_filtered.parquet)

    Export processes full dataset with filter applied

Expected Results

  • Generated column: sycophancy_score rating each response’s authenticity
  • Filtered dataset: Only rows with sycophancy score < 0.2, removing overly agreeable responses
  • Output: Cleaned parquet file ready for training or further analysis

Data Transformation: Categorizing System Prompts

Derive categorical data from unstructured text fields to better understand dataset composition.

Overview

Starting with a large conversation dataset (Orca), analyze system prompts and create a categorical column to classify prompt types.

Steps

  1. Find a conversation dataset

    Open Hyperparam chat

    Query: “look for conversation datasets with at least 100,000 examples”

    Select OpenOrca/partial-train/0000 from results

  2. Examine the data

    Review the system_prompt column containing system prompts

    See the different kind of system prompts

  3. Create categorical transformation

    In chat, request: “create a new column that categorizes the system prompt”

    Hyperparam analyzes system prompts and generates categories (e.g., “Education/Tutor”, “General Assistant”, “Informaton retrieval/QA”)

    New column system_prompt_category appears with classifications

  4. Review categorization

    Scroll through rows to verify category assignments

    Categories help identify prompt patterns across the dataset

  5. Export categorized dataset

    Click export

    Select columns including system_prompt_category

    Export processes full 100k+ dataset

Expected Results

  • New categorical column: system_prompt_category classifying each system prompt
  • Pattern insights: Understand distribution of prompt types in the dataset
  • Enhanced metadata: Enables filtering and analysis by prompt category

Complete Workflow: Patient Data Extraction and Filtering

Extract structured fields from unstructured medical records, filter by criteria, and export a refined dataset.

Overview

Starting with an 53,000+-row parquet file containing unstructured patient records, use LLM-based extraction to create structured columns, filter the dataset by age and diagnosis criteria, and export a subset with selected columns.

Steps

  1. Search for Dataset

    First lets open the Hyperparm Chat and ask ‘find me anonymized patient data with medical charting’

    The first file ‘chunked-ehr/0000’ will work for our example click an open into the data viewer

  2. Extract structured fields using chat

    Open any cell from the ‘input_text’ column containing patient information

    We can full unstructured text data for individual chart by scrolling down

    Use chat to request extraction: “extract age, diagnosis, symptoms, comorbilities, treatments, outcome from input_text”

    Hyperparam will create 6 new columns and populate them with the extractions.

    Columns appear as: extracted_age, diagnosis, symptoms, comorbidities, treatments, outcomes

    Scroll down and you will see Hyperparam filling out all 53k+ rows

  3. Edit data in workspace sample

    Open the file in a workspace (creates a 100-row random sample)

    This forces all generated columns to load for faster iteration

  4. Apply filters

    Add Filter by age: age > 50

    Add Filter by diagnosis: diagnosis contains respiratory

    Add Filter by symptoms: contains "fever"

    Applied filters reduce sample from 100 rows to 2 matching patients

  5. Export filtered dataset

    Click export

    Select specific columns: subject_id, age, diagnosis, symptoms, comobidities, treatments, outcome

    Enable “Apply current table filters”

    Set output filename: filtered_patients.parquet

    Click export to process full 53,000-row dataset with filters applied

Expected Results

  • Extracted columns: Structured fields parsed from unstructured patient text
  • Filtered sample: 5 patients matching criteria (age > 50, some sort of respiratory diagnosis, fever symptom)
  • Final export: ~655 rows (.7% of 53k+ rows, based on sample proportion) with selected columns only