Data Transformation: Categorizing System Prompts

Derive categorical data from unstructured text fields to better understand dataset composition.

Overview

Starting with a large conversation dataset (Orca), analyze system prompts and create a categorical column to classify prompt types.

Demo showing data transformation in Hyperparam

Load the conversation dataset
Open OpenOrca/partial-train/0000
Examine the data
Review the system_prompt column containing system prompts
> Note: See the different kind of system prompts
Create categorical transformation
In chat, request: "create a new column that categorizes the system prompt"
> Note: Hyperparam analyzes system prompts and generates categories (e.g., "Education/Tutor", "General Assistant", "Informaton retrieval/QA")
New column system_prompt_category appears with classifications
Review categorization
Scroll through rows to verify category assignments
> Note: Categories help identify prompt patterns across the dataset
Export categorized dataset
Click export
Select columns including system_prompt_category
Export processes full 100k+ dataset

New categorical column: system_prompt_category classifying each system prompt
Pattern insights: Understand distribution of prompt types in the dataset
Enhanced metadata: Enables filtering and analysis by prompt category

Dataset Discovery - Use natural language to search and discover datasets
Patient Data Workflow - Extract, filter, and export structured medical data
Quality Filtering - Remove low-quality responses from datasets
Deep Research — Multi-step AI workflow for dataset research and model comparison