Data Transformation: Categorizing System Prompts
Derive categorical data from unstructured text fields to better understand dataset composition.
Overview
Starting with a large conversation dataset (Orca), analyze system prompts and create a categorical column to classify prompt types.

Steps
- Load the conversation dataset
- Examine the data
Review the
system_promptcolumn containing system prompts> Note: See the different kind of system prompts
- Create categorical transformation
In chat, request: "create a new column that categorizes the system prompt"
> Note: Hyperparam analyzes system prompts and generates categories (e.g., "Education/Tutor", "General Assistant", "Informaton retrieval/QA")
New column
system_prompt_categoryappears with classifications - Review categorization
Scroll through rows to verify category assignments
> Note: Categories help identify prompt patterns across the dataset
- Export categorized dataset
Click export
Select columns including
system_prompt_categoryExport processes full 100k+ dataset
Expected Results
- New categorical column:
system_prompt_categoryclassifying each system prompt - Pattern insights: Understand distribution of prompt types in the dataset
- Enhanced metadata: Enables filtering and analysis by prompt category
Other Use Cases
- Dataset Discovery - Use natural language to search and discover datasets
- Patient Data Workflow - Extract, filter, and export structured medical data
- Quality Filtering - Remove low-quality responses from datasets
- Deep Research — Multi-step AI workflow for dataset research and model comparison