Data Transformation: Categorizing System Prompts

Derive categorical data from unstructured text fields to better understand dataset composition.

Overview

Starting with a large conversation dataset (Orca), analyze system prompts and create a categorical column to classify prompt types.

Demo showing data transformation in Hyperparam

Steps

  1. Load the conversation dataset

    Open OpenOrca/partial-train/0000

  2. Examine the data

    Review the system_prompt column containing system prompts

    > Note: See the different kind of system prompts

  3. Create categorical transformation

    In chat, request: "create a new column that categorizes the system prompt"

    > Note: Hyperparam analyzes system prompts and generates categories (e.g., "Education/Tutor", "General Assistant", "Informaton retrieval/QA")

    New column system_prompt_category appears with classifications

  4. Review categorization

    Scroll through rows to verify category assignments

    > Note: Categories help identify prompt patterns across the dataset

  5. Export categorized dataset

    Click export

    Select columns including system_prompt_category

    Export processes full 100k+ dataset

Expected Results

  • New categorical column: system_prompt_category classifying each system prompt
  • Pattern insights: Understand distribution of prompt types in the dataset
  • Enhanced metadata: Enables filtering and analysis by prompt category

Other Use Cases

Data Transformation: Categorizing System Prompts - Hyperparam