Quality Filtering: Removing Sycophantic Responses

Filter out low-quality, overly agreeable responses from a chat log dataset using LLM-generated quality scores.

Overview

Starting with a 200k-row chat log dataset (ultrachat_200k), generate a sycophancy score for each conversation, then filter out highly sycophantic responses.

Demo showing removing sycophantic responses in Hyperparam

Steps

  1. Load the dataset

    Open ultrachat_200k/train_gen/0000

  2. Generate sycophancy scores

    Open the chat panel on the right-hand side

    Use chat to request: "add a 0-1 sycophancy score for each row"

    > Note: Hyperparam analyzes each conversation and creates a sycophancy_score column (0.0 = authentic, 1.0 = highly sycophantic).

  3. View sorted sample

    Select either the first rows or a random sample to view 100 rows of the dataset, which can then be sorted by the generated column

    Click the column header to sort table based on ascending or descending sycophancy

  4. Apply filter

    Add filter: sycophancy_score < 0.2

    > Note: Keeps only responses with low sycophancy (authentic, non-pandering) responses

  5. Export filtered dataset

    Click export

    Enable "Apply current table filters"

    Set output filename (e.g., ultrachat_200k_filtered.parquet)

    > Note: Export processes full dataset with filter applied

Expected Results

  • Generated column: sycophancy_score rating each response's authenticity
  • Filtered dataset: Only rows with sycophancy score < 0.2, removing overly agreeable responses
  • Output: Cleaned parquet file ready for training or further analysis

Other Use Cases

Quality Filtering: Removing Sycophantic Responses - Hyperparam