Data Sources

This guide covers all supported data sources in Hyperparam, how to connect to them, and best practices for each source type.

Overview

Supported Sources

SourceAccess MethodAuthenticationStreaming
Local FilesDrag & DropNoneNo
Direct URLsPaste/ClickNone/TokenYes
Hugging FaceSearch/BrowseNone/TokenYes
AWS S3S3 URLsPublic/SignedYes
Google CloudGCS URLsPublic/SignedYes
Azure BlobBlob URLsPublic/SASYes

Local Files

Drag and Drop

The simplest method for small files:

Supported formats:
- .parquet (recommended)
- .txt (limited support)
- .csv (coming soon)
- .json (coming soon)

Behavior Differences

User StateActionResult
Signed OutDrop fileProcess locally only
Signed InDrop fileUpload to storage

Direct URLs

Public URLs

Any publicly accessible URL:

https://example.com/data.parquet
https://cdn.example.com/dataset.parquet
http://public-bucket.s3.amazonaws.com/file.parquet

URL Requirements

  • Must be direct file link
  • No authentication required
  • Supports range requests
  • CORS headers properly set

Testing URL Accessibility

# Test if URL supports range requests
curl -I -H "Range: bytes=0-1000" https://example.com/data.parquet

# Look for:
# Accept-Ranges: bytes
# Content-Range: bytes 0-1000/...

Hugging Face Datasets

Discovery via Chat

Most powerful method:

"Find conversation datasets with quality metrics"
"Show me code generation datasets over 1M examples"
"Search for multilingual instruction datasets"

Direct URLs

Format:

https://huggingface.co/datasets/{org}/{dataset}/resolve/main/{file}

Example:
https://huggingface.co/datasets/wikipedia/resolve/main/data/train-00000-of-00001.parquet

Hugging Face Features

  • Automatic dataset discovery
  • Preview before loading
  • Dataset cards and metadata
  • Version control
  • Community ratings

Authentication (Optional)

For private datasets:

// Coming soon: HF token support
const url = "https://huggingface.co/datasets/private/dataset";
const token = "hf_xxxxxxxxxxxx";

AWS S3

Public S3 Buckets

Direct access to public data:

https://s3.amazonaws.com/bucket-name/path/to/file.parquet
https://bucket-name.s3.amazonaws.com/path/to/file.parquet
https://s3.region.amazonaws.com/bucket-name/file.parquet

S3 Signed URLs

For private buckets:

import boto3
from botocore.exceptions import NoCredentialsError

def create_presigned_url(bucket, key, expiration=3600):
    s3_client = boto3.client('s3')
    try:
        response = s3_client.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket, 'Key': key},
            ExpiresIn=expiration
        )
        return response
    except NoCredentialsError:
        return None

# Generate URL valid for 1 hour
url = create_presigned_url('my-bucket', 'data/file.parquet')

S3 Best Practices

  1. Region Selection: Use closest region
  2. Bucket Settings: Enable transfer acceleration
  3. CORS Configuration:
{
    "CORSRules": [{
        "AllowedOrigins": ["https://hyperparam.app"],
        "AllowedMethods": ["GET", "HEAD"],
        "AllowedHeaders": ["*"],
        "ExposeHeaders": ["Content-Range", "Accept-Ranges"]
    }]
}

Google Cloud Storage

Public GCS URLs

Format:

https://storage.googleapis.com/bucket-name/path/to/file.parquet
https://storage.cloud.google.com/bucket-name/file.parquet

Signed URLs for GCS

from google.cloud import storage
import datetime

def generate_signed_url(bucket_name, blob_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    url = blob.generate_signed_url(
        version="v4",
        expiration=datetime.timedelta(hours=1),
        method="GET",
    )
    return url

GCS Configuration

Enable CORS:

[
  {
    "origin": ["https://hyperparam.app"],
    "method": ["GET", "HEAD"],
    "responseHeader": ["Content-Range", "Accept-Ranges"],
    "maxAgeSeconds": 3600
  }
]

Azure CORS Setup

<CorsRule>
    <AllowedOrigins>https://hyperparam.app</AllowedOrigins>
    <AllowedMethods>GET,HEAD</AllowedMethods>
    <MaxAgeInSeconds>3600</MaxAgeInSeconds>
    <ExposedHeaders>Content-Range,Accept-Ranges</ExposedHeaders>
    <AllowedHeaders>*</AllowedHeaders>
</CorsRule>

Private Data Sources

Upload Strategy

For sensitive data:

  1. Never use public URLs
  2. Generate time-limited signed URLs
  3. Restrict CORS to Hyperparam origin
  4. Monitor access logs
  5. Rotate credentials regularly

Troubleshooting

Common Issues

IssueCauseSolution
"Access Denied"Private bucketUse signed URL
"CORS Error"Missing headersConfigure CORS
"Slow Loading"Far regionUse closer source
"Range Not Supported"Old serverDownload fully

Testing Connectivity

// Browser console test
fetch('https://your-data-url.com/file.parquet', {
    method: 'HEAD',
    headers: {
        'Range': 'bytes=0-1000'
    }
}).then(response => {
    console.log('Status:', response.status);
    console.log('Headers:', response.headers);
});

Summary

Key points:

  • Multiple sources supported with streaming
  • URLs preferred over local files for large data
  • Cloud storage works seamlessly
  • Security via signed URLs
  • Performance varies by source location
  • CORS configuration required for some sources

Choose the right source for your use case and optimize for streaming performance!

Data Sources - Hyperparam