Data Sources

This guide covers all supported data sources in Hyperparam, how to connect to them, and best practices for each source type.

Overview

Supported Sources

SourceAccess MethodAuthenticationStreaming
Local FilesDrag & DropNoneNo
Direct URLsPaste/ClickNone/TokenYes
Hugging FaceSearch/BrowseNone/TokenYes
AWS S3S3 URLsPublic/SignedYes
Google CloudGCS URLsPublic/SignedYes
Azure BlobBlob URLsPublic/SASYes

Local Files

Drag and Drop

The simplest method for small files:

Supported formats:
- .parquet (recommended)
- .txt (limited support)
- .csv (coming soon)
- .json (coming soon)

Behavior Differences

User StateActionResult
Signed OutDrop fileProcess locally only
Signed InDrop fileUpload to storage

Direct URLs

Public URLs

Any publicly accessible URL:

https://example.com/data.parquet
https://cdn.example.com/dataset.parquet
http://public-bucket.s3.amazonaws.com/file.parquet

URL Requirements

  • Must be direct file link
  • No authentication required
  • Supports range requests
  • CORS headers properly set

Testing URL Accessibility

# Test if URL supports range requests
curl -I -H "Range: bytes=0-1000" https://example.com/data.parquet

# Look for:
# Accept-Ranges: bytes
# Content-Range: bytes 0-1000/...

Hugging Face Datasets

Discovery via Chat

Most powerful method:

"Find conversation datasets with quality metrics"
"Show me code generation datasets over 1M examples"
"Search for multilingual instruction datasets"

Direct URLs

Format:

https://huggingface.co/datasets/{org}/{dataset}/resolve/main/{file}

Example:
https://huggingface.co/datasets/wikipedia/resolve/main/data/train-00000-of-00001.parquet

Hugging Face Features

  • Automatic dataset discovery
  • Preview before loading
  • Dataset cards and metadata
  • Version control
  • Community ratings

Authentication (Optional)

For private datasets:

// Coming soon: HF token support
const url = "https://huggingface.co/datasets/private/dataset";
const token = "hf_xxxxxxxxxxxx";

AWS S3

Public S3 Buckets

Direct access to public data:

https://s3.amazonaws.com/bucket-name/path/to/file.parquet
https://bucket-name.s3.amazonaws.com/path/to/file.parquet
https://s3.region.amazonaws.com/bucket-name/file.parquet

S3 Signed URLs

For private buckets:

import boto3
from botocore.exceptions import NoCredentialsError

def create_presigned_url(bucket, key, expiration=3600):
    s3_client = boto3.client('s3')
    try:
        response = s3_client.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket, 'Key': key},
            ExpiresIn=expiration
        )
        return response
    except NoCredentialsError:
        return None

# Generate URL valid for 1 hour
url = create_presigned_url('my-bucket', 'data/file.parquet')

S3 Best Practices

  1. Region Selection: Use closest region
  2. Bucket Settings: Enable transfer acceleration
  3. CORS Configuration:
{
    "CORSRules": [{
        "AllowedOrigins": ["https://hyperparam.app"],
        "AllowedMethods": ["GET", "HEAD"],
        "AllowedHeaders": ["*"],
        "ExposeHeaders": ["Content-Range", "Accept-Ranges"]
    }]
}

Google Cloud Storage

Public GCS URLs

Format:

https://storage.googleapis.com/bucket-name/path/to/file.parquet
https://storage.cloud.google.com/bucket-name/file.parquet

Signed URLs for GCS

from google.cloud import storage
import datetime

def generate_signed_url(bucket_name, blob_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(blob_name)

    url = blob.generate_signed_url(
        version="v4",
        expiration=datetime.timedelta(hours=1),
        method="GET",
    )
    return url

GCS Configuration

Enable CORS:

[
  {
    "origin": ["https://hyperparam.app"],
    "method": ["GET", "HEAD"],
    "responseHeader": ["Content-Range", "Accept-Ranges"],
    "maxAgeSeconds": 3600
  }
]

Azure CORS Setup

<CorsRule>
    <AllowedOrigins>https://hyperparam.app</AllowedOrigins>
    <AllowedMethods>GET,HEAD</AllowedMethods>
    <MaxAgeInSeconds>3600</MaxAgeInSeconds>
    <ExposedHeaders>Content-Range,Accept-Ranges</ExposedHeaders>
    <AllowedHeaders>*</AllowedHeaders>
</CorsRule>

Private Data Sources

Upload Strategy

For sensitive data:

  1. Never use public URLs
  2. Generate time-limited signed URLs
  3. Restrict to Hyperparam origin
  4. Monitor access logs
  5. Rotate credentials regularly

Security Checklist

  • [ ] Use HTTPS only
  • [ ] Set short expiration (1-24 hours)
  • [ ] Limit to GET operations
  • [ ] Restrict by IP if possible
  • [ ] Enable access logging
  • [ ] Use encryption at rest

Troubleshooting

Common Issues

IssueCauseSolution
"Access Denied"Private bucketUse signed URL
"CORS Error"Missing headersConfigure CORS
"Slow Loading"Far regionUse closer source
"Range Not Supported"Old serverDownload fully

Testing Connectivity

// Browser console test
fetch('https://your-data-url.com/file.parquet', {
    method: 'HEAD',
    headers: {
        'Range': 'bytes=0-1000'
    }
}).then(response => {
    console.log('Status:', response.status);
    console.log('Headers:', response.headers);
});

Summary

Key points:

  • Multiple sources supported with streaming
  • URLs preferred over local files for large data
  • Cloud storage works seamlessly
  • Security via signed URLs
  • Performance varies by source location
  • CORS configuration required for some sources

Choose the right source for your use case and optimize for streaming performance!

Data Sources - Hyperparam