Data Sources
This guide covers all supported data sources in Hyperparam, how to connect to them, and best practices for each source type.
Overview
Supported Sources
| Source | Access Method | Authentication | Streaming |
|---|---|---|---|
| Local Files | Drag & Drop | None | No |
| Direct URLs | Paste/Click | None/Token | Yes |
| Hugging Face | Search/Browse | None/Token | Yes |
| AWS S3 | S3 URLs | Public/Signed | Yes |
| Google Cloud | GCS URLs | Public/Signed | Yes |
| Azure Blob | Blob URLs | Public/SAS | Yes |
Local Files
Drag and Drop
The simplest method for small files:
Supported formats:
- .parquet (recommended)
- .txt (limited support)
- .csv (coming soon)
- .json (coming soon)Behavior Differences
| User State | Action | Result |
|---|---|---|
| Signed Out | Drop file | Process locally only |
| Signed In | Drop file | Upload to storage |
Direct URLs
Public URLs
Any publicly accessible URL:
https://example.com/data.parquet
https://cdn.example.com/dataset.parquet
http://public-bucket.s3.amazonaws.com/file.parquetURL Requirements
- Must be direct file link
- No authentication required
- Supports range requests
- CORS headers properly set
Testing URL Accessibility
# Test if URL supports range requests
curl -I -H "Range: bytes=0-1000" https://example.com/data.parquet
# Look for:
# Accept-Ranges: bytes
# Content-Range: bytes 0-1000/...Hugging Face Datasets
Discovery via Chat
Most powerful method:
"Find conversation datasets with quality metrics"
"Show me code generation datasets over 1M examples"
"Search for multilingual instruction datasets"Direct URLs
Format:
https://huggingface.co/datasets/{org}/{dataset}/resolve/main/{file}
Example:
https://huggingface.co/datasets/wikipedia/resolve/main/data/train-00000-of-00001.parquetHugging Face Features
- Automatic dataset discovery
- Preview before loading
- Dataset cards and metadata
- Version control
- Community ratings
Authentication (Optional)
For private datasets:
// Coming soon: HF token support
const url = "https://huggingface.co/datasets/private/dataset";
const token = "hf_xxxxxxxxxxxx";AWS S3
Public S3 Buckets
Direct access to public data:
https://s3.amazonaws.com/bucket-name/path/to/file.parquet
https://bucket-name.s3.amazonaws.com/path/to/file.parquet
https://s3.region.amazonaws.com/bucket-name/file.parquetS3 Signed URLs
For private buckets:
import boto3
from botocore.exceptions import NoCredentialsError
def create_presigned_url(bucket, key, expiration=3600):
s3_client = boto3.client('s3')
try:
response = s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': bucket, 'Key': key},
ExpiresIn=expiration
)
return response
except NoCredentialsError:
return None
# Generate URL valid for 1 hour
url = create_presigned_url('my-bucket', 'data/file.parquet')S3 Best Practices
- Region Selection: Use closest region
- Bucket Settings: Enable transfer acceleration
- CORS Configuration:
{
"CORSRules": [{
"AllowedOrigins": ["https://hyperparam.app"],
"AllowedMethods": ["GET", "HEAD"],
"AllowedHeaders": ["*"],
"ExposeHeaders": ["Content-Range", "Accept-Ranges"]
}]
}Google Cloud Storage
Public GCS URLs
Format:
https://storage.googleapis.com/bucket-name/path/to/file.parquet
https://storage.cloud.google.com/bucket-name/file.parquetSigned URLs for GCS
from google.cloud import storage
import datetime
def generate_signed_url(bucket_name, blob_name):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
url = blob.generate_signed_url(
version="v4",
expiration=datetime.timedelta(hours=1),
method="GET",
)
return urlGCS Configuration
Enable CORS:
[
{
"origin": ["https://hyperparam.app"],
"method": ["GET", "HEAD"],
"responseHeader": ["Content-Range", "Accept-Ranges"],
"maxAgeSeconds": 3600
}
]Azure CORS Setup
<CorsRule>
<AllowedOrigins>https://hyperparam.app</AllowedOrigins>
<AllowedMethods>GET,HEAD</AllowedMethods>
<MaxAgeInSeconds>3600</MaxAgeInSeconds>
<ExposedHeaders>Content-Range,Accept-Ranges</ExposedHeaders>
<AllowedHeaders>*</AllowedHeaders>
</CorsRule>Private Data Sources
Upload Strategy
For sensitive data:
- Never use public URLs
- Generate time-limited signed URLs
- Restrict to Hyperparam origin
- Monitor access logs
- Rotate credentials regularly
Security Checklist
- [ ] Use HTTPS only
- [ ] Set short expiration (1-24 hours)
- [ ] Limit to GET operations
- [ ] Restrict by IP if possible
- [ ] Enable access logging
- [ ] Use encryption at rest
Troubleshooting
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| "Access Denied" | Private bucket | Use signed URL |
| "CORS Error" | Missing headers | Configure CORS |
| "Slow Loading" | Far region | Use closer source |
| "Range Not Supported" | Old server | Download fully |
Testing Connectivity
// Browser console test
fetch('https://your-data-url.com/file.parquet', {
method: 'HEAD',
headers: {
'Range': 'bytes=0-1000'
}
}).then(response => {
console.log('Status:', response.status);
console.log('Headers:', response.headers);
});Summary
Key points:
- Multiple sources supported with streaming
- URLs preferred over local files for large data
- Cloud storage works seamlessly
- Security via signed URLs
- Performance varies by source location
- CORS configuration required for some sources
Choose the right source for your use case and optimize for streaming performance!