Datasets are the foundation of successful AI training. TensorOne’s dataset management system provides secure upload, validation, preprocessing, and versioning capabilities for all types of training data.
Upload Dataset
Upload a new dataset for training. Supports various formats including text, images, audio, and structured data.
Required Parameters
name
: Human-readable name for the dataset (1-100 characters)
type
: Dataset type (text
, image
, audio
, video
, structured
, multimodal
)
format
: Data format (jsonl
, csv
, parquet
, hdf5
, zip
, tar
)
Optional Parameters
description
: Description of the dataset
tags
: Array of tags for organization
validation
: Validation configuration object
preprocessing
: Preprocessing configuration object
metadata
: Additional metadata object
Example Usage
Upload Text Dataset for Language Model Training
curl -X POST "https://api.tensorone.ai/v2/training/datasets" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "custom-instruction-dataset",
"type": "text",
"format": "jsonl",
"description": "Custom instruction-response pairs for fine-tuning",
"validation": {
"required_fields": ["instruction", "response"],
"max_sequence_length": 2048,
"min_examples": 100
},
"preprocessing": {
"tokenizer": "meta-llama/Llama-2-7b-hf",
"add_special_tokens": true,
"truncation": true,
"padding": "max_length"
},
"metadata": {
"source": "customer_conversations",
"language": "en",
"domain": "customer_support"
}
}'
Upload Image Dataset for Computer Vision
curl -X POST "https://api.tensorone.ai/v2/training/datasets" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "product-classification-images",
"type": "image",
"format": "zip",
"description": "Product images with category labels",
"validation": {
"supported_formats": ["jpg", "png", "webp"],
"min_resolution": [224, 224],
"max_file_size": "10MB",
"min_examples_per_class": 50
},
"preprocessing": {
"resize": [512, 512],
"normalize": true,
"augmentation": {
"horizontal_flip": 0.5,
"rotation": 15,
"brightness": 0.2,
"contrast": 0.2
}
},
"metadata": {
"num_classes": 25,
"image_source": "product_catalog",
"annotation_format": "directory_structure"
}
}'
Upload Multimodal Dataset
curl -X POST "https://api.tensorone.ai/v2/training/datasets" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "vision-language-pairs",
"type": "multimodal",
"format": "jsonl",
"description": "Image-caption pairs for vision-language model training",
"validation": {
"required_fields": ["image_path", "caption"],
"image_formats": ["jpg", "png"],
"min_caption_length": 5,
"max_caption_length": 200
},
"preprocessing": {
"vision": {
"resize": [336, 336],
"normalize": true,
"center_crop": true
},
"text": {
"tokenizer": "openai/clip-vit-base-patch32",
"max_length": 77,
"truncation": true
}
}
}'
Response
Returns the created dataset object:
{
"id": "ds_1234567890abcdef",
"name": "custom-instruction-dataset",
"type": "text",
"format": "jsonl",
"status": "uploading",
"uploadUrl": "https://upload.tensorone.ai/datasets/ds_1234567890abcdef",
"uploadToken": "tok_upload_1234567890abcdef",
"validation": {
"required_fields": ["instruction", "response"],
"max_sequence_length": 2048,
"min_examples": 100
},
"size": {
"bytes": 0,
"examples": 0
},
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:30:00Z"
}
Upload Data to Dataset
After creating a dataset, upload your data files using the provided upload URL and token.
Upload via Direct HTTP
curl -X PUT "https://upload.tensorone.ai/datasets/ds_1234567890abcdef" \
-H "Authorization: Bearer tok_upload_1234567890abcdef" \
-H "Content-Type: application/octet-stream" \
--data-binary @training_data.jsonl
Upload Large Files (Multipart)
For files larger than 100MB, use multipart upload:
# Initiate multipart upload
curl -X POST "https://upload.tensorone.ai/datasets/ds_1234567890abcdef/multipart" \
-H "Authorization: Bearer tok_upload_1234567890abcdef" \
-H "Content-Type: application/json" \
-d '{
"filename": "large_dataset.zip",
"size": 2147483648
}'
Get Dataset Details
Retrieve detailed information about a specific dataset.
curl -X GET "https://api.tensorone.ai/v2/training/datasets/ds_1234567890abcdef" \
-H "Authorization: Bearer YOUR_API_KEY"
Response
{
"id": "ds_1234567890abcdef",
"name": "custom-instruction-dataset",
"type": "text",
"format": "jsonl",
"status": "ready",
"description": "Custom instruction-response pairs for fine-tuning",
"size": {
"bytes": 52428800,
"examples": 10000,
"compressed": 15728640
},
"schema": {
"fields": [
{"name": "instruction", "type": "string", "nullable": false},
{"name": "response", "type": "string", "nullable": false},
{"name": "context", "type": "string", "nullable": true}
]
},
"statistics": {
"avg_instruction_length": 45.2,
"avg_response_length": 128.7,
"unique_instructions": 9847,
"language_distribution": {
"en": 0.92,
"es": 0.05,
"fr": 0.03
}
},
"validation": {
"status": "passed",
"checks": [
{"name": "required_fields", "status": "passed"},
{"name": "sequence_length", "status": "passed"},
{"name": "minimum_examples", "status": "passed"}
],
"warnings": [
"2 examples exceed recommended response length"
]
},
"preprocessing": {
"tokenizer": "meta-llama/Llama-2-7b-hf",
"total_tokens": 1567890,
"vocab_size": 32000
},
"versions": [
{
"id": "v1",
"createdAt": "2024-01-15T10:30:00Z",
"size": 52428800,
"checksum": "sha256:a1b2c3d4e5f6..."
}
],
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T11:45:00Z"
}
List Datasets
Retrieve a list of datasets for your account.
curl -X GET "https://api.tensorone.ai/v2/training/datasets" \
-H "Authorization: Bearer YOUR_API_KEY"
Query Parameters
type
: Filter by dataset type (text
, image
, audio
, video
, structured
, multimodal
)
status
: Filter by status (uploading
, processing
, ready
, error
)
limit
: Number of datasets to return (1-100, default: 50)
offset
: Number of datasets to skip for pagination
sort
: Sort order (created_at
, updated_at
, name
, size
)
order
: Sort direction (asc
, desc
, default: desc
)
Response
{
"datasets": [
{
"id": "ds_1234567890abcdef",
"name": "custom-instruction-dataset",
"type": "text",
"status": "ready",
"size": {
"bytes": 52428800,
"examples": 10000
},
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T11:45:00Z"
}
],
"pagination": {
"total": 15,
"limit": 50,
"offset": 0,
"hasMore": false
}
}
Update Dataset
Update dataset metadata and configuration.
curl -X PATCH "https://api.tensorone.ai/v2/training/datasets/ds_1234567890abcdef" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "updated-instruction-dataset",
"description": "Updated description with more context",
"tags": ["instruction-tuning", "customer-support", "v2"],
"metadata": {
"source": "customer_conversations",
"language": "en",
"domain": "customer_support",
"quality_score": 8.5
}
}'
Delete Dataset
Delete a dataset and all associated data.
curl -X DELETE "https://api.tensorone.ai/v2/training/datasets/ds_1234567890abcdef" \
-H "Authorization: Bearer YOUR_API_KEY"
Dataset Validation
Text Dataset Validation
{
"validation": {
"required_fields": ["instruction", "response"],
"optional_fields": ["context", "category"],
"max_sequence_length": 2048,
"min_sequence_length": 10,
"min_examples": 100,
"max_examples": 1000000,
"encoding": "utf-8",
"language_detection": true
}
}
Image Dataset Validation
{
"validation": {
"supported_formats": ["jpg", "jpeg", "png", "webp"],
"min_resolution": [224, 224],
"max_resolution": [4096, 4096],
"max_file_size": "10MB",
"min_examples_per_class": 10,
"check_corruption": true,
"color_space": "RGB"
}
}
SDK Examples
Python SDK
from tensorone import TensorOneClient
import pandas as pd
client = TensorOneClient(api_key="YOUR_API_KEY")
# Create dataset
dataset = client.training.datasets.create(
name="custom-instruction-dataset",
type="text",
format="jsonl",
description="Custom instruction-response pairs",
validation={
"required_fields": ["instruction", "response"],
"max_sequence_length": 2048,
"min_examples": 100
}
)
# Upload data
with open("training_data.jsonl", "rb") as f:
client.training.datasets.upload(dataset.id, f)
# Wait for processing
while dataset.status == "processing":
dataset = client.training.datasets.get(dataset.id)
print(f"Processing... {dataset.status}")
time.sleep(10)
print(f"Dataset ready: {dataset.size.examples} examples")
# List datasets
datasets = client.training.datasets.list(type="text", status="ready")
for ds in datasets:
print(f"{ds.name}: {ds.size.examples} examples")
JavaScript SDK
import { TensorOneClient } from '@tensorone/sdk';
import fs from 'fs';
const client = new TensorOneClient({ apiKey: 'YOUR_API_KEY' });
// Create dataset
const dataset = await client.training.datasets.create({
name: 'custom-instruction-dataset',
type: 'text',
format: 'jsonl',
description: 'Custom instruction-response pairs',
validation: {
requiredFields: ['instruction', 'response'],
maxSequenceLength: 2048,
minExamples: 100
}
});
// Upload data
const fileStream = fs.createReadStream('training_data.jsonl');
await client.training.datasets.upload(dataset.id, fileStream);
// Monitor processing
const waitForReady = async (datasetId) => {
const ds = await client.training.datasets.get(datasetId);
if (ds.status === 'processing') {
setTimeout(() => waitForReady(datasetId), 10000);
} else {
console.log(`Dataset ready: ${ds.size.examples} examples`);
}
};
waitForReady(dataset.id);
Text Datasets
{"instruction": "What is machine learning?", "response": "Machine learning is a subset of artificial intelligence..."}
{"instruction": "Explain neural networks", "response": "Neural networks are computational models inspired by..."}
text,label
"This product is amazing!",positive
"Poor quality, not recommended",negative
Image Datasets
Directory Structure
dataset/
├── class1/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
├── class2/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
JSONL with Annotations
{"image_path": "images/001.jpg", "label": "cat", "bbox": [10, 20, 100, 150]}
{"image_path": "images/002.jpg", "label": "dog", "bbox": [15, 25, 120, 180]}
Error Handling
Common Errors
{
"error": "VALIDATION_FAILED",
"message": "Dataset validation failed",
"details": {
"field": "instruction",
"reason": "Required field missing in 15 examples",
"examples": [45, 67, 89, 123, 156]
}
}
{
"error": "UNSUPPORTED_FORMAT",
"message": "File format not supported",
"details": {
"providedFormat": "xlsx",
"supportedFormats": ["jsonl", "csv", "parquet"]
}
}
{
"error": "QUOTA_EXCEEDED",
"message": "Storage quota exceeded",
"details": {
"currentUsage": "50GB",
"quotaLimit": "50GB",
"requestedSize": "5GB"
}
}
Best Practices
Data Quality
- Ensure consistent formatting across all examples
- Remove duplicates and low-quality samples
- Balance your dataset across different classes or categories
- Validate data integrity before uploading
Storage Optimization
- Use compressed formats like Parquet for structured data
- Optimize image sizes while maintaining quality
- Remove unnecessary metadata from files
- Consider data deduplication for large datasets
Security
- Never include sensitive information in training data
- Use proper access controls for private datasets
- Implement data lineage tracking for compliance
- Regularly audit dataset contents
Dataset processing time varies based on size and complexity. Text datasets typically process within minutes, while large image datasets may take several hours.
Once a dataset is used in a training job, it cannot be deleted. Create a new version if you need to make changes.
API key authentication. Use 'Bearer YOUR_API_KEY' format.
Dataset uploaded successfully
The response is of type object
.