Upload Training Dataset
curl --request POST \
  --url https://api.tensorone.ai/v2/training/datasets \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
  "name": "<string>",
  "description": "<string>",
  "format": "json",
  "dataUrl": "<string>",
  "splitRatio": {
    "train": 0.5,
    "validation": 0.5,
    "test": 0.5
  }
}'
{
  "datasetId": "<string>",
  "name": "<string>",
  "status": "uploading",
  "createdAt": "2023-11-07T05:31:56Z"
}
Datasets are the foundation of successful AI training. TensorOne’s dataset management system provides secure upload, validation, preprocessing, and versioning capabilities for all types of training data.

Upload Dataset

Upload a new dataset for training. Supports various formats including text, images, audio, and structured data.

Required Parameters

  • name: Human-readable name for the dataset (1-100 characters)
  • type: Dataset type (text, image, audio, video, structured, multimodal)
  • format: Data format (jsonl, csv, parquet, hdf5, zip, tar)

Optional Parameters

  • description: Description of the dataset
  • tags: Array of tags for organization
  • validation: Validation configuration object
  • preprocessing: Preprocessing configuration object
  • metadata: Additional metadata object

Example Usage

Upload Text Dataset for Language Model Training

curl -X POST "https://api.tensorone.ai/v2/training/datasets" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "custom-instruction-dataset",
    "type": "text",
    "format": "jsonl",
    "description": "Custom instruction-response pairs for fine-tuning",
    "validation": {
      "required_fields": ["instruction", "response"],
      "max_sequence_length": 2048,
      "min_examples": 100
    },
    "preprocessing": {
      "tokenizer": "meta-llama/Llama-2-7b-hf",
      "add_special_tokens": true,
      "truncation": true,
      "padding": "max_length"
    },
    "metadata": {
      "source": "customer_conversations",
      "language": "en",
      "domain": "customer_support"
    }
  }'

Upload Image Dataset for Computer Vision

curl -X POST "https://api.tensorone.ai/v2/training/datasets" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "product-classification-images",
    "type": "image",
    "format": "zip",
    "description": "Product images with category labels",
    "validation": {
      "supported_formats": ["jpg", "png", "webp"],
      "min_resolution": [224, 224],
      "max_file_size": "10MB",
      "min_examples_per_class": 50
    },
    "preprocessing": {
      "resize": [512, 512],
      "normalize": true,
      "augmentation": {
        "horizontal_flip": 0.5,
        "rotation": 15,
        "brightness": 0.2,
        "contrast": 0.2
      }
    },
    "metadata": {
      "num_classes": 25,
      "image_source": "product_catalog",
      "annotation_format": "directory_structure"
    }
  }'

Upload Multimodal Dataset

curl -X POST "https://api.tensorone.ai/v2/training/datasets" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "vision-language-pairs",
    "type": "multimodal",
    "format": "jsonl",
    "description": "Image-caption pairs for vision-language model training",
    "validation": {
      "required_fields": ["image_path", "caption"],
      "image_formats": ["jpg", "png"],
      "min_caption_length": 5,
      "max_caption_length": 200
    },
    "preprocessing": {
      "vision": {
        "resize": [336, 336],
        "normalize": true,
        "center_crop": true
      },
      "text": {
        "tokenizer": "openai/clip-vit-base-patch32",
        "max_length": 77,
        "truncation": true
      }
    }
  }'

Response

Returns the created dataset object:
{
  "id": "ds_1234567890abcdef",
  "name": "custom-instruction-dataset",
  "type": "text",
  "format": "jsonl",
  "status": "uploading",
  "uploadUrl": "https://upload.tensorone.ai/datasets/ds_1234567890abcdef",
  "uploadToken": "tok_upload_1234567890abcdef",
  "validation": {
    "required_fields": ["instruction", "response"],
    "max_sequence_length": 2048,
    "min_examples": 100
  },
  "size": {
    "bytes": 0,
    "examples": 0
  },
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T10:30:00Z"
}

Upload Data to Dataset

After creating a dataset, upload your data files using the provided upload URL and token.

Upload via Direct HTTP

curl -X PUT "https://upload.tensorone.ai/datasets/ds_1234567890abcdef" \
  -H "Authorization: Bearer tok_upload_1234567890abcdef" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @training_data.jsonl

Upload Large Files (Multipart)

For files larger than 100MB, use multipart upload:
# Initiate multipart upload
curl -X POST "https://upload.tensorone.ai/datasets/ds_1234567890abcdef/multipart" \
  -H "Authorization: Bearer tok_upload_1234567890abcdef" \
  -H "Content-Type: application/json" \
  -d '{
    "filename": "large_dataset.zip",
    "size": 2147483648
  }'

Get Dataset Details

Retrieve detailed information about a specific dataset.
curl -X GET "https://api.tensorone.ai/v2/training/datasets/ds_1234567890abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response

{
  "id": "ds_1234567890abcdef",
  "name": "custom-instruction-dataset",
  "type": "text",
  "format": "jsonl",
  "status": "ready",
  "description": "Custom instruction-response pairs for fine-tuning",
  "size": {
    "bytes": 52428800,
    "examples": 10000,
    "compressed": 15728640
  },
  "schema": {
    "fields": [
      {"name": "instruction", "type": "string", "nullable": false},
      {"name": "response", "type": "string", "nullable": false},
      {"name": "context", "type": "string", "nullable": true}
    ]
  },
  "statistics": {
    "avg_instruction_length": 45.2,
    "avg_response_length": 128.7,
    "unique_instructions": 9847,
    "language_distribution": {
      "en": 0.92,
      "es": 0.05,
      "fr": 0.03
    }
  },
  "validation": {
    "status": "passed",
    "checks": [
      {"name": "required_fields", "status": "passed"},
      {"name": "sequence_length", "status": "passed"},
      {"name": "minimum_examples", "status": "passed"}
    ],
    "warnings": [
      "2 examples exceed recommended response length"
    ]
  },
  "preprocessing": {
    "tokenizer": "meta-llama/Llama-2-7b-hf",
    "total_tokens": 1567890,
    "vocab_size": 32000
  },
  "versions": [
    {
      "id": "v1",
      "createdAt": "2024-01-15T10:30:00Z",
      "size": 52428800,
      "checksum": "sha256:a1b2c3d4e5f6..."
    }
  ],
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T11:45:00Z"
}

List Datasets

Retrieve a list of datasets for your account.
curl -X GET "https://api.tensorone.ai/v2/training/datasets" \
  -H "Authorization: Bearer YOUR_API_KEY"

Query Parameters

  • type: Filter by dataset type (text, image, audio, video, structured, multimodal)
  • status: Filter by status (uploading, processing, ready, error)
  • limit: Number of datasets to return (1-100, default: 50)
  • offset: Number of datasets to skip for pagination
  • sort: Sort order (created_at, updated_at, name, size)
  • order: Sort direction (asc, desc, default: desc)

Response

{
  "datasets": [
    {
      "id": "ds_1234567890abcdef",
      "name": "custom-instruction-dataset",
      "type": "text",
      "status": "ready",
      "size": {
        "bytes": 52428800,
        "examples": 10000
      },
      "createdAt": "2024-01-15T10:30:00Z",
      "updatedAt": "2024-01-15T11:45:00Z"
    }
  ],
  "pagination": {
    "total": 15,
    "limit": 50,
    "offset": 0,
    "hasMore": false
  }
}

Update Dataset

Update dataset metadata and configuration.
curl -X PATCH "https://api.tensorone.ai/v2/training/datasets/ds_1234567890abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "updated-instruction-dataset",
    "description": "Updated description with more context",
    "tags": ["instruction-tuning", "customer-support", "v2"],
    "metadata": {
      "source": "customer_conversations",
      "language": "en",
      "domain": "customer_support",
      "quality_score": 8.5
    }
  }'

Delete Dataset

Delete a dataset and all associated data.
curl -X DELETE "https://api.tensorone.ai/v2/training/datasets/ds_1234567890abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY"

Dataset Validation

Text Dataset Validation

{
  "validation": {
    "required_fields": ["instruction", "response"],
    "optional_fields": ["context", "category"],
    "max_sequence_length": 2048,
    "min_sequence_length": 10,
    "min_examples": 100,
    "max_examples": 1000000,
    "encoding": "utf-8",
    "language_detection": true
  }
}

Image Dataset Validation

{
  "validation": {
    "supported_formats": ["jpg", "jpeg", "png", "webp"],
    "min_resolution": [224, 224],
    "max_resolution": [4096, 4096],
    "max_file_size": "10MB",
    "min_examples_per_class": 10,
    "check_corruption": true,
    "color_space": "RGB"
  }
}

SDK Examples

Python SDK

from tensorone import TensorOneClient
import pandas as pd

client = TensorOneClient(api_key="YOUR_API_KEY")

# Create dataset
dataset = client.training.datasets.create(
    name="custom-instruction-dataset",
    type="text",
    format="jsonl",
    description="Custom instruction-response pairs",
    validation={
        "required_fields": ["instruction", "response"],
        "max_sequence_length": 2048,
        "min_examples": 100
    }
)

# Upload data
with open("training_data.jsonl", "rb") as f:
    client.training.datasets.upload(dataset.id, f)

# Wait for processing
while dataset.status == "processing":
    dataset = client.training.datasets.get(dataset.id)
    print(f"Processing... {dataset.status}")
    time.sleep(10)

print(f"Dataset ready: {dataset.size.examples} examples")

# List datasets
datasets = client.training.datasets.list(type="text", status="ready")
for ds in datasets:
    print(f"{ds.name}: {ds.size.examples} examples")

JavaScript SDK

import { TensorOneClient } from '@tensorone/sdk';
import fs from 'fs';

const client = new TensorOneClient({ apiKey: 'YOUR_API_KEY' });

// Create dataset
const dataset = await client.training.datasets.create({
  name: 'custom-instruction-dataset',
  type: 'text',
  format: 'jsonl',
  description: 'Custom instruction-response pairs',
  validation: {
    requiredFields: ['instruction', 'response'],
    maxSequenceLength: 2048,
    minExamples: 100
  }
});

// Upload data
const fileStream = fs.createReadStream('training_data.jsonl');
await client.training.datasets.upload(dataset.id, fileStream);

// Monitor processing
const waitForReady = async (datasetId) => {
  const ds = await client.training.datasets.get(datasetId);
  if (ds.status === 'processing') {
    setTimeout(() => waitForReady(datasetId), 10000);
  } else {
    console.log(`Dataset ready: ${ds.size.examples} examples`);
  }
};

waitForReady(dataset.id);

Data Formats

Text Datasets

JSONL Format for Instruction Tuning

{"instruction": "What is machine learning?", "response": "Machine learning is a subset of artificial intelligence..."}
{"instruction": "Explain neural networks", "response": "Neural networks are computational models inspired by..."}

CSV Format for Classification

text,label
"This product is amazing!",positive
"Poor quality, not recommended",negative

Image Datasets

Directory Structure

dataset/
├── class1/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
├── class2/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...

JSONL with Annotations

{"image_path": "images/001.jpg", "label": "cat", "bbox": [10, 20, 100, 150]}
{"image_path": "images/002.jpg", "label": "dog", "bbox": [15, 25, 120, 180]}

Error Handling

Common Errors

{
  "error": "VALIDATION_FAILED",
  "message": "Dataset validation failed",
  "details": {
    "field": "instruction",
    "reason": "Required field missing in 15 examples",
    "examples": [45, 67, 89, 123, 156]
  }
}
{
  "error": "UNSUPPORTED_FORMAT",
  "message": "File format not supported",
  "details": {
    "providedFormat": "xlsx",
    "supportedFormats": ["jsonl", "csv", "parquet"]
  }
}
{
  "error": "QUOTA_EXCEEDED",
  "message": "Storage quota exceeded",
  "details": {
    "currentUsage": "50GB",
    "quotaLimit": "50GB",
    "requestedSize": "5GB"
  }
}

Best Practices

Data Quality

  • Ensure consistent formatting across all examples
  • Remove duplicates and low-quality samples
  • Balance your dataset across different classes or categories
  • Validate data integrity before uploading

Storage Optimization

  • Use compressed formats like Parquet for structured data
  • Optimize image sizes while maintaining quality
  • Remove unnecessary metadata from files
  • Consider data deduplication for large datasets

Security

  • Never include sensitive information in training data
  • Use proper access controls for private datasets
  • Implement data lineage tracking for compliance
  • Regularly audit dataset contents
Dataset processing time varies based on size and complexity. Text datasets typically process within minutes, while large image datasets may take several hours.
Once a dataset is used in a training job, it cannot be deleted. Create a new version if you need to make changes.

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Body

application/json

Response

201 - application/json

Dataset uploaded successfully

The response is of type object.