Training jobs are the core of TensorOne’s managed training platform. They handle the complete lifecycle of model training, from resource allocation to checkpoint management.
Create Training Job
Create a new training job with specified model architecture, dataset, and training configuration.
Required Parameters
name
: Human-readable name for the training job (1-100 characters)
modelType
: Type of model to train (llm
, vision
, multimodal
, custom
)
datasetId
: ID of the dataset to use for training
config
: Training configuration object
Optional Parameters
description
: Description of the training job
tags
: Array of tags for organization
gpuType
: Preferred GPU type (a100
, h100
, v100
, rtx4090
)
maxWorkers
: Maximum number of workers (1-32, default: 1)
priority
: Job priority (high
, normal
, low
, default: normal
)
Example Usage
Fine-tune a Language Model
curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "llama-7b-finetune",
"modelType": "llm",
"datasetId": "ds_1234567890abcdef",
"config": {
"baseModel": "meta-llama/Llama-2-7b-hf",
"strategy": "lora",
"parameters": {
"rank": 16,
"alpha": 32,
"dropout": 0.1,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
},
"training": {
"epochs": 3,
"batch_size": 4,
"learning_rate": 2e-4,
"weight_decay": 0.01,
"warmup_steps": 100,
"gradient_accumulation_steps": 8
},
"optimization": {
"mixed_precision": "fp16",
"gradient_checkpointing": true,
"max_grad_norm": 1.0
}
},
"gpuType": "a100",
"maxWorkers": 2,
"priority": "high"
}'
Train a Custom Vision Model
curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "custom-object-detection",
"modelType": "vision",
"datasetId": "ds_vision_objects_001",
"config": {
"architecture": "yolov8",
"input_size": [640, 640],
"num_classes": 80,
"training": {
"epochs": 100,
"batch_size": 16,
"learning_rate": 0.01,
"momentum": 0.937,
"weight_decay": 0.0005
},
"augmentation": {
"mixup": 0.1,
"copy_paste": 0.5,
"hsv_h": 0.015,
"hsv_s": 0.7,
"hsv_v": 0.4
}
},
"gpuType": "v100",
"maxWorkers": 4
}'
Response
Returns the created training job object:
{
"id": "job_1234567890abcdef",
"name": "llama-7b-finetune",
"status": "pending",
"modelType": "llm",
"datasetId": "ds_1234567890abcdef",
"config": {
"baseModel": "meta-llama/Llama-2-7b-hf",
"strategy": "lora",
"parameters": {
"rank": 16,
"alpha": 32,
"dropout": 0.1,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
}
},
"resourceAllocation": {
"gpuType": "NVIDIA A100",
"workers": 2,
"memoryPerWorker": "40GB"
},
"estimatedCost": {
"hourly": 8.50,
"total": 25.50
},
"createdAt": "2024-01-15T10:30:00Z",
"updatedAt": "2024-01-15T10:30:00Z"
}
List Training Jobs
Retrieve a list of training jobs for your account.
curl -X GET "https://api.tensorone.ai/v2/training/jobs" \
-H "Authorization: Bearer YOUR_API_KEY"
Query Parameters
status
: Filter by job status (pending
, running
, completed
, failed
, cancelled
)
modelType
: Filter by model type (llm
, vision
, multimodal
, custom
)
limit
: Number of jobs to return (1-100, default: 50)
offset
: Number of jobs to skip for pagination
sort
: Sort order (created_at
, updated_at
, name
)
order
: Sort direction (asc
, desc
, default: desc
)
Response
{
"jobs": [
{
"id": "job_1234567890abcdef",
"name": "llama-7b-finetune",
"status": "running",
"modelType": "llm",
"progress": {
"currentEpoch": 2,
"totalEpochs": 3,
"currentStep": 1247,
"totalSteps": 1875,
"percentage": 66.5
},
"metrics": {
"loss": 0.342,
"learningRate": 1.8e-4,
"throughput": "1250 tokens/sec"
},
"createdAt": "2024-01-15T10:30:00Z",
"startedAt": "2024-01-15T10:35:00Z",
"estimatedCompletion": "2024-01-15T12:45:00Z"
}
],
"pagination": {
"total": 25,
"limit": 50,
"offset": 0,
"hasMore": false
}
}
Get Training Job Details
Retrieve detailed information about a specific training job.
curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef" \
-H "Authorization: Bearer YOUR_API_KEY"
Response
{
"id": "job_1234567890abcdef",
"name": "llama-7b-finetune",
"status": "running",
"modelType": "llm",
"datasetId": "ds_1234567890abcdef",
"config": {
"baseModel": "meta-llama/Llama-2-7b-hf",
"strategy": "lora",
"parameters": {
"rank": 16,
"alpha": 32,
"dropout": 0.1
},
"training": {
"epochs": 3,
"batch_size": 4,
"learning_rate": 2e-4
}
},
"progress": {
"currentEpoch": 2,
"totalEpochs": 3,
"currentStep": 1247,
"totalSteps": 1875,
"percentage": 66.5,
"elapsedTime": 7200,
"remainingTime": 3600
},
"metrics": {
"currentLoss": 0.342,
"bestLoss": 0.298,
"learningRate": 1.8e-4,
"throughput": "1250 tokens/sec",
"memoryUsage": "38.2GB"
},
"resourceUsage": {
"gpuHours": 4.2,
"cost": 35.70,
"efficiency": 0.94
},
"checkpoints": [
{
"id": "ckpt_epoch_1",
"epoch": 1,
"loss": 0.456,
"createdAt": "2024-01-15T11:15:00Z",
"size": "2.3GB"
}
],
"createdAt": "2024-01-15T10:30:00Z",
"startedAt": "2024-01-15T10:35:00Z",
"estimatedCompletion": "2024-01-15T12:45:00Z"
}
Stop Training Job
Stop a running training job and save the current checkpoint.
curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef/stop" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"saveCheckpoint": true,
"reason": "User requested stop"
}'
Resume Training Job
Resume a stopped training job from the latest checkpoint.
curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef/resume" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"checkpointId": "ckpt_epoch_1"
}'
SDK Examples
Python SDK
from tensorone import TensorOneClient
client = TensorOneClient(api_key="YOUR_API_KEY")
# Create a fine-tuning job
job = client.training.jobs.create(
name="llama-7b-finetune",
model_type="llm",
dataset_id="ds_1234567890abcdef",
config={
"base_model": "meta-llama/Llama-2-7b-hf",
"strategy": "lora",
"parameters": {
"rank": 16,
"alpha": 32,
"dropout": 0.1,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
},
"training": {
"epochs": 3,
"batch_size": 4,
"learning_rate": 2e-4,
"weight_decay": 0.01
}
},
gpu_type="a100",
max_workers=2
)
print(f"Created job: {job.id}")
# Monitor training progress
while job.status in ["pending", "running"]:
job = client.training.jobs.get(job.id)
print(f"Progress: {job.progress.percentage}% - Loss: {job.metrics.current_loss}")
time.sleep(30)
print(f"Training completed with status: {job.status}")
JavaScript SDK
import { TensorOneClient } from '@tensorone/sdk';
const client = new TensorOneClient({ apiKey: 'YOUR_API_KEY' });
// Create a training job
const job = await client.training.jobs.create({
name: 'llama-7b-finetune',
modelType: 'llm',
datasetId: 'ds_1234567890abcdef',
config: {
baseModel: 'meta-llama/Llama-2-7b-hf',
strategy: 'lora',
parameters: {
rank: 16,
alpha: 32,
dropout: 0.1,
targetModules: ['q_proj', 'v_proj', 'k_proj', 'o_proj']
},
training: {
epochs: 3,
batchSize: 4,
learningRate: 2e-4,
weightDecay: 0.01
}
},
gpuType: 'a100',
maxWorkers: 2
});
console.log(`Created job: ${job.id}`);
// Monitor progress
const monitorJob = async (jobId) => {
const job = await client.training.jobs.get(jobId);
console.log(`Progress: ${job.progress.percentage}% - Loss: ${job.metrics.currentLoss}`);
if (job.status === 'running' || job.status === 'pending') {
setTimeout(() => monitorJob(jobId), 30000);
} else {
console.log(`Training completed with status: ${job.status}`);
}
};
monitorJob(job.id);
Error Handling
Common Errors
{
"error": "INSUFFICIENT_RESOURCES",
"message": "Requested GPU type not available",
"details": {
"requestedGpuType": "h100",
"availableGpuTypes": ["a100", "v100", "rtx4090"]
}
}
{
"error": "DATASET_NOT_FOUND",
"message": "Dataset with specified ID does not exist",
"details": {
"datasetId": "ds_invalid_id"
}
}
{
"error": "CONFIGURATION_ERROR",
"message": "Invalid training configuration",
"details": {
"field": "config.training.learning_rate",
"reason": "Learning rate must be between 1e-6 and 1e-1"
}
}
Best Practices
Resource Optimization
- Choose appropriate GPU types based on model size and memory requirements
- Use gradient accumulation to effectively increase batch size on limited memory
- Enable mixed precision training for faster convergence and memory efficiency
- Monitor resource utilization and adjust worker count accordingly
Cost Management
- Use spot instances for non-critical training jobs to reduce costs
- Implement early stopping to prevent unnecessary training iterations
- Set cost alerts to monitor training expenses
- Consider using smaller models or LoRA fine-tuning for cost efficiency
Training Stability
- Start with conservative learning rates and gradually increase
- Use gradient clipping to prevent exploding gradients
- Implement proper validation splits to monitor overfitting
- Save checkpoints frequently to recover from interruptions
Training jobs are billed per second of GPU usage. Jobs in pending
status are not billed until they transition to running
.
Large training jobs may take several minutes to provision resources. Monitor the job status and be patient during the initial setup phase.
API key authentication. Use 'Bearer YOUR_API_KEY' format.
Training job created successfully
The response is of type object
.