Create Training Job
curl --request POST \
  --url https://api.tensorone.ai/v2/training/jobs \
  --header 'Authorization: <api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
  "name": "<string>",
  "description": "<string>",
  "framework": "pytorch",
  "modelConfig": {
    "modelType": "language_model",
    "baseModel": "<string>",
    "customCode": "<string>"
  },
  "datasetConfig": {
    "datasetId": "<string>",
    "datasetUrl": "<string>",
    "format": "json"
  },
  "hyperparameters": {
    "learningRate": 0.001,
    "batchSize": 32,
    "epochs": 10,
    "optimizer": "adam"
  },
  "infrastructure": {
    "gpuType": "rtx-4090",
    "gpuCount": 1,
    "memory": "32GB",
    "storage": "100GB"
  }
}'
{
  "jobId": "<string>",
  "name": "<string>",
  "status": "queued",
  "estimatedStartTime": "2023-11-07T05:31:56Z",
  "estimatedDuration": "<string>",
  "estimatedCost": {
    "gpuCost": 123,
    "storageCost": 123,
    "totalCost": 123
  },
  "createdAt": "2023-11-07T05:31:56Z"
}
Training jobs are the core of TensorOne’s managed training platform. They handle the complete lifecycle of model training, from resource allocation to checkpoint management.

Create Training Job

Create a new training job with specified model architecture, dataset, and training configuration.

Required Parameters

  • name: Human-readable name for the training job (1-100 characters)
  • modelType: Type of model to train (llm, vision, multimodal, custom)
  • datasetId: ID of the dataset to use for training
  • config: Training configuration object

Optional Parameters

  • description: Description of the training job
  • tags: Array of tags for organization
  • gpuType: Preferred GPU type (a100, h100, v100, rtx4090)
  • maxWorkers: Maximum number of workers (1-32, default: 1)
  • priority: Job priority (high, normal, low, default: normal)

Example Usage

Fine-tune a Language Model

curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-7b-finetune",
    "modelType": "llm",
    "datasetId": "ds_1234567890abcdef",
    "config": {
      "baseModel": "meta-llama/Llama-2-7b-hf",
      "strategy": "lora",
      "parameters": {
        "rank": 16,
        "alpha": 32,
        "dropout": 0.1,
        "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
      },
      "training": {
        "epochs": 3,
        "batch_size": 4,
        "learning_rate": 2e-4,
        "weight_decay": 0.01,
        "warmup_steps": 100,
        "gradient_accumulation_steps": 8
      },
      "optimization": {
        "mixed_precision": "fp16",
        "gradient_checkpointing": true,
        "max_grad_norm": 1.0
      }
    },
    "gpuType": "a100",
    "maxWorkers": 2,
    "priority": "high"
  }'

Train a Custom Vision Model

curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "custom-object-detection",
    "modelType": "vision",
    "datasetId": "ds_vision_objects_001",
    "config": {
      "architecture": "yolov8",
      "input_size": [640, 640],
      "num_classes": 80,
      "training": {
        "epochs": 100,
        "batch_size": 16,
        "learning_rate": 0.01,
        "momentum": 0.937,
        "weight_decay": 0.0005
      },
      "augmentation": {
        "mixup": 0.1,
        "copy_paste": 0.5,
        "hsv_h": 0.015,
        "hsv_s": 0.7,
        "hsv_v": 0.4
      }
    },
    "gpuType": "v100",
    "maxWorkers": 4
  }'

Response

Returns the created training job object:
{
  "id": "job_1234567890abcdef",
  "name": "llama-7b-finetune",
  "status": "pending",
  "modelType": "llm",
  "datasetId": "ds_1234567890abcdef",
  "config": {
    "baseModel": "meta-llama/Llama-2-7b-hf",
    "strategy": "lora",
    "parameters": {
      "rank": 16,
      "alpha": 32,
      "dropout": 0.1,
      "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
    }
  },
  "resourceAllocation": {
    "gpuType": "NVIDIA A100",
    "workers": 2,
    "memoryPerWorker": "40GB"
  },
  "estimatedCost": {
    "hourly": 8.50,
    "total": 25.50
  },
  "createdAt": "2024-01-15T10:30:00Z",
  "updatedAt": "2024-01-15T10:30:00Z"
}

List Training Jobs

Retrieve a list of training jobs for your account.
curl -X GET "https://api.tensorone.ai/v2/training/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY"

Query Parameters

  • status: Filter by job status (pending, running, completed, failed, cancelled)
  • modelType: Filter by model type (llm, vision, multimodal, custom)
  • limit: Number of jobs to return (1-100, default: 50)
  • offset: Number of jobs to skip for pagination
  • sort: Sort order (created_at, updated_at, name)
  • order: Sort direction (asc, desc, default: desc)

Response

{
  "jobs": [
    {
      "id": "job_1234567890abcdef",
      "name": "llama-7b-finetune",
      "status": "running",
      "modelType": "llm",
      "progress": {
        "currentEpoch": 2,
        "totalEpochs": 3,
        "currentStep": 1247,
        "totalSteps": 1875,
        "percentage": 66.5
      },
      "metrics": {
        "loss": 0.342,
        "learningRate": 1.8e-4,
        "throughput": "1250 tokens/sec"
      },
      "createdAt": "2024-01-15T10:30:00Z",
      "startedAt": "2024-01-15T10:35:00Z",
      "estimatedCompletion": "2024-01-15T12:45:00Z"
    }
  ],
  "pagination": {
    "total": 25,
    "limit": 50,
    "offset": 0,
    "hasMore": false
  }
}

Get Training Job Details

Retrieve detailed information about a specific training job.
curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response

{
  "id": "job_1234567890abcdef",
  "name": "llama-7b-finetune",
  "status": "running",
  "modelType": "llm",
  "datasetId": "ds_1234567890abcdef",
  "config": {
    "baseModel": "meta-llama/Llama-2-7b-hf",
    "strategy": "lora",
    "parameters": {
      "rank": 16,
      "alpha": 32,
      "dropout": 0.1
    },
    "training": {
      "epochs": 3,
      "batch_size": 4,
      "learning_rate": 2e-4
    }
  },
  "progress": {
    "currentEpoch": 2,
    "totalEpochs": 3,
    "currentStep": 1247,
    "totalSteps": 1875,
    "percentage": 66.5,
    "elapsedTime": 7200,
    "remainingTime": 3600
  },
  "metrics": {
    "currentLoss": 0.342,
    "bestLoss": 0.298,
    "learningRate": 1.8e-4,
    "throughput": "1250 tokens/sec",
    "memoryUsage": "38.2GB"
  },
  "resourceUsage": {
    "gpuHours": 4.2,
    "cost": 35.70,
    "efficiency": 0.94
  },
  "checkpoints": [
    {
      "id": "ckpt_epoch_1",
      "epoch": 1,
      "loss": 0.456,
      "createdAt": "2024-01-15T11:15:00Z",
      "size": "2.3GB"
    }
  ],
  "createdAt": "2024-01-15T10:30:00Z",
  "startedAt": "2024-01-15T10:35:00Z",
  "estimatedCompletion": "2024-01-15T12:45:00Z"
}

Stop Training Job

Stop a running training job and save the current checkpoint.
curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef/stop" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "saveCheckpoint": true,
    "reason": "User requested stop"
  }'

Resume Training Job

Resume a stopped training job from the latest checkpoint.
curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_1234567890abcdef/resume" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "checkpointId": "ckpt_epoch_1"
  }'

SDK Examples

Python SDK

from tensorone import TensorOneClient

client = TensorOneClient(api_key="YOUR_API_KEY")

# Create a fine-tuning job
job = client.training.jobs.create(
    name="llama-7b-finetune",
    model_type="llm",
    dataset_id="ds_1234567890abcdef",
    config={
        "base_model": "meta-llama/Llama-2-7b-hf",
        "strategy": "lora",
        "parameters": {
            "rank": 16,
            "alpha": 32,
            "dropout": 0.1,
            "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
        },
        "training": {
            "epochs": 3,
            "batch_size": 4,
            "learning_rate": 2e-4,
            "weight_decay": 0.01
        }
    },
    gpu_type="a100",
    max_workers=2
)

print(f"Created job: {job.id}")

# Monitor training progress
while job.status in ["pending", "running"]:
    job = client.training.jobs.get(job.id)
    print(f"Progress: {job.progress.percentage}% - Loss: {job.metrics.current_loss}")
    time.sleep(30)

print(f"Training completed with status: {job.status}")

JavaScript SDK

import { TensorOneClient } from '@tensorone/sdk';

const client = new TensorOneClient({ apiKey: 'YOUR_API_KEY' });

// Create a training job
const job = await client.training.jobs.create({
  name: 'llama-7b-finetune',
  modelType: 'llm',
  datasetId: 'ds_1234567890abcdef',
  config: {
    baseModel: 'meta-llama/Llama-2-7b-hf',
    strategy: 'lora',
    parameters: {
      rank: 16,
      alpha: 32,
      dropout: 0.1,
      targetModules: ['q_proj', 'v_proj', 'k_proj', 'o_proj']
    },
    training: {
      epochs: 3,
      batchSize: 4,
      learningRate: 2e-4,
      weightDecay: 0.01
    }
  },
  gpuType: 'a100',
  maxWorkers: 2
});

console.log(`Created job: ${job.id}`);

// Monitor progress
const monitorJob = async (jobId) => {
  const job = await client.training.jobs.get(jobId);
  console.log(`Progress: ${job.progress.percentage}% - Loss: ${job.metrics.currentLoss}`);
  
  if (job.status === 'running' || job.status === 'pending') {
    setTimeout(() => monitorJob(jobId), 30000);
  } else {
    console.log(`Training completed with status: ${job.status}`);
  }
};

monitorJob(job.id);

Error Handling

Common Errors

{
  "error": "INSUFFICIENT_RESOURCES",
  "message": "Requested GPU type not available",
  "details": {
    "requestedGpuType": "h100",
    "availableGpuTypes": ["a100", "v100", "rtx4090"]
  }
}
{
  "error": "DATASET_NOT_FOUND",
  "message": "Dataset with specified ID does not exist",
  "details": {
    "datasetId": "ds_invalid_id"
  }
}
{
  "error": "CONFIGURATION_ERROR",
  "message": "Invalid training configuration",
  "details": {
    "field": "config.training.learning_rate",
    "reason": "Learning rate must be between 1e-6 and 1e-1"
  }
}

Best Practices

Resource Optimization

  • Choose appropriate GPU types based on model size and memory requirements
  • Use gradient accumulation to effectively increase batch size on limited memory
  • Enable mixed precision training for faster convergence and memory efficiency
  • Monitor resource utilization and adjust worker count accordingly

Cost Management

  • Use spot instances for non-critical training jobs to reduce costs
  • Implement early stopping to prevent unnecessary training iterations
  • Set cost alerts to monitor training expenses
  • Consider using smaller models or LoRA fine-tuning for cost efficiency

Training Stability

  • Start with conservative learning rates and gradually increase
  • Use gradient clipping to prevent exploding gradients
  • Implement proper validation splits to monitor overfitting
  • Save checkpoints frequently to recover from interruptions
Training jobs are billed per second of GPU usage. Jobs in pending status are not billed until they transition to running.
Large training jobs may take several minutes to provision resources. Monitor the job status and be patient during the initial setup phase.

Authorizations

Authorization
string
header
required

API key authentication. Use 'Bearer YOUR_API_KEY' format.

Body

application/json

Response

201 - application/json

Training job created successfully

The response is of type object.