curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Llama-2-7B Fine-tuning",
    "description": "Fine-tune Llama-2-7B on custom dialogue dataset",
    "framework": "huggingface",
    "modelConfig": {
      "modelType": "language_model",
      "baseModel": "llama-2-7b-hf",
      "customCode": "https://github.com/your-org/llama-finetuning"
    },
    "datasetConfig": {
      "datasetId": "ds_abc123",
      "format": "json",
      "splitRatio": {"train": 0.8, "validation": 0.2}
    },
    "hyperparameters": {
      "learningRate": 0.0001,
      "batchSize": 16,
      "epochs": 3,
      "optimizer": "adamw"
    },
    "infrastructure": {
      "gpuType": "a100-40gb",
      "gpuCount": 2,
      "memory": "64GB",
      "storage": "500GB"
    }
  }'
{
  "jobId": "job_train_abc123",
  "status": "queued",
  "estimatedStartTime": "2024-01-16T10:15:00Z",
  "estimatedDuration": "2h 45m",
  "estimatedCost": {
    "gpuCost": 12.50,
    "storageCost": 2.00,
    "totalCost": 14.50
  },
  "queuePosition": 3,
  "createdAt": "2024-01-15T14:30:00Z"
}
Create and launch training jobs for machine learning models using TensorOne’s distributed GPU infrastructure. Supports various frameworks including PyTorch, TensorFlow, and Hugging Face Transformers.

Request Body

name
string
required
Human-readable name for your training job
description
string
Optional description of what the training job does
framework
string
required
ML framework to use:
  • pytorch - PyTorch with CUDA support
  • tensorflow - TensorFlow 2.x with GPU support
  • huggingface - Hugging Face Transformers
  • custom - Custom Docker image
modelConfig
object
required
Model configuration parameters
datasetConfig
object
required
Dataset configuration
hyperparameters
object
Training hyperparameters
infrastructure
object
Infrastructure requirements
checkpointing
object
Model checkpointing configuration

Response

jobId
string
Unique identifier for the created training job
status
string
Initial job status: queued, pending, initializing
estimatedStartTime
string
ISO 8601 timestamp of estimated job start time
estimatedDuration
string
Estimated training duration (e.g., “2h 30m”)
estimatedCost
object
Cost estimation for the training job

Example

curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Llama-2-7B Fine-tuning",
    "description": "Fine-tune Llama-2-7B on custom dialogue dataset",
    "framework": "huggingface",
    "modelConfig": {
      "modelType": "language_model",
      "baseModel": "llama-2-7b-hf",
      "customCode": "https://github.com/your-org/llama-finetuning"
    },
    "datasetConfig": {
      "datasetId": "ds_abc123",
      "format": "json",
      "splitRatio": {"train": 0.8, "validation": 0.2}
    },
    "hyperparameters": {
      "learningRate": 0.0001,
      "batchSize": 16,
      "epochs": 3,
      "optimizer": "adamw"
    },
    "infrastructure": {
      "gpuType": "a100-40gb",
      "gpuCount": 2,
      "memory": "64GB",
      "storage": "500GB"
    }
  }'
{
  "jobId": "job_train_abc123",
  "status": "queued",
  "estimatedStartTime": "2024-01-16T10:15:00Z",
  "estimatedDuration": "2h 45m",
  "estimatedCost": {
    "gpuCost": 12.50,
    "storageCost": 2.00,
    "totalCost": 14.50
  },
  "queuePosition": 3,
  "createdAt": "2024-01-15T14:30:00Z"
}

Pre-built Training Templates

Use pre-configured templates for common training scenarios:

Language Model Fine-tuning

template = "llm-finetuning"
config = {
    "baseModel": "llama-2-13b",
    "taskType": "instruction_following",
    "datasetFormat": "alpaca"
}

Computer Vision

template = "image-classification"
config = {
    "baseModel": "efficientnet-b3", 
    "numClasses": 10,
    "imageSize": 224
}

Custom Training

template = "custom-pytorch"
config = {
    "dockerImage": "your-registry/training-image:latest",
    "entrypoint": "python train.py"
}

Monitoring & Callbacks

Set up monitoring and callbacks during training:
# Add monitoring webhooks
monitoring_config = {
    "webhooks": [
        {
            "url": "https://your-app.com/training-webhook",
            "events": ["job_started", "epoch_completed", "job_completed", "job_failed"]
        }
    ],
    "metrics": ["loss", "accuracy", "learning_rate"],
    "logFrequency": "step"  # Log metrics every step
}

Best Practices

  • Dataset Preparation: Ensure your dataset is properly formatted and accessible
  • Hyperparameter Tuning: Start with conservative learning rates and batch sizes
  • Checkpointing: Enable checkpointing to prevent data loss from interruptions
  • Resource Estimation: Use smaller experiments to estimate resource requirements
  • Monitoring: Set up proper logging and monitoring for long-running jobs
  • Cost Control: Set maximum training time and cost limits to prevent overruns
Training jobs are billed by the minute based on GPU type and resources used. Jobs can be paused, resumed, or cancelled at any time through the API or web interface.