Get Training Job

curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_train_abc123" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "includeMetrics=true" \
  -d "includeLogs=true" \
  -d "logLines=50"

{
  "jobId": "job_train_abc123",
  "name": "Llama-2-7B Fine-tuning",
  "status": "running",
  "progress": {
    "currentEpoch": 2,
    "totalEpochs": 3,
    "currentStep": 450,
    "totalSteps": 600,
    "percentComplete": 75.0,
    "estimatedTimeRemaining": "32m"
  },
  "metrics": {
    "loss": 0.8234,
    "accuracy": 0.847,
    "validationLoss": 0.9123,
    "validationAccuracy": 0.821,
    "learningRate": 0.00008,
    "throughput": {
      "samplesPerSecond": 45.2,
      "tokensPerSecond": 2048,
      "stepTime": 1.8
    }
  },
  "resourceUsage": {
    "gpuUtilization": [98.5, 97.8],
    "memoryUsage": {
      "gpuMemory": [38.2, 37.9],
      "systemMemory": 45.6
    },
    "storageUsage": 127.3
  },
  "configuration": {
    "framework": "huggingface",
    "modelConfig": {
      "modelType": "language_model",
      "baseModel": "llama-2-7b-hf"
    },
    "hyperparameters": {
      "learningRate": 0.0001,
      "batchSize": 16,
      "epochs": 3,
      "optimizer": "adamw"
    },
    "infrastructure": {
      "gpuType": "a100-40gb",
      "gpuCount": 2,
      "memory": "64GB"
    }
  },
  "checkpoints": [
    {
      "checkpointId": "ckpt_epoch_1",
      "epoch": 1,
      "step": 600,
      "metrics": {
        "loss": 1.234,
        "validationLoss": 1.345
      },
      "size": "2.3GB",
      "createdAt": "2024-01-15T16:30:00Z"
    }
  ],
  "costs": {
    "currentCost": 18.75,
    "estimatedTotalCost": 25.00,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    }
  },
  "timestamps": {
    "createdAt": "2024-01-15T14:30:00Z",
    "startedAt": "2024-01-15T15:15:00Z",
    "lastUpdated": "2024-01-15T17:45:00Z"
  }
}

curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_train_abc123" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "includeMetrics=true" \
  -d "includeLogs=true" \
  -d "logLines=50"

{
  "jobId": "job_train_abc123",
  "name": "Llama-2-7B Fine-tuning",
  "status": "running",
  "progress": {
    "currentEpoch": 2,
    "totalEpochs": 3,
    "currentStep": 450,
    "totalSteps": 600,
    "percentComplete": 75.0,
    "estimatedTimeRemaining": "32m"
  },
  "metrics": {
    "loss": 0.8234,
    "accuracy": 0.847,
    "validationLoss": 0.9123,
    "validationAccuracy": 0.821,
    "learningRate": 0.00008,
    "throughput": {
      "samplesPerSecond": 45.2,
      "tokensPerSecond": 2048,
      "stepTime": 1.8
    }
  },
  "resourceUsage": {
    "gpuUtilization": [98.5, 97.8],
    "memoryUsage": {
      "gpuMemory": [38.2, 37.9],
      "systemMemory": 45.6
    },
    "storageUsage": 127.3
  },
  "configuration": {
    "framework": "huggingface",
    "modelConfig": {
      "modelType": "language_model",
      "baseModel": "llama-2-7b-hf"
    },
    "hyperparameters": {
      "learningRate": 0.0001,
      "batchSize": 16,
      "epochs": 3,
      "optimizer": "adamw"
    },
    "infrastructure": {
      "gpuType": "a100-40gb",
      "gpuCount": 2,
      "memory": "64GB"
    }
  },
  "checkpoints": [
    {
      "checkpointId": "ckpt_epoch_1",
      "epoch": 1,
      "step": 600,
      "metrics": {
        "loss": 1.234,
        "validationLoss": 1.345
      },
      "size": "2.3GB",
      "createdAt": "2024-01-15T16:30:00Z"
    }
  ],
  "costs": {
    "currentCost": 18.75,
    "estimatedTotalCost": 25.00,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    }
  },
  "timestamps": {
    "createdAt": "2024-01-15T14:30:00Z",
    "startedAt": "2024-01-15T15:15:00Z",
    "lastUpdated": "2024-01-15T17:45:00Z"
  }
}

Get comprehensive information about a training job, including real-time status, training metrics, resource usage, and detailed logs.

Path Parameters

jobId

string

required

Unique identifier of the training job to retrieve

Query Parameters

includeMetrics

boolean

default:"true"

Whether to include training metrics in the response

includeLogs

boolean

default:"false"

Whether to include recent log entries in the response

logLines

integer

default:"100"

Number of recent log lines to include (max 1000)

Response

jobId

string

Unique identifier of the training job

name

string

Human-readable name of the training job

status

string

Current job status: queued, initializing, running, completed, failed, cancelled, paused

progress

object

Training progress information

Show Progress Object

currentEpoch

integer

Current training epoch

totalEpochs

integer

Total number of epochs planned

currentStep

integer

Current training step within the epoch

totalSteps

integer

Total steps per epoch

percentComplete

number

Overall completion percentage (0-100)

estimatedTimeRemaining

string

Estimated time remaining (e.g., “1h 23m”)

metrics

object

Latest training metrics

Show Metrics Object

loss

number

Current training loss

accuracy

number

Current training accuracy (if applicable)

validationLoss

number

Latest validation loss

validationAccuracy

number

Latest validation accuracy

learningRate

number

Current learning rate

throughput

object

Training throughput metrics

Show Throughput Object

samplesPerSecond

number

Training samples processed per second

tokensPerSecond

number

Tokens processed per second (for language models)

stepTime

number

Average time per training step in seconds

resourceUsage

object

Current resource utilization

Show Resource Usage

gpuUtilization

array

GPU utilization percentage for each allocated GPU

memoryUsage

object

Memory usage statistics

Show Memory Usage

gpuMemory

array

GPU memory usage in GB for each GPU

systemMemory

number

System RAM usage in GB

storageUsage

number

Storage usage in GB

configuration

object

Job configuration details

Show Configuration

framework

string

ML framework being used

modelConfig

object

Model configuration details

hyperparameters

object

Training hyperparameters

infrastructure

object

Infrastructure allocation details

checkpoints

array

Available model checkpoints

Show Checkpoint Object

checkpointId

string

Unique checkpoint identifier

epoch

integer

Epoch when checkpoint was created

step

integer

Step when checkpoint was created

metrics

object

Metrics at checkpoint time

size

string

Checkpoint file size (e.g., “2.3GB”)

createdAt

string

ISO 8601 timestamp of checkpoint creation

costs

object

Job cost information

Show Costs

currentCost

number

Cost incurred so far in USD

estimatedTotalCost

number

Estimated total cost in USD

costBreakdown

object

Detailed cost breakdown by resource type

timestamps

object

Important timestamps

Show Timestamps

createdAt

string

Job creation timestamp

startedAt

string

Job start timestamp

completedAt

string

Job completion timestamp (if completed)

lastUpdated

string

Last status update timestamp

Example

curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_train_abc123" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "includeMetrics=true" \
  -d "includeLogs=true" \
  -d "logLines=50"

{
  "jobId": "job_train_abc123",
  "name": "Llama-2-7B Fine-tuning",
  "status": "running",
  "progress": {
    "currentEpoch": 2,
    "totalEpochs": 3,
    "currentStep": 450,
    "totalSteps": 600,
    "percentComplete": 75.0,
    "estimatedTimeRemaining": "32m"
  },
  "metrics": {
    "loss": 0.8234,
    "accuracy": 0.847,
    "validationLoss": 0.9123,
    "validationAccuracy": 0.821,
    "learningRate": 0.00008,
    "throughput": {
      "samplesPerSecond": 45.2,
      "tokensPerSecond": 2048,
      "stepTime": 1.8
    }
  },
  "resourceUsage": {
    "gpuUtilization": [98.5, 97.8],
    "memoryUsage": {
      "gpuMemory": [38.2, 37.9],
      "systemMemory": 45.6
    },
    "storageUsage": 127.3
  },
  "configuration": {
    "framework": "huggingface",
    "modelConfig": {
      "modelType": "language_model",
      "baseModel": "llama-2-7b-hf"
    },
    "hyperparameters": {
      "learningRate": 0.0001,
      "batchSize": 16,
      "epochs": 3,
      "optimizer": "adamw"
    },
    "infrastructure": {
      "gpuType": "a100-40gb",
      "gpuCount": 2,
      "memory": "64GB"
    }
  },
  "checkpoints": [
    {
      "checkpointId": "ckpt_epoch_1",
      "epoch": 1,
      "step": 600,
      "metrics": {
        "loss": 1.234,
        "validationLoss": 1.345
      },
      "size": "2.3GB",
      "createdAt": "2024-01-15T16:30:00Z"
    }
  ],
  "costs": {
    "currentCost": 18.75,
    "estimatedTotalCost": 25.00,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    }
  },
  "timestamps": {
    "createdAt": "2024-01-15T14:30:00Z",
    "startedAt": "2024-01-15T15:15:00Z",
    "lastUpdated": "2024-01-15T17:45:00Z"
  }
}

Status Descriptions

queued: Job is waiting for available resources
initializing: Setting up environment and downloading dependencies
running: Active training in progress
completed: Training finished successfully
failed: Training encountered an error and stopped
cancelled: Job was manually cancelled by user
paused: Training is temporarily paused (can be resumed)

Monitoring Training Progress

Use the job details endpoint to build real-time monitoring dashboards:

import time
import matplotlib.pyplot as plt

def monitor_training(job_id):
    losses = []
    accuracies = []
    
    while True:
        response = requests.get(
            f"https://api.tensorone.ai/v2/training/jobs/{job_id}",
            headers={"Authorization": "Bearer YOUR_API_KEY"}
        )
        
        job = response.json()
        
        if job['status'] == 'running':
            losses.append(job['metrics']['loss'])
            accuracies.append(job['metrics']['accuracy'])
            
            print(f"Epoch {job['progress']['currentEpoch']}: "
                  f"Loss={job['metrics']['loss']:.4f}, "
                  f"Acc={job['metrics']['accuracy']:.3f}")
        
        elif job['status'] in ['completed', 'failed', 'cancelled']:
            print(f"Training {job['status']}")
            break
            
        time.sleep(30)  # Check every 30 seconds
    
    # Plot training curves
    plt.plot(losses, label='Training Loss')
    plt.plot(accuracies, label='Training Accuracy')
    plt.legend()
    plt.show()

monitor_training("job_train_abc123")

Best Practices

Regular Monitoring: Check job status every 30-60 seconds for active jobs
Cost Tracking: Monitor current costs to avoid budget overruns
Checkpoint Management: Regularly review and download important checkpoints
Resource Optimization: Use GPU utilization metrics to optimize batch sizes
Log Analysis: Enable log inclusion for debugging failed or slow jobs

Create Training Job Cancel Training Job

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Path Parameters

Query Parameters

Response

Example

Status Descriptions

Monitoring Training Progress

Best Practices

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Path Parameters

​Query Parameters

​Response

​Example

​Status Descriptions

​Monitoring Training Progress

​Best Practices

Path Parameters

Query Parameters

Response

Example

Status Descriptions

Monitoring Training Progress

Best Practices