curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_train_abc123" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "includeMetrics=true" \
  -d "includeLogs=true" \
  -d "logLines=50"
{
  "jobId": "job_train_abc123",
  "name": "Llama-2-7B Fine-tuning",
  "status": "running",
  "progress": {
    "currentEpoch": 2,
    "totalEpochs": 3,
    "currentStep": 450,
    "totalSteps": 600,
    "percentComplete": 75.0,
    "estimatedTimeRemaining": "32m"
  },
  "metrics": {
    "loss": 0.8234,
    "accuracy": 0.847,
    "validationLoss": 0.9123,
    "validationAccuracy": 0.821,
    "learningRate": 0.00008,
    "throughput": {
      "samplesPerSecond": 45.2,
      "tokensPerSecond": 2048,
      "stepTime": 1.8
    }
  },
  "resourceUsage": {
    "gpuUtilization": [98.5, 97.8],
    "memoryUsage": {
      "gpuMemory": [38.2, 37.9],
      "systemMemory": 45.6
    },
    "storageUsage": 127.3
  },
  "configuration": {
    "framework": "huggingface",
    "modelConfig": {
      "modelType": "language_model",
      "baseModel": "llama-2-7b-hf"
    },
    "hyperparameters": {
      "learningRate": 0.0001,
      "batchSize": 16,
      "epochs": 3,
      "optimizer": "adamw"
    },
    "infrastructure": {
      "gpuType": "a100-40gb",
      "gpuCount": 2,
      "memory": "64GB"
    }
  },
  "checkpoints": [
    {
      "checkpointId": "ckpt_epoch_1",
      "epoch": 1,
      "step": 600,
      "metrics": {
        "loss": 1.234,
        "validationLoss": 1.345
      },
      "size": "2.3GB",
      "createdAt": "2024-01-15T16:30:00Z"
    }
  ],
  "costs": {
    "currentCost": 18.75,
    "estimatedTotalCost": 25.00,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    }
  },
  "timestamps": {
    "createdAt": "2024-01-15T14:30:00Z",
    "startedAt": "2024-01-15T15:15:00Z",
    "lastUpdated": "2024-01-15T17:45:00Z"
  }
}
Get comprehensive information about a training job, including real-time status, training metrics, resource usage, and detailed logs.

Path Parameters

jobId
string
required
Unique identifier of the training job to retrieve

Query Parameters

includeMetrics
boolean
default:"true"
Whether to include training metrics in the response
includeLogs
boolean
default:"false"
Whether to include recent log entries in the response
logLines
integer
default:"100"
Number of recent log lines to include (max 1000)

Response

jobId
string
Unique identifier of the training job
name
string
Human-readable name of the training job
status
string
Current job status: queued, initializing, running, completed, failed, cancelled, paused
progress
object
Training progress information
metrics
object
Latest training metrics
resourceUsage
object
Current resource utilization
configuration
object
Job configuration details
checkpoints
array
Available model checkpoints
costs
object
Job cost information
timestamps
object
Important timestamps

Example

curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_train_abc123" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "includeMetrics=true" \
  -d "includeLogs=true" \
  -d "logLines=50"
{
  "jobId": "job_train_abc123",
  "name": "Llama-2-7B Fine-tuning",
  "status": "running",
  "progress": {
    "currentEpoch": 2,
    "totalEpochs": 3,
    "currentStep": 450,
    "totalSteps": 600,
    "percentComplete": 75.0,
    "estimatedTimeRemaining": "32m"
  },
  "metrics": {
    "loss": 0.8234,
    "accuracy": 0.847,
    "validationLoss": 0.9123,
    "validationAccuracy": 0.821,
    "learningRate": 0.00008,
    "throughput": {
      "samplesPerSecond": 45.2,
      "tokensPerSecond": 2048,
      "stepTime": 1.8
    }
  },
  "resourceUsage": {
    "gpuUtilization": [98.5, 97.8],
    "memoryUsage": {
      "gpuMemory": [38.2, 37.9],
      "systemMemory": 45.6
    },
    "storageUsage": 127.3
  },
  "configuration": {
    "framework": "huggingface",
    "modelConfig": {
      "modelType": "language_model",
      "baseModel": "llama-2-7b-hf"
    },
    "hyperparameters": {
      "learningRate": 0.0001,
      "batchSize": 16,
      "epochs": 3,
      "optimizer": "adamw"
    },
    "infrastructure": {
      "gpuType": "a100-40gb",
      "gpuCount": 2,
      "memory": "64GB"
    }
  },
  "checkpoints": [
    {
      "checkpointId": "ckpt_epoch_1",
      "epoch": 1,
      "step": 600,
      "metrics": {
        "loss": 1.234,
        "validationLoss": 1.345
      },
      "size": "2.3GB",
      "createdAt": "2024-01-15T16:30:00Z"
    }
  ],
  "costs": {
    "currentCost": 18.75,
    "estimatedTotalCost": 25.00,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    }
  },
  "timestamps": {
    "createdAt": "2024-01-15T14:30:00Z",
    "startedAt": "2024-01-15T15:15:00Z",
    "lastUpdated": "2024-01-15T17:45:00Z"
  }
}

Status Descriptions

  • queued: Job is waiting for available resources
  • initializing: Setting up environment and downloading dependencies
  • running: Active training in progress
  • completed: Training finished successfully
  • failed: Training encountered an error and stopped
  • cancelled: Job was manually cancelled by user
  • paused: Training is temporarily paused (can be resumed)

Monitoring Training Progress

Use the job details endpoint to build real-time monitoring dashboards:
import time
import matplotlib.pyplot as plt

def monitor_training(job_id):
    losses = []
    accuracies = []
    
    while True:
        response = requests.get(
            f"https://api.tensorone.ai/v2/training/jobs/{job_id}",
            headers={"Authorization": "Bearer YOUR_API_KEY"}
        )
        
        job = response.json()
        
        if job['status'] == 'running':
            losses.append(job['metrics']['loss'])
            accuracies.append(job['metrics']['accuracy'])
            
            print(f"Epoch {job['progress']['currentEpoch']}: "
                  f"Loss={job['metrics']['loss']:.4f}, "
                  f"Acc={job['metrics']['accuracy']:.3f}")
        
        elif job['status'] in ['completed', 'failed', 'cancelled']:
            print(f"Training {job['status']}")
            break
            
        time.sleep(30)  # Check every 30 seconds
    
    # Plot training curves
    plt.plot(losses, label='Training Loss')
    plt.plot(accuracies, label='Training Accuracy')
    plt.legend()
    plt.show()

monitor_training("job_train_abc123")

Best Practices

  • Regular Monitoring: Check job status every 30-60 seconds for active jobs
  • Cost Tracking: Monitor current costs to avoid budget overruns
  • Checkpoint Management: Regularly review and download important checkpoints
  • Resource Optimization: Use GPU utilization metrics to optimize batch sizes
  • Log Analysis: Enable log inclusion for debugging failed or slow jobs