Path Parameters
Unique identifier of the training job to retrieve
Query Parameters
Whether to include training metrics in the response
Whether to include recent log entries in the response
Number of recent log lines to include (max 1000)
Response
Unique identifier of the training job
Human-readable name of the training job
Current job status:
queued
, initializing
, running
, completed
, failed
, cancelled
, paused
Training progress information
Latest training metrics
Current resource utilization
Job configuration details
Available model checkpoints
Job cost information
Important timestamps
Example
Status Descriptions
queued
: Job is waiting for available resourcesinitializing
: Setting up environment and downloading dependenciesrunning
: Active training in progresscompleted
: Training finished successfullyfailed
: Training encountered an error and stoppedcancelled
: Job was manually cancelled by userpaused
: Training is temporarily paused (can be resumed)
Monitoring Training Progress
Use the job details endpoint to build real-time monitoring dashboards:Best Practices
- Regular Monitoring: Check job status every 30-60 seconds for active jobs
- Cost Tracking: Monitor current costs to avoid budget overruns
- Checkpoint Management: Regularly review and download important checkpoints
- Resource Optimization: Use GPU utilization metrics to optimize batch sizes
- Log Analysis: Enable log inclusion for debugging failed or slow jobs