curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_train_abc123/checkpoints" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "includeMetrics=true" \
  -d "sortBy=validationLoss" \
  -d "sortOrder=asc" \
  -d "limit=10"
{
  "checkpoints": [
    {
      "checkpointId": "ckpt_best_abc123",
      "name": "Best Validation - Epoch 15",
      "epoch": 15,
      "step": 9000,
      "metrics": {
        "loss": 0.6234,
        "accuracy": 0.891,
        "validationLoss": 0.7123,
        "validationAccuracy": 0.876
      },
      "size": "2.3GB",
      "type": "best",
      "downloadUrl": "https://checkpoints.tensorone.ai/signed/ckpt_best_abc123?expires=...",
      "deployable": true,
      "createdAt": "2024-01-15T20:30:00Z"
    },
    {
      "checkpointId": "ckpt_epoch_20",
      "name": "Epoch 20 Checkpoint",
      "epoch": 20,
      "step": 12000,
      "metrics": {
        "loss": 0.5876,
        "accuracy": 0.903,
        "validationLoss": 0.7456,
        "validationAccuracy": 0.864
      },
      "size": "2.3GB",
      "type": "automatic",
      "downloadUrl": "https://checkpoints.tensorone.ai/signed/ckpt_epoch_20?expires=...",
      "deployable": true,
      "createdAt": "2024-01-15T22:15:00Z"
    }
  ],
  "totalCount": 12,
  "summary": {
    "bestCheckpoint": {
      "checkpointId": "ckpt_best_abc123",
      "metrics": {
        "validationLoss": 0.7123,
        "validationAccuracy": 0.876
      }
    },
    "latestCheckpoint": {
      "checkpointId": "ckpt_epoch_20",
      "createdAt": "2024-01-15T22:15:00Z"
    },
    "totalSize": "27.6GB"
  }
}
Manage and interact with model checkpoints created during training. Checkpoints allow you to save model state at specific points and restore or deploy models from those states.

List Checkpoints

Path Parameters

jobId
string
required
Unique identifier of the training job

Query Parameters

includeMetrics
boolean
default:"true"
Whether to include training metrics for each checkpoint
sortBy
string
default:"createdAt"
Sort checkpoints by: createdAt, epoch, step, loss, accuracy
sortOrder
string
default:"desc"
Sort order: asc or desc
limit
integer
default:"50"
Maximum number of checkpoints to return (1-100)

Response

checkpoints
array
Array of checkpoint objects
totalCount
integer
Total number of checkpoints for this job
summary
object
Summary statistics about checkpoints

Examples

curl -X GET "https://api.tensorone.ai/v2/training/jobs/job_train_abc123/checkpoints" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -G \
  -d "includeMetrics=true" \
  -d "sortBy=validationLoss" \
  -d "sortOrder=asc" \
  -d "limit=10"
{
  "checkpoints": [
    {
      "checkpointId": "ckpt_best_abc123",
      "name": "Best Validation - Epoch 15",
      "epoch": 15,
      "step": 9000,
      "metrics": {
        "loss": 0.6234,
        "accuracy": 0.891,
        "validationLoss": 0.7123,
        "validationAccuracy": 0.876
      },
      "size": "2.3GB",
      "type": "best",
      "downloadUrl": "https://checkpoints.tensorone.ai/signed/ckpt_best_abc123?expires=...",
      "deployable": true,
      "createdAt": "2024-01-15T20:30:00Z"
    },
    {
      "checkpointId": "ckpt_epoch_20",
      "name": "Epoch 20 Checkpoint",
      "epoch": 20,
      "step": 12000,
      "metrics": {
        "loss": 0.5876,
        "accuracy": 0.903,
        "validationLoss": 0.7456,
        "validationAccuracy": 0.864
      },
      "size": "2.3GB",
      "type": "automatic",
      "downloadUrl": "https://checkpoints.tensorone.ai/signed/ckpt_epoch_20?expires=...",
      "deployable": true,
      "createdAt": "2024-01-15T22:15:00Z"
    }
  ],
  "totalCount": 12,
  "summary": {
    "bestCheckpoint": {
      "checkpointId": "ckpt_best_abc123",
      "metrics": {
        "validationLoss": 0.7123,
        "validationAccuracy": 0.876
      }
    },
    "latestCheckpoint": {
      "checkpointId": "ckpt_epoch_20",
      "createdAt": "2024-01-15T22:15:00Z"
    },
    "totalSize": "27.6GB"
  }
}

Get Specific Checkpoint

Get detailed information about a specific checkpoint:
cURL
curl -X GET "https://api.tensorone.ai/v2/training/checkpoints/ckpt_best_abc123" \
  -H "Authorization: Bearer YOUR_API_KEY"
Python
# Get specific checkpoint details
response = requests.get(
    "https://api.tensorone.ai/v2/training/checkpoints/ckpt_best_abc123",
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

checkpoint = response.json()
print(f"Checkpoint: {checkpoint['name']}")
print(f"Created: {checkpoint['createdAt']}")
print(f"Validation Accuracy: {checkpoint['metrics']['validationAccuracy']:.3f}")

Create Manual Checkpoint

Create a checkpoint manually during training:
cURL
curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_train_abc123/checkpoints" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Manual Checkpoint - Good Performance",
    "description": "Saving model state after observing good validation metrics"
  }'
Python
# Create manual checkpoint for running job
response = requests.post(
    "https://api.tensorone.ai/v2/training/jobs/job_train_abc123/checkpoints",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "name": "Experiment Milestone",
        "description": "Checkpoint before hyperparameter adjustment"
    }
)

checkpoint = response.json()
print(f"Created checkpoint: {checkpoint['checkpointId']}")

Deploy Checkpoint as Endpoint

Deploy a checkpoint directly as a serverless endpoint:
Python
# Deploy best checkpoint as endpoint
response = requests.post(
    "https://api.tensorone.ai/v2/endpoints",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "name": "My Fine-tuned Model",
        "checkpointId": "ckpt_best_abc123",
        "gpuIds": ["rtx-4090"],
        "workerCount": 1
    }
)

endpoint = response.json()
print(f"Deployed endpoint: {endpoint['id']}")
print(f"Endpoint URL: {endpoint['url']}")

Delete Checkpoint

Delete a checkpoint to free up storage:
cURL
curl -X DELETE "https://api.tensorone.ai/v2/training/checkpoints/ckpt_old_abc123" \
  -H "Authorization: Bearer YOUR_API_KEY"
Python
# Delete old or unnecessary checkpoints
response = requests.delete(
    "https://api.tensorone.ai/v2/training/checkpoints/ckpt_old_abc123",
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

if response.status_code == 204:
    print("Checkpoint deleted successfully")

Checkpoint Types

Automatic Checkpoints

  • Created automatically based on your training configuration
  • Typically saved at the end of each epoch
  • Named with epoch number (e.g., “Epoch 5 Checkpoint”)

Best Checkpoints

  • Automatically saved when validation metrics improve
  • Only the best performing checkpoint is kept
  • Overwritten when a better checkpoint is found

Manual Checkpoints

  • Created on-demand via API or web interface
  • Useful for saving state at specific experimental milestones
  • Custom names and descriptions

Final Checkpoints

  • Created when training completes or is cancelled
  • Represents the final state of the model
  • Always preserved unless explicitly deleted

Checkpoint Management Best Practices

Storage Optimization

# Regularly clean up old automatic checkpoints
def cleanup_old_checkpoints(job_id, keep_count=5):
    response = requests.get(
        f"https://api.tensorone.ai/v2/training/jobs/{job_id}/checkpoints",
        params={"sortBy": "createdAt", "sortOrder": "desc"}
    )
    
    checkpoints = response.json()['checkpoints']
    
    # Keep best and final checkpoints, plus N most recent
    to_delete = []
    automatic_count = 0
    
    for ckpt in checkpoints:
        if ckpt['type'] in ['best', 'final']:
            continue
        elif ckpt['type'] == 'automatic':
            automatic_count += 1
            if automatic_count > keep_count:
                to_delete.append(ckpt['checkpointId'])
    
    # Delete old checkpoints
    for ckpt_id in to_delete:
        requests.delete(
            f"https://api.tensorone.ai/v2/training/checkpoints/{ckpt_id}",
            headers={"Authorization": "Bearer YOUR_API_KEY"}
        )

Backup Important Checkpoints

# Download and backup critical checkpoints
def backup_checkpoint(checkpoint_id, local_path):
    # Get checkpoint details
    response = requests.get(
        f"https://api.tensorone.ai/v2/training/checkpoints/{checkpoint_id}",
        headers={"Authorization": "Bearer YOUR_API_KEY"}
    )
    
    checkpoint = response.json()
    download_url = checkpoint['downloadUrl']
    
    # Download checkpoint file
    checkpoint_response = requests.get(download_url, stream=True)
    with open(local_path, 'wb') as f:
        for chunk in checkpoint_response.iter_content(chunk_size=8192):
            f.write(chunk)
    
    print(f"Backed up checkpoint to {local_path}")

# Backup best checkpoint
backup_checkpoint("ckpt_best_abc123", "./backups/best_model.ckpt")
Checkpoints are automatically compressed and deduplicated to minimize storage costs. Similar model states share common data blocks to reduce overall storage usage.
Checkpoint download URLs expire after 1 hour for security. Generate new URLs as needed or download immediately after getting the URL.