curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_train_abc123/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Early stopping due to satisfactory results",
    "preserveCheckpoints": true,
    "createFinalCheckpoint": true
  }'
{
  "jobId": "job_train_abc123",
  "status": "cancelling",
  "cancellationDetails": {
    "cancelledAt": "2024-01-15T18:30:00Z",
    "reason": "Early stopping due to satisfactory results",
    "finalCheckpointCreated": true,
    "finalCheckpointId": "ckpt_final_abc123",
    "resourcesReleased": false
  },
  "finalMetrics": {
    "finalEpoch": 2,
    "finalStep": 1250,
    "finalLoss": 0.7823,
    "finalAccuracy": 0.856,
    "trainingDuration": "2h 15m"
  },
  "costSummary": {
    "totalCost": 18.75,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    },
    "estimatedSavings": 6.25
  }
}
Cancel a training job that is currently running, queued, or paused. This action stops the training process immediately and releases allocated resources.

Path Parameters

jobId
string
required
Unique identifier of the training job to cancel

Request Body

reason
string
Optional reason for cancelling the job (for logging and analysis)
preserveCheckpoints
boolean
default:"true"
Whether to keep existing checkpoints after cancellation
createFinalCheckpoint
boolean
default:"false"
Whether to create a final checkpoint before cancelling (if job is running)
force
boolean
default:"false"
Force cancellation even if the job is in a transitional state

Response

jobId
string
ID of the cancelled training job
status
string
New job status after cancellation: cancelling or cancelled
cancellationDetails
object
Details about the cancellation process
finalMetrics
object
Final training metrics at the time of cancellation
costSummary
object
Final cost information

Example

curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_train_abc123/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Early stopping due to satisfactory results",
    "preserveCheckpoints": true,
    "createFinalCheckpoint": true
  }'
{
  "jobId": "job_train_abc123",
  "status": "cancelling",
  "cancellationDetails": {
    "cancelledAt": "2024-01-15T18:30:00Z",
    "reason": "Early stopping due to satisfactory results",
    "finalCheckpointCreated": true,
    "finalCheckpointId": "ckpt_final_abc123",
    "resourcesReleased": false
  },
  "finalMetrics": {
    "finalEpoch": 2,
    "finalStep": 1250,
    "finalLoss": 0.7823,
    "finalAccuracy": 0.856,
    "trainingDuration": "2h 15m"
  },
  "costSummary": {
    "totalCost": 18.75,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    },
    "estimatedSavings": 6.25
  }
}

Cancellation Process

The cancellation process follows these steps:
  1. Validation: Check if job can be cancelled (running, queued, or paused jobs only)
  2. Final Checkpoint: Create final checkpoint if requested and job is running
  3. Graceful Shutdown: Stop training process gracefully to preserve data integrity
  4. Resource Release: Release allocated GPUs, memory, and storage
  5. Status Update: Update job status to cancelled
  6. Cleanup: Remove temporary files (checkpoints are preserved if requested)

Job States and Cancellation

Current StatusCan CancelBehavior
queued✅ YesImmediate cancellation, no resources to release
initializing✅ YesStop initialization, release resources
running✅ YesGraceful shutdown, optional final checkpoint
paused✅ YesCancel from paused state
completed❌ NoJob already finished
failed❌ NoJob already terminated
cancelled❌ NoJob already cancelled

Force Cancellation

Use the force parameter for jobs that are stuck in transitional states:
# Force cancel a stuck job
response = requests.post(
    f"https://api.tensorone.ai/v2/training/jobs/{job_id}/cancel",
    json={"force": True, "reason": "Job stuck in transitional state"}
)
Force cancellation may result in data loss and should only be used when normal cancellation fails.

Checkpoint Management

When cancelling a job, you have several checkpoint options:

Preserve Existing Checkpoints

{
    "preserveCheckpoints": True,  # Keep all existing checkpoints
    "createFinalCheckpoint": False  # Don't create new checkpoint
}

Create Final Checkpoint

{
    "preserveCheckpoints": True,   # Keep existing + create final
    "createFinalCheckpoint": True  # Save current model state
}

Clean Cancellation

{
    "preserveCheckpoints": False,  # Remove all checkpoints
    "createFinalCheckpoint": False # No final checkpoint
}

Cost Implications

Cancelling a job has the following cost implications:
  • Incurred Costs: You pay for resources used up to the cancellation point
  • No Future Charges: No additional charges after successful cancellation
  • Checkpoint Storage: Preserved checkpoints continue to incur storage costs
  • Early Termination: No penalties for early cancellation

Common Use Cases

Early Stopping

# Cancel when validation metrics plateau
if validation_loss_improvement < threshold:
    cancel_response = requests.post(
        f"https://api.tensorone.ai/v2/training/jobs/{job_id}/cancel",
        json={
            "reason": "Early stopping - validation loss plateaued",
            "createFinalCheckpoint": True
        }
    )

Resource Reallocation

# Cancel lower priority job to free resources
cancel_response = requests.post(
    f"https://api.tensorone.ai/v2/training/jobs/{low_priority_job_id}/cancel",
    json={
        "reason": "Reallocating resources to higher priority job",
        "preserveCheckpoints": True
    }
)

Hyperparameter Adjustment

# Cancel to restart with better hyperparameters
cancel_response = requests.post(
    f"https://api.tensorone.ai/v2/training/jobs/{job_id}/cancel",
    json={
        "reason": "Restarting with optimized hyperparameters",
        "createFinalCheckpoint": True
    }
)

Best Practices

  • Always provide a reason for cancellation to help with analysis and debugging
  • Create final checkpoints for running jobs to preserve training progress
  • Monitor the cancellation process as it may take a few minutes to complete
  • Clean up unused checkpoints periodically to manage storage costs
  • Use force cancellation sparingly and only when normal cancellation fails
Cancelled jobs remain in your job history for 30 days before being permanently removed. Checkpoints are preserved according to your retention settings.