Cancel Training Job

curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_train_abc123/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Early stopping due to satisfactory results",
    "preserveCheckpoints": true,
    "createFinalCheckpoint": true
  }'

{
  "jobId": "job_train_abc123",
  "status": "cancelling",
  "cancellationDetails": {
    "cancelledAt": "2024-01-15T18:30:00Z",
    "reason": "Early stopping due to satisfactory results",
    "finalCheckpointCreated": true,
    "finalCheckpointId": "ckpt_final_abc123",
    "resourcesReleased": false
  },
  "finalMetrics": {
    "finalEpoch": 2,
    "finalStep": 1250,
    "finalLoss": 0.7823,
    "finalAccuracy": 0.856,
    "trainingDuration": "2h 15m"
  },
  "costSummary": {
    "totalCost": 18.75,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    },
    "estimatedSavings": 6.25
  }
}

curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_train_abc123/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Early stopping due to satisfactory results",
    "preserveCheckpoints": true,
    "createFinalCheckpoint": true
  }'

{
  "jobId": "job_train_abc123",
  "status": "cancelling",
  "cancellationDetails": {
    "cancelledAt": "2024-01-15T18:30:00Z",
    "reason": "Early stopping due to satisfactory results",
    "finalCheckpointCreated": true,
    "finalCheckpointId": "ckpt_final_abc123",
    "resourcesReleased": false
  },
  "finalMetrics": {
    "finalEpoch": 2,
    "finalStep": 1250,
    "finalLoss": 0.7823,
    "finalAccuracy": 0.856,
    "trainingDuration": "2h 15m"
  },
  "costSummary": {
    "totalCost": 18.75,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    },
    "estimatedSavings": 6.25
  }
}

Cancel a training job that is currently running, queued, or paused. This action stops the training process immediately and releases allocated resources.

Path Parameters

jobId

string

required

Unique identifier of the training job to cancel

Request Body

reason

string

Optional reason for cancelling the job (for logging and analysis)

preserveCheckpoints

boolean

default:"true"

Whether to keep existing checkpoints after cancellation

createFinalCheckpoint

boolean

default:"false"

Whether to create a final checkpoint before cancelling (if job is running)

force

boolean

default:"false"

Force cancellation even if the job is in a transitional state

Response

jobId

string

ID of the cancelled training job

status

string

New job status after cancellation: cancelling or cancelled

cancellationDetails

object

Details about the cancellation process

Show Cancellation Details

cancelledAt

string

ISO 8601 timestamp when cancellation was initiated

reason

string

User-provided reason for cancellation

finalCheckpointCreated

boolean

Whether a final checkpoint was created

finalCheckpointId

string

ID of the final checkpoint (if created)

resourcesReleased

boolean

Whether allocated resources have been released

finalMetrics

object

Final training metrics at the time of cancellation

Show Final Metrics

finalEpoch

integer

Last completed epoch

finalStep

integer

Last completed training step

finalLoss

number

Training loss at cancellation

finalAccuracy

number

Training accuracy at cancellation (if applicable)

trainingDuration

string

Total training time before cancellation

costSummary

object

Final cost information

Show Cost Summary

totalCost

number

Total cost incurred in USD

costBreakdown

object

Detailed cost breakdown by resource type

estimatedSavings

number

Estimated cost savings from early cancellation in USD

Example

curl -X POST "https://api.tensorone.ai/v2/training/jobs/job_train_abc123/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Early stopping due to satisfactory results",
    "preserveCheckpoints": true,
    "createFinalCheckpoint": true
  }'

{
  "jobId": "job_train_abc123",
  "status": "cancelling",
  "cancellationDetails": {
    "cancelledAt": "2024-01-15T18:30:00Z",
    "reason": "Early stopping due to satisfactory results",
    "finalCheckpointCreated": true,
    "finalCheckpointId": "ckpt_final_abc123",
    "resourcesReleased": false
  },
  "finalMetrics": {
    "finalEpoch": 2,
    "finalStep": 1250,
    "finalLoss": 0.7823,
    "finalAccuracy": 0.856,
    "trainingDuration": "2h 15m"
  },
  "costSummary": {
    "totalCost": 18.75,
    "costBreakdown": {
      "gpuCost": 16.25,
      "storageCost": 2.50
    },
    "estimatedSavings": 6.25
  }
}

Cancellation Process

The cancellation process follows these steps:

Validation: Check if job can be cancelled (running, queued, or paused jobs only)
Final Checkpoint: Create final checkpoint if requested and job is running
Graceful Shutdown: Stop training process gracefully to preserve data integrity
Resource Release: Release allocated GPUs, memory, and storage
Status Update: Update job status to cancelled
Cleanup: Remove temporary files (checkpoints are preserved if requested)

Job States and Cancellation

Current Status	Can Cancel	Behavior
`queued`	✅ Yes	Immediate cancellation, no resources to release
`initializing`	✅ Yes	Stop initialization, release resources
`running`	✅ Yes	Graceful shutdown, optional final checkpoint
`paused`	✅ Yes	Cancel from paused state
`completed`	❌ No	Job already finished
`failed`	❌ No	Job already terminated
`cancelled`	❌ No	Job already cancelled

Force Cancellation

Use the force parameter for jobs that are stuck in transitional states:

# Force cancel a stuck job
response = requests.post(
    f"https://api.tensorone.ai/v2/training/jobs/{job_id}/cancel",
    json={"force": True, "reason": "Job stuck in transitional state"}
)

Force cancellation may result in data loss and should only be used when normal cancellation fails.

Checkpoint Management

When cancelling a job, you have several checkpoint options:

Preserve Existing Checkpoints

{
    "preserveCheckpoints": True,  # Keep all existing checkpoints
    "createFinalCheckpoint": False  # Don't create new checkpoint
}

Create Final Checkpoint

{
    "preserveCheckpoints": True,   # Keep existing + create final
    "createFinalCheckpoint": True  # Save current model state
}

Clean Cancellation

{
    "preserveCheckpoints": False,  # Remove all checkpoints
    "createFinalCheckpoint": False # No final checkpoint
}

Cost Implications

Cancelling a job has the following cost implications:

Incurred Costs: You pay for resources used up to the cancellation point
No Future Charges: No additional charges after successful cancellation
Checkpoint Storage: Preserved checkpoints continue to incur storage costs
Early Termination: No penalties for early cancellation

Common Use Cases

Early Stopping

# Cancel when validation metrics plateau
if validation_loss_improvement < threshold:
    cancel_response = requests.post(
        f"https://api.tensorone.ai/v2/training/jobs/{job_id}/cancel",
        json={
            "reason": "Early stopping - validation loss plateaued",
            "createFinalCheckpoint": True
        }
    )

Resource Reallocation

# Cancel lower priority job to free resources
cancel_response = requests.post(
    f"https://api.tensorone.ai/v2/training/jobs/{low_priority_job_id}/cancel",
    json={
        "reason": "Reallocating resources to higher priority job",
        "preserveCheckpoints": True
    }
)

Hyperparameter Adjustment

# Cancel to restart with better hyperparameters
cancel_response = requests.post(
    f"https://api.tensorone.ai/v2/training/jobs/{job_id}/cancel",
    json={
        "reason": "Restarting with optimized hyperparameters",
        "createFinalCheckpoint": True
    }
)

Best Practices

Always provide a reason for cancellation to help with analysis and debugging
Create final checkpoints for running jobs to preserve training progress
Monitor the cancellation process as it may take a few minutes to complete
Clean up unused checkpoints periodically to manage storage costs
Use force cancellation sparingly and only when normal cancellation fails

Cancelled jobs remain in your job history for 30 days before being permanently removed. Checkpoints are preserved according to your retention settings.

Get Training Job Dataset Management

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

Path Parameters

Request Body

Response

Example

Cancellation Process

Job States and Cancellation

Force Cancellation

Checkpoint Management

Preserve Existing Checkpoints

Create Final Checkpoint

Clean Cancellation

Cost Implications

Common Use Cases

Early Stopping

Resource Reallocation

Hyperparameter Adjustment

Best Practices

Getting Started

Account Management

GPU Clusters (VPS)

Serverless Endpoints

Managed Training

AI Services

Payment & Billing

Monitoring & Analytics

​Path Parameters

​Request Body

​Response

​Example

​Cancellation Process

​Job States and Cancellation

​Force Cancellation

​Checkpoint Management

​Preserve Existing Checkpoints

​Create Final Checkpoint

​Clean Cancellation

​Cost Implications

​Common Use Cases

​Early Stopping

​Resource Reallocation

​Hyperparameter Adjustment

​Best Practices

Path Parameters

Request Body

Response

Example

Cancellation Process

Job States and Cancellation

Force Cancellation

Checkpoint Management

Preserve Existing Checkpoints

Create Final Checkpoint

Clean Cancellation

Cost Implications

Common Use Cases

Early Stopping

Resource Reallocation

Hyperparameter Adjustment

Best Practices