Create and launch training jobs for machine learning models using TensorOne’s distributed GPU infrastructure. Supports various frameworks including PyTorch, TensorFlow, and Hugging Face Transformers.
Request Body
Human-readable name for your training job
Optional description of what the training job does
ML framework to use:
pytorch
- PyTorch with CUDA support
tensorflow
- TensorFlow 2.x with GPU support
huggingface
- Hugging Face Transformers
custom
- Custom Docker image
Model configuration parameters Type of model: language_model
, image_classifier
, object_detector
, custom
Pre-trained model to fine-tune (e.g., llama-2-7b
, bert-base
, resnet50
)
Git repository URL or Docker image containing your training code
Dataset configuration ID of dataset uploaded to TensorOne platform
Public URL to dataset (S3, GCS, etc.)
Dataset format: json
, csv
, parquet
, hdf5
, custom
Training/validation split ratios (default: 80/20)
Training hyperparameters Number of training epochs
Optimizer: adam
, sgd
, adamw
, rmsprop
Learning rate scheduler: cosine
, linear
, exponential
Infrastructure requirements GPU type: rtx-4090
, a100-40gb
, a100-80gb
, h100-80gb
Number of GPUs (1-8 for multi-GPU training)
RAM requirement: 16GB
, 32GB
, 64GB
, 128GB
Storage requirement for datasets and checkpoints
Model checkpointing configuration Enable automatic checkpointing
Checkpoint frequency: epoch
, hour
, step
Maximum number of checkpoints to keep
Response
Unique identifier for the created training job
Initial job status: queued
, pending
, initializing
ISO 8601 timestamp of estimated job start time
Estimated training duration (e.g., “2h 30m”)
Cost estimation for the training job Estimated GPU compute cost in USD
Estimated storage cost in USD
Total estimated cost in USD
Example
curl -X POST "https://api.tensorone.ai/v2/training/jobs" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Llama-2-7B Fine-tuning",
"description": "Fine-tune Llama-2-7B on custom dialogue dataset",
"framework": "huggingface",
"modelConfig": {
"modelType": "language_model",
"baseModel": "llama-2-7b-hf",
"customCode": "https://github.com/your-org/llama-finetuning"
},
"datasetConfig": {
"datasetId": "ds_abc123",
"format": "json",
"splitRatio": {"train": 0.8, "validation": 0.2}
},
"hyperparameters": {
"learningRate": 0.0001,
"batchSize": 16,
"epochs": 3,
"optimizer": "adamw"
},
"infrastructure": {
"gpuType": "a100-40gb",
"gpuCount": 2,
"memory": "64GB",
"storage": "500GB"
}
}'
{
"jobId" : "job_train_abc123" ,
"status" : "queued" ,
"estimatedStartTime" : "2024-01-16T10:15:00Z" ,
"estimatedDuration" : "2h 45m" ,
"estimatedCost" : {
"gpuCost" : 12.50 ,
"storageCost" : 2.00 ,
"totalCost" : 14.50
},
"queuePosition" : 3 ,
"createdAt" : "2024-01-15T14:30:00Z"
}
Pre-built Training Templates
Use pre-configured templates for common training scenarios:
Language Model Fine-tuning
template = "llm-finetuning"
config = {
"baseModel" : "llama-2-13b" ,
"taskType" : "instruction_following" ,
"datasetFormat" : "alpaca"
}
Computer Vision
template = "image-classification"
config = {
"baseModel" : "efficientnet-b3" ,
"numClasses" : 10 ,
"imageSize" : 224
}
Custom Training
template = "custom-pytorch"
config = {
"dockerImage" : "your-registry/training-image:latest" ,
"entrypoint" : "python train.py"
}
Monitoring & Callbacks
Set up monitoring and callbacks during training:
# Add monitoring webhooks
monitoring_config = {
"webhooks" : [
{
"url" : "https://your-app.com/training-webhook" ,
"events" : [ "job_started" , "epoch_completed" , "job_completed" , "job_failed" ]
}
],
"metrics" : [ "loss" , "accuracy" , "learning_rate" ],
"logFrequency" : "step" # Log metrics every step
}
Best Practices
Dataset Preparation : Ensure your dataset is properly formatted and accessible
Hyperparameter Tuning : Start with conservative learning rates and batch sizes
Checkpointing : Enable checkpointing to prevent data loss from interruptions
Resource Estimation : Use smaller experiments to estimate resource requirements
Monitoring : Set up proper logging and monitoring for long-running jobs
Cost Control : Set maximum training time and cost limits to prevent overruns
Training jobs are billed by the minute based on GPU type and resources used. Jobs can be paused, resumed, or cancelled at any time through the API or web interface.