Announcement
Run your most ambitious AI workloads in production with Tractorun
Build, deploy, and scale training, batch inference, reinforcement learning and other GPU accelerated AI tasks

Maksim Zuev
Tractorun core developer
We are excited to introduce Tractorun, a powerful open-source tool for running distributed machine learning workloads.
Tractorun allows you to submit discrete jobs such as batch inference, model training, and reinforcement learning for execution on a cluster of GPUs.
What can you do with Tractorun?
Tractorun is ideal for launching any computational tasks that require or benefit from GPU acceleration. AI workloads in particular benefit from distributed computing to speed up their execution:
Training and fine-tuning models. Use Tractorun to train models across multiple compute nodes simply.
Offline batch inference. Run distributed inference to process millions of prompts.
Reinforcement learning tasks. Use Tractorun to scale and optimize reinforcement learning model training and evaluation.
For additional practical examples refer to the Github with solution notebooks.
How does it work?
Submit a job using tractorun main.py and Tractorun will upload your job and run it on the TractoAI cloud. Built for production, Tractorun ensures scalable and reliable performance for your workloads. You can get started quickly without touching Kubernetes, Slurm, or configuring cloud service providers.
Tractorun offers two developer-friendly interfaces:
Python SDK for Jupyter Notebooks and fast prototyping.
CLI with YAML configuration files for production-ready workflows.

Why Tractorun?
Tractorun increases development velocity and offers production-grade enterprise features:
Fully managed service: Tractorun coordinates the execution of distributed workloads, enabling AI and Data engineers to focus on coding scalable applications rather than infrastructure.
Scalability: Easily scale workloads across multiple nodes
Integrations: Supports a growing ecosystem of ML libraries (Pytorch, Jax, HF transformers) and inference frameworks (vllm, sglang).
Monitoring and observability: Real-time visual dashboard to monitor and manage running jobs.
Part of Tracto.ai: store your data, and run Map/Reduce, Spark, ClickHouse and Tractorun jobs in unified pipelines.
Tractorun in practice
In this article, we’ll explore how to use Tractorun to fine-tune a TinyStories-3M model, enabling it to generate fairy tales about Tracto AI. This step-by-step guide will cover:
Bulk offline inference. Using the vllm and Llama-3.2-3B-Instruct model to generate a dataset and store it as a structured table in YTSaurus.
Defining an effective PyTorch Dataset to interact with the YTSaurus table.
Fine-Tuning the roneneldan/TinyStories-3M model. Running fine-tuning with the Hugging Face transformers library.
Evaluating the fine-tuned model.
We are going to use tractorun python sdk in this example, but you can also use the CLI with YAML configuration files for production-ready workflows. Also simple example for cli can be found on Github.
The completed example is available on Github
Offline inference
In the first step, we'll use vLLM for batch inference Llama to generate a dataset of fairy tales. To do this, we'll:
Implement prepare_dataset function that will be run on Tracto.ai platform to infer the model.
Call this function using tractorun.run.
Here's how to set up the inference process:
Define YTDataset for PyTorch
Now that we have generated a dataset of fairy tales using bulk inference, we need to define a PyTorch dataset to efficiently load and process this data during training. This allows to stream data from YTSaurus and apply necessary transformations for the model training.
Fine-tuning
Now that we have the dataset defined, we can fine-tune the roneneldan/TinyStories-3M model using the Hugging Face transformers library. To do this, we'll:
Implement training function that will be run on Tracto.ai platform to train the model.
Call this function using tractorun.run.
Here's how to set up the fine-tuning process on 2 nodes with data parallelism:
Conclusions
To learn more about the Tracto.ai platform and get access to it, visit our website. You can find the example from this article and more on GitHub, such as:
To learn more about the Tracto.ai platform and get access to it, visit our website. You can find the example from this article and more on GitHub, such as:
DeepSeek R1 batch offline inference.
Training a model using MNIST dataset.
and more.