"Python 3.8+ toolbox for submitting jobs to Slurm
Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps submission and provide access to results, logs and more. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Submitit allows to switch seamlessly between executing on Slurm or locally.
From inside an environment with submitit
installed:
import submitit
def add(a, b):
return a + b
# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(timeout_min=1, slurm_partition="dev")
job = executor.submit(add, 5, 7) # will compute add(5, 7)
print(job.job_id) # ID of your job
output = job.result() # waits for completion and returns output
assert output == 12 # 5 + 7 = 12... your addition was computed in the cluster
The Job
class also provides tools for reading the log files (job.stdout()
and job.stderr()
).
If what you want to run is a command, turn it into a Python function using submitit.helpers.CommandFunction
, then submit it.
By default stdout is silenced in CommandFunction
, but it can be unsilenced with verbose=True
.
Find more examples here!!!
Submitit is a Python 3.8+ toolbox for submitting jobs to Slurm. It aims at running python function from python code.
Quick install, in a virtualenv/conda environment where pip
is installed (check which pip
):
pip install submitit
conda install -c conda-forge submitit
pip install git+https://github.com/facebookincubator/submitit@main#egg=submitit
You can try running the MNIST example to check that everything is working as expected (requires sklearn).
See the following pages for more detailled information:
submitit
works, which files are created for each job, and the main objects you will interact with.submitit
.nevergrad
usage and how it interfaces with submitit
.The aim of this Python3 package is to be able to launch jobs on Slurm painlessly from inside Python, using the same submission and job patterns than the standard library package concurrent.futures
:
Here are a few benefits of using this lightweight package:
submitit
executor and one of concurrent.futures
executors in a line, so that it is easy to run your code either on slurm, or locally with multithreading for instance.Submitit is used by FAIR researchers on the FAIR cluster. The defaults are chosen to make their life easier, and might not be ideal for every cluster.
dask
is a nice framework for distributed computing. dask.distributed
provides the same concurrent.futures
executor API as submitit
:
from distributed import Client
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(processes=1, cores=2, memory="2GB")
cluster.scale(2) # this may take a few seconds to launch
executor = Client(cluster)
executor.submit(...)
The key difference with submitit
is that dask.distributed
distributes the jobs to a pool of workers (see the cluster
variable above) while submitit
jobs are directly jobs on the cluster. In that sense submitit
is a lower level interface than dask.distributed
and you get more direct control over your jobs, including individual stdout
and stderr
, and possibly checkpointing in case of preemption and timeout. On the other hand, you should avoid submitting multiple small tasks with submitit
, which would create many independent jobs and possibly overload the cluster, while you can do it without any problem through dask.distributed
.
By chronological order: Jérémy Rapin, Louis Martin, Lowik Chanussot, Lucas Hosseini, Fabio Petroni, Francisco Massa, Guillaume Wenzek, Thibaut Lavril, Vinayak Tantia, Andrea Vedaldi, Max Nickel, Quentin Duval (feel free to contribute and add your name ;) )
Submitit is released under the MIT License.