Usage¶

Configuration¶

The first thing you’ll want to do is configure drmr for your local workload manager, by running drmrc. As long as you only have one workload manager installed, it should be able to detect it and create a reasonable configuration.

If you have a default account or destination (Slurm partition or PBS queue) you want to use for jobs that don’t specify one, you can add that to the configuration. See drmrc in the Command reference section below for details.

Writing and submitting scripts¶

Once you’ve configured drmr, you’re ready to write and submit a drmr script. Try putting this script in a file called hello:

echo "hello world"

Then run drmr hello. Almost nothing will happen. You should just see a number printed before the shell prompt comes back. This is the job ID of a success job, which drmr submits at the end of every script. You can monitor that job ID to see when your jobs finish.

The hello script has probably finished by the time you’ve read this far. The only indication will be a new directory named .drmr. In there you’ll find a surprising number of files for a simple “hello world” example. The signal-to-noise ratio does improve as your scripts grow in size. The contents of .drmr should look something like this:

$ ls -1 .drmr
hello.1_174.out
hello.1.slurm
hello.cancel
hello.finish_176.out
hello.finished
hello.finish.slurm
hello.success
hello.success_175.out
hello.success.slurm

All of them start with hello, the job name derived automatically from your drmr script’s filename. We could have explicitly set the job name instead, by submitting the script with drmr’s --job-name option.

The file hello.1.slurm contains the actual job script. The name consists of the job prefix hello, the job number of the command in the drmr script (1), and a suffix indicating the DRM in use (.slurm, because I’m using Slurm for this example). The job script looks like this:

#!/bin/bash

####  Slurm preamble

#SBATCH --export=ALL
#SBATCH --job-name=hello.1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4000
#SBATCH --output "/home/john/tmp/.drmr/hello.1_%j.out"
#SBATCH --workdir=/home/john/tmp


####  End Slurm preamble


####  Environment setup
. /home/john/.virtualenvs/drmr/bin/activate

####  Commands

echo "hello world"

It’s pretty much what you’d write to submit any Slurm job. If you’re using PBS, it will have a .pbs extension, and contain a PBS-specific preamble.

You’ll notice that the last line is the command from your hello script.

In between is a nicety for Python users: if you have a virtual environment active when you submit a script, it will be activated before running your commands.

Each job script’s standard output and error will be in a file named after the job, containing its DRM job ID. Here it’s hello.1_174.out, and it contains:

hello world

Usually, your drmr scripts will contain commands explicitly redirecting their standard output to files, and you’ll only refer to these default output files when jobs fail.

The rest of the files are drmr housekeeping: there’s a script to cancel all the jobs (hello.cancel), completion jobs (hello.finish.slurm and hello.success.slurm) and their output files, and finally a couple of marker files: hello.finished and hello.success. That last one is what you want to see: if the .success file exists, all of your drmr script’s jobs completed successfully. If you see the .finished file, but not .success, something went wrong.

A more complete example is included in the output of drmr --help, which you can read under drmr below. See also the real-world scripts under Examples.

Command reference¶

drmrc¶

Creates the drmr configuration file, .drmrc.

Help is available by running drmrc --help:

usage: drmrc [-h] [-a ACCOUNT] [-d DESTINATION] [-o]
                      [-r RESOURCE_MANAGER]

Generate a drmr configuration file for your local environment.

optional arguments:
  -h, --help            show this help message and exit
  -a ACCOUNT, --account ACCOUNT
                        The account to which jobs will be charged by default.
  -d DESTINATION, --destination DESTINATION
                        The default queue/partition in which to run jobs.
  -o, --overwrite       Overwrite any existing configuration file.
  -r RESOURCE_MANAGER, --resource-manager RESOURCE_MANAGER
                        If you have more than one resource manager available,
                        you can specify it.

drmr¶

Submits a pipeline script to a distributed resource manager. By default all the pipeline’s commands are run concurrently, but you can indicate dependencies by adding # drmr:wait directives between jobs. Whenever a wait directive is encountered, the pipeline will wait for all prior jobs to complete before continuing.

You may also specify job parameters, like CPU or memory requirements, time limits, etc. in # drmr:job directives.

You can get help, including a full example, by running drmr --help:

usage: drmr [-h] [-a ACCOUNT] [-d DESTINATION] [--debug] [-j JOB_NAME]
            [-f FROM_LABEL] [--mail-at-finish] [--mail-on-error]
            [--start-held] [-t TO_LABEL] [-w WAIT_LIST]
            input

Submit a drmr script to a distributed resource manager.

positional arguments:
  input                 The file containing commands to submit. Use "-" for
                        stdin.

optional arguments:
  -h, --help            show this help message and exit
  -a ACCOUNT, --account ACCOUNT
                        The account to be billed for the jobs.
  -d DESTINATION, --destination DESTINATION
                        The queue/partition in which to run the jobs.
  --debug               Turn on debug-level logging.
  -j JOB_NAME, --job-name JOB_NAME
                        The job name.
  -f FROM_LABEL, --from-label FROM_LABEL
                        Ignore script lines before the given label.
  --mail-at-finish      Send mail when all jobs are finished.
  --mail-on-error       Send mail if any job fails.
  --start-held          Submit a held job at the start of the pipeline, which
                        must be released to start execution.
  -t TO_LABEL, --to-label TO_LABEL
                        Ignore script lines after the given label.
  -w WAIT_LIST, --wait-list WAIT_LIST
                        A colon-separated list of job IDs that must complete
                        before any of this script's jobs are started.

Supported resource managers are:

  Slurm
  PBS

drmr will read configuration from your ~/.drmrc, which must be
valid JSON. You can specify your resource manager and default
values for any job parameters listed below.

Directives
==========

Your script can specify job control directives in special
comments starting with "drmr:".

# drmr:wait

  Drmr by default runs all the script's commands
  concurrently. The wait directive tells drmr to wait for
  any jobs started since the last wait directive, or the
  beginning of the script, to complete successfully.

# drmr:label

  Labels let you selectively run sections of your script: you can
  restart from a label with --from-label, running everything after
  it, or just the commands before the label given with --to-label.

# drmr:job

  You can customize the following job parameters:

  time_limit: The maximum amount of time the DRM should allow the job: "12:30:00" or "12h30m".
  working_directory: The directory where the job should be run.
  processor_memory: The amount of memory required per processor.
  node_properties: A comma-separated list of properties each node must have.
  account: The account to which the job will be billed.
  processors: The number of cores required on each node.
  default: Use the resource manager's default job parameters.
  destination: The execution environment (queue, partition, etc.) for the job.
  job_name: A name for the job.
  memory: The amount of memory required on any one node.
  nodes: The number of nodes required for the job.
  email: The submitter's email address, for notifications.

  Whatever you specify will apply to all jobs after the directive.

  To revert to default parameters, use:

  # drmr:job default

  To request 4 CPUs, 8GB of memory per processor, and a
  limit of 12 hours of execution time on one node:

  # drmr:job nodes=1 processors=4 processor_memory=8000 time_limit=12:00:00

Example
=======

A complete example script follows:

#!/bin/bash

#
# Example drmr script. It can be run as a normal shell script, or
# submitted to a resource manager with the drmr command.
#

#
# You can just write commands as you would in any script. Their output
# will be captured in files by the resource manager.
#
echo thing1

#
# You can only use flow control within a command; drmr's parser is not
# smart enough to deal with conditionals, or create jobs for each
# iteration of a for loop, or anything like that.
#
# You can do this, but it will just all happen in a single job:
#
for i in $(seq 1 4); do echo thing${i}; done

#
# Comments are OK.
#
echo thing2  # even trailing comments

#
# Line continuations are OK.
#
echo thing1 \
     thing2 \
     thing3

#
# Pipes are OK.
#
echo funicular berry harvester | wc -w

#
# The drmr wait directive makes subsequent tasks depend on the
# successful completion of all jobs since the last wait directive or
# the start of the script.
#

# drmr:wait
echo "And proud we are of all of them."

#
# You can specify job parameters:
#

# drmr:job nodes=1 processors=4 processor_memory=8000 time_limit=12:00:00
echo "I got mine but I want more."

#
# And revert to the defaults defined by drmr or the resource manager.
#

# drmr:job default
echo "This job feels so normal."

# drmr:wait
# drmr:job time_limit=00:15:00
echo "All done!"

# And finally, a job is automatically submitted to wait on all the
# other jobs and report success or failure of the entire script.
# Its job ID will be printed.

drmrarray¶

If you have hundreds or thousands of tasks that don’t depend on each other, you can make life easier for yourself and your DRM by submitting them in a job array with drmrarray. Both Slurm and PBS cope better with lots of jobs if they’re part of an array, and it’s definitely easier to make sense of the DRM’s status output when it doesn’t contain hundreds or thousands of lines.

With drmrarray, job parameters can only be specified once, at the top of the script, and will apply to all jobs in the array. And of course you cannot define dependencies. You can, however, run whatever program you like on each line of the script you feed to drmrarray.

You can get help, including a full example, by running drmrarray --help:

usage: drmrarray [-h] [-a ACCOUNT] [-d DESTINATION] [--debug] [-f]
                 [-j JOB_NAME] [--mail-at-finish] [--mail-on-error]
                 [-s SLOT_LIMIT] [-w WAIT_LIST]
                 input

Submit a drmr script to a distributed resource manager as a job array.

positional arguments:
  input                 The file containing commands to submit. Use "-" for
                        stdin.

optional arguments:
  -h, --help            show this help message and exit
  -a ACCOUNT, --account ACCOUNT
                        The account to be billed for the jobs.
  -d DESTINATION, --destination DESTINATION
                        The queue/partition in which to run the jobs.
  --debug               Turn on debug-level logging.
  -f, --finish-jobs     If specified, two extra jobs will be queued after the
                        main array, to indicate success and completion.
  -j JOB_NAME, --job-name JOB_NAME
                        The job name.
  --mail-at-finish      Send mail when all jobs are finished.
  --mail-on-error       Send mail if any job fails.
  -s SLOT_LIMIT, --slot-limit SLOT_LIMIT
                        The number of jobs that will be run concurrently when
                        the job is started, or 'all' (the default).
  -w WAIT_LIST, --wait-list WAIT_LIST
                        A colon-separated list of job IDs that must complete
                        before any of this script's jobs are started.

Supported resource managers are:

  Slurm
  PBS

drmrarray will read configuration from your ~/.drmrc, which must be valid
JSON. You can specify your resource manager and default values for any job
parameters listed below.

Directives
==========

Your script can specify job parameters in special comments starting
with "drmr:job".

# drmr:job

  You can customize the following job parameters:

  time_limit: The maximum amount of time the DRM should allow the job: "12:30:00" or "12h30m".
  working_directory: The directory where the job should be run.
  processor_memory: The amount of memory required per processor.
  node_properties: A comma-separated list of properties each node must have.
  account: The account to which the job will be billed.
  processors: The number of cores required on each node.
  default: Use the resource manager's default job parameters.
  destination: The execution environment (queue, partition, etc.) for the job.
  job_name: A name for the job.
  memory: The amount of memory required on any one node.
  nodes: The number of nodes required for the job.
  email: The submitter's email address, for notifications.

  Whatever you specify will apply to all jobs after the directive.

  To revert to default parameters, use:

  # drmr:job default

  To request 4 CPUs, 8GB of memory per processor, and a
  limit of 12 hours of execution time on one node:

  # drmr:job nodes=1 processors=4 processor_memory=8000 time_limit=12:00:00

drmrm¶

A drmr script can generate a lot of jobs. Deleting them with the DRM tools (e.g. qdel, scancel) can be cumbersome, so drmrm tries to make it easier. Help is available by running drmrm --help

usage: drmrm [-h] [--debug] [-n] [-j JOB_NAME] [-u USER] [job_id [job_id ...]]

Remove jobs from a distributed resource manager.

positional arguments:
  job_id                A job ID to remove.

optional arguments:
  -h, --help            show this help message and exit
  --debug               Turn on debug-level logging.
  -n, --dry-run         Just print jobs that would be removed, without
                        actually removing them.
  -j JOB_NAME, --job-name JOB_NAME
                        Remove only jobs whose names contain this string.
  -u USER, --user USER  Remove only jobs belonging to this user.