Extending the slurm pool to other servers

Hello,

As SLURM is a manager for computer clusters, we are wandering if it is possible to extend the pool to other servers possessing the executables ? It would not be especially CentOS servers, to allow the use on existing clusters.

Some prerequisites come to mind:

  • SLURM: add the pool to the slurm config file (obviously)
  • OTB apps: compile and install on other nodes
  • python scripts: copy to make avail on other nodes
  • custom paths: as clusters are often used by a variety of applications and users, it is likely than the libraries are not installed in classical paths to allow parallel and different builds. It is possible to add a “source” script at the beginning of the sbatch script as an option ? That allows access to python files (with PYTHONPATH) and the libs (LD_LIBRARY_PATH), and the apps (OTB_APPLICATION_PATH)
  • orchestrator: is there something to do ? I do not understand where the triggers to add right or wrong status to the DB are implemented. Is it file-based, is there something in the sbatch script sending a message, or something else ?
  • other prerequisite ?

Thanks

Hi,

The system does use SLURM to run the processing jobs, so in theory it’s possible to set it up on multiple servers. Point by point:

SLURM: add the pool to the slurm config file (obviously)

Yes, the SLURM cluster should be configured properly.

OTB apps: compile and install on other nodes

Yes, or you can install the sen2agri-processors RPM if you’re using our binary distribution.

python scripts: copy to make avail on other nodes

The relevant ones should be in that package. The downloaders should only run on a single node.

custom paths: as clusters are often used by a variety of applications and users

Not right now, the applications assume a system-wide install.

orchestrator: is there something to do ?

There should be a single instance of the orchestrator and executor daemons. The jobs try to report their completion to the executor, using the IP address configured in the database (executor.listen-ip key in the config table). This defaults to 127.0.0.1 and should be changed to that node’s address.

other prerequisite ?

Not as far as I know, but please note that this scenario hasn’t really been tested properly. Sorry for the bad news.

My preference would be to provide a couple of Docker images to simplify installation on shared nodes, but this has yet to happen.