Requested and Scheduled Jobs Not Running After a Computer Restart

Hello,

I decided to restart my machine since it’s been running already for days, and is becoming unresponsive.
However, I did this while there are running sen2agri jobs/tasks. (More than 10 tasks, L4A and L4B).

After the restart process, I use systemctl restart sen2agri-services to enable my site on the web interface. I also checked that the jobs/tasks were running (based from the monitoring tab) but upon checking in htop, it seemed that there were no running sen2agri processes based on CPU usage.

How long does it take for the jobs/tasks to resume? Does a simple machine restart really affects the continuation of my jobs/tasks?

Regards.

Hello,

Is this issue related to slurm? I’m sharing my screenshot for the jobs that are not working.

I’m also attaching the logs for slurm and slurmd

slurmd:

[2019-11-22T07:43:57.248] [53287.0] done with job
[2019-11-22T07:43:57.518] [53287] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2019-11-22T07:43:57.765] [53287] done with job
[2019-11-22T07:43:59.187] _run_prolog: run job script took usec=11
[2019-11-22T07:43:59.187] _run_prolog: prolog with lock for job 53288 ran for 0 seconds
[2019-11-22T07:43:59.187] Launching batch job 53288 for UID 1002
[2019-11-22T07:43:59.941] launch task 53288.0 request from 1002.1002@127.0.0.1 (port 27789)
[2019-11-22T10:00:29.686] Slurmd shutdown completing
[2019-11-22T10:05:25.079] Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw) SocketsPerBoard=8:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:2(hw)
[2019-11-22T10:05:25.410] Message aggregation disabled
[2019-11-22T10:05:25.673] error: _cpu_freq_cpu_avail: Could not open /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
[2019-11-22T10:05:25.674] Resource spec: Reserved system memory limit not configured for this node
[2019-11-22T10:05:25.974] slurmd version 15.08.7 started
[2019-11-22T10:05:26.425] slurmd started on Fri, 22 Nov 2019 10:05:26 +0800
[2019-11-22T10:05:26.426] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=48075 TmpDisk=51175 Uptime=45 CPUSpecList=(null)
[2019-11-22T10:05:26.443] _handle_stray_script: Purging vestigial job script /var/spool/slurm/slurmd/job53285/slurm_script
[2019-11-22T10:05:26.541] _handle_stray_script: Purging vestigial job script /var/spool/slurm/slurmd/job53288/slurm_script
[2019-11-22T14:55:34.205] Slurmd shutdown completing
[2019-11-22T14:56:36.116] Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw) SocketsPerBoard=8:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:2(hw)
[2019-11-22T14:56:36.587] Message aggregation disabled
[2019-11-22T14:56:36.589] error: _cpu_freq_cpu_avail: Could not open /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
[2019-11-22T14:56:36.588] Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw) SocketsPerBoard=8:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:2(hw)
[2019-11-22T14:56:36.589] Resource spec: Reserved system memory limit not configured for this node
[2019-11-22T14:56:36.589] Message aggregation disabled
[2019-11-22T14:56:36.591] error: _cpu_freq_cpu_avail: Could not open /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
[2019-11-22T14:56:36.591] Resource spec: Reserved system memory limit not configured for this node
[2019-11-22T14:56:36.593] slurmd version 15.08.7 started
[2019-11-22T14:56:36.594] slurmd version 15.08.7 started
[2019-11-22T14:56:36.758] slurmd started on Fri, 22 Nov 2019 14:56:36 +0800
[2019-11-22T14:56:36.759] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=48075 TmpDisk=51175 Uptime=29 CPUSpecList=(null)
[2019-11-22T14:56:36.987] error: Error binding slurm stream socket: Address already in use
[2019-11-22T14:56:36.987] error: Unable to bind listen port (*:6818): Address already in use
[2019-11-22T14:57:39.427] _run_prolog: run job script took usec=11
[2019-11-22T14:57:39.427] _run_prolog: prolog with lock for job 53289 ran for 0 seconds
[2019-11-22T14:57:39.428] _run_prolog: run job script took usec=11
[2019-11-22T14:57:39.428] _run_prolog: prolog with lock for job 53298 ran for 0 seconds
[2019-11-22T14:57:39.428] Launching batch job 53289 for UID 1002
[2019-11-22T14:57:39.986] Launching batch job 53298 for UID 1002
[2019-11-22T14:57:40.436] launch task 53289.0 request from 1002.1002@127.0.0.1 (port 43730)
[2019-11-22T14:57:40.493] launch task 53298.0 request from 1002.1002@127.0.0.1 (port 44242)
[2019-11-22T15:18:10.934] launch task 53310.0 request from 1002.1002@127.0.0.1 (port 46734)
[2019-11-22T15:18:10.935] _run_prolog: run job script took usec=4
[2019-11-22T15:18:10.935] _run_prolog: prolog with lock for job 53310 ran for 0 seconds
[2019-11-22T15:18:10.944] [53310.0] couldn't chdir to `/home/brentf': Permission denied: going to /tmp instead
[2019-11-22T15:18:11.035] [53310.0] done with job
[2019-11-22T15:18:31.833] launch task 53311.0 request from 1002.1002@127.0.0.1 (port 33423)
[2019-11-22T15:18:31.834] _run_prolog: run job script took usec=7
[2019-11-22T15:18:31.834] _run_prolog: prolog with lock for job 53311 ran for 0 seconds
[2019-11-22T15:18:31.842] [53311.0] couldn't chdir to `/home/brentf': Permission denied: going to /tmp instead
[2019-11-22T15:18:31.863] [53311.0] done with job
[2019-11-22T15:28:21.768] _run_prolog: run job script took usec=10
[2019-11-22T15:28:21.768] _run_prolog: prolog with lock for job 53290 ran for 0 seconds
[2019-11-22T15:28:21.768] Launching batch job 53290 for UID 1002
[2019-11-22T15:28:21.807] _run_prolog: run job script took usec=6
[2019-11-22T15:28:21.807] _run_prolog: prolog with lock for job 53299 ran for 0 seconds
[2019-11-22T15:28:21.807] Launching batch job 53299 for UID 1002
[2019-11-22T15:28:21.828] launch task 53290.0 request from 1002.1002@127.0.0.1 (port 23203)
[2019-11-22T15:28:21.834] launch task 53299.0 request from 1002.1002@127.0.0.1 (port 25251)
[2019-11-22T15:32:10.426] launch task 53312.0 request from 1002.1002@127.0.0.1 (port 20146)
[2019-11-22T15:32:10.426] _run_prolog: run job script took usec=3
[2019-11-22T15:32:10.426] _run_prolog: prolog with lock for job 53312 ran for 0 seconds
[2019-11-22T15:32:10.432] [53312.0] couldn't chdir to `/home/brentf': Permission denied: going to /tmp instead
[2019-11-22T15:32:10.449] [53312.0] done with job
[2019-11-22T15:40:49.622] launch task 53313.0 request from 1002.1002@127.0.0.1 (port 59606)
[2019-11-22T15:40:49.622] _run_prolog: run job script took usec=2
[2019-11-22T15:40:49.622] _run_prolog: prolog with lock for job 53313 ran for 0 seconds
[2019-11-22T15:40:49.628] [53313.0] couldn't chdir to `/home/brentf': Permission denied: going to /tmp instead
[2019-11-22T15:40:49.647] [53313.0] done with job
[2019-11-22T15:44:12.074] launch task 53314.0 request from 1002.1002@127.0.0.1 (port 4325)
[2019-11-22T15:44:12.074] _run_prolog: run job script took usec=4
[2019-11-22T15:44:12.074] _run_prolog: prolog with lock for job 53314 ran for 0 seconds
[2019-11-22T15:44:12.083] [53314.0] couldn't chdir to `/home/brentf': Permission denied: going to /tmp instead
[2019-11-22T15:44:12.100] [53314.0] done with job

slurm:

[2019-11-21T18:52:08.264] sched: Allocate JobID=53287 NodeList=localhost #CPUs=1 Partition=sen2agri
[2019-11-21T18:53:05.881] _slurm_rpc_submit_batch_job JobId=53288 usec=482
[2019-11-21T18:54:06.474] _slurm_rpc_submit_batch_job JobId=53289 usec=1001
[2019-11-21T18:55:06.194] _slurm_rpc_submit_batch_job JobId=53290 usec=789
[2019-11-21T18:56:06.092] _slurm_rpc_submit_batch_job JobId=53291 usec=613
[2019-11-21T18:57:06.238] _slurm_rpc_submit_batch_job JobId=53292 usec=73914
[2019-11-21T18:58:06.068] _slurm_rpc_submit_batch_job JobId=53293 usec=1252
[2019-11-21T18:59:05.824] _slurm_rpc_submit_batch_job JobId=53294 usec=459
[2019-11-21T19:00:06.310] _slurm_rpc_submit_batch_job JobId=53295 usec=576
[2019-11-21T19:01:06.566] _slurm_rpc_submit_batch_job JobId=53296 usec=477
[2019-11-21T19:02:06.256] _slurm_rpc_submit_batch_job JobId=53297 usec=487
[2019-11-22T07:43:57.579] job_complete: JobID=53287 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2019-11-22T07:43:57.611] job_complete: JobID=53287 State=0x8003 NodeCnt=1 done
[2019-11-22T07:43:58.944] sched: Allocate JobID=53288 NodeList=localhost #CPUs=1 Partition=sen2agri
[2019-11-22T07:44:07.547] _slurm_rpc_submit_batch_job JobId=53298 usec=83220
[2019-11-22T07:45:06.756] _slurm_rpc_submit_batch_job JobId=53299 usec=19891
[2019-11-22T07:46:08.041] _slurm_rpc_submit_batch_job JobId=53300 usec=444
[2019-11-22T07:47:07.727] _slurm_rpc_submit_batch_job JobId=53301 usec=588
[2019-11-22T07:48:08.469] _slurm_rpc_submit_batch_job JobId=53302 usec=442
[2019-11-22T07:49:08.440] _slurm_rpc_submit_batch_job JobId=53303 usec=78795
[2019-11-22T07:50:08.580] _slurm_rpc_submit_batch_job JobId=53304 usec=246849
[2019-11-22T07:51:08.242] _slurm_rpc_submit_batch_job JobId=53305 usec=44657
[2019-11-22T07:52:07.282] _slurm_rpc_submit_batch_job JobId=53306 usec=563
[2019-11-22T07:53:07.774] _slurm_rpc_submit_batch_job JobId=53307 usec=600
[2019-11-22T07:54:08.056] _slurm_rpc_submit_batch_job JobId=53308 usec=1033
[2019-11-22T10:00:25.364] Terminate signal (SIGINT or SIGTERM) received
[2019-11-22T10:00:25.545] Saving all slurm state
[2019-11-22T10:00:34.456] layouts: all layouts are now unloaded.
[2019-11-22T10:05:26.558] error: Unable to lock pidfile `/var/run/slurmctld.pid': Resource temporarily unavailable
[2019-11-22T10:05:26.757] slurmctld version 15.08.7 started on cluster sen2agri
[2019-11-22T10:05:26.757] slurmctld version 15.08.7 started on cluster sen2agri
[2019-11-22T10:05:26.980] error: Association database appears down, reading from state file.
[2019-11-22T10:05:26.980] error: Association database appears down, reading from state file.
[2019-11-22T10:05:27.345] layouts: no layout to initialize
[2019-11-22T10:05:27.345] layouts: no layout to initialize
[2019-11-22T10:05:27.852] layouts: loading entities/relations information
[2019-11-22T10:05:27.852] layouts: loading entities/relations information
[2019-11-22T10:05:27.862] Recovered state of 1 nodes
[2019-11-22T10:05:27.862] Recovered state of 1 nodes
[2019-11-22T10:05:27.972] recovered job step 53285.0
[2019-11-22T10:05:27.972] recovered job step 53285.0
[2019-11-22T10:05:27.973] Recovered JobID=53285 State=0x1 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.973] Recovered JobID=53285 State=0x1 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.973] recovered job step 53288.0
[2019-11-22T10:05:27.973] recovered job step 53288.0
[2019-11-22T10:05:27.973] Recovered JobID=53288 State=0x1 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.973] Recovered JobID=53288 State=0x1 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.973] Recovered JobID=53289 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.973] Recovered JobID=53289 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.973] Recovered JobID=53290 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.973] Recovered JobID=53290 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.973] Recovered JobID=53291 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53292 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53291 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53293 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53292 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53294 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53293 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53295 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53294 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53295 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53296 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53297 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53296 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53298 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53297 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53299 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53298 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53300 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53299 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53301 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53300 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53302 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53301 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53303 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53302 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53304 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53303 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53305 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53304 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53306 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53305 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53307 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53306 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53308 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered information about 22 jobs
[2019-11-22T10:05:27.974] Recovered JobID=53307 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered JobID=53308 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T10:05:27.974] Recovered information about 22 jobs
[2019-11-22T10:05:27.976] cons_res: select_p_node_init
[2019-11-22T10:05:27.976] cons_res: select_p_node_init
[2019-11-22T10:05:27.976] cons_res: preparing for 2 partitions
[2019-11-22T10:05:27.976] cons_res: preparing for 2 partitions
[2019-11-22T10:05:28.588] Recovered state of 0 reservations
[2019-11-22T10:05:28.588] Recovered state of 0 reservations
[2019-11-22T10:05:28.605] read_slurm_conf: backup_controller not specified.
[2019-11-22T10:05:28.605] read_slurm_conf: backup_controller not specified.
[2019-11-22T10:05:28.605] cons_res: select_p_reconfigure
[2019-11-22T10:05:28.605] cons_res: select_p_reconfigure
[2019-11-22T10:05:28.605] cons_res: select_p_node_init
[2019-11-22T10:05:28.605] cons_res: select_p_node_init
[2019-11-22T10:05:28.605] cons_res: preparing for 2 partitions
[2019-11-22T10:05:28.605] cons_res: preparing for 2 partitions
[2019-11-22T10:05:28.605] Running as primary controller
[2019-11-22T10:05:28.605] Running as primary controller
[2019-11-22T10:05:28.719] Registering slurmctld at port 6817 with slurmdbd.
[2019-11-22T10:05:28.719] Registering slurmctld at port 6817 with slurmdbd.
[2019-11-22T10:05:29.213] Recovered information about 0 sicp jobs
[2019-11-22T10:05:29.213] Recovered information about 0 sicp jobs
[2019-11-22T10:05:29.215] error: Error binding slurm stream socket: Address already in use
[2019-11-22T10:05:29.215] fatal: slurm_init_msg_engine_addrname_port error Address already in use
[2019-11-22T10:05:32.500] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2019-11-22T10:06:07.905] Registering slurmctld at port 6817 with slurmdbd.
[2019-11-22T10:06:50.255] Terminate signal (SIGINT or SIGTERM) received
[2019-11-22T10:06:50.285] Saving all slurm state
[2019-11-22T10:06:55.681] layouts: all layouts are now unloaded.
[2019-11-22T14:56:35.826] slurmctld version 15.08.7 started on cluster sen2agri
[2019-11-22T14:56:35.972] error: Association database appears down, reading from state file.
[2019-11-22T14:56:36.245] layouts: no layout to initialize
[2019-11-22T14:56:37.034] layouts: loading entities/relations information
[2019-11-22T14:56:37.157] Recovered state of 1 nodes
[2019-11-22T14:56:37.385] recovered job step 53285.0
[2019-11-22T14:56:37.385] Recovered JobID=53285 State=0x1 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] recovered job step 53288.0
[2019-11-22T14:56:37.385] Recovered JobID=53288 State=0x1 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53289 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53290 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53291 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53292 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53293 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53294 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53295 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53296 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53297 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53298 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53299 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53300 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.385] Recovered JobID=53301 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.386] Recovered JobID=53302 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.386] Recovered JobID=53303 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.386] Recovered JobID=53304 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.386] Recovered JobID=53305 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.386] Recovered JobID=53306 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.386] Recovered JobID=53307 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.386] Recovered JobID=53308 State=0x0 NodeCnt=0 Assoc=4
[2019-11-22T14:56:37.386] Recovered information about 22 jobs
[2019-11-22T14:56:37.386] cons_res: select_p_node_init
[2019-11-22T14:56:37.386] cons_res: preparing for 2 partitions
[2019-11-22T14:56:38.114] Recovered state of 0 reservations
[2019-11-22T14:56:38.168] read_slurm_conf: backup_controller not specified.
[2019-11-22T14:56:38.168] cons_res: select_p_reconfigure
[2019-11-22T14:56:38.168] cons_res: select_p_node_init
[2019-11-22T14:56:38.168] cons_res: preparing for 2 partitions
[2019-11-22T14:56:38.168] Running as primary controller
[2019-11-22T14:56:38.168] Registering slurmctld at port 6817 with slurmdbd.
[2019-11-22T14:56:38.170] Recovered information about 0 sicp jobs
[2019-11-22T14:56:41.174] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2019-11-22T14:57:38.958] sched: Allocate JobID=53289 NodeList=localhost #CPUs=1 Partition=sen2agri
[2019-11-22T14:57:39.076] sched: Allocate JobID=53298 NodeList=localhost #CPUs=1 Partition=sen2agri
[2019-11-22T15:18:10.929] sched: _slurm_rpc_allocate_resources JobId=53310 NodeList=localhost usec=25117
[2019-11-22T15:18:10.974] job_complete: JobID=53310 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2019-11-22T15:18:11.034] job_complete: JobID=53310 State=0x8003 NodeCnt=1 done
[2019-11-22T15:18:31.824] sched: _slurm_rpc_allocate_resources JobId=53311 NodeList=localhost usec=370
[2019-11-22T15:18:31.863] job_complete: JobID=53311 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2019-11-22T15:18:31.863] job_complete: JobID=53311 State=0x8003 NodeCnt=1 done
[2019-11-22T15:28:21.595] Batch JobId=53285 missing from node 0 (not found BatchStartTime after startup), Requeuing job
[2019-11-22T15:28:21.595] job_complete: JobID=53285 State=0x1 NodeCnt=1 WTERMSIG 126
[2019-11-22T15:28:21.595] job_complete: JobID=53285 State=0x1 NodeCnt=1 cancelled by node failure
[2019-11-22T15:28:21.653] job_complete: requeue JobID=53285 State=0x8000 NodeCnt=1 due to node failure
[2019-11-22T15:28:21.653] job_complete: JobID=53285 State=0x8000 NodeCnt=1 done
[2019-11-22T15:28:21.653] Batch JobId=53288 missing from node 0 (not found BatchStartTime after startup), Requeuing job
[2019-11-22T15:28:21.653] job_complete: JobID=53288 State=0x1 NodeCnt=1 WTERMSIG 126
[2019-11-22T15:28:21.653] job_complete: JobID=53288 State=0x1 NodeCnt=1 cancelled by node failure
[2019-11-22T15:28:21.710] job_complete: requeue JobID=53288 State=0x8000 NodeCnt=1 due to node failure
[2019-11-22T15:28:21.710] job_complete: JobID=53288 State=0x8000 NodeCnt=1 done
[2019-11-22T15:28:21.710] Requeuing JobID=53285 State=0x0 NodeCnt=0
[2019-11-22T15:28:21.711] Requeuing JobID=53288 State=0x0 NodeCnt=0
[2019-11-22T15:28:21.716] sched: Allocate JobID=53290 NodeList=localhost #CPUs=1 Partition=sen2agri
[2019-11-22T15:28:21.765] sched: Allocate JobID=53299 NodeList=localhost #CPUs=1 Partition=sen2agri
[2019-11-22T15:31:49.001] error: User 1000 not found
[2019-11-22T15:31:49.001] _job_create: invalid account or partition for user 1000, account '(null)', and partition 'sen2agri'
[2019-11-22T15:31:49.001] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified 
[2019-11-22T15:32:10.422] sched: _slurm_rpc_allocate_resources JobId=53312 NodeList=localhost usec=246
[2019-11-22T15:32:10.449] job_complete: JobID=53312 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2019-11-22T15:32:10.449] job_complete: JobID=53312 State=0x8003 NodeCnt=1 done
[2019-11-22T15:40:49.618] sched: _slurm_rpc_allocate_resources JobId=53313 NodeList=localhost usec=244
[2019-11-22T15:40:49.647] job_complete: JobID=53313 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2019-11-22T15:40:49.647] job_complete: JobID=53313 State=0x8003 NodeCnt=1 done
[2019-11-22T15:44:12.067] sched: _slurm_rpc_allocate_resources JobId=53314 NodeList=localhost usec=369
[2019-11-22T15:44:12.100] job_complete: JobID=53314 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2019-11-22T15:44:12.100] job_complete: JobID=53314 State=0x8003 NodeCnt=1 done
[2019-11-22T15:52:42.609] Invalid node state transition requested for node localhost from=ALLOCATED to=RESUME
[2019-11-22T15:52:42.610] _slurm_rpc_update_node for localhost: Invalid node state specified
[2019-11-22T15:53:55.049] Invalid node state transition requested for node localhost from=ALLOCATED to=RESUME
[2019-11-22T15:53:55.049] _slurm_rpc_update_node for localhost: Invalid node state specified

Regards.

It’s already running I just need to delete the redundant and least important jobs unrelated to L4B processing.

Cheers.