L3-Processing

lnicola · May 10, 2018, 10:56am

Hmm, did sinfo -R mention any reason?

Maatin · May 10, 2018, 1:31pm

[root@vm-sen2agri ~]# sinfo -R
REASON USER TIMESTAMP NODELIST

the info is empty now.

But from time to time Slurm is disabling the node again and the processors stop running…

lnicola · May 10, 2018, 1:42pm

Maybe / is full? Can you check?

Maatin · May 10, 2018, 1:46pm

There are still 2.5T free space:)

  [root@vm-sen2agri ~]# df -h
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sda4       7.9T  5.4T  2.5T  70% /
    devtmpfs         99G     0   99G   0% /dev
    tmpfs            99G   29M   99G   1% /dev/shm
    tmpfs            99G  4.3G   95G   5% /run
    tmpfs            99G     0   99G   0% /sys/fs/cgroup
    /dev/sda2      1014M  196M  819M  20% /boot
    tmpfs            20G   48K   20G   1% /run/user/1000
    /dev/sr0        8.1G  8.1G     0 100% /run/media/admin/CentOS 7 x86_64

Maatin · May 10, 2018, 7:06pm

Now I got a new message running sinfo -R:

REASON               USER      TIMESTAMP           NODELIST
Low socket*core*thre slurm     2018-05-10T16:17:58 localhost
**You have new mail in /var/spool/mail/root**

/var/spool/mail/root contains:

From user@localhost.localdomain  Thu May 10 17:30:43 2018
Return-Path: <user@localhost.localdomain>
X-Original-To: root@localhost
Delivered-To: root@localhost.localdomain
Received: by vm-sen2agri.localdomain (Postfix, from userid 0)
        id 1EDE8357E5A4F; Thu, 10 May 2018 17:30:42 +0200 (CEST)
Date: Thu, 10 May 2018 17:30:42 +0200
From: user@localhost.localdomain
To: root@localhost.localdomain
Subject: [abrt] setroubleshoot-server:
 server.py:699:RunFaultServer:ValueError: unable to open
 /sys/fs/selinux/policy:  Device or resource busy
Message-ID: <5af465a2.Gv7yHwSyobfwC3AS%user@localhost>
User-Agent: Heirloom mailx 12.5 7/5/10
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

reason:         server.py:699:RunFaultServer:ValueError: unable to open /sys/fs/selinux/policy:  Device or resource busy
cmdline:        /usr/bin/python -Es /usr/sbin/setroubleshootd -f ''
executable:     /usr/sbin/setroubleshootd
package:        setroubleshoot-server-3.2.28-3.el7
component:	setroubleshoot
pid:            7895
hostname:	vm-sen2agri
count:          1
abrt_version:   2.1.11
analyzer:	Python
architecture:   x86_64
duphash:        33bc329d11dc6adc518b385827c2f8af9b150646
event_log:
kernel:         3.10.0-693.21.1.el7.x86_64
last_occurrence: 1525966239
os_release:     CentOS Linux release 7.4.1708 (Core)
pkg_arch:	x86_64
pkg_epoch:	0
pkg_fingerprint: 24C6 A8A7 F4A8 0EB5
pkg_name:	setroubleshoot-server
pkg_release:    3.el7
pkg_epoch:	0
pkg_fingerprint: 24C6 A8A7 F4A8 0EB5
pkg_name:	setroubleshoot-server
pkg_release:    3.el7
pkg_vendor:     CentOS
pkg_version:    3.2.28
runlevel:	N 5
time:           Thu 10 May 2018 05:30:39 PM CEST
type:           Python
uid:            992
username:	setroubleshoot
uuid:           33bc329d11dc6adc518b385827c2f8af9b150646

backtrace:
:server.py:699:RunFaultServer:ValueError: unable to open /sys/fs/selinux/policy:  Device or resource busy
:
:
:Traceback (most recent call last):
:  File "/usr/sbin/setroubleshootd", line 102, in <module>
:    RunFaultServer(timeout)
:  File "/usr/lib64/python2.7/site-packages/setroubleshoot/server.py", line 699, in RunFaultServer
:    audit2why.init()
:ValueError: unable to open /sys/fs/selinux/policy:  Device or resource busy
:
:
:Local variables in innermost frame:
:timeout: 10
environ:
:DBUS_STARTER_BUS_TYPE=system
:DBUS_STARTER_ADDRESS=unix:path=/var/run/dbus/system_bus_socket
machineid:
:systemd=c1f2476d7ed14011acf5514cdf336eec
:sosreport_uploader-dmidecode=54d39d3f7cb7ffcc5d86e2347a1256acdd7e6401628cf6cc15c755fad51b5486
os_info:
:NAME="CentOS Linux"
:VERSION="7 (Core)"
:ID="centos"
:ID_LIKE="rhel fedora"
:VERSION_ID="7"
:PRETTY_NAME="CentOS Linux 7 (Core)"
:ANSI_COLOR="0;31"
:CPE_NAME="cpe:/o:centos:centos:7"
:HOME_URL="https://www.centos.org/"
:BUG_REPORT_URL="https://bugs.centos.org/"
:
:CENTOS_MANTISBT_PROJECT="CentOS-7"
:CENTOS_MANTISBT_PROJECT_VERSION="7"
:REDHAT_SUPPORT_PRODUCT="centos"
:REDHAT_SUPPORT_PRODUCT_VERSION="7"

lnicola · May 14, 2018, 8:39am

Ah, the reason was also in one of your previous posts:

Reason=Low socket*core*thread count, Low CPUs

In /etc/slurm/slurm.conf you have this line:

NodeName=localhost CPUs=64

Does your system really have 64 logical threads? If not, you should reduce the CPUs value to the real one.