Hmm, did sinfo -R
mention any reason?
[root@vm-sen2agri ~]# sinfo -R
REASON USER TIMESTAMP NODELIST
the info is empty now.
But from time to time Slurm is disabling the node again and the processors stop running…
Maybe /
is full? Can you check?
There are still 2.5T free space:)
[root@vm-sen2agri ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda4 7.9T 5.4T 2.5T 70% /
devtmpfs 99G 0 99G 0% /dev
tmpfs 99G 29M 99G 1% /dev/shm
tmpfs 99G 4.3G 95G 5% /run
tmpfs 99G 0 99G 0% /sys/fs/cgroup
/dev/sda2 1014M 196M 819M 20% /boot
tmpfs 20G 48K 20G 1% /run/user/1000
/dev/sr0 8.1G 8.1G 0 100% /run/media/admin/CentOS 7 x86_64
Now I got a new message running sinfo -R:
REASON USER TIMESTAMP NODELIST
Low socket*core*thre slurm 2018-05-10T16:17:58 localhost
**You have new mail in /var/spool/mail/root**
/var/spool/mail/root contains:
From user@localhost.localdomain Thu May 10 17:30:43 2018
Return-Path: <user@localhost.localdomain>
X-Original-To: root@localhost
Delivered-To: root@localhost.localdomain
Received: by vm-sen2agri.localdomain (Postfix, from userid 0)
id 1EDE8357E5A4F; Thu, 10 May 2018 17:30:42 +0200 (CEST)
Date: Thu, 10 May 2018 17:30:42 +0200
From: user@localhost.localdomain
To: root@localhost.localdomain
Subject: [abrt] setroubleshoot-server:
server.py:699:RunFaultServer:ValueError: unable to open
/sys/fs/selinux/policy: Device or resource busy
Message-ID: <5af465a2.Gv7yHwSyobfwC3AS%user@localhost>
User-Agent: Heirloom mailx 12.5 7/5/10
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
reason: server.py:699:RunFaultServer:ValueError: unable to open /sys/fs/selinux/policy: Device or resource busy
cmdline: /usr/bin/python -Es /usr/sbin/setroubleshootd -f ''
executable: /usr/sbin/setroubleshootd
package: setroubleshoot-server-3.2.28-3.el7
component: setroubleshoot
pid: 7895
hostname: vm-sen2agri
count: 1
abrt_version: 2.1.11
analyzer: Python
architecture: x86_64
duphash: 33bc329d11dc6adc518b385827c2f8af9b150646
event_log:
kernel: 3.10.0-693.21.1.el7.x86_64
last_occurrence: 1525966239
os_release: CentOS Linux release 7.4.1708 (Core)
pkg_arch: x86_64
pkg_epoch: 0
pkg_fingerprint: 24C6 A8A7 F4A8 0EB5
pkg_name: setroubleshoot-server
pkg_release: 3.el7
pkg_epoch: 0
pkg_fingerprint: 24C6 A8A7 F4A8 0EB5
pkg_name: setroubleshoot-server
pkg_release: 3.el7
pkg_vendor: CentOS
pkg_version: 3.2.28
runlevel: N 5
time: Thu 10 May 2018 05:30:39 PM CEST
type: Python
uid: 992
username: setroubleshoot
uuid: 33bc329d11dc6adc518b385827c2f8af9b150646
backtrace:
:server.py:699:RunFaultServer:ValueError: unable to open /sys/fs/selinux/policy: Device or resource busy
:
:
:Traceback (most recent call last):
: File "/usr/sbin/setroubleshootd", line 102, in <module>
: RunFaultServer(timeout)
: File "/usr/lib64/python2.7/site-packages/setroubleshoot/server.py", line 699, in RunFaultServer
: audit2why.init()
:ValueError: unable to open /sys/fs/selinux/policy: Device or resource busy
:
:
:Local variables in innermost frame:
:timeout: 10
environ:
:DBUS_STARTER_BUS_TYPE=system
:DBUS_STARTER_ADDRESS=unix:path=/var/run/dbus/system_bus_socket
machineid:
:systemd=c1f2476d7ed14011acf5514cdf336eec
:sosreport_uploader-dmidecode=54d39d3f7cb7ffcc5d86e2347a1256acdd7e6401628cf6cc15c755fad51b5486
os_info:
:NAME="CentOS Linux"
:VERSION="7 (Core)"
:ID="centos"
:ID_LIKE="rhel fedora"
:VERSION_ID="7"
:PRETTY_NAME="CentOS Linux 7 (Core)"
:ANSI_COLOR="0;31"
:CPE_NAME="cpe:/o:centos:centos:7"
:HOME_URL="https://www.centos.org/"
:BUG_REPORT_URL="https://bugs.centos.org/"
:
:CENTOS_MANTISBT_PROJECT="CentOS-7"
:CENTOS_MANTISBT_PROJECT_VERSION="7"
:REDHAT_SUPPORT_PRODUCT="centos"
:REDHAT_SUPPORT_PRODUCT_VERSION="7"
Ah, the reason was also in one of your previous posts:
Reason=Low socket*core*thread count, Low CPUs
In /etc/slurm/slurm.conf
you have this line:
NodeName=localhost CPUs=64
Does your system really have 64 logical threads? If not, you should reduce the CPUs
value to the real one.
1 Like