Irwin recently experienced one of those random crashes and currently is in a bad state. One symptom:
[root@tempest ~] ssh irwin
Last login: Tue Dec 11 14:31:20 2012 from puma.wellesley.edu
[root@irwin ~] getent passwd anderson
So, somehow one of the essential authentication services has failed. What could it be? Thrush is working fine, so I did the following on both irwin and thrush
[root@irwin ~] service --status-all > /usr/network/tmp/services-irwin
JAVA_EXECUTABLE or HSQLDB_JAR_PATH in '/etc/sysconfig/hsqldb' is set to a non-file.
No sensors found!
Make sure you loaded all the kernel drivers you need.
Try sensors-detect to find out which these are.
Now, compare:
[root@irwin tmp] diff services-irwin services-thrush | grep -v pid
1,7c1,7
< auditd is stopped
---
9,11c9
< Running
< cgred is stopped
---
13,17c11,15
< 2010
---
> dnsmasq is stopped
> 2014
Hmm. That’s interesting. I would have expected something involving sssd or ldap. Let’s look specifically at the failing command:
[root@irwin tmp] strace getent passwd anderson
close(3) = 0
munmap(0xb7484000, 99604) = 0
getpid() = 15639
fstat64(-1, 0xbff2f278) = -1 EBADF (Bad file descriptor)
time(NULL) = 1355255435
socket(PF_FILE, SOCK_STREAM, 0) = 3
fcntl64(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
fcntl64(3, F_GETFD) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
connect(3, {sa_family=AF_FILE, path="/var/lib/sss/pipes/nss"}, 110) = -1 ECONNREFUSED (Connection refused)
close(3) = 0
exit_group(2)
Connection refused? By whom? Why?
[root@irwin tmp] cd /var/log/sssd/
[root@irwin sssd] grep 'connection refused' *
[root@irwin sssd] grep 'connection' *
[root@irwin sssd] grep 'refused' *
[root@irwin sssd] cd ..
[root@irwin log] grep 'refused' *
[root@irwin log] grep 'connection' *
anaconda.syslog:16:24:25,558 NOTICE NetworkManager: ifcfg-rh: read connection 'System eth0'
anaconda.syslog:16:24:25,558 NOTICE NetworkManager: ifcfg-rh: Ignoring connection 'System eth0' and its device due to NM_CONTROLLED/BRIDGE/VLAN.
anaconda.syslog:16:24:26,555 WARNING NetworkManager: <warn> error requesting auth for org.freedesktop.NetworkManager.use-user-connections: (26) Remote Exception invoking org.freedesktop.PolicyKit1.Authority.CheckAuthorization() on /org/freedesktop/PolicyKit1/Authority at name org.freedesktop.PolicyKit1: org.freedesktop.DBus.Error.Spawn.ExecFailed: Cannot launch daemon, file not found or permissions invalid
anaconda.syslog:16:25:05,936 NOTICE NetworkManager: ifcfg-rh: Managing connection 'System eth0' and its device because NM_CONTROLLED was true.
[root@irwin log]
So, not logged by the client, it seems. What’s with this pipe?
[root@irwin log] lsof | grep pipes
sssd 1851 root 13u unix 0xdf043500 0t0 10659 /var/lib/sss/pipes/private/sbus-monitor
sssd 1851 root 15u unix 0xf56a0ac0 0t0 10779 /var/lib/sss/pipes/private/sbus-monitor
sssd 1851 root 17u unix 0xdf040580 0t0 10842 /var/lib/sss/pipes/private/sbus-monitor
sssd_be 1862 root 22u unix 0xf56a04c0 0t0 10780 /var/lib/sss/pipes/private/sbus-dp_LDAP.1862
sssd_be 1862 root 30u unix 0xf5a5f480 0t0 10844 /var/lib/sss/pipes/private/sbus-dp_LDAP.1862
sssd_pam 1893 root 23u unix 0xdf040780 0t0 10845 /var/lib/sss/pipes/pam
sssd_pam 1893 root 24u unix 0xdf040980 0t0 10847 /var/lib/sss/pipes/private/pam
[root@irwin log] service sssd status
sssd (pid 1851) is running...
[root@irwin log]
Hmm. I expected nss to be there. What happens if we just restart sssd?
[root@irwin log] service sssd restart
Stopping sssd: [ OK ]
Starting sssd: [ OK ]
[root@irwin log] getent passwd anderson
anderson:*:716:501:Scott D. Anderson,E114:/home/anderson:/bin/bash
[root@irwin log]
Well, that’s good, I guess. At least we can do this remotely. We could also set up a cron job that just restarts sssd every hour or so. Can anyone think of anything better? Can we tell when sssd gets in trouble?
[root@irwin log] grep sssd /var/log/messages
Dec 10 22:42:37 irwin abrt[9800]: Saved core dump of pid 1892 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:42:37-1892 (819200 bytes)
Dec 10 22:42:38 irwin sssd[nss]: Starting up
Dec 10 22:43:09 irwin abrt[9979]: Saved core dump of pid 9801 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:43:09-9801 (712704 bytes)
Dec 10 22:43:11 irwin sssd[nss]: Starting up
Dec 10 22:43:34 irwin abrt[10156]: Saved core dump of pid 9980 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:43:34-9980 (712704 bytes)
Dec 10 22:43:36 irwin sssd[nss]: Starting up
Dec 10 22:43:58 irwin abrt[10333]: Saved core dump of pid 10157 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:43:58-10157 (712704 bytes)
Dec 10 22:44:00 irwin sssd[nss]: Starting up
Dec 10 22:44:23 irwin abrt[10512]: Saved core dump of pid 10334 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:44:23-10334 (712704 bytes)
Dec 11 14:56:06 irwin sssd[pam]: Shutting down
Dec 11 14:56:06 irwin sssd[be[LDAP]]: Shutting down
Dec 11 14:56:06 irwin sssd: Starting up
Dec 11 14:56:06 irwin sssd[be[LDAP]]: Starting up
Dec 11 14:56:06 irwin sssd[nss]: Starting up
Dec 11 14:56:06 irwin sssd[pam]: Starting up
[root@irwin log]
Oh, that’s interesting. And we see:
[root@irwin log] cd /var/spool/abrt
[root@irwin abrt] ls
abrt-db ccpp-2012-10-02-21:46:07-1899 ccpp-2012-12-10-22:42:37-1892
ccpp-2012-10-02-21:45:13-3727 ccpp-2012-12-10-22:42:28-17063 last-ccpp
[root@irwin abrt] ls -lt
total 20
drwxr-x--- 2 abrt root 4096 Dec 10 22:45 ccpp-2012-12-10-22:42:37-1892
drwxr-xr-x 3 root root 4096 Dec 10 22:45 ccpp-2012-10-02-21:45:13-3727
drwxr-x--- 2 abrt root 4096 Dec 10 22:45 ccpp-2012-10-02-21:46:07-1899
drwxr-x--- 2 abrt gdm 4096 Dec 10 22:45 ccpp-2012-12-10-22:42:28-17063
-rw------- 1 root root 26 Dec 10 22:44 last-ccpp
-rw-r--r--. 1 root root 0 Aug 29 12:53 abrt-db
[root@irwin abrt] cd ccpp-2012-12-10-22\:42\:37-1892/
[root@irwin ccpp-2012-12-10-22:42:37-1892] ls
abrt_version component environ limits package sosreport.tar.xz uuid
analyzer coredump executable maps pid time var_log_messages
architecture count hostname open_fds pwd uid
cmdline dso_list kernel os_release reason username
[root@irwin ccpp-2012-12-10-22:42:37-1892] more cmdline
/usr/libexec/sssd/sssd_nss --debug-to-files
[root@irwin ccpp-2012-12-10-22:42:37-1892] more reason
Process /usr/libexec/sssd/sssd_nss was killed by signal 6 (SIGABRT)
[root@irwin ccpp-2012-12-10-22:42:37-1892] more var_log_messages
Dec 10 22:42:37 irwin abrt[9800]: Saved core dump of pid 1892 (/usr/libexec/sssd/sssd_nss) to /var/spool
/abrt/ccpp-2012-12-10-22:42:37-1892 (819200 bytes)
Dec 10 22:43:09 irwin abrt[9979]: Saved core dump of pid 9801 (/usr/libexec/sssd/sssd_nss) to /var/spool
/abrt/ccpp-2012-12-10-22:43:09-9801 (712704 bytes)
Okay, so nothing diagnostic in here, but interesting nevertheless. Googling for “sssd_nss sigabrt” yields some results like:
https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id...
Nov 13, 2011 – Summary: sssd_nss crashes when passed invalid UTF-8
for the .../usr/libexec/sssd/sssd_nss was killed by signal 6 (SIGABRT)
time: Sun Nov ...
[root@irwin ccpp-2012-12-10-22:42:37-1892] rpm -q sssd
sssd-1.8.0-32.el6.i686
Still, there is an sssd-1.9 out there, so maybe we should try to upgrade?