Irwin recently experienced one of those random crashes and currently is in a bad state. One symptom:
[root@tempest ~] ssh irwin Last login: Tue Dec 11 14:31:20 2012 from puma.wellesley.edu [root@irwin ~] getent passwd anderson
So, somehow one of the essential authentication services has failed. What could it be? Thrush is working fine, so I did the following on both irwin and thrush
[root@irwin ~] service --status-all > /usr/network/tmp/services-irwin JAVA_EXECUTABLE or HSQLDB_JAR_PATH in '/etc/sysconfig/hsqldb' is set to a non-file. No sensors found! Make sure you loaded all the kernel drivers you need. Try sensors-detect to find out which these are.
Now, compare:
[root@irwin tmp] diff services-irwin services-thrush | grep -v pid 1,7c1,7 < auditd is stopped --- 9,11c9 < Running < cgred is stopped --- 13,17c11,15 < 2010 --- > dnsmasq is stopped > 2014
Hmm. That’s interesting. I would have expected something involving sssd or ldap. Let’s look specifically at the failing command:
[root@irwin tmp] strace getent passwd anderson
close(3) = 0 munmap(0xb7484000, 99604) = 0 getpid() = 15639 fstat64(-1, 0xbff2f278) = -1 EBADF (Bad file descriptor) time(NULL) = 1355255435 socket(PF_FILE, SOCK_STREAM, 0) = 3 fcntl64(3, F_GETFL) = 0x2 (flags O_RDWR) fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl64(3, F_GETFD) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 connect(3, {sa_family=AF_FILE, path="/var/lib/sss/pipes/nss"}, 110) = -1 ECONNREFUSED (Connection refused) close(3) = 0 exit_group(2)
Connection refused? By whom? Why?
[root@irwin tmp] cd /var/log/sssd/ [root@irwin sssd] grep 'connection refused' * [root@irwin sssd] grep 'connection' * [root@irwin sssd] grep 'refused' * [root@irwin sssd] cd .. [root@irwin log] grep 'refused' * [root@irwin log] grep 'connection' * anaconda.syslog:16:24:25,558 NOTICE NetworkManager: ifcfg-rh: read connection 'System eth0' anaconda.syslog:16:24:25,558 NOTICE NetworkManager: ifcfg-rh: Ignoring connection 'System eth0' and its device due to NM_CONTROLLED/BRIDGE/VLAN. anaconda.syslog:16:24:26,555 WARNING NetworkManager: <warn> error requesting auth for org.freedesktop.NetworkManager.use-user-connections: (26) Remote Exception invoking org.freedesktop.PolicyKit1.Authority.CheckAuthorization() on /org/freedesktop/PolicyKit1/Authority at name org.freedesktop.PolicyKit1: org.freedesktop.DBus.Error.Spawn.ExecFailed: Cannot launch daemon, file not found or permissions invalid anaconda.syslog:16:25:05,936 NOTICE NetworkManager: ifcfg-rh: Managing connection 'System eth0' and its device because NM_CONTROLLED was true. [root@irwin log]
So, not logged by the client, it seems. What’s with this pipe?
[root@irwin log] lsof | grep pipes sssd 1851 root 13u unix 0xdf043500 0t0 10659 /var/lib/sss/pipes/private/sbus-monitor sssd 1851 root 15u unix 0xf56a0ac0 0t0 10779 /var/lib/sss/pipes/private/sbus-monitor sssd 1851 root 17u unix 0xdf040580 0t0 10842 /var/lib/sss/pipes/private/sbus-monitor sssd_be 1862 root 22u unix 0xf56a04c0 0t0 10780 /var/lib/sss/pipes/private/sbus-dp_LDAP.1862 sssd_be 1862 root 30u unix 0xf5a5f480 0t0 10844 /var/lib/sss/pipes/private/sbus-dp_LDAP.1862 sssd_pam 1893 root 23u unix 0xdf040780 0t0 10845 /var/lib/sss/pipes/pam sssd_pam 1893 root 24u unix 0xdf040980 0t0 10847 /var/lib/sss/pipes/private/pam [root@irwin log] service sssd status sssd (pid 1851) is running... [root@irwin log]
Hmm. I expected nss to be there. What happens if we just restart sssd?
[root@irwin log] service sssd restart Stopping sssd: [ OK ] Starting sssd: [ OK ] [root@irwin log] getent passwd anderson anderson:*:716:501:Scott D. Anderson,E114:/home/anderson:/bin/bash [root@irwin log]
Well, that’s good, I guess. At least we can do this remotely. We could also set up a cron job that just restarts sssd every hour or so. Can anyone think of anything better? Can we tell when sssd gets in trouble?
[root@irwin log] grep sssd /var/log/messages Dec 10 22:42:37 irwin abrt[9800]: Saved core dump of pid 1892 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:42:37-1892 (819200 bytes) Dec 10 22:42:38 irwin sssd[nss]: Starting up Dec 10 22:43:09 irwin abrt[9979]: Saved core dump of pid 9801 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:43:09-9801 (712704 bytes) Dec 10 22:43:11 irwin sssd[nss]: Starting up Dec 10 22:43:34 irwin abrt[10156]: Saved core dump of pid 9980 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:43:34-9980 (712704 bytes) Dec 10 22:43:36 irwin sssd[nss]: Starting up Dec 10 22:43:58 irwin abrt[10333]: Saved core dump of pid 10157 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:43:58-10157 (712704 bytes) Dec 10 22:44:00 irwin sssd[nss]: Starting up Dec 10 22:44:23 irwin abrt[10512]: Saved core dump of pid 10334 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:44:23-10334 (712704 bytes) Dec 11 14:56:06 irwin sssd[pam]: Shutting down Dec 11 14:56:06 irwin sssd[be[LDAP]]: Shutting down Dec 11 14:56:06 irwin sssd: Starting up Dec 11 14:56:06 irwin sssd[be[LDAP]]: Starting up Dec 11 14:56:06 irwin sssd[nss]: Starting up Dec 11 14:56:06 irwin sssd[pam]: Starting up [root@irwin log]
Oh, that’s interesting. And we see:
[root@irwin log] cd /var/spool/abrt [root@irwin abrt] ls abrt-db ccpp-2012-10-02-21:46:07-1899 ccpp-2012-12-10-22:42:37-1892 ccpp-2012-10-02-21:45:13-3727 ccpp-2012-12-10-22:42:28-17063 last-ccpp [root@irwin abrt] ls -lt total 20 drwxr-x--- 2 abrt root 4096 Dec 10 22:45 ccpp-2012-12-10-22:42:37-1892 drwxr-xr-x 3 root root 4096 Dec 10 22:45 ccpp-2012-10-02-21:45:13-3727 drwxr-x--- 2 abrt root 4096 Dec 10 22:45 ccpp-2012-10-02-21:46:07-1899 drwxr-x--- 2 abrt gdm 4096 Dec 10 22:45 ccpp-2012-12-10-22:42:28-17063 -rw------- 1 root root 26 Dec 10 22:44 last-ccpp -rw-r--r--. 1 root root 0 Aug 29 12:53 abrt-db [root@irwin abrt] cd ccpp-2012-12-10-22\:42\:37-1892/ [root@irwin ccpp-2012-12-10-22:42:37-1892] ls abrt_version component environ limits package sosreport.tar.xz uuid analyzer coredump executable maps pid time var_log_messages architecture count hostname open_fds pwd uid cmdline dso_list kernel os_release reason username [root@irwin ccpp-2012-12-10-22:42:37-1892] more cmdline /usr/libexec/sssd/sssd_nss --debug-to-files [root@irwin ccpp-2012-12-10-22:42:37-1892] more reason Process /usr/libexec/sssd/sssd_nss was killed by signal 6 (SIGABRT) [root@irwin ccpp-2012-12-10-22:42:37-1892] more var_log_messages Dec 10 22:42:37 irwin abrt[9800]: Saved core dump of pid 1892 (/usr/libexec/sssd/sssd_nss) to /var/spool /abrt/ccpp-2012-12-10-22:42:37-1892 (819200 bytes) Dec 10 22:43:09 irwin abrt[9979]: Saved core dump of pid 9801 (/usr/libexec/sssd/sssd_nss) to /var/spool /abrt/ccpp-2012-12-10-22:43:09-9801 (712704 bytes)
Okay, so nothing diagnostic in here, but interesting nevertheless. Googling for “sssd_nss sigabrt” yields some results like:
https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id... Nov 13, 2011 – Summary: sssd_nss crashes when passed invalid UTF-8 for the .../usr/libexec/sssd/sssd_nss was killed by signal 6 (SIGABRT) time: Sun Nov ...
[root@irwin ccpp-2012-12-10-22:42:37-1892] rpm -q sssd sssd-1.8.0-32.el6.i686
One Response to intermittent crashes