intermittent crashes

Print Friendly, PDF & Email

Irwin recently experienced one of those random crashes and currently is in a bad state.  One symptom:

[root@tempest ~] ssh irwin
Last login: Tue Dec 11 14:31:20 2012 from puma.wellesley.edu
[root@irwin ~] getent passwd anderson

So, somehow one of the essential authentication services has failed.  What could it be?  Thrush is working fine, so I did the following on both irwin and thrush

[root@irwin ~] service --status-all > /usr/network/tmp/services-irwin 
JAVA_EXECUTABLE or HSQLDB_JAR_PATH in '/etc/sysconfig/hsqldb' is set to a non-file.
No sensors found!
Make sure you loaded all the kernel drivers you need.
Try sensors-detect to find out which these are.

Now, compare:

[root@irwin tmp] diff services-irwin services-thrush | grep -v pid
1,7c1,7
< auditd is stopped
---
9,11c9
< Running
< cgred is stopped
---
13,17c11,15
< 2010
---
> dnsmasq is stopped
> 2014

Hmm. That’s interesting. I would have expected something involving sssd or ldap.  Let’s look specifically at the failing command:

[root@irwin tmp] strace getent passwd anderson
close(3)                                = 0
munmap(0xb7484000, 99604)               = 0
getpid()                                = 15639
fstat64(-1, 0xbff2f278)                 = -1 EBADF (Bad file descriptor)
time(NULL)                              = 1355255435
socket(PF_FILE, SOCK_STREAM, 0)         = 3
fcntl64(3, F_GETFL)                     = 0x2 (flags O_RDWR)
fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
fcntl64(3, F_GETFD)                     = 0
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
connect(3, {sa_family=AF_FILE, path="/var/lib/sss/pipes/nss"}, 110) = -1 ECONNREFUSED (Connection refused)
close(3)                                = 0
exit_group(2)

Connection refused?  By whom?  Why?

[root@irwin tmp] cd /var/log/sssd/
[root@irwin sssd] grep 'connection refused' *
[root@irwin sssd] grep 'connection' *
[root@irwin sssd] grep 'refused' *
[root@irwin sssd] cd ..
[root@irwin log] grep 'refused' *
[root@irwin log] grep 'connection' *
anaconda.syslog:16:24:25,558 NOTICE NetworkManager:    ifcfg-rh:     read connection 'System eth0'
anaconda.syslog:16:24:25,558 NOTICE NetworkManager:    ifcfg-rh: Ignoring connection 'System eth0' and its device due to NM_CONTROLLED/BRIDGE/VLAN.
anaconda.syslog:16:24:26,555 WARNING NetworkManager: <warn> error requesting auth for org.freedesktop.NetworkManager.use-user-connections: (26) Remote Exception invoking org.freedesktop.PolicyKit1.Authority.CheckAuthorization() on /org/freedesktop/PolicyKit1/Authority at name org.freedesktop.PolicyKit1: org.freedesktop.DBus.Error.Spawn.ExecFailed: Cannot launch daemon, file not found or permissions invalid
anaconda.syslog:16:25:05,936 NOTICE NetworkManager:    ifcfg-rh: Managing connection 'System eth0' and its device because NM_CONTROLLED was true.
[root@irwin log]

So, not logged by the client, it seems.  What’s with this pipe?

[root@irwin log] lsof | grep pipes
sssd       1851      root   13u     unix 0xdf043500      0t0      10659 /var/lib/sss/pipes/private/sbus-monitor
sssd       1851      root   15u     unix 0xf56a0ac0      0t0      10779 /var/lib/sss/pipes/private/sbus-monitor
sssd       1851      root   17u     unix 0xdf040580      0t0      10842 /var/lib/sss/pipes/private/sbus-monitor
sssd_be    1862      root   22u     unix 0xf56a04c0      0t0      10780 /var/lib/sss/pipes/private/sbus-dp_LDAP.1862
sssd_be    1862      root   30u     unix 0xf5a5f480      0t0      10844 /var/lib/sss/pipes/private/sbus-dp_LDAP.1862
sssd_pam   1893      root   23u     unix 0xdf040780      0t0      10845 /var/lib/sss/pipes/pam
sssd_pam   1893      root   24u     unix 0xdf040980      0t0      10847 /var/lib/sss/pipes/private/pam
[root@irwin log] service sssd status
sssd (pid  1851) is running...
[root@irwin log]

Hmm.  I expected nss to be there.  What happens if we just restart sssd?

[root@irwin log] service sssd restart
Stopping sssd:                                             [  OK  ]
Starting sssd:                                             [  OK  ]
[root@irwin log] getent passwd anderson
anderson:*:716:501:Scott D. Anderson,E114:/home/anderson:/bin/bash
[root@irwin log]

Well, that’s good, I guess.  At least we can do this remotely.  We could also set up a cron job that just restarts sssd every hour or so.  Can anyone think of anything better?  Can we tell when sssd gets in trouble?

[root@irwin log] grep sssd /var/log/messages
Dec 10 22:42:37 irwin abrt[9800]: Saved core dump of pid 1892 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:42:37-1892 (819200 bytes)
Dec 10 22:42:38 irwin sssd[nss]: Starting up
Dec 10 22:43:09 irwin abrt[9979]: Saved core dump of pid 9801 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:43:09-9801 (712704 bytes)
Dec 10 22:43:11 irwin sssd[nss]: Starting up
Dec 10 22:43:34 irwin abrt[10156]: Saved core dump of pid 9980 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:43:34-9980 (712704 bytes)
Dec 10 22:43:36 irwin sssd[nss]: Starting up
Dec 10 22:43:58 irwin abrt[10333]: Saved core dump of pid 10157 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:43:58-10157 (712704 bytes)
Dec 10 22:44:00 irwin sssd[nss]: Starting up
Dec 10 22:44:23 irwin abrt[10512]: Saved core dump of pid 10334 (/usr/libexec/sssd/sssd_nss) to /var/spool/abrt/ccpp-2012-12-10-22:44:23-10334 (712704 bytes)
Dec 11 14:56:06 irwin sssd[pam]: Shutting down
Dec 11 14:56:06 irwin sssd[be[LDAP]]: Shutting down
Dec 11 14:56:06 irwin sssd: Starting up
Dec 11 14:56:06 irwin sssd[be[LDAP]]: Starting up
Dec 11 14:56:06 irwin sssd[nss]: Starting up
Dec 11 14:56:06 irwin sssd[pam]: Starting up
[root@irwin log]

Oh, that’s interesting.  And we see:

[root@irwin log] cd /var/spool/abrt
[root@irwin abrt] ls
abrt-db                        ccpp-2012-10-02-21:46:07-1899   ccpp-2012-12-10-22:42:37-1892
ccpp-2012-10-02-21:45:13-3727  ccpp-2012-12-10-22:42:28-17063  last-ccpp
[root@irwin abrt] ls -lt
total 20
drwxr-x---  2 abrt root 4096 Dec 10 22:45 ccpp-2012-12-10-22:42:37-1892
drwxr-xr-x  3 root root 4096 Dec 10 22:45 ccpp-2012-10-02-21:45:13-3727
drwxr-x---  2 abrt root 4096 Dec 10 22:45 ccpp-2012-10-02-21:46:07-1899
drwxr-x---  2 abrt gdm  4096 Dec 10 22:45 ccpp-2012-12-10-22:42:28-17063
-rw-------  1 root root   26 Dec 10 22:44 last-ccpp
-rw-r--r--. 1 root root    0 Aug 29 12:53 abrt-db
[root@irwin abrt] cd ccpp-2012-12-10-22\:42\:37-1892/
[root@irwin ccpp-2012-12-10-22:42:37-1892] ls
abrt_version  component  environ     limits      package  sosreport.tar.xz  uuid
analyzer      coredump   executable  maps        pid      time              var_log_messages
architecture  count      hostname    open_fds    pwd      uid
cmdline       dso_list   kernel      os_release  reason   username
[root@irwin ccpp-2012-12-10-22:42:37-1892] more cmdline 
/usr/libexec/sssd/sssd_nss --debug-to-files
[root@irwin ccpp-2012-12-10-22:42:37-1892] more reason 
Process /usr/libexec/sssd/sssd_nss was killed by signal 6 (SIGABRT)
[root@irwin ccpp-2012-12-10-22:42:37-1892] more var_log_messages 
Dec 10 22:42:37 irwin abrt[9800]: Saved core dump of pid 1892 (/usr/libexec/sssd/sssd_nss) to /var/spool
/abrt/ccpp-2012-12-10-22:42:37-1892 (819200 bytes)
Dec 10 22:43:09 irwin abrt[9979]: Saved core dump of pid 9801 (/usr/libexec/sssd/sssd_nss) to /var/spool
/abrt/ccpp-2012-12-10-22:43:09-9801 (712704 bytes)

Okay, so nothing diagnostic in here, but interesting nevertheless.  Googling for “sssd_nss sigabrt” yields some results like:

https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id...
Nov 13, 2011 – Summary: sssd_nss crashes when passed invalid UTF-8 
for the .../usr/libexec/sssd/sssd_nss was killed by signal 6 (SIGABRT) 
time: Sun Nov ...
See (https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=753639) but that seems to have been fixed before our version of sssd:
[root@irwin ccpp-2012-12-10-22:42:37-1892] rpm -q sssd
sssd-1.8.0-32.el6.i686
Still, there is an sssd-1.9 out there, so maybe we should try to upgrade?

 

 

 

 

About CS SysAdmins

The CS Department System Administrators
This entry was posted in Uncategorized. Bookmark the permalink.

One Response to intermittent crashes

Leave a Reply

Your email address will not be published. Required fields are marked *