iapi socket hangs until it crashes with too many files open
Due to either connectivity issues on a client side, on the road to the iApi or on the iApi itself, it can happen that the iapi-socket hangs and then crashes.
It manifests the following way:
Jun 08 04:38:19 immer13-21.glei.ch exim[29232]: 2020-06-08 04:38:19 login_server authenticator failed for (aaaa) [xxxx]: 435 Unable to authenticate at present (set_id=foo): socket read timed out
=> The iApi socket starts to hang.
When discovering the situation I started to strace the iapi socket process:
strace -p 12406
strace: Process 12406 attached
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 1009
fstat(1009, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1009, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1009, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 1010
fstat(1010, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1010, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1010, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 1011
fstat(1011, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1011, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1011, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 1012
fstat(1012, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1012, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1012, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
And it just kept going in that loop.
I then tried to write and read from the socket myself:
-bash-4.2$ echo 'mail_domains' | nc -U /run/iapi/exim
And it just hang there.
I then looked at the process itself and it had waaay too many open FDs
ls -l /proc/12406/fd/
[...]
lrwx------. 1 root root 64 Jun 8 11:27 998 -> socket:[85708105]
lrwx------. 1 root root 64 Jun 8 11:27 999 -> socket:[85709353]
All of these sockets somehow seemed to be established.
# ss -na | grep 85709353
u_str ESTAB 67 0 /run/iapi/exim 85709353 * 0
I digged further, but suddenly my open nc "returned":
-bash-4.2$ echo 'mail_domains' | nc -U /run/iapi/exim
Ncat: Connection reset by peer.
And also the strace shutdown:
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 1023
fstat(1023, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1023, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(1023, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, 0x7ffca37364d0, [110], SOCK_CLOEXEC) = -1 EMFILE (Too many open files)
close(15) = 0
close(16) = 0
close(17) = 0
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 15
fstat(15, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(15, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(15, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 16
fstat(16, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(16, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(16, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, {sa_family=AF_UNIX}, [110->2], SOCK_CLOEXEC) = 17
fstat(17, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(17, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fstat(17, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, 0x7ffca37364d0, [110], SOCK_CLOEXEC) = -1 EMFILE (Too many open files)
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8) = 1 ([{fd=7, revents=POLLIN}])
accept4(7, 0x7ffca37364d0, [110], SOCK_CLOEXEC) = -1 EMFILE (Too many open files)
open("/proc/self/maps", O_RDONLY|O_CLOEXEC) = -1 EMFILE (Too many open files)
write(2, "/usr/local/lib/iapi/iapi-socket-"..., 56) = 56
write(2, ": ", 2) = 2
write(2, "Too many open files", 19) = 19
write(2, " (", 2) = 2
write(2, "Errno::EMFILE", 13) = 13
write(2, ")\n", 2) = 2
write(2, "\tfrom /usr/local/lib/iapi/iapi-s"..., 60) = 60
write(2, "\tfrom /usr/local/lib/iapi/iapi-s"..., 72) = 72
write(2, "\tfrom /usr/local/lib/iapi/iapi-s"..., 62) = 62
write(2, "\tfrom /usr/local/lib/iapi/iapi-s"..., 63) = 63
rt_sigaction(SIGINT, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fbab3d67630}, {sa_handler=0x7fbab4085970, sa_mask=[], sa_flags=SA_RESTORER|SA_SI
GINFO, sa_restorer=0x7fbab3d67630}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fbab3d67630}, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO,
sa_restorer=0x7fbab3d67630}, 8) = 0
close(1005) = 0
close(1006) = 0
close(1007) = 0
close(1008) = 0
close(1009) = 0
close(1010) = 0
[...]
close(73) = 0
madvise(0x7fba740ae000, 4096, MADV_DONTNEED) = 0
close(8) = 0
close(9) = 0
close(10) = 0
close(20) = 0
close(21) = 0
close(7) = 0
close(17) = 0
close(16) = 0
close(15) = 0
write(4, "!", 1) = 1
munmap(0x7fbab44e4000, 1052672) = 0
exit_group(1) = ?
+++ exited with 1 +++
And if you look at the output of the service:
Jun 8 11:31:39 immer13-21 iapi: /usr/local/lib/iapi/iapi-socket-proxy.rb:141:in `accept': Too many open files (Errno::EMFILE)
Jun 8 11:31:39 immer13-21 iapi: from /usr/local/lib/iapi/iapi-socket-proxy.rb:141:in `run'
Jun 8 11:31:39 immer13-21 iapi: from /usr/local/lib/iapi/iapi-socket-proxy.rb:557:in `block in <main>'
Jun 8 11:31:39 immer13-21 iapi: from /usr/local/lib/iapi/iapi-socket-proxy.rb:556:in `catch'
Jun 8 11:31:39 immer13-21 iapi: from /usr/local/lib/iapi/iapi-socket-proxy.rb:556:in `<main>'
Jun 8 11:31:39 immer13-21 systemd: iapi-socket-proxy@exim.service: main process exited, code=exited, status=1/FAILURE
=> While I investigated the process ran out of FDs.
Since then the service got restarted and seems to work properly again.
I think we just pile up open sockets and somehow never time out, this eventually crashes.
Imho we should detect that much earlier and gracefully handle it.
The issue is a bit hard to reproduce, though it might be an option to just DROP (not reject) outgoing packets to 9442 to simulate connectivity issues.