The Case of the Delinquent Node.js Service and How We Solved It

tl;dr

This is the story about how we analyzed and fixed an unstable Node.js service in our infrastructure.

Background into the Node.js issue

A few months ago, we decided to have a scheduled downtime and migrate all our existing services from EC2 Classic to EC2 VPC for security, stability and control. We’d been systematically planning this migration two months before D-day — with over a 100 services that were making a switch from Classic to VPC, we didn’t want to leave any room for glitches. We migrated each service to VPC, tested each of them for stability and selectively routed some of our production traffic to these new instances — monitoring the system stability and metrics throughout.

But after few hours, we observed that a specific feature (which was dependent on this service that handled long-lived WebSocket connections and was hosted behind ELB and nginx) was intermittently slow or unavailable.

node.js

We grappled with this service that was acting up — restarting the service seemed to fix it but only to have more issues later.

Digging deeper

The ELB CloudWatch Metrics was our starting point.

Clearly, there were spikes in the number of TCP connections each time the service failed. These plateaued out each time we restarted the service, only to work up a frenzy of new TCP connections after a short while.

It had to be the network right? And, so we scoured through all our network settings and group policies but couldn’t build a case against it. They seemed to be standard and nothing seemed out of place.

We ran the tcpaccept tool from the BCC/eBPF suite in one of our Node.js instance to see at what rate the connections were being accepted.

[root@nodejs-app5 ~]# /usr/share/bcc/tools/tcpaccept -t
…….
TIME(s) PID COMM IP RADDR LADDR LPORT
4194.519 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4194.528 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4194.529 24848 node 6 ::ffff:127.0.0.1 ::ffff:127.0.0.1 81
4194.535 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 8
4194.539 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4194.541 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4194.548 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4194.553 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4194.625 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4194.625 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4196.074 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4196.075 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4196.499 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4196.511 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4196.514 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4196.515 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4196.516 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4196.785 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
4197.837 1763 sshd 4 172.16.3.17 172.16.150.139 22
4198.073 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81
…….

The connection acceptance rates looked normal (remember, we were only redirecting small subset of our traffic) and didn’t show anything unusual. Sigh.

Still unsure about where to start looking, we started with the obvious — doing packet capture because the connection count kept going up. We waited for the issue to resurface (after a restart), connected it to one of the machines that hosted the nginx service and ran tshark to capture the packets.

[root@nginx1 ~]# tshark -f ‘host 172.16.150.139 and (tcp[tcpflags] & (tcp-rst|tcp-fin) != 0)’
……….
36.348790 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10954 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899165 TSecr=906255810
36.369605 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10955 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899170 TSecr=906255815
36.372122 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10956 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899171 TSecr=906255816
36.399220 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10957 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899178 TSecr=906255822
36.408761 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10958 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899180 TSecr=906255825
36.416402 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10959 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899182 TSecr=906255827
36.419439 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10960 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899183 TSecr=906255827
36.457502 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10961 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899192 TSecr=906255837
36.458672 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10962 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899192 TSecr=906255837
36.459733 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10963 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899193 TSecr=906255837
36.463407 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10964 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899194 TSecr=906255838
36.485216 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10965 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899199 TSecr=906255844
36.487000 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10966 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899199 TSecr=906255844
36.491632 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10967 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899201 TSecr=906255845
36.499982 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10941 [FIN, ACK] Seq=1 Ack=1 Win=16 Len=0 TSval=3899203 TSecr=906255847
36.500013 172.16.151.146 -> 172.16.150.139 TCP 66 10941 > 81 [FIN, ACK] Seq=1 Ack=2 Win=16 Len=0 TSval=906255848 TSecr=3899203
36.526724 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10969 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899209 TSecr=906255854
36.555598 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10970 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899217 TSecr=906255860
36.559707 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10971 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899218 TSecr=906255862
……….

Wading through the captured packets, we figured the TCP connections were being reset by the remote app machine and new connections were being established. Repeatedly.

So it looked like there was something wrong with our machine setup and/or our app (as it always turns out to be).

To see why these supposedly long-lived TCP (Websocket) connections were being closed, we ran strace on the Node process.

[root@nodejs-app5 ~]# strace -ff -p `pidof node`
…….
clock_gettime(CLOCK_MONOTONIC, {8666, 663543136}) = 0
clock_gettime(CLOCK_MONOTONIC, {8666, 663607713}) = 0
epoll_wait(5, 7ffd00abd8d0, 1024, 11) = -1 EINTR (Interrupted system call)
— — SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=11784, si_uid=0} — –
tgkill(2100, 2100, SIGTERM) = 0
rt_sigreturn() = -1 EINTR (Interrupted system call)
— — SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=2100, si_uid=0} — –
+++ killed by SIGTERM +++
…….

Surprisingly, the Node.js app was actually being killed and getting restarted automatically since we had Monit watching it over. Now we knew why the TCP connections were being reset.

A closer look at the tcpaccept output and you’ll notice that the PID of the process accepting connections kept changing — definitely something that we’d overlooked earlier.

The hunch: An ominous SIGTERM

At this point, we had a lot of questions!

What was killing the Node.js process?

We looked at dmesg and our memory usage seemed normal (besides we have Nagios with check_oomkiller, to notify us should OOM killer be on the rampage), so it couldn’t be OOM killer.

Was the app doing something unexpected?

To find out, we attached a GDB to the delinquent process, set breakpoints on all the exit syscalls and then waited.

[root@nodejs-app5 ~]# gdb -ex ‘catch syscall exit exit_group’ -ex ‘break exit’ -ex ‘continue’ -p `pidof node`
…….
Loaded symbols for /lib64/libresolv.so.2
0x00007fbc8c8f6419 in syscall () from /lib64/libc.so.6
Catchpoint 1 (syscalls ‘exit’ [60] ‘exit_group’ [231])
Breakpoint 2 at 0x7fbc8c83dea0
Continuing.
Program received signal SIGTERM, Terminated.
0x00007fbc8c8f6419 in syscall () from /lib64/libc.so.6
…….

Our hunch was confirmed when the app itself didn’t crash or call exit but was killed by a SIGTERM.

Looking around to see where the signal was coming from, we started stracing the Monit process to see what it does. Since we know Monit was watching over the Node.js process and could kill/restart it, should there be any problems with the process.

[root@nodejs-app5 ~]# strace -ff -p `pidof monit`
…….
connect(4, {sa_family=AF_INET, sin_port=htons(81), sin_addr=inet_addr(“127.0.0.1”)}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=4, events=POLLIN|POLLOUT}], 1, 10000) = 1 ([{fd=4, revents=POLLIN|POLLOUT}])
getsockopt(4, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
poll([{fd=4, events=POLLIN}], 1, 0) = 1 ([{fd=4, revents=POLLIN}])
read(4, “”, 1024) = 0
write(4, “GET / HTTP/1.1\r\nHost: 127.0.0.1:”…, 95) = 95
read(4, “”, 1024) = 0
write(2, “HTTP: error receiving data — Su”…, 38) = 38
sendto(3, “<11>Nov 15 10:08:01 monit[8186]:”…, 71, MSG_NOSIGNAL, NULL, 0) = 71
shutdown(4, SHUT_RDWR) = -1 ENOTCONN (Transport endpoint is not connected)
close(4) = 0
…….

From the trace, we understood that Monit tries to connect to the Node.js process but fails. Following this, it sends a SIGTERM and restarts the app.

check host node_web_app_0 with address 127.0.0.1
GROUP node_web_app
start program = “/bin/sh -c ‘cmd_to_start_the_service’”
stop program = “cmd_to_stop_the_service”
if failed port 81 protocol HTTP
request /
with timeout 10 seconds
then restart

We then discovered that Monit upon trying to check the Node.js app’s health status (by running a “http GET of /”), fails and kills/restarts the app.

Here’s a closer look at the at the strace on the Node process to see network activity,

[root@nodejs-app5 ~]# strace -ff -e trace=network -p `pidof node`
…….
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EMFILE (Too many open files)
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 9
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EMFILE (Too many open files)
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 9
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable
…….

With an EMFILE error popping up often, we wondered if the app repeatedly reached its max file descriptor limit configured in limits.conf and was then prompted by Monit to restart because it couldn’t reach the Node.js process.

[root@nodejs-app5 ~]# cat /etc/security/limits.conf | grep -i nofile
# — nofile — max number of open files
root hard nofile 500000
root soft nofile 500000
* hard nofile 500000
* soft nofile 500000

So, we decided to get to the bottom of things and

[root@nodejs-app5 ~]# ls -l /proc/`pidof node`/fd | wc -l
1025

it looked like the limit kept falling back to 1024 despite us setting it to a much higher value.

[root@nodejs-app5 ~]# grep ‘open files’ /proc/`pidof node`/limits
Max open files 1024 4096 files

Connecting to another machine which has just been started, we verified that the Node.js process has its limits set to what we have in limits.conf.

Why did the limits fall back to these low values? We know for sure that these values are inherited from the parent process.

Each time we connected to a machine that had just started, our Chef OpsWorks-based infrastructure set the right limits for the Node.js process. It’s when Monit takes over and restarts the process that something goes wrong.

The Monit process’s limits matched the low values in the Node.js app. We wondered how.

Started from the bottom, now we’re here

Once the kernel was done initializing, it starts the first user space program, init. All the hard-coded values for various limits were still intact — file -nr with the open file descriptor limit at soft=1024 and hard=4096. On cue, all our programs configured to start during system startup, would now go live.

So all these system services/daemons inherited similar rlimit values but this was only applied to the log-in shell — in effect, just the programs started by the interactive shell would reflect these higher values.

In case you lost us somewhere, here’s a quick rundown,

  • Our app started with high rlimit values (as set by OpsWorks/the Chef script).
  • A few hours in, our app crashed due to an unrelated bug (this, we narrowed down to an internal call that was introduced only in this new setup).
  • Monit on detecting the app crash would rush in, only to restart the app at lower rlimit values (similar to its own).
  • After the app reached 1024 connections, it quashed incoming connections.
  • Monit on its patrol (http GET of /), fails to receive the app’s health status and restarts it, resetting all connections.
  • From the browser angle, since the connections (which uses socket.io) are closed, it would start a re-connection logic. With each connection running this logic, only a few would succeed while the rest succumb (reaching the 1024 limit) to a restart.

Out of the multiple ways to fix this, we ended up updating our Monit script to set higher file limits. 🙂

check host node_web_app_0 with address 127.0.0.1
GROUP node_web_app
start program = “/bin/sh -c ‘ulimit -n 100000; cmd_to_start_the_service’”
stop program = “cmd_to_stop_the_service”
if failed port 81 protocol HTTP
request /
with timeout 10 seconds
then restart


If you found this useful, subscribe to our newsletter for more awesome content!

 

About The Authors

Suman Kumar Dey is Technical Lead — DevOps, Freshdesk and Suresh Kumar is a Technical Architect with Freshdesk.

Alina Benny is a Corporate Marketing Associate with Freshdesk. Say hello to her on Twitter and LinkedIn.

 

Originally published on Hacker Noon.