Products
Product Portfolio

Cribl puts your IT and Security data at the center of your data management strategy and provides a one-stop shop for analyzing, collecting, processing, and routing it all at any scale. Try the Cribl suite of products and start building your data engine today!
Learn more ›

Evolving demands placed on IT and Security teams are driving a new architecture for how observability data is captured, curated, and queried. This new architecture provides flexibility and control while managing the costs of increasing data volumes.
Read white paper ›

Cribl Stream

Cribl Stream is a vendor-agnostic observability pipeline that gives you the flexibility to collect, reduce, enrich, normalize, and route data from any source to any destination within your existing data infrastructure.
Learn more ›

Vodafone Case Study

Vodafone Dials up Business Insights with Cribl Stream
Read Case Study ›

Cribl Edge

Cribl Edge provides an intelligent, highly scalable edge-based data collection system for logs, metrics, and application data.
Learn more ›

SpyCloud Edge Story

Listen to how SpyCloud uses Cribl Edge at scale.
Watch Video ›

Cribl Search

Cribl Search turns the traditional search process on its head, allowing users to search data in place without having to collect/store first.
Learn more ›

How Cribl Search Can Save You From Drowning in a Deluge of Data
Read Blog ›

Cribl Lake

Cribl Lake is a turnkey data lake solution that takes just minutes to get up and running — no data expertise needed. Leverage open formats, unified security with rich access controls, and central access to all IT and security data.
Learn more ›

Navigating the future of IT and Security Data management white paper
Read white paper ›

Cribl.Cloud

The Cribl.Cloud platform gets you up and running fast without the hassle of running infrastructure.
Learn more ›

Cribl.Cloud Solution Brief

The fastest and easiest way to realize the value of an observability ecosystem.
Read Solution Brief ›

Cribl Copilot

Cribl Copilot gets your deployments up and running in minutes, not weeks or months.
Learn more ›

Cribl Copilot

Your Trusted AI Advisor for Deploying, Configuring & Troubleshooting.
Read blog ›

AppScope

AppScope gives operators the visibility they need into application behavior, metrics and events with no configuration and no agent required.
Learn more ›

Sandbox

Launch an AppScope Sandbox today!
Launch Now ›
Solutions
Use Cases

Explore Cribl’s Solutions by Use Cases:

Supercharge Security Insights ›

Accelerate Cloud Migration ›

Avoid Vendor Lock-in ›

Agent Consolidation ›

Slash Storage Costs ›

Free Up Space for High-Value Data ›

Route From Any Source To Any Destination ›

Immediate Access to Archived Data ›

Replay Data from Low-Cost Storage ›

Reduce Log Volume & Pay Less for Infrastructure ›
Integration

Explore Cribl’s Solutions by Integrations:

Amazon ›

CrowdStrike ›

Elastic ›

Exabeam ›

Google ›

Microsoft ›

Splunk ›

Wiz ›

View All Integrations ›

Seamless Integrations for Your Observability Data
Learn More ›
Industries

Explore Cribl’s Solutions by Industry:

AIOps ›

Financial Services ›

Healthcare ›

Managed Security Services ›

Manufacturing and Logistics ›

Media and Entertainment ›

Public Sector ›

Retail ›
Resources
Resources

Resource Library ›

Documentation ›

Guides ›

AppScope Docs ›

Blog ›

Glossary ›

Podcasts ›

Telemetry 101

Understanding the Basics of Telemetry and Its Benefits
Learn More ›
Events & Webinars

Events ›

Webinars ›

CriblCon24
Watch On-Demand ›

July 31 | 10am PT / 1pm ET

Navigating the Data Current Report: Transforming IT & Security Operations in 2024
Register ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

What is Observability? ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Tools & Pricing

Download Library ›

Past Releases ›

Pricing Plans ›

Stream ROI Calculator ›

Download Library

Download Cribl’s suite of products for free to get started.
Download ›
Customers
Customer Stories

Get inspired by how our customers are innovating IT, security and observability. They inspire us daily!
Read Customer Stories ›

Sally Beauty Holdings

Sally Beauty Swaps LogStash and Syslog-ng with Cribl.Cloud for a Resilient Security and Observability Pipeline
Read Case Study ›
Customer Experience

Support & Success ›

Professional Services ›

Service Delivery Partners ›

Documentation ›

AppScope Docs ›

Professional Services

Check out our new Professional Services offering.
Learn More ›
Learning

Try the Sandboxes ›

Self Guided Trials ›

Cribl University ›

Cribl Community ›

Cribl Curious Forum ›

Try Your Own Cribl Sandbox

Experience a full version of Cribl Stream and Cribl Edge in the cloud.
Launch Now ›
Company
About Cribl

Transform data management with Cribl, the Data Engine for IT and Security
Learn More ›

Cribl Corporate Overview

Cribl makes open observability a reality, giving you the freedom and flexibility to make choices instead of compromises.
Get the Guide ›

Cribl Newsroom

Stay up to date on all things Cribl and observability.
Visit the Newsroom ›

Press Releases

Read our most recent press releases.
Recent Press Releases ›

Leadership

Cribl’s leadership team has built and launched category-defining products for some of the most innovative companies in the technology sector, and is supported by the world’s most elite investors.
Meet our Leaders ›

Careers

Join the Cribl herd! The smartest, funniest, most passionate goats you’ll ever meet.
Learn More ›

Cribl Named to the Inc. 5000 List of Fastest Growing Private Companies
Learn More ›

Cribl for Startups

Whether you’re just getting started or scaling up, the Cribl for Startups program gives you the tools and resources your company needs to be successful at every stage.
Learn More ›

Contact Us

Want to learn more about Cribl from our sales experts? Send us your contact information and we’ll be in touch.
Talk to an Expert ›

Try Cribl Talk to an expert

Beware DoSing the Node.js worker pool with DNS query timeouts

Written by Clint Sharp

October 2, 2019

TL;DR:

Node.js sends synchronous requests like DNS lookups to a pool of worker threads, which by default is number of cores / 2 which can be overloaded by many requests to a DNS name which does not resolve, indefinitely blocking connections to legitimate names which will resolve. Beware of your application making many requests to names which do not resolve. This is the story of how I found this condition and diagnosing the root cause using common tools like tcpdump, strace, and gdb. I learned quite a bit during this process, and hopefully this troubleshooting process will help someone in the future.

I maintain our docker based demo environment. I encourage you to check it out; it’s super easy to get up and running and it allows you to interact with Cribl LogStream in a real environment with Splunk, Elasticsearch, Kibana, Graphite, Grafana, Kafka, and more. Unfortunately, running all 6 of those systems on my laptop can sometimes be a bit taxing. Most of the time I’m just in LogStream, or maybe Splunk for some queries. To trim down the environment, I used the approach in this Stack Overflow answer to turn the other containers off, and then set our outputs to those destinations to drop on backpressure. This worked fine initially, but I noticed sometimes we couldn’t send data to Splunk and we’d start hitting the persistent queue.

In Cribl’s UI, you can easily navigate to our internal logs. These logs also go out stdout in our Docker container, so I ran docker-compose logs cribl | grep splunk and saw us attempting to connect and failing:

{"time":"2019-09-17T21:21:02.226Z","channel":"output:splunk","level":"info","message":"attempting to connect"}
{"time":"2019-09-17T21:21:03.227Z","channel":"output:splunk","level":"warn","message":"sending is blocked","since":1568755066340,"elapsed":196887,"endpoint":{"host":"splunk","port":9997,"tls":false}}
{"time":"2019-09-17T21:21:05.227Z","channel":"output:splunk","level":"warn","message":"sending is blocked","since":1568755066340,"elapsed":198887,"endpoint":{"host":"splunk","port":9997,"tls":false}}
{"time":"2019-09-17T21:21:07.227Z","channel":"output:splunk","level":"error","message":"connection error","error":{"message":"connection timeout, dst=splunk:9997","stack":"Error: connection timeout, dst=splunk:9997\n    at Socket.<anonymous> (/opt/cribl/bin/cribl.js:12:373813)\n    at Socket.emit (events.js:198:13)\n    at Socket.EventEmitter.emit (domain.js:448:20)\n    at Socket._onTimeout (net.js:442:8)\n    at ontimeout (timers.js:436:11)\n    at tryOnTimeout (timers.js:300:5)\n    at listOnTimeout (timers.js:263:5)\n    at Timer.processTimers (timers.js:223:10)"}}

We’ve clearly timed out. At this point, I needed to start digging around in the container and see what was going on.

Interrogating the environment: `host`, `nc`, `tcpdump`

As Ledion will tell you, my favorite tool is tcpdump. It’s always my first goto tool, especially for this type of issue. First things first though, I needed to inspect the environment and try to reproduce the connectivity issue I was seeing from LogStream. I used docker-compose exec cribl bash to get a shell in the container. From my shell, I installed a bunch of CLI packages, then used host and nc to ensure I could actually connect to the host splunk which was configured.

apt-get update && apt-get install -y net-tools inetutils-ping lsof tcpdump netcat strace dnsutils

root@0f1f2b815321:/# host splunk
splunk has address 192.168.144.12

root@0f1f2b815321:/# nc splunk 9997
asdf
root@0f1f2b815321:/#

Note, on nc I’m just trying to validate it doesn’t timeout or get a connection reset. asdf is just me sending some input and having the remote side hang up because it’s not valid input for that port.

Next, I used tcpdump to see if we were trying to connect and why the connection wasn’t succeeding. But, I never saw LogStream even try to connect.

root@0f1f2b815321:/# tcpdump port 9997
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

My next step was perhaps DNS wasn’t resolving, even though it worked from host. I ran tcpdump again to look for DNS queries.

root@0f1f2b815321:/# tcpdump -tttt port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
2019-09-17 22:08:41.447362 IP 0f1f2b815321.46567 > ip-192-168-65-1.us-west-2.compute.internal.domain: 6387+ A? elasticsearch. (31)
2019-09-17 22:08:41.448535 IP 0f1f2b815321.42650 > ip-192-168-65-1.us-west-2.compute.internal.domain: 34028+ PTR? 1.65.168.192.in-addr.arpa. (43)
2019-09-17 22:08:41.487990 IP ip-192-168-65-1.us-west-2.compute.internal.domain > 0f1f2b815321.42650: 34028 1/0/0 PTR ip-192-168-65-1.us-west-2.compute.internal. (99)
2019-09-17 22:08:41.583949 IP 0f1f2b815321.34799 > ip-192-168-65-1.us-west-2.compute.internal.domain: 12064+ A? graphite. (26)
2019-09-17 22:08:46.588461 IP 0f1f2b815321.56956 > ip-192-168-65-1.us-west-2.compute.internal.domain: 12064+ A? graphite. (26)
2019-09-17 22:08:46.623660 IP 0f1f2b815321.40947 > ip-192-168-65-1.us-west-2.compute.internal.domain: 58456+ A? elasticsearch. (31)
2019-09-17 22:08:51.627423 IP 0f1f2b815321.51155 > ip-192-168-65-1.us-west-2.compute.internal.domain: 58456+ A? elasticsearch. (31)
2019-09-17 22:08:51.839217 IP 0f1f2b815321.55180 > ip-192-168-65-1.us-west-2.compute.internal.domain: 26796+ A? elasticsearch. (31)
2019-09-17 22:08:56.841580 IP 0f1f2b815321.36587 > ip-192-168-65-1.us-west-2.compute.internal.domain: 26796+ A? elasticsearch. (31)
2019-09-17 22:08:56.918732 IP 0f1f2b815321.43796 > ip-192-168-65-1.us-west-2.compute.internal.domain: 42488+ A? elasticsearch. (31)
2019-09-17 22:09:01.924076 IP 0f1f2b815321.45636 > ip-192-168-65-1.us-west-2.compute.internal.domain: 42488+ A? elasticsearch. (31)
2019-09-17 22:09:02.089549 IP 0f1f2b815321.32877 > ip-192-168-65-1.us-west-2.compute.internal.domain: 53533+ A? elasticsearch. (31)
2019-09-17 22:09:07.096617 IP 0f1f2b815321.60797 > ip-192-168-65-1.us-west-2.compute.internal.domain: 53533+ A? elasticsearch. (31)
2019-09-17 22:09:07.200726 IP 0f1f2b815321.38639 > ip-192-168-65-1.us-west-2.compute.internal.domain: 13457+ A? elasticsearch. (31)
2019-09-17 22:09:12.202478 IP 0f1f2b815321.49121 > ip-192-168-65-1.us-west-2.compute.internal.domain: 13457+ A? elasticsearch. (31)
2019-09-17 22:09:12.417174 IP 0f1f2b815321.49136 > ip-192-168-65-1.us-west-2.compute.internal.domain: 58907+ A? elasticsearch. (31)
^C
18 packets captured
18 packets received by filter
0 packets dropped by kernel

I know we’re going to be trying to resolve elasticsearch and graphite and failing because I had disabled those containers. I also knew that without shutting down those containers it worked, so my next theory was that something might be breaking with name resolution that wasn’t happening on the wire. Recent versions of Linux have really changed the way name resolution works, and I was guessing perhaps it was happening over a unix socket or somewhere I couldn’t see.

Fun with `strace`

The best way I could think to see if something weird might be happening before the network would be to see if it was even trying to resolve the name or make the network connection through any mechanism. I ran strace against the running process. Note, from docker, this needs to be a privileged container to run strace. Another thing to note, the -f flag to strace is super important, as Node.js spawns new threads, called Light Weight Processes or LWP in strace and gdb, to handle DNS resolution and without strace -f then it won’t by default follow new threads or processes. (I had never heard of a Light Weight Process or that threads actually get a pid you don’t see from ps).

root@0f1f2b815321:/# strace -f -p 6 -e connect > strace.log 2>&1
^C
root@0f1f2b815321:/# grep splunk strace.log
root@0f1f2b815321:/# grep 9997 strace.log
root@0f1f2b815321:/# cat strace.log | sort | uniq -c
...
      1 [pid    13] connect(41, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, 16) = 0
...   
      2 [pid 14] connect(42, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.11")}, 16) = 0
...
      6 [pid    14] connect(32, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
      7 [pid    14] connect(41, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
      1 [pid    14] connect(41, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
   1431 [pid    14] connect(42, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0

     14 [pid    14] connect(42, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
    235 [pid    14] connect(46, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
      2 [pid    14] connect(46, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
    413 [pid    14] connect(47, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
      4 [pid    14] connect(47, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
    453 [pid    14] connect(48, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
      4 [pid    14] connect(48, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
     67 [pid    14] connect(51, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
      1 [pid    14] connect(51, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
...

I piped the output of strace -f -p 6 -e connect to a log, where I grep for splunk and 9997 which shows me it’s not connecting at all. Lastly, I took the output of strace and piped it through sort and uniq -c to get a view on popular messages in the file. I can see from the output that we are connecting to 127.0.0.11 on port 53, which matches with my tcpdump. (Note: I also learned DNS gets proxied locally to systemd-resolved. Someone please convince me modern Linux isn’t NT 4.0). I see some unix socket stuff failing to /var/run/nscd/socket, which is super interesting but not particularly relevant.

What’s most interesting in this output, is that one LWP, pid 13, and another, pid 14, are the only things connecting on port 53 and there aren’t particularly many of these queries. However, I didn’t learn particularly much from strace. At this point, I was pretty stumped, so I reached out Ledion, who advised that I try to see what Node.js was doing in gdb.

`gdb` is super cool

I have not historically spent much time in gdb. Most languages I have spent much time in have poor debuggers (JavaScript/Typescript being the exceptions). But, nothing compares to gdb. I’ve used it a lot in the last few weeks helping our engineers diagnose problems in my test environments. In this case, the use case is pretty simple for gdb, can we just poke into a stack trace of every running thread in Node.js to see what’s doing? In gdb, I attached to the running process and then ran thread apply all backtrace. The command is pretty straightforward to read. It’s telling gdb to run backtrace against all threads.

root@0f1f2b815321:/# gdb -p 6
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6
[New LWP 7]
[New LWP 8]
[New LWP 9]
[New LWP 10]
[New LWP 11]
[New LWP 12]
[New LWP 13]
[New LWP 14]
[New LWP 15]
[New LWP 16]
[New LWP 32]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ffccadbdd61 in gettimeofday ()
(gdb) thread apply all bt

Thread 10 (Thread 0x7f4cd17fa700 (LWP 15)):
#0  0x00007f4cd9aafbf9 in __GI___poll (fds=fds@entry=0x7f4cd17f8038, nfds=nfds@entry=1, timeout=5000, timeout@entry=<error reading variable: That operation is not available on integers of more than 8 bytes.>)
    at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f4cd05e7bf8 in send_dg (ansp2_malloced=0x0, resplen2=0x0, anssizp2=0x0, ansp2=0x0, anscp=0x7f4cd17f90e8, gotsomewhere=<synthetic pointer>, v_circuit=<synthetic pointer>, ns=0, terrno=0x7f4cd17f8018, anssizp=0x7f4cd17f8150,
    ansp=0x7f4cd17f8008, buflen2=0, buf2=0x0, buflen=31, buf=0x7f4cd17f8180 "d\212\001", statp=<optimized out>) at res_send.c:1110
#2  __res_context_send (ctx=ctx@entry=0x7f4cc0009ec0, buf=buf@entry=0x7f4cd17f8180 "d\212\001", buflen=buflen@entry=31, buf2=buf2@entry=0x0, buflen2=buflen2@entry=0, ans=<optimized out>, ans@entry=0x7f4cd17f8c90 "0\217\177\320L\177",
    anssiz=<optimized out>, ansp=<optimized out>, ansp2=<optimized out>, nansp2=<optimized out>, resplen2=<optimized out>, ansp2_malloced=<optimized out>) at res_send.c:517
#3  0x00007f4cd05e54d1 in __GI___res_context_query (ctx=ctx@entry=0x7f4cc0009ec0, name=name@entry=0x48f4530 "elasticsearch", class=class@entry=1, type=type@entry=1, answer=answer@entry=0x7f4cd17f8c90 "0\217\177\320L\177",
    anslen=anslen@entry=1024, answerp=0x7f4cd17f90e8, answerp2=0x0, nanswerp2=0x0, resplen2=0x0, answerp2_malloced=0x0) at res_query.c:216
#4  0x00007f4cd05e628d in __res_context_querydomain (domain=0x0, answerp2_malloced=0x0, resplen2=0x0, nanswerp2=0x0, answerp2=0x0, answerp=0x7f4cd17f90e8, anslen=1024, answer=0x7f4cd17f8c90 "0\217\177\320L\177", type=1, class=1,
    name=0x48f4530 "elasticsearch", ctx=0x7f4cc0009ec0) at res_query.c:601
#5  __GI___res_context_search (ctx=ctx@entry=0x7f4cc0009ec0, name=name@entry=0x48f4530 "elasticsearch", class=class@entry=1, type=type@entry=1, answer=answer@entry=0x7f4cd17f8c90 "0\217\177\320L\177", anslen=anslen@entry=1024,
    answerp=0x7f4cd17f90e8, answerp2=0x0, nanswerp2=0x0, resplen2=0x0, answerp2_malloced=0x0) at res_query.c:370
#6  0x00007f4cd8173f0c in gethostbyname3_context (ctx=ctx@entry=0x7f4cc0009ec0, name=name@entry=0x48f4530 "elasticsearch", af=af@entry=2, result=result@entry=0x7f4cd17f97c0, buffer=buffer@entry=0x7f4cd17f9a30 "\300\250\220\b",
    buflen=buflen@entry=1024, errnop=0x7f4cd17fa658, h_errnop=0x7f4cd17fa6bc, ttlp=0x0, canonp=0x0) at nss_dns/dns-host.c:218
#7  0x00007f4cd8174928 in _nss_dns_gethostbyname3_r (name=0x48f4530 "elasticsearch", af=2, result=0x7f4cd17f97c0, buffer=0x7f4cd17f9a30 "\300\250\220\b", buflen=1024, errnop=0x7f4cd17fa658, h_errnop=0x7f4cd17fa6bc, ttlp=0x0, canonp=0x0)
    at nss_dns/dns-host.c:164
#8  0x00007f4cd8174981 in _nss_dns_gethostbyname2_r (name=<optimized out>, af=<optimized out>, result=<optimized out>, buffer=<optimized out>, buflen=<optimized out>, errnop=<optimized out>, h_errnop=0x7f4cd17fa6bc)
    at nss_dns/dns-host.c:282
#9  0x00007f4cd9ad0e3d in __gethostbyname2_r (name=name@entry=0x48f4530 "elasticsearch", af=af@entry=2, resbuf=resbuf@entry=0x7f4cd17f97c0, buffer=<optimized out>, buflen=1024, result=result@entry=0x7f4cd17f97b8, h_errnop=0x7f4cd17fa6bc)
    at ../nss/getXXbyYY_r.c:315
#10 0x00007f4cd9aa1054 in gaih_inet (name=name@entry=0x48f4530 "elasticsearch", service=<optimized out>, req=req@entry=0x7f4cd17f99f0, pai=pai@entry=0x7f4cd17f99b8, naddrs=naddrs@entry=0x7f4cd17f99b4, tmpbuf=tmpbuf@entry=0x7f4cd17f9a20)
    at ../sysdeps/posix/getaddrinfo.c:591
#11 0x00007f4cd9aa2ce4 in __GI_getaddrinfo (name=<optimized out>, service=<optimized out>, hints=0x7f4cd17f99f0, pai=0x4eb8ff8) at ../sysdeps/posix/getaddrinfo.c:2300
#12 0x00007f4cdad0cb3d in getaddrinfo (node=0x48f4530 "elasticsearch", service=0x0, hints=0x48f4500, res=0x4eb8ff8) at src/wrap.c:3458
#13 0x0000000000a876a1 in uv__getaddrinfo_work (w=0x4eb8fb0) at ../deps/uv/src/unix/getaddrinfo.c:106
#14 0x0000000000a7ba7e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:122
#15 0x00007f4cd9d936db in start_thread (arg=0x7f4cd17fa700) at pthread_create.c:463
#16 0x00007f4cd9abc88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f4cd1ffb700 (LWP 14)):
#0  0x00007f4cd9aafbf9 in __GI___poll (fds=fds@entry=0x7f4cd1ff9038, nfds=nfds@entry=1, timeout=5000, timeout@entry=<error reading variable: That operation is not available on integers of more than 8 bytes.>)
    at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f4cd05e7bf8 in send_dg (ansp2_malloced=0x0, resplen2=0x0, anssizp2=0x0, ansp2=0x0, anscp=0x7f4cd1ffa0e8, gotsomewhere=<synthetic pointer>, v_circuit=<synthetic pointer>, ns=0, terrno=0x7f4cd1ff9018, anssizp=0x7f4cd1ff9150,
    ansp=0x7f4cd1ff9008, buflen2=0, buf2=0x0, buflen=31, buf=0x7f4cd1ff9180 "\210\305\001", statp=<optimized out>) at res_send.c:1110
#2  __res_context_send (ctx=ctx@entry=0x7f4cb8009b40, buf=buf@entry=0x7f4cd1ff9180 "\210\305\001", buflen=buflen@entry=31, buf2=buf2@entry=0x0, buflen2=buflen2@entry=0, ans=<optimized out>, ans@entry=0x7f4cd1ff9c90 "0\237\377\320L\177",
    anssiz=<optimized out>, ansp=<optimized out>, ansp2=<optimized out>, nansp2=<optimized out>, resplen2=<optimized out>, ansp2_malloced=<optimized out>) at res_send.c:517
#3  0x00007f4cd05e54d1 in __GI___res_context_query (ctx=ctx@entry=0x7f4cb8009b40, name=name@entry=0x9a14290 "elasticsearch", class=class@entry=1, type=type@entry=1, answer=answer@entry=0x7f4cd1ff9c90 "0\237\377\320L\177",
    anslen=anslen@entry=1024, answerp=0x7f4cd1ffa0e8, answerp2=0x0, nanswerp2=0x0, resplen2=0x0, answerp2_malloced=0x0) at res_query.c:216
#4  0x00007f4cd05e628d in __res_context_querydomain (domain=0x0, answerp2_malloced=0x0, resplen2=0x0, nanswerp2=0x0, answerp2=0x0, answerp=0x7f4cd1ffa0e8, anslen=1024, answer=0x7f4cd1ff9c90 "0\237\377\320L\177", type=1, class=1,
    name=0x9a14290 "elasticsearch", ctx=0x7f4cb8009b40) at res_query.c:601
#5  __GI___res_context_search (ctx=ctx@entry=0x7f4cb8009b40, name=name@entry=0x9a14290 "elasticsearch", class=class@entry=1, type=type@entry=1, answer=answer@entry=0x7f4cd1ff9c90 "0\237\377\320L\177", anslen=anslen@entry=1024,
    answerp=0x7f4cd1ffa0e8, answerp2=0x0, nanswerp2=0x0, resplen2=0x0, answerp2_malloced=0x0) at res_query.c:370
#6  0x00007f4cd8173f0c in gethostbyname3_context (ctx=ctx@entry=0x7f4cb8009b40, name=name@entry=0x9a14290 "elasticsearch", af=af@entry=2, result=result@entry=0x7f4cd1ffa7c0, buffer=buffer@entry=0x7f4cd1ffaa30 "\300\250\220\b",
    buflen=buflen@entry=1024, errnop=0x7f4cd1ffb658, h_errnop=0x7f4cd1ffb6bc, ttlp=0x0, canonp=0x0) at nss_dns/dns-host.c:218
#7  0x00007f4cd8174928 in _nss_dns_gethostbyname3_r (name=0x9a14290 "elasticsearch", af=2, result=0x7f4cd1ffa7c0, buffer=0x7f4cd1ffaa30 "\300\250\220\b", buflen=1024, errnop=0x7f4cd1ffb658, h_errnop=0x7f4cd1ffb6bc, ttlp=0x0, canonp=0x0)
    at nss_dns/dns-host.c:164
#8  0x00007f4cd8174981 in _nss_dns_gethostbyname2_r (name=<optimized out>, af=<optimized out>, result=<optimized out>, buffer=<optimized out>, buflen=<optimized out>, errnop=<optimized out>, h_errnop=0x7f4cd1ffb6bc)
---Type <return> to continue, or q <return> to quit---
    at nss_dns/dns-host.c:282
#9  0x00007f4cd9ad0e3d in __gethostbyname2_r (name=name@entry=0x9a14290 "elasticsearch", af=af@entry=2, resbuf=resbuf@entry=0x7f4cd1ffa7c0, buffer=<optimized out>, buflen=1024, result=result@entry=0x7f4cd1ffa7b8, h_errnop=0x7f4cd1ffb6bc)
    at ../nss/getXXbyYY_r.c:315
#10 0x00007f4cd9aa1054 in gaih_inet (name=name@entry=0x9a14290 "elasticsearch", service=<optimized out>, req=req@entry=0x7f4cd1ffa9f0, pai=pai@entry=0x7f4cd1ffa9b8, naddrs=naddrs@entry=0x7f4cd1ffa9b4, tmpbuf=tmpbuf@entry=0x7f4cd1ffaa20)
    at ../sysdeps/posix/getaddrinfo.c:591
#11 0x00007f4cd9aa2ce4 in __GI_getaddrinfo (name=<optimized out>, service=<optimized out>, hints=0x7f4cd1ffa9f0, pai=0x5c223c8) at ../sysdeps/posix/getaddrinfo.c:2300
#12 0x00007f4cdad0cb3d in getaddrinfo (node=0x9a14290 "elasticsearch", service=0x0, hints=0x9a14260, res=0x5c223c8) at src/wrap.c:3458
#13 0x0000000000a876a1 in uv__getaddrinfo_work (w=0x5c22380) at ../deps/uv/src/unix/getaddrinfo.c:106
#14 0x0000000000a7ba7e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:122
#15 0x00007f4cd9d936db in start_thread (arg=0x7f4cd1ffb700) at pthread_create.c:463
#16 0x00007f4cd9abc88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This produces quite a bit of output, but I’ve kept only the two threads that were particularly relevant. From this purview, it’s pretty easy to see what’s going on. Node has two threads looking up the name elasticsearch, which matches with what we saw with tcpdump. At this point, I started googling around trying to find why Node.js might timeout, when I stumbled on this StackOverflow post, which lead to this more important post on dns.lookup() versus dns.resolve*() .

Root cause and workaround

From my readings, root cause was pretty clear. Because I had shut down Elasticsearch and Graphite, attempting to connect to those destinations was exhausting the capacity of the thread pool waiting on DNS requests for those names. With this theory, I added some extra hosts to docker-compose.yml with fake RFC1918 IPs which allowed DNS to succeed but the connection attempts fail which resolved my issue. Key finding from this was to be careful in your Node.js application, as many connection attempts to names which will not resolve can easily starve connection attempts to legitimate working names.

I didn’t have time to look into Node.js and see what happens in the case of this congestion scenario with requests. Based on the evidence I saw, it will never attempt to connect properly to the hostname that will resolve, even after many minutes, so my guess is we are initiating so many connection attempts that the backlog never clears. It would be nice to know if they eventually drop requests out of the queue or what happens with backpressure.

For future versions of Cribl LogStream, we’ll make some changes how we resolve names to guard against this potential DoS scenario.

Return to Cribl Blog

Additional Reading

Preventing Friction With an Impactful Security Champions Program

Liam McGovern Jul 24, 2024

Securing the Foundation of Cribl Copilot

Zach Rayburn Jul 22, 2024

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Download Library

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us

Beware DoSing the Node.js worker pool with DNS query timeouts

Written by Clint Sharp

TL;DR:

Interrogating the environment: `host`, `nc`, `tcpdump`

Fun with `strace`

`gdb` is super cool

Root cause and workaround

Additional Reading

Preventing Friction With an Impactful Security Champions Program

Securing the Foundation of Cribl Copilot

Questions about our technology? We’d love to chat with you.

So you're rockin' Internet Explorer!

Product Portfolio

Cribl Stream

Cribl Edge

Cribl Search

Cribl Lake

Cribl.Cloud

Cribl Copilot

AppScope

Use Cases

Integration

Industries

Resources

Events & Webinars

Learning

Tools & Pricing

Download Library

Customer Stories

Customer Experience

Learning

Try Your Own Cribl Sandbox

About Cribl

Cribl Newsroom

Leadership

Careers

Cribl for Startups

Contact Us

Beware DoSing the Node.js worker pool with DNS query timeouts

Written by Clint Sharp

TL;DR:

Interrogating the environment: host, nc, tcpdump

Fun with strace

gdb is super cool

Root cause and workaround

Additional Reading

Preventing Friction With an Impactful Security Champions Program

Securing the Foundation of Cribl Copilot

Questions about our technology? We’d love to chat with you.

So you're rockin' Internet Explorer!

Interrogating the environment: `host`, `nc`, `tcpdump`

Fun with `strace`

`gdb` is super cool