Subject: select() returns 1, timeout hasn't expired, but all FD bits are zero

select() returns 1, timeout hasn't expired, but all FD bits are zero

From: Daniel Hardman <daniel.hardman_at_gmail.com>
Date: Fri, 23 Jan 2015 16:33:21 -0700

I have been getting occasional, mysterious crashes inside ares_process() on
3 separate machines. Typically the crashes happen under steady, moderate
load (tens to hundreds of DNS queries per second); I can make them happen
faster if I stress the system. Today I finally captured a useful core and
did some detective work.

I have a classic select loop (which I've appended below FWIW). The
backtrace shows that when I crash, I'm inside ares_process => process_fds
=> read_udp_packets (line 481 in c-ares1.10) => handle_error. The problem
is line 693, which asserts that query->server == whichserver. This causes a
crash because query is null, so query->(anything) is invalid.

I was flummoxed. How could I have arrived in this weird state? Moving back
up the stack, I saw that at the time I called ares_process(), select() had
returned 1 (so, theoretically, 1 FD was ready for read/write), and timeout
still had about 95 millisecs remaining -- yet every single bit in both my
readers and writers arrays was zero.

Questions:

1. Am I correct that I should have seen at least 1 bit set in my readers or
writers FD arrays if a socket was truly ready?
2. Is ares_process() supposed to be smart enough not to crash if select()
lies to it?
3. Does this sound like any bug reports anybody's familiar with?
4. Anybody have advice about a workaround? Do I just double-check the bits
to make sure something's not zero if select() returns a positive number?
5. It would be super easy to patch handle_error() to test for query !=
null. Would that be worthwhile, or am I just masking something more serious?

--Daniel

    while (true)
    {
        // Get file descriptors for any DNS tasks that are currently
running.
        FD_ZERO(&readers);
        FD_ZERO(&writers);
        auto pending = ares_fds(channel, &readers, &writers);

        if (pending) {
            ares_timeout(channel, NULL, &timeout);
            auto n = select(pending, &readers, &writers, NULL, &timeout);

            if (n < 0) {
                char buf[256];
                strerror_r(errno, buf, sizeof(buf));
                fprintf(stderr, "Got error %d from select(); %s.\n", errno,
buf);
            } else {
                // Invoke appropriate callback(s) now that something
meaningful
                // needs to be done.
                ares_process(channel, &readers, &writers);
            }
        } else {
            break;
        }
    }
Received on 2015-01-24