Two lessons from using libresolv

If you want to do DNS lookups in your Linux or BSD application, you've got lots of useful functions in libresolv. The basics involve a several steps, but there are examples everywhere. It goes like this:

Call res_query() on the domain name.
Call ns_initparse() to parse the response. This returns an "ns_msg" struct.
Call ns_msg_count() get the number of results in the ns_msg.
Call ns_parserr() on each of the results, giving you an "ns_rr" struct.
This struct contains the TTL and the actual result data.

Now the fun begins. If you did a lookup on types A or AAAA, the data is simply a struct in_addr or struct in6_addr, respectively. For a CNAME record you have to call dn_expand() to unpack the result into a usable string. Easy peasy.

But for type MX, the result also has a priority. Where is that value? There is no such field in ns_rr, and there is no helper function. Most examples just call ns_sprintrr(), which returns a string such as "10 smtp.example.se". However, doing that and then parsing the result is just stupid.

The source code for ns_sprintrr() is open, and there you can see that for MX, the result starts with a 16 bit integer, which is exactly this priority. Afterwards comes the hostname, which you again have to unpack with dn_expand(). But why is this integer here? The manual pages provide no clues. The source code just picks these two bytes, without explaining why.

It turns out to be due to RFC 1035, which is where the format of this data is specified. There are several other record types you can ask for (SOA, CAA, TXT, and whatnot) and their formats are all specified here.

Lesson 1: Document where your data formats are specified.

To make things thread safe, res_query() has a new variant, called res_nquery(). Its first parameter is a state parameter, which you initiate using res_ninit(). Afterwards, you release any resources it used by calling res_nclose(). This state parameter has a field "options" which you can use to modify how res_nquery() behaves (use TCP instead of UDP, for example). It also has a bit RES_INIT which is set by res_ninit(). To make things easy, just check this bit before doing a query, and if it is not already set, call res_ninit().

Then, suddenly, connections made by my application were suddenly closed, and new connections got descriptor 0. It's a valid value, but only if somebody actually closed it first. And then that connection was closed, and another got descriptor 0. So, data sent over the first connection ended up being sent over the second. Not good.

I added log statements all over the place, and ran everything within Valgrind. Nothing. Finally I ran the app within strace, and found the culprit: My app called res_nclose() in a thread which hadn't called res_ninit(). The res_nclose() function begins like this:

if (statp->_vcsock >= 0) { close(statp->_vcsock); }

Does it check the RES_INIT bit first? No. As the data was on the heap, this _vcsock field was 0. Oops. So now my application has to check this bit. Afterwards, _vcsock is set to -1, so it would be safe to call res_nclose() twice. But it is never initiated to -1 unless res_ninit() is called, so this check only solves one part of the problem.

Lesson 2: Allow users to call your functions in any order. When doing the query, this is automatic as one function returns the data used as input by the next one. So you cannot really call them in the wrong order. Here, you would get the same effect by letting res_ninit() return the state used by the other functions. This would make it impossible to cal res_close() without calling res_ninit() first, as you otherwise wouldn't have any state parameter.