Last week I briefly mentioned the nightmares behind Linux and GNU Standard C++ ABI compatibility. This manifested itself at a very bad time — at ESEA, we were getting ready to launch our European servers. Unlike their USA counterparts, two of the physical machines were 32-bit. I made 32-bit builds of our software and uploaded them to both machines. I checked one box and everything was working. Then we launched.
Guess what? The machine I didn’t check? Nothing was working. A critical Metamod plugin was failing to load with the message:
libstdc++.so.6: cannot handle TLS data
I had seen this before, and quickly panicked; in past cases I’d never managed to fix it. It is virtually undocumented (try a Google search), and the problem is occurring in the ELF loader, not exactly a trivial area of Linux to start dissecting at product launch.
The first thing I did was download the libc source code for the OS – 2.3.4. A grep for “handle TLS data” revealed this in elf/dl-load.c:
case PT_TLS: #ifdef USE_TLS ... #endif /* Uh-oh, the binary expects TLS support but we cannot provide it. */ errval = 0; errstring = N_("cannot handle TLS data"); goto call_lose; break;
It looked like libstdc++.so.6, which was required by our binary, had a PT_TLS section in its header. TLS is thread local storage. So, now I needed to find out why the TLS block couldn’t be created.
First, I opened up libc-2.3.4.so in a diassembler. The function name in question was _dl_map_object_from_fd. I then had to find where the PT_TLS case was handled (constant value is 7). From the source code, a _dl_next_tls_modid() is called shortly after. I found that call and traced back the jumps using IDA’s cross-reference feature. I found this:
.text:4AA03A8C cmp eax, 7 .text:4AA03A8F nop .text:4AA03A90 jnz short loc_4AA03A50 .text:4AA03A92 mov eax, [esi+14h] .text:4AA03A95 test eax, eax .text:4AA03A97 jz short loc_4AA03A50
Bingo! Clearly, USE_TLS is defined, otherwise it wouldn’t bother with a jump. So the problem is definitely in the logic somewhere, rather than a lack of TLS support. It was time for some debugging with gdb:
(gdb) set disas intel (gdb) display/i $pc (gdb) br _dl_map_object_from_fd Breakpoint 1 at 0x4aa03866
I got lucky with the matching address and put a direct breakpoint:
(gdb) del 1 (gdb) br *0x4AA03A8C Breakpoint 2 at 0x4aa03a8c
I stepped through the assembly, following along in my source code. I narrowed the failing condition down to this code:
/* If GL(dl_tls_dtv_slotinfo_list) == NULL, then rtld.c did not set up TLS data structures, so don't use them now. */ || __builtin_expect (GL(dl_tls_dtv_slotinfo_list) != NULL, 1)) {
I opened up rtld.c and searched for dl_tls_dtv_slotinfo_list — and the answer was immediately apparent:
/* We do not initialize any of the TLS functionality unless any of the initial modules uses TLS. This makes dynamic loading of modules with TLS impossible, but to support it requires either eagerly doing setup now or lazily doing it later. Doing it now makes us incompatible with an old kernel that can't perform TLS_INIT_TP, even if no TLS is ever used. Trying to do it lazily is too hairy to try when there could be multiple threads (from a non-TLS-using libpthread). */ if (!TLS_INIT_TP_EXPENSIVE || GL(dl_tls_max_dtv_idx) > 0)
And there it was. That version of glibc refused to late-load dynamic libraries that had TLS requirements. When I checked the working server, it had a much later libc (2.5 from Centos 5, versus 2.3.4 from Centos 4.4).
I am hardly worthy of nit-picking the likes of glibc maintainers, but I find it lame that the error message was completely undocumented, as was the (lack of) functionality therein. While researching this, I also looked through the glibc CVS – the bug was first fixed here. The big comment explaining the bug remains, even though it appears that as of this revision, it is no longer applicable. Whether that’s true or not, I don’t know. The actual revision comments are effectively useless for determining what the changes mean. I may never really know.
How did I end up solving this? Rather than do an entire system upgrade, I removed our libstdc++ dependency. It just so happened it was there by accident. Oops. Note that earlier versions of libstdc++ had no PT_TLS references — which is why this is a subtle ABI issue.
In the end, the moral of the story is: binary compatibility on Linux is a nightmare. It’s no fault of the just the kernel, or GNU — it’s the fault of everyone picking and enforcing their own standards.
As a final tirade, Glibc needs to get dlerror() message documentation. “Use the source, Luke,” is not an acceptable API reference.
Ouch, well, thanks for the information BAILOPAN. I hate it when I debug for a day and a half and it turns out I linked something that wasn’t needed.
To avoid unexpected ABI breaks may be used tools for static comparison of old library code with a new one, such as free ABI-compliance-checker from http://ispras.linuxfoundation.org/index.php/ABI_compliance_checker
Thanks for posting all of the examples. Also, thanks for posting the source code, very helpful! :-)