This article is mirrored on my blog
My linker in
development was
crashing on free
. It was calling one malloc
but then freeing with a
free
associated with a different malloc
. This subsequently caused a
segmentation fault because the free
expected a metadata structure that
didn't exist in the other malloc
(at least, not at the same size).
This took a lot of sleuthing. I knew it was a problem with my linker/loader, but couldn't step-debug it because my loader doesn't create a program image the way GDB expects.[^1]
This article goes over how I found the issue. It might help you if you are encountering strange things with your work-in-progress linker (ha ha!), or like reading about someone debugging something.[^2]
Finding the problem
The problem manifest in my attempt to statically link[^3] the following simple Cakelisp program:
(add-c-search-directory-module "/home/macoy/musl/include") (c-import "stdio.h" "stdlib.h") (defun main (&return int) (fprintf stderr "Hello, C runtime!\n") (var data (* char) (type-cast (malloc (* (sizeof (type (* char))) 10)) (* char))) (fprintf stderr "Allocated and got %p!\n" data) (set (at 0 data) 0) (fprintf stderr "Accessed %p, it's now %d\n" data (at 0 data)) (free data) (fprintf stderr "Freed!\n") (return 0))
This program first proved that musl libc was at least partially
functional by successfully printing to stderr
. However, the program
segfaulted in free
.
I used a combination of a signal handler and rudimentary stack printing
via backtrace.h
to find I was calling the incorrect malloc
function
relative to the later free
.
I discovered this by noticing that I could successfully set the data
returned by malloc
without encountering a segmentation fault, so the
memory was at least valid.
I then hacked together a damn simple "interactive debugger", which
gets triggered when SIGSEGV
is caught by my signal handler[^4]:
;; Very minimal! (defun-local interactive-debugger () (fprintf stderr "Commands:\n \tquit\n \tprint-symbol [symbol-name]\n") (var print-symbol-tag-length (const int) (strlen "print-symbol")) (while 1 (fprintf stderr "> ") (var input ([] 256 char) (array 0)) ;; Note: We need to request stdin before running this in a signal handler! (fgets input (sizeof input) stdin) (cond ((or (= 0 (strcmp "quit\n" input)) (= 0 (strcmp "q\n" input))) (break)) ((= 0 (strncmp "print-symbol " input print-symbol-tag-length)) (var symbol-name-buffer ([] 128 char) (array 0)) (strcpy symbol-name-buffer (+ input print-symbol-tag-length 1)) (set (at (- (strlen symbol-name-buffer) 1) symbol-name-buffer) 0) (fprintf stderr "Searching for '%s'\n" symbol-name-buffer) ;; This prints where it finds it for us (var symbol (* void) (find-symbol-address-in-allocated-sections symbol-name-buffer))))))
The print-symbol
command alerted me to the fact that malloc
was
resolving to the lite_malloc.c
implementation, but free
was
resolving to the mallocng
implementation.
I then started looking at lite_malloc.c
and untangling the mess.
Let's walk through the issue.
musl's malloc implementation
musl has a "lite" or simple malloc
that is defined as a fallback
when e.g. mallocng
isn't included in your musl build.
It is defined like so:
static void *__simple_malloc(size_t n) { // [Implementation omitted by article author] } weak_alias(__simple_malloc, __libc_malloc_impl); void *__libc_malloc(size_t n) { return __libc_malloc_impl(n); } static void *default_malloc(size_t n) { return __libc_malloc_impl(n); } weak_alias(default_malloc, malloc);
After
reading
several
pages
on this relatively obscure feature, I came to understand what these
weak_alias
macros accomplish.
If the user defines their own malloc
, that creates a strong
definition, thereby overriding musl libc's malloc
.
If the user does not define their own malloc
, the definition of
default_malloc
will be resolved to by the linker when the linker asks
for malloc
.
This I deduce is accomplished like so:
- GCC sees
__attribute__(weak, alias("malloc"))
associated withdefault_malloc
(either via#pragma
or direct attribute on thedefault_malloc
definition) - GCC generates the code for
default_malloc
, and puts it in the object file under the ELF symbol namemalloc
anddefault_malloc
, which we can confirm withnm
:
~/Repositories/linker-loader $ nm --defined /home/macoy/Downloads/musl-1.2.3/obj/src/malloc/lite_malloc.lo 0000000000000000 b brk.2119 0000000000000000 D __bump_lockptr 0000000000000000 b cur.2120 0000000000000000 t default_malloc 0000000000000000 b end.2121 0000000000000000 T __libc_malloc 0000000000000000 W __libc_malloc_impl 0000000000000000 b lock 0000000000000000 W malloc 0000000000000000 b mmap_step.2122 0000000000000000 t __simple_malloc
The W
denotes a weak symbol. Note that you would be able reference
default_malloc
directly, i.e. the alias isn't an override, but in
this case default_malloc
is marked static
, so it will not be exposed
to the linker by its true name.
The other weak_alias
on __simple_malloc
is the one that broke my
loader. This alias accomplishes a different goal. In case the user has
not defined their own malloc
, default_malloc
will be called, which
references __libc_malloc_impl
. The weak alias on __simple_malloc
says "If __libc_malloc_impl
is not defined, then use
__simple_malloc
instead."
The intent I believe with this alias is to allow users when building
musl to switch which malloc
implementation musl will use internally
and by default. This is corroborated by the configure --help
prompt:
Optional packages: --with-malloc=... choose malloc implementation [mallocng]
The problem with my linker
I end up finding a weak definition of malloc
, which resolves to
calling default_malloc
. This is actually the right behavior because
there is no strong definition of malloc
, because I never override it
in the user program.
However, __libc_malloc_impl
resolves to the weak __simple_malloc
,
when it should instead resolve to the strong __libc_malloc_impl
provided by musl libc's mallocng
implementation:
~/Repositories/linker-loader $ nm --defined /home/macoy/Downloads/musl-1.2.3/obj/src/malloc/mallocng/malloc.lo 0000000000000000 t alloc_slot 0000000000000000 r debruijn32.3106 0000000000000000 t enframe 0000000000000000 t get_stride 0000000000000000 T __libc_malloc_impl 0000000000000000 T __malloc_alloc_meta 0000000000000000 T __malloc_allzerop 0000000000000000 T __malloc_atfork 0000000000000000 B __malloc_context 0000000000000004 C __malloc_lock 0000000000000000 R __malloc_size_classes 0000000000000000 r med_cnt_tab 0000000000000000 t queue 0000000000000000 t rdlock 0000000000000000 t size_to_class 0000000000000000 r small_cnt_tab 0000000000000000 t step_seq 0000000000000000 t wrlock
Here are the relevant lines adjacent to each other, for comparison:
# lite_malloc.lo: 0000000000000000 W __libc_malloc_impl # malloc.lo: 0000000000000000 T __libc_malloc_impl
W
denotes a weak symbol definition while T
denotes a strong
public/global symbol defined in the text section of the object file.
In the ELF specification (PDF), the issue becomes quite clear (emphasis mine):
When the link editor combines several relocatable object files, it does not allow multiple definitions of
STB_GLOBAL
symbols with the same name. On the other hand, if a defined global symbol exists, the appearance of a weak symbol with the same name will not cause an error. The link editor honors the global definition and ignores the weak ones. Similarly, if a common symbol exists (i.e., a symbol whosest_shndx
field holdsSHN_COMMON
), the appearance of a weak symbol with the same name will not cause an error. The link editor honors the common definition and ignores the weak ones.
I was resolving __libc_malloc_impl
to the weak definition in
lite_malloc.lo
instead of the strong definition in malloc.lo
.
Later, when I try to free
, I end up referencing the free
I find in
malloc/free.lo
, which just calls __libc_free
, which is only defined
in malloc/mallocng/free.c
. If instead there was a corresponding
__simple_free
, I never would have realized that I was calling
__simple_malloc
.
Of course, I had the TODO item to implement this all properly before I began debugging:
*** TODO Symbol resolution needs to be addressed, especially once I load programs that override "weak" functions
I had put it off not knowing whether it would become an issue. It isn't a straightforward implementation, which is why I didn't do it immediately.
Now, after not doing it and seeing why it is important, I have gained a better understanding of how it is supposed to work.
Takeaways
If you do not implement the specification to a T[^5], you may end up debugging tricky things like this without any debugger nor good idea of what's going wrong.
The advantage is if you persist, you can learn new tools for understanding the problem and investigating the data.
If you're interested in other linker adventures, read about my "linker-loader" project, which talks about why I even bother with all this work.
You can also read the much simpler Know What Your Linker
Knows article
where I explain how objdump
can be useful when debugging link errors.
Around the time I wrote that article was when I started learning more
about linkers, which are something you don't really have to think too
hard about during regular program development.
[^1]: It does not meet the assumptions required by the GDB
add-symbol-file
, so I couldn't use gdb even if I manually added
objects one-by-one, with specified offsets in memory
[^2]: There must be like, a dozen people in the world that would meet that criteria, right? Right?!
[^3]: I wanted to statically link to musl libc because dynamic linking to e.g. glibc seemed much more complicated, especially because I did not have any dynamic linking support yet in my linker/loader.
[^4]: Yes, I am aware I am calling functions which aren't safe to call in signal handlers. In my case I am not expecting to "ship" this debugger, it's only a means to an end, so it ended up being fine.
[^5]: Why hadn't I? Well, I find implementing things piece-by-piece and testing as I go to result in higher success rates than all-or-nothing pushes. Sometimes it bites you when the missing pieces are essential to the next test. Also, sometimes you're not really sure how to make things until you're halfway down the road making it, because it's unique/experimental and/or you don't fully understand the purpose of the specification.