[Linux Kernel eBPF] Analyzing Negative Index Handling in BPF CO-RE

Background

To study Linux kernel eBPF, I took a look at a related mailing list post and analyzed it.

The Email

You can read the original email here. -> lore.kernel.org

From: Weiming Shi <[email protected]>
Subject: [PATCH bpf] bpf: reject negative CO-RE accessor indices in bpf_core_parse_spec()

...

CO-RE accessor strings are colon-separated indices that describe a path
from a root BTF type to a target field, e.g. "0:1:2" walks through
nested struct members. bpf_core_parse_spec() parses each component with
sscanf("%d"), so negative values like -1 are silently accepted.  The
subsequent bounds checks (access_idx >= btf_vlen(t)) only guard the
upper bound and always pass for negative values. When -1 reaches
btf_member_bit_offset() it gets cast to u32 0xffffffff, producing an
out-of-bounds read far past the members array.

A crafted BPF program with a negative CO-RE accessor on any struct that
exists in vmlinux BTF (e.g. task_struct) crashes the kernel during
BPF_PROG_LOAD:

 BUG: unable to handle page fault for address: ffffed11818b6626
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 7f74e067 P4D 7f74e067 PUD 0
 Oops: Oops: 0000 [#1] SMP KASAN NOPTI
 CPU: 0 UID: 0 PID: 85 Comm: poc Not tainted 7.0.0-rc6 #18 PREEMPT(full)
 Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:bpf_core_parse_spec (tools/lib/bpf/relo_core.c:348)
 RAX: 00000000ffffffff RBX: ffff88800c5b3128 RCX: 0000000000000000
 Call Trace:
  <TASK>
  bpf_core_calc_relo_insn (tools/lib/bpf/relo_core.c:1319)
  bpf_core_apply (kernel/bpf/btf.c:9507)
  bpf_check (kernel/bpf/verifier.c:26031)
  bpf_prog_load (kernel/bpf/syscall.c:3089)
  __sys_bpf (kernel/bpf/syscall.c:6228)
  __x64_sys_bpf (kernel/bpf/syscall.c:6339)
  do_syscall_64 (arch/x86/entry/syscall_64.c:94)
  </TASK>

CO-RE accessor indices are inherently non-negative (field index, array
index, or enumerator index), so reject them after parsing.

Fixes: ddc7c3042614 ("libbpf: implement BPF CO-RE offset relocation algorithm")
Reported-by: Xiang Mei <[email protected]>
igned-off-by: Weiming Shi <[email protected]>

They sent the email with the S missing from Signed-off-by at the end.

Analysis of the Email

First, let’s go over what CO-RE accessor strings are.

CO-RE accessor strings

A CO-RE accessor string is a path string that describes which inner member of which original type a field access was pointing to.

To put it simply, it is the access path that records how something like

task->mm->exe_file->f_inode->i_ino

moves from task all the way down to i_ino.

CO-RE: a mechanism that allows code to keep working even when the struct layout in the source code differs from the struct layout in the target kernel

These strings are generated by Clang when compiling __builtin_preserve_access_index().

They are later used by libbpf when it compares against the target kernel’s BTF to locate the actual field.

According to the email, they look like "0:1:2".

bpf_core_parse_spec()

This function reads the accessor string contained in a CO-RE relocation and converts it into an internal path representation that libbpf can actually follow.

For example:

"0:1:2" -> bpf_core_parse_spec() -> fills in the data inside bpf_core_spec

The problem is that this function parses each number with sscanf("%d"), which means values like -1 are accepted as well. Since the later bounds checks do not reject negative values, they slip through and eventually cause an out-of-bounds read.

It looks like the author actually tested this with a real negative CO-RE accessor string.

Experiment - Building a BPF Program with a Negative CO-RE accessor string

Plan: build a normal BPF program -> carefully patch the binary -> test it

Building a normal BPF program

First, generate vmlinux.h with bpftool.

bpftool btf dump file /path/to/vmlinux format c > vmlinux.h

I rebuilt the kernel with the BPF- and BTF-related options enabled.

After generating vmlinux.h, I compiled the following C code.

// task_core.bpf.c
// clang -O2 -g -target bpf -D__TARGET_ARCH_x86 \
  -I./libbpf/src \
  -I. \
  -c task_core.bpf.c -o task_core.bpf.o

#include "vmlinux.h"
#include "bpf_helpers.h"
#include "bpf_core_read.h"
#include "bpf_tracing.h"

char LICENSE[] SEC("license") = "GPL";

struct event {
    __u32 pid;
    __u32 tgid;
    __u32 ppid;
    char comm[16];
};


struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 24);
} events SEC(".maps");

SEC("tp_btf/sched_switch")
int BPF_PROG(handle_switch)
{
    struct task_struct *task;
    struct task_struct *parent;
    struct event *e;

    task = (struct task_struct *)bpf_get_current_task_btf();

    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return 0;

    e->pid  = BPF_CORE_READ(task, pid);
    e->tgid = BPF_CORE_READ(task, tgid);
    bpf_core_read_str(&e->comm, sizeof(e->comm), task->comm);

    parent = BPF_CORE_READ(task, real_parent);
    e->ppid = parent ? BPF_CORE_READ(parent, tgid) : 0;

    bpf_ringbuf_submit(e, 0);
    return 0;
}

I ran into header issues, so I fixed that by pulling the BPF-related headers from the libbpf Git repository.

The problem here was that in order to patch the binary, I had to figure out exactly which part needed to be turned into a negative value...

At first I was about to give up because I thought I had no way of knowing that. Then, after some back-and-forth with GPT, I found out that I could simply take things like 0:81 in a normal BPF object and change them to 0:-1.

# patch.py
from elftools.elf.elffile import ELFFile

with open("task_core.bpf.o", "rb") as f:
    data = bytearray(f.read())

with open("task_core.bpf.o", "rb") as f:
    elf = ELFFile(f)
    btf = elf.get_section_by_name(".BTF")
    if not btf:
        raise RuntimeError(".BTF not found")

    blob = btf.data()
    needle = b"0:81\x00"
    off = blob.find(needle)
    if off < 0:
        raise RuntimeError("accessor string not found")

    file_off = btf['sh_offset'] + off
    repl = b"0:-1\x00"
    data[file_off:file_off+len(repl)] = repl

with open("repro-neg.bpf.o", "wb") as f:
    f.write(data)
// loader.c
// cc -O2 -g loader.c -o loader -lbpf -lelf -lz
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <signal.h>
#include <unistd.h>
#include <bpf/libbpf.h>

static volatile sig_atomic_t stop;

static void on_sigint(int sig)
{
    stop = 1;
}

struct event {
    __u32 pid;
    __u32 tgid;
    __u32 ppid;
    char comm[16];
};


static int libbpf_print_fn(enum libbpf_print_level level,
                           const char *format, va_list args)
{
    return vfprintf(stderr, format, args);
}

static int handle_event(void *ctx, void *data, size_t len)
{
    const struct event *e = data;

    printf("pid=%u tgid=%u ppid=%u comm=%s\n",
           e->pid, e->tgid, e->ppid, e->comm);
    return 0;
}


int main(void)
{
    struct bpf_object *obj = NULL;
    struct bpf_program *prog;
    struct bpf_link *link = NULL;
    struct ring_buffer *rb = NULL;
    int map_fd;
    int err;

    signal(SIGINT, on_sigint);
    signal(SIGTERM, on_sigint);

    libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
    libbpf_set_print(libbpf_print_fn);

    obj = bpf_object__open_file("task_core.bpf.o", NULL);
    if (!obj) {
        fprintf(stderr, "failed to open BPF object\n");
        return 1;
    }

    err = bpf_object__load(obj);
    if (err) {
        fprintf(stderr, "failed to load BPF object: %d\n", err);
        goto cleanup;
    }

    prog = bpf_object__find_program_by_name(obj, "handle_switch");
    if (!prog) {
        fprintf(stderr, "failed to find program: handle_switch\n");
        err = -ENOENT;
        goto cleanup;
    }

    link = bpf_program__attach(prog);
    if (!link) {
        err = -errno;
        fprintf(stderr, "failed to attach program: %d\n", err);
        goto cleanup;
    }

    map_fd = bpf_object__find_map_fd_by_name(obj, "events");
    if (map_fd < 0) {
        err = map_fd;
        fprintf(stderr, "failed to find map fd: %d\n", err);
        goto cleanup;
    }

    /* Consuming the ring buffer is optional, but if you don't read from it, events will just pile up. */
    // rb = ring_buffer__new(map_fd, NULL, NULL, NULL);
    rb = ring_buffer__new(map_fd, handle_event, NULL, NULL);
    if (!rb) {
        err = -errno;
        fprintf(stderr, "failed to create ring buffer: %d\n", err);
        goto cleanup;
    }

    printf("program loaded and attached\n");

    while (!stop) {
        err = ring_buffer__poll(rb, 100 /* ms */);
        if (err == -EINTR)
            break;
        if (err < 0) {
            fprintf(stderr, "ring_buffer__poll failed: %d\n", err);
            goto cleanup;
        }
    }

    err = 0;

cleanup:
    ring_buffer__free(rb);
    bpf_link__destroy(link);
    bpf_object__close(obj);
    return err != 0;
}

What I found was not a kernel crash but a libbpf crash. Or at least that was what I thought at first. Then I went back and reread the email, and it was pretty obvious that the email was describing a kernel crash. :(

libbpf: sec 'tp_btf/sched_switch': found 4 CO-RE relocations
libbpf: CO-RE relocating [13] struct task_struct: found target candidate [136253] struct task_struct in [vmlinux]
libbpf: prog 'handle_switch': relo #0: <byte_off> [13] struct task_struct.pid (0:80 @ offset 1456)
libbpf: prog 'handle_switch': relo #0: matching candidate #0 <byte_off> [136253] struct task_struct.pid (0:80 @ offset 1456)
libbpf: prog 'handle_switch': relo #0: patched insn #9 (ALU/ALU64) imm 1456 -> 1456
loader[70]: segfault at 55f4b45f4e68 ip 00007f320e693a6d sp 00007ffc5e7211c0 error 4 in libbpf.so.1[42a6d,7f320e65a000+45000]
Segmentation fault

At this point I needed to confirm which function it was actually crashing in, why that code path was the problem, and how it had been fixed. But I couldn’t get gdbserver inside QEMU to connect properly to gdb on the host. :(

After wrestling with GPT for a while, I finally got it connected!!!

gef> bt
#0  0x00007f83fa2ee665 in btf_member_bit_offset (t=0x556b18f35ec0, member_idx=0xffffffff)
    at /home/rand/nomads/kernel-dev/kernel-lab/bpf/libbpf/src/btf.h:627
#1  0x00007f83fa2ef09b in bpf_core_parse_spec (prog_name=0x556b18f2db30 "handle_switch", btf=0x556b18f35d50, relo=0x556b18f3e65c,
    spec=0x7fff5da89e90) at relo_core.c:348
#2  0x00007f83fa2f164b in bpf_core_calc_relo_insn (prog_name=0x556b18f2db30 "handle_switch", relo=0x556b18f3e65c, relo_idx=0x1,
    local_btf=0x556b18f35d50, cands=0x556b18f2dd10, specs_scratch=0x7fff5da89e90, targ_res=0x7fff5da8ae90) at relo_core.c:1319
#3  0x00007f83fa2bc1c0 in bpf_core_resolve_relo (prog=0x556b18f2da10, relo=0x556b18f3e65c, relo_idx=0x1, local_btf=0x556b18f35d50,
    cand_cache=0x556b18f2b5b0, targ_res=0x7fff5da8ae90) at libbpf.c:6022
#4  0x00007f83fa2bc53c in bpf_object__relocate_core (obj=0x556b18f2b310, targ_btf_path=0x0) at libbpf.c:6115
#5  0x00007f83fa2bef77 in bpf_object__relocate (obj=0x556b18f2b310, targ_btf_path=0x0) at libbpf.c:7379
#6  0x00007f83fa2c33c7 in bpf_object_prepare (obj=0x556b18f2b310, target_btf_path=0x0) at libbpf.c:8921
#7  0x00007f83fa2c355b in bpf_object_load (obj=0x556b18f2b310, extra_log_level=0x0, target_btf_path=0x0) at libbpf.c:8960
#8  0x00007f83fa2c36db in bpf_object__load (obj=0x556b18f2b310) at libbpf.c:8996
#9  0x0000556b149de1c4 in main () at repro-loader.c:61
#10 0x00007f83f9fb0b57 in ?? () from target:/lib64/libc.so.6
#11 0x00007f83f9fb0c15 in __libc_start_main () from target:/lib64/libc.so.6
#12 0x0000556b149de3c5 in _start ()

This is the backtrace right after libbpf crashed.

You can see bpf_core_calc_relo_insn there, which also showed up in the email. 😀

Now let’s slowly walk through the code.

Let’s look at the code and see how the bug works

First, let’s follow the backtrace in reverse order.

// repro-loader.c
    err = bpf_object__load(obj);
    if (err) {
        fprintf(stderr, "failed to load BPF object: %d\n", err);
        goto cleanup;
    }

This is the loader code for the manipulated BPF program that I wrote.

It directly calls libbpf’s bpf_object__load.

// libbpf.c
int bpf_object__load(struct bpf_object *obj)
{
	return bpf_object_load(obj, 0, NULL);
}

This simply calls bpf_object_load.

// libbpf.c
static int bpf_object_load(struct bpf_object *obj, int extra_log_level, const char *target_btf_path)
{
    ...

	if (obj->state < OBJ_PREPARED) {
		err = bpf_object_prepare(obj, target_btf_path);
		if (err)
			return libbpf_err(err);
	}
	err = bpf_object__load_progs(obj, extra_log_level);
	err = err ? : bpf_object_init_prog_arrays(obj);
	err = err ? : bpf_object_prepare_struct_ops(obj);

    ...
}

Here, let’s look at bpf_object_prepare.

// libbpf.c
static int bpf_object_prepare(struct bpf_object *obj, const char *target_btf_path)
{   
    ...

	err = err ? : bpf_object__relocate(obj, obj->btf_custom_path ? : target_btf_path);
	err = err ? : bpf_object__sanitize_and_load_btf(obj);

    ...
	
    if (err) {
		bpf_object_unpin(obj);
		bpf_object_unload(obj);
		obj->state = OBJ_LOADED;
		return err;
	}
}

This function preprocesses the BPF object read from ELF and gets it into a state where it is ready to be loaded into the kernel.

One of the key steps here is bpf_object__relocate, which fixes up symbols, offsets, and similar information from compile time so they match the currently running kernel.

// libbpf.c
static int bpf_object__relocate(struct bpf_object *obj, const char *targ_btf_path)
{
    ...

	if (obj->btf_ext) {
		err = bpf_object__relocate_core(obj, targ_btf_path);
		if (err) {
			pr_warn("failed to perform CO-RE relocations: %s\n",
				errstr(err));
			return err;
		}
		bpf_object__sort_relos(obj);
	}

    ...
}

The first thing this function does is perform CO-RE relocation if the object has a .BTF.ext section.

// libbpf.c
static int
bpf_object__relocate_core(struct bpf_object *obj, const char *targ_btf_path)
{
    ...
    
    err = bpf_core_resolve_relo(prog, rec, i, obj->btf, cand_cache, &targ_res);
    if (err) {
        pr_warn("prog '%s': relo #%d: failed to relocate: %s\n",
            prog->name, i, errstr(err));
        goto out;
    }

    err = bpf_core_patch_insn(prog->name, insn, insn_idx, rec, i, &targ_res);
    if (err) {
        pr_warn("prog '%s': relo #%d: failed to patch insn #%u: %s\n",
            prog->name, i, insn_idx, errstr(err));
        goto out;
    }

    ...
}

This function figures out where the kernel types, fields, enums, and so on referenced by the BPF program correspond in the current target kernel, and then patches the BPF instruction with the actual value.

// libbpf.c
static int bpf_core_resolve_relo(...)
{
    ...

	return bpf_core_calc_relo_insn(prog_name, relo, relo_idx, local_btf, cands, specs_scratch,
				       targ_res);
}

This function finds the relocation target in the local BTF and computes the final relocation result.

// relo_core.c
int bpf_core_calc_relo_insn(...)
{
    ...

	local_id = relo->type_id;
	local_type = btf_type_by_id(local_btf, local_id);
	local_name = btf__name_by_offset(local_btf, local_type->name_off);
	if (!local_name)
		return -EINVAL;

	err = bpf_core_parse_spec(prog_name, local_btf, relo, local_spec);
	if (err) {
		const char *spec_str;

		spec_str = btf__name_by_offset(local_btf, relo->access_str_off);
		pr_warn("prog '%s': relo #%d: parsing [%d] %s %s + %s failed: %d\n",
			prog_name, relo_idx, local_id, btf_kind_str(local_type),
			str_is_empty(local_name) ? "<anon>" : local_name,
			spec_str ?: "<?>", err);
		return -EINVAL;
	}

    ...
}

This function is the one that determines the final relocation value.

It parses the relocation access string into a spec through bpf_core_parse_spec.

// relo_core.c
/*
 * Turn bpf_core_relo into a low- and high-level spec representation,
 * validating correctness along the way, as well as calculating resulting
 * field bit offset, specified by accessor string. Low-level spec captures
 * every single level of nestedness, including traversing anonymous
 * struct/union members. High-level one only captures semantically meaningful
 * "turning points": named fields and array indicies.
 * E.g., for this case:
 *
 *   struct sample {
 *       int __unimportant;
 *       struct {
 *           int __1;
 *           int __2;
 *           int a[7];
 *       };
 *   };
 *
 *   struct sample *s = ...;
 *
 *   int x = &s->a[3]; // access string = '0:1:2:3'
 *
 * Low-level spec has 1:1 mapping with each element of access string (it's
 * just a parsed access string representation): [0, 1, 2, 3].
 *
 * High-level spec will capture only 3 points:
 *   - initial zero-index access by pointer (&s->... is the same as &s[0]...);
 *   - field 'a' access (corresponds to '2' in low-level spec);
 *   - array element #3 access (corresponds to '3' in low-level spec).
 *
 * Type-based relocations (TYPE_EXISTS/TYPE_MATCHES/TYPE_SIZE,
 * TYPE_ID_LOCAL/TYPE_ID_TARGET) don't capture any field information. Their
 * spec and raw_spec are kept empty.
 *
 * Enum value-based relocations (ENUMVAL_EXISTS/ENUMVAL_VALUE) use access
 * string to specify enumerator's value index that need to be relocated.
 */
int bpf_core_parse_spec(const char *prog_name, const struct btf *btf,
			const struct bpf_core_relo *relo,
			struct bpf_core_spec *spec)
{
	int access_idx, parsed_len, i;
	struct bpf_core_accessor *acc;
	const struct btf_type *t;
	const char *name, *spec_str;
	__u32 id, name_off;
	__s64 sz;

	spec_str = btf__name_by_offset(btf, relo->access_str_off);
	if (str_is_empty(spec_str) || *spec_str == ':')
		return -EINVAL;

	memset(spec, 0, sizeof(*spec));
	spec->btf = btf;
	spec->root_type_id = relo->type_id;
	spec->relo_kind = relo->kind;

	/* type-based relocations don't have a field access string */
	if (core_relo_is_type_based(relo->kind)) {
		if (strcmp(spec_str, "0"))
			return -EINVAL;
		return 0;
	}

	/* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */
	while (*spec_str) {
		if (*spec_str == ':')
			++spec_str;
		if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
			return -EINVAL;
		if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
			return -E2BIG;
		spec_str += parsed_len;
		spec->raw_spec[spec->raw_len++] = access_idx;
	}

	if (spec->raw_len == 0)
		return -EINVAL;

	t = skip_mods_and_typedefs(btf, relo->type_id, &id);
	if (!t)
		return -EINVAL;

	access_idx = spec->raw_spec[0];
	acc = &spec->spec[0];
	acc->type_id = id;
	acc->idx = access_idx;
	spec->len++;

	if (core_relo_is_enumval_based(relo->kind)) {
		if (!btf_is_any_enum(t) || spec->raw_len > 1 || access_idx >= btf_vlen(t))
			return -EINVAL;

		/* record enumerator name in a first accessor */
		name_off = btf_is_enum(t) ? btf_enum(t)[access_idx].name_off
					  : btf_enum64(t)[access_idx].name_off;
		acc->name = btf__name_by_offset(btf, name_off);
		return 0;
	}

	if (!core_relo_is_field_based(relo->kind))
		return -EINVAL;

	sz = btf__resolve_size(btf, id);
	if (sz < 0)
		return sz;
	spec->bit_offset = access_idx * sz * 8;

	for (i = 1; i < spec->raw_len; i++) {
		t = skip_mods_and_typedefs(btf, id, &id);
		if (!t)
			return -EINVAL;

		access_idx = spec->raw_spec[i];
		acc = &spec->spec[spec->len];

		if (btf_is_composite(t)) {
			const struct btf_member *m;
			__u32 bit_offset;

			if (access_idx >= btf_vlen(t))
				return -EINVAL;

			bit_offset = btf_member_bit_offset(t, access_idx);
			spec->bit_offset += bit_offset;

			m = btf_members(t) + access_idx;
			if (m->name_off) {
				name = btf__name_by_offset(btf, m->name_off);
				if (str_is_empty(name))
					return -EINVAL;

				acc->type_id = id;
				acc->idx = access_idx;
				acc->name = name;
				spec->len++;
			}

			id = m->type;
		} else if (btf_is_array(t)) {
			const struct btf_array *a = btf_array(t);
			bool flex;

			t = skip_mods_and_typedefs(btf, a->type, &id);
			if (!t)
				return -EINVAL;

			flex = is_flex_arr(btf, acc - 1, a);
			if (!flex && access_idx >= a->nelems)
				return -EINVAL;

			spec->spec[spec->len].type_id = id;
			spec->spec[spec->len].idx = access_idx;
			spec->len++;

			sz = btf__resolve_size(btf, id);
			if (sz < 0)
				return sz;
			spec->bit_offset += access_idx * sz * 8;
		} else {
			pr_warn("prog '%s': relo for [%u] %s (at idx %d) captures type [%d] of unexpected kind %s\n",
				prog_name, relo->type_id, spec_str, i, id, btf_kind_str(t));
			return -EINVAL;
		}
	}

	return 0;
}

This is the key function.

Let’s start with the big picture.

bpf_core_relo is a relocation record that says which type, field, or enum information referenced by a BPF instruction needs to be recalculated at load time so it matches the target kernel.

In other words, it looks at .BTF.ext, checks the current kernel’s BTF, and recalculates the value.


There are three kinds of bpf_core_relo.

  1. field-based relocation: where is this field, and how many bytes does it occupy?
  2. type-based relocation: the type itself
  3. enum value-based relocation: where does this enum member index exist?

Next, let’s look at the input and output of this function.

Input

  1. btf: type information
  2. relo: the relocation source
  3. spec: the structure where the result is stored

Output

  1. raw_spec: a low-level representation where the access string is parsed into a numeric array
  2. spec[]: a high-level representation that keeps only the meaningful points
  3. bit_offset: the final bit offset

Let’s also look at it visually.

struct sample {
    int __unimportant;
    struct {
        int __1;
        int __2;
        int a[7];
    };
};

struct sample *s = ...;
int x = &s->a[3]; // access string = '0:1:2:3'

That part is the core idea.

If you draw it out, it looks like this:

+---------------------------+   struct sample
| __unimportant             |   field #0
+---------------------------+
| anonymous struct          |   field #1
|   +-------------------+   |
|   | __1               |   |   field #0
|   +-------------------+   |
|   | __2               |   |   field #1
|   +-------------------+   |
|   | a[0]              |   |   field #2
|   | a[1]              |   |
|   | a[2]              |   |
|   | a[3]   <--- here  |   |
|   | ...               |   |
|   +-------------------+   |
+---------------------------+

The access string "0:1:2:3" means:

  • 0 : s[0] (s->... is effectively the same as s[0]...)
  • 1 : field #1 of struct sample -> anonymous struct
  • 2 : field #2 of that anonymous struct -> a
  • 3 : array element a[3]

So:

0:1:2:3
│ │ │ └─ array index 3
│ │ └─── field index 2 (a)
│ └───── field index 1 (anonymous struct)
└─────── root pointer first element

At the low level, that becomes:

raw_spec = [0, 1, 2, 3]

And the high-level form keeps only the meaningful points, like this:

high-level spec = [
  root[0],
  field a,
  array[3]
]

To summarize what the function does:

  1. take out the access string

  2. initialize the spec

  3. if it is a type-based relocation, handle it specially

  4. parse "0:1:2:3" into raw_spec[]

  5. normalize the root type by removing typedef/modifier layers

  6. if it is an enum-based relocation, handle it specially

  7. if it is a field-based relocation, walk the types downward

    • access struct/union members
    • access array indices
    • accumulate bit offsets
    • record meaningful accessors
  8. finish the spec


The problem happens at step 4.

Here is the code for step 4:

/* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */
while (*spec_str) {
    if (*spec_str == ':')
        ++spec_str;
    if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
        return -EINVAL;
    if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
        return -E2BIG;
    spec_str += parsed_len;
    spec->raw_spec[spec->raw_len++] = access_idx;
}

Variables:

  • spec_str: the string holding something like "0:1:2:3:4"
  • access_idx: temporary storage for each parsed index
  • parsed_len: number of characters consumed by sscanf
  • spec->raw_len: length of the raw_spec array
  • spec->raw_spec: the low-level array, such as [0, 1, 2, 3, 4]
  • BPF_CORE_SPEC_MAX_LEN: 64 by default

The issue is that because it uses %d, if spec_str contains something like "0:-1", then a negative value gets stored in spec->raw_spec. (spec->raw_spec is an int array.)

Once a negative value gets in there, you end up with an out-of-bounds access.

Here is why.

  • For enum types, the code only checks whether the value is greater than the BTF length, so a negative value slips through:
if (!btf_is_any_enum(t) || spec->raw_len > 1 || access_idx >= btf_vlen(t))
    return -EINVAL;
  • For field accesses, it also only checks whether the index exceeds the BTF length, so again the negative value passes:
if (access_idx >= btf_vlen(t))
    return -EINVAL;

bit_offset = btf_member_bit_offset(t, access_idx);

Then in btf_member_bit_offset, the argument type is __u32, so the negative value is converted into a huge unsigned value.

// btf.h
static inline __u32 btf_member_bit_offset(const struct btf_type *t,
					  __u32 member_idx)
{
	const struct btf_member *m = btf_members(t) + member_idx;
	bool kflag = btf_kflag(t);

	return kflag ? BTF_MEMBER_BIT_OFFSET(m->offset) : m->offset;
}
  • Since a negative value is assigned into the __u32-typed bit_offset path, an unintended value gets used.
  • Once code ends up accessing memory through that path, it hits a segmentation fault.
624         const struct btf_member *m = btf_members(t) + member_idx;
625         bool kflag = btf_kflag(t);
626   
        // m = 0x00007ffd8d4792b8  ->  0x0000563b994b5ec0
-> 627         return kflag ? BTF_MEMBER_BIT_OFFSET(m->offset) : m->offset;
628   }

The Patch

The patch simply adds a check right after parsing access_idx to reject negative values.

        ++spec_str;
    if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
        return -EINVAL;
+	if (access_idx < 0)
+		return -EINVAL;
    if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
        return -E2BIG;
    spec_str += parsed_len;

Final Thoughts

I think I learned quite a lot about CO-RE while analyzing this bug.

Setting up the environment was not easy.

I also looked through other code to see whether similar bugs existed elsewhere, but I could not find any.

Even if someone managed to exploit this by abusing libbpf’s parsing and somehow getting a shell out of it, in most cases anyone using libbpf is already root anyway, so... it does not seem especially practical as an exploit.