Where Threads Sleep — epoll and the C10K Problem Explained

What no one teaches you about the JavaScript Event Loop — Part 3 of 8

A different puzzle

In Part 2 we left a question hanging. An idle browser tab uses zero CPU. Open your task manager — a tab with nothing happening on its page shows 0% CPU usage. Not 0.1%. Not 0.01%. Zero.

But the tab's main thread still exists. It's in kernel memory as a task_struct. It's not running while(true) — we'd see that in the CPU usage. So what is it doing?

The same question, in a different costume, is the question that defined an entire era of server engineering. It's called the C10K problem: how do you build a server that handles 10,000 concurrent connections?

Most of those 10,000 connections are idle most of the time. They're chat users between messages, browsers waiting for the next API request, IoT devices reporting every few minutes. The server needs to be ready for any of them to send data at any moment.

The naive solution: spawn 10,000 threads, one per connection. Each thread sits in a loop calling read() on its connection, waiting for the user to send something. On Linux, each thread gets 8 MB of stack space by default. 10,000 threads × 8 MB = 80 GB of stack memory. Your server has 16 GB. The math doesn't work.

But there's a deeper question hiding inside the failure. If those 10,000 threads existed, they'd be doing nothing 99% of the time — just waiting. Yet they'd cost 80 GB. Why does "waiting" cost so much memory? What is a waiting thread actually using?

The answer, once you see it, is the same as the answer to "what does the idle tab's main thread do." Both threads are in the same state, in kernel memory, using zero CPU, waiting for something to happen. The C10K problem and the idle-tab problem have the same shape and the same solution.

Once you see what the solution is, you'll see the engineering trick on which every modern server, every browser tab, and every JavaScript runtime is built. The unanswered click from Part 1 — the one that sat somewhere during the freeze — sits in the same place. Same mechanism, different problem.

What you'll hold by the end

After Part 3:

What "sleeping" means mechanically: paused data, not paused activity.
How wait queues work: linked lists attached to resources; waking is code walking the list.
What syscalls are and how user code crosses into the kernel.
What file descriptors are: integer handles to kernel objects, per process.
What blocking I/O looks like at the kernel level — the read() that sleeps.
The C10K problem and why one-thread-per-connection fails.
The multiplexing trick: one thread waiting on N resources at once via epoll.
Why every JavaScript async API (callbacks, Promises, async/await) is a consequence of the architecture this part describes.
The full chain from mouse click to JavaScript handler, finally end-to-end.

What's out of scope:

The event loop algorithm itself (Part 5).
V8 and where JS actually runs in Chrome (Part 4).
Microtasks and Promises (Part 6).
TCP/IP protocol details (mentioned, not deep-dived).
Linux IRQ subsystem internals beyond what we need.

What "sleeping" actually means

The word "sleeping" is misleading. It sounds active — like the thread is doing something, the way a person sleeps. The mental model that comes with the word: the thread is "in" sleep, "being" asleep, waiting.

This is upside down, exactly the way "thread is running" was upside down in Part 2. Let me state the right framing now, before introducing the mechanism:

A sleeping thread is doing nothing. Not "pausing." Not "waiting." Nothing.

It's not on any CPU. It's not executing any instruction. It's not consuming energy beyond the static memory holding its data. Its task_struct sits in kernel memory with its state field set to SLEEPING, its registers frozen at whatever they held when it last ran.

A sleeping thread is the same as a paused movie. The pixels of the paused frame are in memory. The movie is not doing anything. The frame is not "waiting" — it sits.

What makes the thread "wake up" is other code running, finding the thread's struct in a list, and changing its state. The thread doesn't wake itself. Nothing inside the thread does anything during sleep. External code — usually a kernel interrupt handler running on a different CPU core, or kernel code triggered by another thread — sets the sleeping thread's state back to READY and adds it to the scheduler's run queue.

So:

"The thread sleeps" actually means: external code (the kernel) set its state field to SLEEPING and removed it from the run queue.
"The thread wakes up" actually means: external code (an interrupt handler, the kernel itself) set its state field back to READY and put it back in the run queue.

The thread is data. The CPU is the agent. Sleep means "we set this data aside." Wake means "we put this data back in line for the CPU to use again."

This is the same reframe as Part 2, applied to a new state. Hold it; it matters for everything in this part.

Wait queues, mechanically

When a thread sleeps, where does it go? Kernel memory, yes — but specifically, where in kernel memory? How does the wake-up code know which threads to wake when something happens?

The answer is the wait queue.

A wait queue is a linked list of waiter entries. Each waiter entry holds:

A pointer to a task_struct (the sleeping thread)
A wake function (what to do when the resource becomes ready — typically "set this task to READY and add it to the run queue")

Wait queue heads are embedded in kernel objects. Every "thing you might wait for" has its own wait queue:

Sockets have wait queues (for threads waiting on incoming data)
Pipes have wait queues
Mutexes have wait queues
Timers have wait queues
File reads (when data isn't yet in the page cache) park threads in wait queues

From the Linux kernel source, the wait queue head is defined in include/linux/wait.h, roughly:

// Simplified from include/linux/wait.h
struct wait_queue_head {
    spinlock_t            lock;
    struct list_head      head;   // the linked list of waiters
};

struct wait_queue_entry {
    unsigned int          flags;
    void                 *private;       // typically points to task_struct
    wait_queue_func_t     func;          // wake function
    struct list_head      entry;         // linked list node
};

(I'm showing the structural fields. Real definitions have a few more for lock ordering, debugging, etc. The shape is what matters.)

So a socket with three sleeping threads waiting on it looks like this:

When data arrives at the socket, the kernel network code walks this list, calls each waiter's wake function in turn. The default wake function sets the target task_struct's state to READY and adds it to the scheduler's run queue. The scheduler picks them up on subsequent ticks.

Now, hold on. Before you build the wrong picture: in the JavaScript case, you'll almost never see wait queues with more than one entry. A browser tab has one main thread. When that thread blocks on a socket (or, much more commonly, on the tab's main epoll instance — coming soon), it's almost always the only entry in that wait queue. The "N waiters" design exists because this data structure is used by everything in the kernel — multi-threaded C servers, kernel internals, device drivers, mutexes — not just JavaScript runtimes. For our story, mentally simplify: one waiter per queue, almost always.

There's one important exception you'll meet soon: when we get to epoll, the waiter installed on each watched fd's wait queue isn't a thread waiter at all. It's a hook — a waiter whose wake function isn't "wake some thread" but "tell the epoll instance that this fd is ready." You'll see why this matters when we get there. Same data structure, different use case. Just don't be confused when it shows up.

Etymology note. "Wait queue" is a slightly misleading name. The data structure is a linked list with custom wake functions, not strictly FIFO. Whether wake-up walks the list in order, wakes everyone at once, or wakes only one waiter, depends on the wake function and the wake-up call site. The "queue" naming comes from 1970s OS textbook tradition where any list of waiting things was called a queue regardless of access pattern. The name is conventional, not literal.

Source paths for the curious:

include/linux/wait.h — definitions of wait_queue_head and wait_queue_entry
kernel/sched/wait.c — the wake-up logic (__wake_up_common is the workhorse)

Syscalls — how user code talks to the kernel

Before we trace through a real blocking read(), we need to talk briefly about how user code reaches kernel functionality at all.

A syscall (system call) is the mechanism by which user-mode code requests kernel services. The user code can't just call kernel functions — kernel functions live in privileged memory and need the CPU in kernel mode to execute (recall Part 2's discussion of ring 0 vs ring 3). So there's a controlled gateway.

On x86-64, the gateway is a single CPU instruction called syscall. When user code executes this instruction:

The CPU switches from user mode to kernel mode automatically (the privilege check is done in hardware).
The CPU jumps to a fixed address — the syscall entry point — that the kernel registered during boot via a special instruction (wrmsr to set the MSR_LSTAR register).
The kernel reads register RAX to learn which syscall was requested. (Each syscall has a number — read is 0, write is 1, open is 2, etc.)
The kernel reads other registers for arguments (RDI, RSI, RDX, R10, R8, R9 for the first six arguments on Linux).
The kernel dispatches to the appropriate handler function.
When the handler returns, it puts the result in RAX, then executes sysret (the partner instruction). The CPU switches back to user mode, and the user code resumes after the syscall instruction with the return value in RAX.

There are about 350 syscalls on x86-64 Linux. The full list is in arch/x86/entry/syscalls/syscall_64.tbl in the kernel source. The handful you've heard of: read, write, open, close, socket, connect, accept, epoll_create, epoll_wait, mmap, fork, execve, exit. There are 340 more, most of which you'll never directly use.

The takeaway: every time your code wants to do anything the kernel controls — read a file, send data to a network, allocate memory, sleep — it goes through a syscall. The syscall is the controlled gateway between user and kernel modes.

In Node.js, when you do fs.readFile(path, callback), somewhere deep inside libuv that eventually becomes a read syscall. When you do fetch(url), somewhere deep inside Chrome's network stack that becomes a sendmsg syscall (for TCP send). Every async I/O operation in JavaScript bottoms out in one or more syscalls. You don't see them — they're abstracted six layers deep — but they're there.

Source paths:

arch/x86/entry/entry_64.S — the assembly entry point where syscalls land
include/uapi/asm-generic/unistd.h — generic syscall number definitions
arch/x86/entry/syscalls/syscall_64.tbl — x86-64 specific syscall table

File descriptors — the integer-as-handle pattern

Every syscall that operates on a kernel object (file, socket, pipe, epoll instance) needs to identify which object. The mechanism is the file descriptor.

A file descriptor is just a small integer. That's it — a number like 3, 5, 12, 73. Your user code holds these integers and passes them to syscalls. The kernel uses them as indices.

Each process has a file descriptor table in kernel memory. It's an array of pointers. Each slot either is empty or holds a pointer to a kernel object:

   Process: chrome-tab (PID 5432)
   File descriptor table:

     [0] → pointer to terminal struct        (stdin)
     [1] → pointer to terminal struct        (stdout)
     [2] → pointer to terminal struct        (stderr)
     [3] → pointer to file struct            (opened a config file)
     [4] → pointer to socket struct          (connection to gmail.com)
     [5] → pointer to socket struct          (IPC channel to browser process)
     [6] → pointer to pipe struct
     [7] → pointer to eventpoll struct       (this tab's epoll instance!)
     [8] → NULL                              (free slot)
     ...

When you call socket(), the kernel allocates a struct socket, finds the lowest free slot in your process's fd table, puts the pointer there, and returns the slot number. From then on, your user code uses that number to refer to the socket. Every read(fd, ...), write(fd, ...), close(fd) uses the number to look up the actual struct.

Three reasons this design exists:

1. Safety. User-mode code must never get a raw pointer into kernel memory. If it did, a malicious or buggy program could corrupt the kernel by stomping that pointer. With file descriptors, user code holds an integer — meaningless without the kernel's translation. The fd table itself lives in kernel memory, unreachable from user space.

2. Uniformity. The same fd table holds pointers to different kinds of kernel objects: files, sockets, pipes, epoll instances, timers, eventfds, signalfds, even some kinds of memory. To user code, they all look the same — an integer. You call read(fd, ...) without caring whether fd points at a regular file or a TCP socket. The kernel dispatches internally to the right read implementation. This is what "everything is a file" in Unix actually means — not literally files, but a uniform integer-handle interface for all kernel objects.

3. Per-process numbering. Each process has its own fd table. Process A's fd 5 and Process B's fd 5 are completely unrelated, pointing to different (or no) kernel structs. The numbers have meaning only within the process.

Limits. The fd table per process is bounded. The default Linux soft limit is 1024 file descriptors per process. You can raise it (ulimit -n in a shell) up to the hard limit, which is typically 1,048,576 (1M) but configurable. When a process exceeds its limit, the next syscall that would allocate a new fd (open, socket, accept, etc.) returns -1 with errno = EMFILE (Too many open files). This is one of the most common production errors in long-running services — leak fds, hit the limit, server stops accepting connections.

You can see your shell's current limit with:

ulimit -n          # soft limit
ulimit -Hn         # hard limit

And inspect what fds a running process has open:

ls -l /proc/<PID>/fd

Etymology. The name "file descriptor" is historical. Unix originally was a file-oriented system; everything kernel-side was a file. When networking was added in 1980s BSD, sockets reused the same fd mechanism instead of inventing a parallel one. The name stuck even though half of what fds reference now isn't files. Mentally translate "file descriptor" to "kernel object handle."

Sockets — a worked example

We've named sockets a few times. Let's open one up, because the rest of this part traces through what happens during a socket operation.

A socket is a kernel struct. It represents one endpoint of a connection. It holds:

// Heavily simplified — real struct socket plus struct sock has 100+ fields
struct socket {
    /* protocol-agnostic socket fields */
    socket_state           state;
    struct file           *file;        // back-pointer to the fd's file struct
    struct sock           *sk;          // protocol-specific lower half
    wait_queue_head_t      wq;          // threads waiting for socket events
};

struct sock {
    /* protocol-specific stuff (TCP, UDP, etc.) */
    struct sk_buff_head    receive_queue;  // bytes received, waiting to be read
    struct sk_buff_head    write_queue;    // bytes queued for sending
    int                    sk_state;       // TCP state machine (LISTEN, ESTABLISHED, ...)
    /* ... lots more ... */
};

(The actual definitions are in include/linux/net.h and include/net/sock.h.)

The fields that matter for us:

Receive buffer: a list of byte buffers (sk_buffs) holding data that has arrived from the network but hasn't been read yet by user code.
Send buffer: bytes user code has written but the network hasn't yet transmitted.
Wait queue: threads currently blocked on this socket (waiting for incoming data, or for the send buffer to drain, etc.).
TCP state: the protocol state machine — whether we're in LISTEN, ESTABLISHED, CLOSE_WAIT, etc.

When user code calls socket(AF_INET, SOCK_STREAM, 0), the kernel allocates one of these structs, plus an sk (the protocol-specific lower half), wires up the function-pointer tables for TCP, and returns an fd.

When user code calls connect(fd, address), the kernel uses the fd to find the socket, initiates a TCP handshake (sending SYN, waiting for SYN-ACK, sending ACK), and updates sk_state along the way.

When user code calls write(fd, "hello", 5), the bytes are queued in write_queue, and the network code is woken to send them out the network interface.

When user code calls read(fd, buf, 100), the kernel uses the fd to find the socket, then looks in receive_queue. If there are bytes, copy them to buf and return. If not, park the thread (we'll trace this in the next section).

Concrete numbers: a typical laptop has 50-200 active sockets at any moment. You can see them:

ss -t     # TCP sockets, Linux
ss -tu    # TCP + UDP
lsof -i   # all internet sockets, Mac/Linux

Run that on your machine. Chrome alone often holds 30-50 sockets — one or more per tab for HTTP connections, plus connections to Google services, plus IPC channels (which look like Unix domain sockets) between Chrome's various processes.

Etymology. "Socket" is from 1980s BSD. The metaphor was an electrical socket — a named endpoint you "plug into" with a network cable (conceptually). The metaphor is shallow; the mechanism is "a buffered kernel struct with protocol state and a wait queue." When you read "socket" in any networking context, mentally substitute "kernel-side endpoint struct."

A blocking `read()`, step by step

Now we trace what happens when user code calls read(fd, buf, 100) and the socket has no data. This is the load-bearing example. Once you see what happens here, every other "thread sleeps and gets woken" scenario follows the same pattern.

Setup: a user program has opened a TCP socket connected to some server. The socket has fd 5 in our process. We want to read up to 100 bytes from it. The server hasn't sent anything yet, so the receive buffer is empty.

char buf[100];
ssize_t n = read(5, buf, 100);

What the CPU does:

Step 1. User code executes the syscall instruction (with RAX=0 for read, RDI=5 for fd, RSI=&buf, RDX=100). CPU switches to kernel mode, jumps to the syscall entry point.

Step 2. Kernel dispatches based on RAX=0 → sys_read. The function is in fs/read_write.c. Roughly:

// Heavily simplified from fs/read_write.c
ssize_t sys_read(unsigned int fd, char __user *buf, size_t count) {
    struct file *file = fdget(fd);    // look up fd in current process's fd table
    if (!file) return -EBADF;
    
    ssize_t ret = file->f_op->read(file, buf, count, ...);
    
    fdput(file);
    return ret;
}

The kernel uses fd 5 to look up the file struct in our process's fd table. The file struct has a function-pointer table called f_op. For a socket, f_op->read points to the socket-specific read function.

Step 3. For a TCP socket, the read function chains down to tcp_recvmsg in net/ipv4/tcp.c. This is where the interesting thing happens.

tcp_recvmsg checks the socket's receive queue. There's no data. Now the function must do one of two things:

Return EAGAIN (try again later) — if the socket is set to non-blocking mode.
Park the thread until data arrives — if the socket is blocking (the default).

We're in blocking mode. So we park.

Step 4. The parking sequence. This is the part you should see in concrete form:

// Simplified from net/ipv4/tcp.c around tcp_recvmsg
DEFINE_WAIT(wait_entry);
add_wait_queue(&sk->sk_wq->wait, &wait_entry);

while (!data_available()) {
    set_current_state(TASK_INTERRUPTIBLE);   // mark current thread as SLEEPING
    
    if (signal_pending(current)) {           // check for signals (Ctrl-C, etc.)
        ret = -EINTR;
        break;
    }
    
    schedule();                              // give up the CPU
    
    // ⋯ time passes ⋯
    // ⋯ we are not running ⋯
    // ⋯ another thread is on this CPU ⋯
    // ⋯ eventually, someone wakes us up ⋯
    // ⋯ schedule() returns here ⋯
}

set_current_state(TASK_RUNNING);
remove_wait_queue(&sk->sk_wq->wait, &wait_entry);

// Now copy received bytes from receive_queue to user's buf
// ... and return from the syscall

Walk through each line:

DEFINE_WAIT(wait_entry): allocate a wait_queue_entry on the kernel stack. Its private field is set to current (the macro for "the currently running task_struct"). Its func is set to default_wake_function.
add_wait_queue(&sk->sk_wq->wait, &wait_entry): link the wait entry into the socket's wait queue. Now if the socket gets data, our entry will be in the list to be walked.
set_current_state(TASK_INTERRUPTIBLE): set our thread's state to SLEEPING (interruptible means signals can wake it; uninterruptible — for things like disk reads — means only the resource being ready wakes it).
schedule(): this is the magic line. We call the scheduler. The scheduler sees that our thread's state is SLEEPING, removes us from the run queue, picks another READY thread, and switches the CPU to that thread. Our thread is now paused, mid-syscall, with everything saved in our task_struct.

At this point, our thread is paused inside sys_read, which is paused inside tcp_recvmsg, which is paused inside schedule(). The CPU is running someone else. Time passes — could be 1 millisecond, could be 10 minutes — and our thread consumes zero CPU during this entire time.

Step 5. Eventually, data arrives at the socket. The wake-up happens (next section traces this). Our task_struct's state is set back to TASK_RUNNING, we're added to the scheduler's run queue, the scheduler eventually picks us, our CPU registers are restored, and schedule() returns — right where we left off. From our perspective, no time passed.

Step 6. The while loop re-checks data_available(). This time, yes, there's data. We exit the loop. We remove_wait_queue to clean up. We copy the bytes to the user's buf. We return from tcp_recvmsg, then from sys_read, then sysret puts us back in user mode at the user code that called read(). User code now has the data.

This is the entire mechanism of a blocking syscall. The thread inserts itself into a wait queue, sets its state to sleeping, calls schedule(), and pauses. Something else wakes it, the scheduler resumes it, schedule() returns, the syscall completes. Same pattern for every blocking I/O operation in every program on the system.

Who wakes the thread?

We left a gap. "Eventually data arrives at the socket." How does that become "our thread is in state TASK_RUNNING"?

The chain, traced backwards from the wake:

At the bottom: a network packet arrives at the network card (NIC). The NIC asserts an interrupt — its IRQ line goes high. Recall Part 2: the CPU drops what it's doing, looks up the IDT entry for the network IRQ, switches to kernel mode, runs the network driver's interrupt handler.

In the network driver: the handler reads the packet bytes from the NIC's hardware buffer into a kernel sk_buff (a kernel structure representing one packet). It hands the sk_buff to the kernel's networking stack.

In the networking stack: the IP/TCP code identifies which socket this packet belongs to (by matching source/destination IP and port against open sockets). It finds our struct socket. It appends the packet's data to sk->sk_receive_queue. Then — this is the key step — it calls sk->sk_data_ready(sk).

sk_data_ready: by default this calls wake_up_interruptible_sync_poll(&sk->sk_wq->wait, ...). Which walks the socket's wait queue, calling each waiter's func (wake function).

The default wake function: sets the target task_struct's state from TASK_INTERRUPTIBLE to TASK_RUNNING, and adds it to the scheduler's run queue. Done.

Back at the scheduler: on the next scheduling decision, our thread is READY again. The scheduler picks it. Loads its saved registers. schedule() returns. Our tcp_recvmsg while-loop continues from where it paused. We see that data_available() is now true. We exit the loop. We read the data and return up the syscall stack.

The complete chain, visualized:

This is the full mechanism of "thread sleeps, gets woken by an event." Every blocking I/O in every program follows this pattern.

For a long time I thought "sleeping" was something the thread did — like it pressed pause on itself. The truer picture is that the thread is set aside as data, by other code, and other code wakes it. The thread is no more "sleeping" than a paused movie is — it's just stored. The wake-up is just another piece of code running, finding the stored thread's pointer in a list, and flipping a state bit. There's no agency in the sleep. The agency is in the code that puts threads to sleep and the code that wakes them up.

One thread, many things to wait for — the problem

Now we return to the C10K opening.

The naive solution to handling 10,000 connections: one thread per connection. Each thread sits in read() on its socket. This works — the mechanism we just traced handles it fine. Each thread parks itself in its socket's wait queue. When data arrives, that specific thread wakes.

The cost:

10,000 threads × 8 MB stack each = 80 GB of memory
Context switching overhead: thousands of switches per second as connections become active
Kernel data structure bloat: 10,000 task_structs

You can buy 80 GB of RAM, but you've spent your budget before doing any actual work. And the 8 MB stacks are mostly empty — each thread is just sitting in read(), not doing recursive work. We're paying for memory we don't use, just because each thread needs some stack.

Is there a way to use one thread to wait on all 10,000 sockets?

The naive attempt: have one thread loop through 10,000 sockets, calling read() on each in turn. This fails for two reasons:

If you call read() blocking, the thread parks on socket 1's wait queue. It won't even check sockets 2-10,000 until socket 1 has data.
If you call read() non-blocking, the thread is in a busy loop — burning 100% CPU asking each socket "got anything? got anything? got anything?" Battery dies, fan screams.

Neither works.

What you actually want: park the thread somehow such that it sleeps (zero CPU), but wakes when any of the 10,000 sockets has data. Not one socket. Any.

Look back at the wait queue mechanism. A thread is parked by linking a wait entry into a wait queue. There's nothing in the data structure that says you can't link one thread into many wait queues simultaneously. If you did, the first resource to have data would wake the thread.

That's the trick. One thread, many wait queues. The OS gives you a syscall family to do exactly this.

The multiplexing syscalls — a brief history

Several syscalls do "wait for any of these." They were invented at different times, solving the same problem with different APIs:

select (1983, BSD Unix) — the oldest. You pass three bitmaps of fds (for read-ready, write-ready, error). The kernel walks all of them, links your thread into each fd's wait queue, sleeps you. When any fd is ready, all the wait queues wake you. You then walk all the bitmaps to figure out which fd is ready. O(N) per call. Hardcoded maximum of 1024 fds (FD_SETSIZE) on most systems.
poll (1986, System V Unix) — like select but uses an array of struct pollfd instead of bitmaps. Removes the 1024 fd limit. Still O(N) per call — kernel walks the whole array every time, even if you're watching 10,000 fds and only one is ready.
epoll (2002, Linux) — solves the O(N) problem. You register fds once with epoll_ctl. The kernel installs persistent hooks on each fd's wait queue. When an fd becomes ready, the hook adds the fd to a "ready list" inside the epoll instance. Your epoll_wait call sleeps until the ready list has entries, then returns just those entries. O(1) per ready fd, regardless of how many you're watching.
kqueue (2000, FreeBSD/macOS) — appeared in FreeBSD 4.1, around the same time as epoll. Equivalent in spirit; different API. The Mac equivalent of epoll.
IOCP (Windows) — uses a "completion" model rather than "readiness" — you initiate the operation, then later get notified when it has completed (with the data already filled in). Different mental model, same goal.

Etymology of "epoll": "event poll" — the event-driven, scalable successor to poll. The "e" prefix indicates the modern version. The name is conventional rather than illuminating.

For our story, epoll is the one to understand mechanically. Every JavaScript runtime on Linux uses it (or libuv, which wraps it). Node.js's event loop sleeps in epoll_wait. Chrome's renderer process main thread sleeps in epoll_wait (wrapped by Chromium's MessagePump). Everywhere JavaScript runs on Linux, somewhere underneath there is an epoll_wait.

epoll, mechanically

epoll has three syscalls.

epoll_create() — creates an "epoll instance" in the kernel. The kernel allocates an eventpoll struct, puts a pointer to it in the current process's fd table, and returns the fd. Think of the epoll instance as the thread's "mailbox in the kernel": a place where the kernel will deposit information about which fds are ready.

The eventpoll struct, simplified from fs/eventpoll.c:

// Simplified from fs/eventpoll.c
struct eventpoll {
    spinlock_t            lock;
    wait_queue_head_t     wq;       // threads blocked in epoll_wait on this instance
    struct list_head      rdllist;  // the "ready list" — fds currently ready
    struct rb_root_cached rbr;      // the watch set — fds being monitored
    /* ... lots more for poll/race handling ... */
};

Three load-bearing fields:

rbr — the watch set. A red-black tree of fds being monitored, with associated event masks ("watch this fd for read-ready," etc.). Populated by epoll_ctl(ADD).
rdllist — the ready list. A linked list of fds that have become ready since the last epoll_wait. Populated by hooks (next paragraph).
wq — the wait queue of this epoll instance itself. Threads currently blocked in epoll_wait on this instance are parked here.

epoll_ctl(epfd, ADD, fd, events) — register an fd with this epoll instance. The kernel:

Looks up the epoll instance from epfd.
Adds an entry to the epoll's rbr tree: "watching fd, interested in these events."
Installs a hook on fd's wait queue. This is the crucial mechanism.

The "hook" is a wait_queue_entry whose func is not the default wake function. Instead, it's ep_poll_callback (defined in fs/eventpoll.c). When this hook is "woken" (because the fd becomes ready, and the resource's wake-up code walks its wait queue), the callback doesn't wake any thread directly. Instead, it does:

ep_poll_callback (simplified):
    add this fd to the epoll instance's rdllist
    if any threads are sleeping in the epoll instance's wq:
        wake one (or all, depending on flags)

So the hook captures the fd-ready event and accumulates it in the epoll instance, while optionally also waking any thread blocked in epoll_wait.

epoll_wait(epfd, events_out, max, timeout) — wait for events.

Look up the epoll instance.
If rdllist is non-empty: copy up to max entries into events_out, remove them from rdllist, return the count. Thread never sleeps.
If rdllist is empty: park the current thread in the epoll instance's wq. State TASK_INTERRUPTIBLE. schedule(). Sleep.
(Later, woken by a hook firing because some fd became ready.) Remove from wq. Recheck rdllist. Copy entries to events_out. Return.

Visually, the whole structure:

When data arrives at, say, the gmail socket (fd 5): the socket's sk_data_ready walks its wait queue. It finds the hook. It calls the hook's wake function. The wake function adds fd 5 to epoll instance #9's rdllist and tries to wake any thread in epoll's wq. If our thread is parked in epoll_wait, it wakes up.

When our thread wakes from epoll_wait, it returns the contents of rdllist: "fd 5 is ready, with these events." Our code then knows: I should read(5). We read. We process. We loop back to epoll_wait.

This is the trick. One thread. Three fds (or three thousand fds, doesn't matter). The thread sleeps in epoll_wait, costing nothing. When any registered fd is ready, the thread wakes with information about which. No busy-polling. No 10,000 threads. One sleep, many possible wakes.

The wait-queue exception I foreshadowed. Recall I said earlier that in the JavaScript case, wait queues almost always have one waiter — but with an exception in epoll. Look at the diagram: each socket's wait queue contains a hook, not our thread directly. The hook is a waiter — it's a wait_queue_entry like any other — but its private field doesn't point at our task_struct. Its func isn't the default wake function. It's the epoll callback. Our thread, meanwhile, is parked in the epoll instance's own wq, not in any socket's wq. So when you look at a socket's wait queue in a JavaScript process, you see one entry — the epoll hook. That's the exception. The data structure is the same; the use case is different.

Source: fs/eventpoll.c — the entire epoll implementation. Reading the actual file is informative if you're comfortable with Linux kernel C. ep_poll_callback is the wake function I described.

Why this is the architecture every event loop is built on

We can now state the architectural pattern cleanly.

The pattern:

The thread owns an epoll instance.
Whenever the thread needs to "wait on something," it registers an fd with the epoll instance via epoll_ctl.
When the thread has nothing to do, it calls epoll_wait.
The thread sleeps. Zero CPU. Until any registered fd is ready.
The thread wakes with a list of ready fds. For each, it does whatever it needs to: read data, call a callback, dispatch an event.
After handling, the thread loops back to epoll_wait.

This is the architectural spine of every modern event-driven program:

Node.js's main thread does this (via libuv, which wraps epoll/kqueue/IOCP).
Chrome's renderer process main thread does this (via Chromium's MessagePump).
nginx, redis, memcached, every high-performance server — same pattern.
Tokio (Rust), asyncio (Python), Vert.x (Java) — same pattern.

The JavaScript event loop is one specific implementation of this pattern, with one specific thread, one specific epoll instance, and JavaScript-level queues layered on top. We'll build the JavaScript layer in Parts 5 and 6.

But — and this is the cost — this pattern has serious trade-offs. They're the reason JavaScript looks the way it does as a language.

The trade-offs

Trade-off 1: No CPU parallelism. One thread, one core max. A callback that runs for 500ms freezes everything else for 500ms. There's no other thread to handle the next click during that time. This is why "never block the event loop" is the single most important performance rule in Node.js and the browser.

Trade-off 2: Forced asynchronous programming. You cannot write blocking code. If you call read() directly (blocking), the entire event loop freezes for the duration. So all I/O must be asynchronous. The thread initiates the operation, registers the fd with epoll, returns to the event loop. Later, when the fd is ready, a callback fires.

This is why JavaScript has callbacks, Promises, and async/await. They're not language features chosen for elegance. They are the required adaptation to a runtime where the main thread cannot block. If JavaScript's runtime model used "one thread per request" (the way classic PHP-on-Apache did), you wouldn't need any of these patterns — you'd just write let x = read(fd) and the thread would block, and other threads would handle other requests.

But every JavaScript host chose the single-threaded multiplexed event-loop model, because the alternative (multiple threads touching the DOM, or shared memory between request threads) brings race conditions. The architectural choice forces the language to provide async patterns. Callbacks → Promises → async/await is the evolution of how to make this constraint less painful for developers.

Trade-off 3: Head-of-line blocking. Callbacks run serially on the one thread. If you have 1000 fds become ready at once, your code processes them in some order. The 1000th callback waits for the previous 999 to finish.

Trade-off 4: Harder debugging. A stack trace shows only the current callback, not the chain of "what triggered this." Modern runtimes have added async stack traces to mitigate this, but they're imperfect.

Trade-off 5: One bug can kill the whole process. An uncaught error in a callback can take down the event loop and the entire process unless explicitly handled at the right boundary.

Trade-off 6: No help for CPU-bound work. Multiplexing helps when threads are waiting (which they are, most of the time). It doesn't help when they're computing. If your bottleneck is CPU rather than I/O, you need actual parallelism — Web Workers in the browser, worker_threads in Node. These spawn real OS threads.

This is the moment I finally understood why JavaScript has so many async patterns. Callbacks, Promises, async/await — they're not language features chosen for elegance. They're the necessary adaptation to a runtime where the main thread cannot block. Once I saw the constraint, every async API in JavaScript started making sense.

The mouse-click chain, finally

We have all the machinery now. We can trace the click puzzle from Part 1 end to end.

Setup: a browser tab has an event listener registered for clicks. The tab is idle — no JavaScript is running. The main thread is parked in epoll_wait. The user clicks.

T0:  User physically clicks the mouse.
     The mouse hardware sends a signal to the mouse controller (USB or Bluetooth).

T1:  Mouse controller asserts an IRQ.
     CPU jumps to IDT[mouse-IRQ] → kernel mode.
     Kernel runs the mouse driver's interrupt handler.

T2:  Driver reads click coordinates from the mouse controller's hardware.
     Hands them to the OS's input subsystem.

T3:  OS input subsystem forwards to the window manager
     (X11 / Wayland on Linux, WindowServer on macOS,
      Win32 input system on Windows).
     Window manager identifies which window is focused or under the cursor.
     If it's Chrome: forward the click event to Chrome's browser process.

T4:  Chrome browser process (the main process, not a tab):
     Receives the click. Determines which tab it belongs to.
     Sends an IPC message: "click at (x, y) in your viewport"
     to the target tab's renderer process.

T5:  Tab process. The IPC channel is a Unix domain socket (Linux) or 
     equivalent. The IPC fd is registered with the tab's epoll instance.
     When the browser process wrote to the IPC, bytes arrived at the
     IPC socket's receive buffer. The socket's wait queue was walked.
     It contained an epoll hook. The hook fired:
     fd added to rdllist; wake any thread in epoll's wq.

T6:  The tab's main thread was sleeping in epoll_wait.
     Its state was TASK_INTERRUPTIBLE.
     The hook's wake-up changes it to TASK_RUNNING and adds it to the run queue.

T7:  Scheduler picks the main thread on the next opportunity.
     Loads its saved registers. schedule() returns inside epoll_wait.
     epoll_wait returns with rdllist contents: "IPC fd is ready."

T8:  Tab's main thread is now back in Chrome's C++ event loop code.
     It reads from the IPC channel. The bytes describe the click event.

T9:  Chrome's C++ event loop:
     Walks the tab's DOM tree to find the element at (x, y).
     Looks up event listeners on that element (and ancestors, for bubbling).
     Finds our JavaScript onClick handler.
     Constructs a click Event object.

T10: Queue a task on the user-interaction task source:
     "Dispatch click event with this handler chain."

T11: Event loop picks this task on the next iteration.
     V8 is invoked. V8 builds a call stack. JS handler runs.
     User sees the alert (or whatever the handler does).

That's the full chain. Every step is a piece of code running on the CPU. Nothing is magic. The click physically arrives at the computer (T0-T1), traverses kernel and userland chains (T2-T5), eventually causes a wait queue to be walked (T5), which moves the main thread from sleeping to ready (T6), which lets the scheduler resume it (T7), which lets epoll_wait return (T7), which lets Chrome's C++ process the event (T8-T10), which finally queues the JS callback to run (T11).

Compressed into one sentence: a click is a chain of code triggered by a hardware interrupt, that eventually appends an entry to a list in the kernel, that causes a thread to be marked READY, that lets the scheduler resume it, that lets it return from a syscall, that lets it queue a JavaScript task, that the event loop will pick up on its next iteration.

Now go back to Part 1's puzzle. The original while(true){} froze the page. The click physically happened during the freeze — the chain above ran, all the way through step T5 (the IPC arrived, the wait queue was walked). But our main thread wasn't sleeping in epoll_wait. It was in step T11 of an earlier iteration — running JavaScript, executing while(true). The wait queue's wake-up looked for sleeping threads to wake. There were none. The fd went into rdllist. And sat there.

That's where the click was. In the rdllist of the tab's epoll instance. Waiting for the main thread to finish whatever it was doing and call epoll_wait again. Since while(true) never finishes, that call never came. The click sat in rdllist forever — or until you killed the tab and the kernel freed the epoll instance, taking the rdllist with it.

In our second puzzle (the 5-second freeze with setTimeout at the end), the click sat in rdllist for the duration of the freeze. The moment the while-loop exited and the script ended, the event loop returned to epoll_wait, which immediately returned because rdllist already had the IPC event. Chrome processed the event, queued the JS task, picked it up, ran the listener, set clicked = true. By the time setTimeout fired its callback, the click had been fully processed — that's why the second console.log showed true.

This is what "the click was somewhere all along" means. It was a few bytes in a rdllist linked list inside the kernel's epoll struct for our tab. Nothing more. Nothing magic.

What you should hold from Part 3

Three things:

Sleeping is passive. A sleeping thread is data in a wait queue, with its state field set to SLEEPING. Nothing inside the thread runs. External code (interrupt handlers, kernel networking code, etc.) wakes the thread by changing its state and putting it back in the run queue.
Multiplexing is the engineering trick. One thread can be linked into many wait queues — directly or via an epoll hook. epoll_wait lets one thread sleep on N resources at once, costing one thread and one struct. This is the foundation of every event-driven runtime: the JavaScript event loop, Node.js's libuv, every modern server.
JavaScript's async patterns are consequences. Callbacks, Promises, async/await exist because the JavaScript runtime uses the multiplexed event-loop architecture. The main thread cannot block. Every async API is a way to write code that returns control to the event loop while waiting for something. The patterns evolved (callbacks → Promises → async/await) but the constraint they're solving has been the same since 1995.

The click puzzle is now fully resolved. But we still don't have a complete picture of the JavaScript event loop itself. We know the main thread sleeps in epoll_wait. We know events get queued. What we don't know yet is:

Where exactly does JavaScript run — is it directly on the main thread, or somewhere else?
What's V8's role, and how does it relate to Chrome's C++ event loop code?
What are these "tasks" we keep mentioning, structurally?

Part 4 maps the territory inside a Chrome tab. V8 inside the renderer process. The call stack as V8 frames. The relationship between Chrome's C++ event loop and JavaScript's execution. After that, Part 5 builds the event loop algorithm proper.

Next: Part 4 — Where JavaScript Actually Runs.

Part 3: Where Threads Sleep

A different puzzle

What you'll hold by the end

What "sleeping" actually means

Wait queues, mechanically

Syscalls — how user code talks to the kernel

File descriptors — the integer-as-handle pattern

Sockets — a worked example

A blocking `read()`, step by step

Who wakes the thread?

One thread, many things to wait for — the problem

The multiplexing syscalls — a brief history

epoll, mechanically

Why this is the architecture every event loop is built on

The trade-offs

The mouse-click chain, finally

What you should hold from Part 3

Comments

What no one teaches you about the JavaScript Event Loop

Part 4: Where JavaScript Actually Runs

More from this blog

Part 4: Where JavaScript Actually Runs

Part 2: What "Single-Threaded" Actually Means

Part 1: The Question Tutorials Skip

Part 4: Why this Is the Only Exception

Command Palette

A different puzzle

What you'll hold by the end

What "sleeping" actually means

Wait queues, mechanically

Syscalls — how user code talks to the kernel

File descriptors — the integer-as-handle pattern

Sockets — a worked example

A blocking read(), step by step

Who wakes the thread?

One thread, many things to wait for — the problem

The multiplexing syscalls — a brief history

epoll, mechanically

Why this is the architecture every event loop is built on

The trade-offs

The mouse-click chain, finally

What you should hold from Part 3

Comments

What no one teaches you about the JavaScript Event Loop

Part 4: Where JavaScript Actually Runs

More from this blog

A blocking `read()`, step by step