I'm reminded of Raymond Chen's many many blogs[1][2][3](there are a lot more) on why TerminateThread is a bad idea. Not surprised at all the same is true elsewhere. I will say in my own code this is why I tend to prefer cancellable system calls that are alertable. That way the thread can wake up, check if it needs to die and then GTFO.
One of my more annoying gotchas on Windows is that despite this advice being very reasonable sounding, the runtime itself (I believe it actually happens in the kernel) essentially calls TerminateThread on all child threads before running global destructors and atexit hooks. Good luck following this advice when the kernel actively fights you when it come time to shutdown
So there is a reason that in the C++ spec if a std::thread is still joinable when the destructor is called it calls std::terminate[1]. That reason being exactly this case. If the house is being torn down it's not safe to try to save the curtains[2]. Just let the house get torn down as quickly as possible. If you wanted to save the curtains (e.g. do things on the threads before they exit) you need to do it before the end of main and thus global destructors start getting called.
Global destructors and atexit are called by the C/C++ runtime, Windows has nothing to do with that. The C and C++ specs require that returning from main() has the same effect of ending the process as exit() does, meaning they can’t allow any still-running threads to continue running. Given these constraints, would you prefer the threads to keep running until after global destructors and atexit have run? That would be at least as likely to wreak havoc. No, in C/C++, you need to make sure that other threads are not running anymore before returning from main().
> Well, since thread cancellation is implemented using exceptions, and thread cancellation can happen in arbitrary places
No, thread cancelation cannot happen in arbitrary places. Or doesn't have to.
There are two kinds of cancelation: asynchronous and deferred.
POSIX provides an API to configure this for a thread, dynamically: pthread_setcanceltype.
Furthermore, cancelation can be enabled and disabled also.
int pthread_setcancelstate(int state, int *oldstate); // PTHREAD_CANCEL_ENABLE, PTHREAD_CANCEL_DISABLE
int pthread_setcanceltype(int type, int *oldtype); // PTHREAD_CANCEL_DEFERRED, PTHREAD_CANCEL_ASYNCHRONOUS
Needless to say, a thread would only turn on asynchronous cancelation over some code where it is safe to do so, where it won't be caught in the middle of allocating resources, or manipulating data structures that will be in a bad state, and such.
I talk about the cancelability state and how it can help us shortly after that statement: https://mazzo.li/posts/stopping-linux-threads.html#controlle... . In hindsight I should have made a forward reference to that section when talking about C++. My broad point was that combining C++ exceptions and thread cancellation is fraught with danger and imo best avoided.
I regret to be informed that they still haven't figured this out. I was very active in thread on Linux over 20 years ago, working on glibc and whatnot. That was before C++ had threads, needless to say. There was a time when cancelation didn't do C++ unwinding, only the PTHREAD_CLEANUP_PUSH handlers. So of course cancellation and C++ exceptions was a bad cocktail then.
This article does a nice job of explaining why pthread cancellation is hopeless.
> If we could know that no signal handler is ran between the flag check and the syscall, then we’d be safe.
If you're willing to write assembly, you can accomplish this without rseq. I got it working many years ago on a bunch of platforms. [1] It's similar to what they did in this article: define a "critical region" between the initial flag check and the actual syscall. If the signal happens here, ensure the instruction pointer gets adjusted in such a way that the syscall is bypassed and EINTR returned immediately. But it doesn't need any special kernel support that's Linux-only and didn't exist at the time, just async signal handlers.
(rseq is a very cool facility, btw, just not necessary for this.)
For interrupting long-running syscalls there is another solution:
Install an empty SIGINT signal handler (without SA_RESTART), then run the loop.
When the thread should stop:
* Set stop flag
* Send a SIGINT to the thread, using pthread_kill or tgkill
* Syscalls will fail with EINTR
* check for EINTR & stop flag , then we know we have to clean up and stop
Of course a lot of code will just retry on EINTR, so that requires having control over all the code that does syscalls, which isn't really feasible when using any libraries.
EDIT: The post describes exactly this method, and what the problem with it is, I just missed it.
> One can either preemptively or cooperatively schedule threads, and one can also either preemptively or cooperatively cancel processes, but one can only cooperatively cancel threads.
If you can swing it (don't need to block on IO indefinitely), I'd suggest just the simple coordination model.
* Some atomic bool controls if the thread should stop or not;
* The thread doesn't make any unbounded wait syscalls;
* And the thread uses pthread_cond_wait (or equivalent C++ std wrappers) in place of sleeping while idle.
To kill the thread, set the stop flag and cond_signal the condvar. (Under the hood on Linux, this uses futex.)
The tricky part is really point 2 there, that can be harder than it looks (e.g. even simple file I/O can be network drives). Async IO can really shine here, though it’s not exactly trivial designing async cancelletion either.
Relying heavily on a check for an atomic bool is prone to race conditions. I think it's cleaner to structure the event loop as a message queue and have a queued message that indicates it's time to stop.
Queuing a stop means you have to process the queue before stopping. Which certainly is stopping cleanly, but if you wanted to stop the thread because its queue was too long and the work requests were stale, it doesn't help much.
You could maybe allow a queue skipping feature to be used for stop messages... But if it's only for stop messages, set an atomic bool stop, then send a stop message. If the thread just misses the stop bool and waits for messages, you'll get the stop message; if the queue is large, you'll get the stop bool.
Calling pthread_cond_signal without acquiring mutex can lead to lost wakeup. And of course you can't really acquire a mutex in an async signal safe function like interrupt handler.
Without the signaling thread acquiring a mutex, you might end up signaling after T2 has checked the boolean, but before it has called cond_wait.
But this can be solved by processing the async signal in a deferred manner from some other watcher thread
How so? It takes only a couple of machine cycles to poll a boolean.
(And what other kind of boolean is there, besides atomic? It's either true or it's false, and if nothing can set it back to false once it goes true, I don't see the hazard. It's a CPU, not an FPGA.)
The type is named atomic, but atomicity is not its only useful property. The atomic types also give control over the memory ordering, defaulting to sequentially consistent (seq_cst, the strongest).
Without memory order guarantees enforced by memory barriers, a write to the boolean in thread A is not guaranteed to be observed by thread B. That matters both after initialization--where thread A sets the boolean to false but thread B may observe true, false, or invalid--and also after the transition--where thread B may fail to observe that the boolean has flipped from false to true.
[edit: I'm not sure the above reasoning actually matters; as stated already by parent, "It's a CPU, not an FPGA"; modern multicore shared-memory CPUs have coherent caches]
> Without memory order guarantees enforced by memory barriers, a write to the boolean in thread A is not guaranteed to be observed by thread B.
No, that's not correct. Memory ordering doesn't influence how fast a write is propagated to other cores, that's what cache coherency is for. Memory ordering of an access only matters in relation to accesses on other memory locations. There's a great introduction to this topic by Mara Bos: https://marabos.nl/atomics/memory-ordering.html
Indeed. I started to figure this out, hence my edit. Thanks for the link.
There are hypothetical, historical, and special-purpose architectures which don't have cache coherency (or implement it differently enough to matter here), but for all practical purposes, it seems that all modern, general-purpose architectures implement it.
disagree. i think then it's too tempting down the line for someone to add a message with blocking processing.
a simple clear loop that looks for a requested stop flag with a confirmed stop flag works pretty well. this can be built into a synchronous "stop" function for the caller that sets the flag and then does a timed wait on the confirmation (using condition variables and pthread_cond_timedwait or waitforxxxobject if you're on windows).
that's the point. use nonblocking io and an event polling mechanism with a timeout to keep an eye on an exit flag- that's all you need to handle clean shutdowns.
i think on windows you can wait on both the sockets/file descriptors and condition variables with the same waitforxxxobject blocking mechanism. on linux you can do libevent, epoll, select or pthread_cond_timedwait. all of these have "release on event or after timeout" semantics. you can use eventfd to combine them.
i would not ever recommend relying on signals and writing custom cleanup handlers for them (!).
unless they're blocked waiting for an external event, most system calls tend to return in a reasonable amount of time. handle the external event blocking scenario (stuff that select waits for) and you're basically there. moreover, if you're looking to exit cleanly, you probably don't want to take your chances interrupting syscalls with signals (!) anyway.
> If you can't accept this, maybe don't play with threads, they are dangerous.
too late. when i first started playing with threads, linux didn't really support them.
> use nonblocking io and an event polling mechanism
Not incompatible with what I said.
> with a timeout to keep an eye on an exit flag
This is the stupid part. You will burn CPU cycles waking up spuriously for timeouts with no work to do. Setting the flag won't wake up the event loop until the timeout hits, adding pointless delay.
You want to make signalling an exit to actually wake up your event loop. Then you also don't need a timeout.
I.e. you should make your "ask to exit" code use the same wakeup mechanism as the work queue, which is what I said at the beginning. Not burning CPU polling a volatile bool in memory on the side.
> This is the stupid part. You will burn CPU cycles waking up spuriously for timeouts with no work to do. Setting the flag won't wake up the event loop until the timeout hits, adding pointless delay.
it's the smart part. waking up at 50hz or 100hz is essentially free and if there's an os bug or other race that causes the one time "wake up to exit" event to get lost, the system will still be able to shut down cleanly with largely imperceptible delay. it also means that it can be ported to systems that don't support combined condition variable/fd semantics.
libcurl dealt with this a few months ago, and the sentiment is about the same: thread cancellation in glibc is hairy. The short summary (which I think is accurate) is that an hostname query via libnss ultimately had to read a config file, and glibc's `open` is a thead cancellation point, so if it's canceled, it'll won't free memory that was allocated before the `open`.
Note that the situation with libcurl is very specific: lookup with libnss is only available as a synchronous call. All other syscalls they make can be done with async APIs, which can easily be cancelled without any of the trickery discussed here.
This was a fun read, I didn't know about rseq until today! And before this I reasonably assumed that the naive busy-wait thing would typically be what you'd do in a thread in most circumstances. Or that at least most threads do loop in that manner. I knew that signals and such were a problem but I didn't think just wanting to stop a thread would be so hard! :)
IIRC rseq was originally proposed by Google to support their pure-userspace read-copy-update (RCU) implementation, which relied on per-CPU not per-thread data.
The right approach is to avoid simple syscalls like sleep() or recv(), and instead call use multiplexing calls like epoll() or io_uring(). These natively support being interrupted by some other thread because you can pass, at minimum, two things for them to wait for: the thing you're actually interested in, and some token that can be signalled from another thread. For example, you could start a unix socket pair which you do a read wait on alongside the real work, then write to it from another thread to signal cancellation. Of course, by the time you're doing that you really could multiplex useful IO too.
You also need to manually check this mechanism from time to time even if you're doing CPU bound work.
If you're using an async framework like asyncio/Trio in Python or ASIO in C++, you can request a callback to be run from another other thread (this is the real foothold because it's effectively interrupting a long sleep/recv/whatever to do other work in the thread) at which point you can call cancellation on whatever IO is still outstanding (e.g. call task.cancel() in asyncio). Then you're effectively allowing this cancellation to happen at every await point.
(In C# you can pass around a CancellationToken, which you can cancel directly from another thread to save that extra bit of indirection.)
But I also disagree with it. Yes, the logical conclusion of starting down that path is that you end up with full on use of coroutines and some IO framework (though I don't see the problem with that). But a simple wrapper for individual calls that is recv+cancel rather than just recv etc is better than any solution mentioned in the blog post.
The fact is, if you want to wait for more than one thing at once at the syscall level (in this case, IO + inter thread cancellation), then the way to do that is to use select or poll or something else actually designed for that.
I had this problem, and I solved it by farming the known-blocking syscalls to a separate thread pool. Then the calling thread can just abandon the wait. To make it a bit better, you can also use bounded timeouts (~1-2 seconds) with retries for some calls like recvfrom() via SO_TIMEOUT so that the termination time becomes bounded.
This is probably the cleanest solution that is portable.
This seems like a lot of work to do when you have signalfd, no? That + async and non blocking I/O should create the basis of a simple thread cancellation mechanism that exits pretty immediately, no?
As I note in the blog post in various places if one can organize the code so that cancellation is explicit things are indeed easier. I also cite eventfd as one way of doing so. What I meant to convey is that there's no easy way to cancel arbitrary code safely.
I have been using signalfd + epoll where it looks like I could use eventfd instead (or just epoll_pwait). Is there a significant benefit to one approach over another? I suspect eventfd might be more efficient (and doesn't use up a signal handler... when are we going to get SIGUSR3 ?!?).
I don't really get it. There are two possibilities here:
* You control all the IO, and then you can use some cooperative mechanism to signal cancellation to the thread.
* You don't control IO code at the syscall level (e.g. you're using some library that uses sockets under the hood, such as a database client library)... But then it's just obvious you're screwed. If you could somehow terminate the thread abruptly then you'll leak resources (possibly leaving mutexes locked, as you said), or if you interrupt syscalls with an error code then the library won't understand it. That's too trivial to warrant a blog post fussing about signals.
The only useful discussion to have on the topic of thread cancellation is what happens when you can do a cooperative cancel, so I don't think it's fair to shoot that discussion down.
If you just want to stop and/or kill all child threads, you can read the list of thread IDs from /proc/pid/task, and send a signal to them with tgkill().
Sometimes that doesn't matter - maybe you are just trying to get the process to exit without core dumping due to running threads accessing things that are disappearing.
I'm not sure there's any better solution if you are dealing with a library that creates threads and doesn't provide an API to shut them down.
I think what is really needed is just exception (unwind cleanup) mechanism and a cheap way to mask interrupts. Signal deferral mechanism does exactly that -- so that with(out)-interrupts just simply set a variable and don't need to go through syscall.
this stuff always seemed a mess. in practice i've always just used async io (non-blocking) and condition variables with shutdown flags.
trying to preemptively terminate a thread in a reliable fashion under linux always seemed like a fool's errand.
fwiw. it's not all that important, they get cleaned up at exit anyway. (and one should not be relying on operating system thread termination facilities for this sort of thing.)
I claim that this is a solved problem, without rseq.
1. Any given thread in an application waits for "events of interest", then performs computations based on those events (= keeps the CPU busy for a while), then goes back to waiting for more events.
2. There are generally two kinds of events: one kind that you can wait for, possibly indefinitely, with ppoll/pselect (those cover signals, file descriptors, and timing), and another kind you can wait for, possibly indefinitely, with pthread_cond_wait (or even pthread_cond_timedwait). pthread_cond_wait cannot be interrupted by signals (by design), and that's a good thing. The first kind is generally used for interacting with the environment through non-blocking syscalls (you can even notice SIGCHLD when a child process exits, and reap it with a WNOHANG waitpid()), while the second kind is used for distributing computation between cores.
3. The two kinds of waits are generally not employed together in any given thread, because while you're blocked on one kind, you cannot wait for the other kind (e.g., while you're blocked in ppoll(), you can't be blocked in pthread_cond_wait()). Put differently, you design your application in the first place such that threads wait like this.
4. The fact that pthread_mutex_lock in particular is not interruptible by signals (by design!) is no problem, because no thread should block on any mutex indefinitely (or more strongly: mutex contention should be low).
5. In a thread that waits for events via ppoll/pselect, use a signal to indicate a need to stop. If the CPU processing done in this kind of thread may take long, break it up into chunks, and check sigpending() every once in a while, during the CPU-intensive computation (or even unblock the signal for the thread every once in a while, to let the signal be delivered -- you can act on that too).
6. In a thread that waits for events via pthread_cond_wait, relax the logical condition "C" that is associated with the condvar to ((C) || stop), where "stop" is a new variable protected by the mutex that is associated with the condvar. If the CPU processing done in this kind of thread may take long, then break it up into chunks, and check "stop" (bracketed by acquiring and releasing the mutex) every once in a while.
7. For interrupting the ppoll/pselect type of thread, send it a signal with pthread_kill (EDIT: or send it a single byte via a pipe that the thread monitors just for this purpose; but then the periodic checking in that thread has to use a nonblocking read or a distinct ppoll, for that pipe). For interrupting the other type of thread, grab the mutex, set "stop", call pthread_cond_signal or pthread_cond_broadcast, then release the mutex.
8. (edited to add:) with both kinds, you can hierarchically reap the stopped threads with pthread_join.
pthread cancelation ends up not being the greatest, but it's important to represent it accurately. It has two modes: asynchronous and deferred. In asynchronous mode, a thread can be canceled any time, even in the middle of a critical section with a lock held. However, in deferred mode, a thread's cancelation can be delayed to the next cancelation point (a subset of POSIX function calls basically) and so it's possible to make that do-stuff-under-lock flow safe with cancelation after all.
That's not to say people do or that it's a good idea to try.
Cancellation points and cancellability state are discussed in the post. In a C codebase that you fully control pthread cancellation _can_ be made to work, but if you control the whole codebase I'd argue you're better off just structuring your program so that you yield cooperatively frequently enough to ensure prompt termination.
The while loop surrounds the whole thread, which does multiple tasks. The conditional is there to surround some work completing in a reasonable time. That's how I understood, at least.
while (true) {
if (stop) { break; }
// Perform some work completing in a reasonable time
}
Be just:
While(!stop){
Do-the-thing;
}
Anyway, the last part:
>> It’s quite frustrating that there’s no agreed upon way to interrupt and stack unwind a Linux thread and to protect critical sections from such unwinding. There are no technical obstacles to such facilities existing, but clean teardown is often a neglected part of software.
I think it is a “design feature”. In C everything is low level, so I have no expectation of a high level feature like “stop this thread and cleanup the mess” IMHO asking that is similar to asking for GC in C.
yes, maybe except if you don't have a single tight loop and stop checks are not just done once in the loop body but manually sprinkled through various places of your code (e.g. thing a long running compute task split into part 1,2(tight loop),3(loop),4 then you probably want a stop check between each of them and in each inner iteration of 3 but probably not in each inner iteration of 2 (as each check is an atomic load).
Maybe. But seems to me there should be better ways to organize the code. In the case you mention there will be many places where you have to cleanup (that is what the article is about) so the code will be hell to debug: multithreaded, with multiple exit points in each thread… I have done relly tons and tons of multithreading and never once needed such a conplicated thing. Typically the code which gets run in parallel is either for managing one resource type OR number crunching w/o resource allocation… if you are spawning threads that do lots of resource allocation, maybe you have architecture problems, or you are solving a very niche problem.
If your threads run "cooperative multi threading" task (e.g. rust tokio runtime, JS in general etc.) then this kinda is a non problem.
Due to task frequently returning to the scheduler the scheduler can do "should stop" check there (also as it might be possible to squeeze it into other atomic state bit maps it might have 0 relevant performance overhead (a single is-bit-set check)). And then properly shut down tasks. Now "properly shut down tasks" isn't as trivial, like the "cleaning up local resources" part normally is, but for graceful shutdown you normally also want to allow cleaning up remote resources, e.g. transaction state. But this comes from the difference of "somewhat forced shutdown" and "grace full shutdown". And in very many cases you want "grace full shutdown" and only if it doesn't work force it. Another reason not to use "naive" forced only shutdown...
Interpreter languages can do something similar in a very transparent manner (if they want to). But run into similar issues wrt. locking and forced unwinding/panics from arbitrary places as C.
Sure a very broken task might block long term. But in that case you often are better of to kill it as part of process termination instead and if that doesn't seem an option for "resilience" reasons than you are already in better use "multiple processes for resilience" (potentially across different servers) territory IMHO.
So as much as forced thread termination looks tempting I found that any time I thought I needed it it was because I did something very wrong else where.
Concepts of cooperate multi threading, co-rutines etc. aren't limited to user space.
Actually they out date the whole "async" movement or whatever you want to call it.
Also the article is about user-space threads, i.e. OS threads, not kernel-space threads (which use kthread_* not pthread_* and kthreads stopping does work by setting a flag to indicate it's supposed to stop, wakes the thread and then waits for exit. I.e. it works much more close to the `if(stop) exit` example then any signal usage.
The pthread API, defined by POSIX, and referred to in TFA's title as "linux threads", concerns kernel threads. Pthreads does not provide user-space threads in any implementation I am aware of, and is not intended to (though it likely could be done). The API is intended to allow a process to have multiple execution contexts that can be scheduled by the kernel independently.
I think you have a very strange definition of "user-space", "kernel-space".
kernel space is what runs _in_ the kernel, it doesn't involve pthreads (on any OS) and uses kthreads (on Linux).
POSIX threads are user space threads. It doesn't matter that they are scheduled by the kernel, that is the norm for threads in user space. Also know as OS threads.
What you probably mean with user-space threads are green threads. Which are build on one or more OS threads but have an additional scheduling layer which can schedule multiple green threads on one OS thread using some form of multiplexing scheme.
I'm unfamiliar with this, is "coop multithreading" basically equivalent to Windows 3.1 style coop multitasking, where a thread can hang the whole application by ... not cooperating?
both are forms of cooperative multi tasking/threading
but there are a many differences about what you expect how exactly it is implemented
and most relevantly you run the cooperative tasks on a pool of OS threads which are preempted, so a single thread hanging won't hang your whole application. And dev tooling has gotten much better since then, that helps a lot too.
also Windows 3.1 kinda predates me, so no not really
PS: And JS in the browser still uses cooperative multi tasking and the whole website hanging isn't exactly the norm ;) Partially because you can opt for preempted threads (worker) which also happend to not be good at handling termination. And partially because a lot of iff overhead logic is moved outside of web apps and into the browser itself (e.g. layout, rendering etc.). It's pretty common to setup one such worker and then use message passing to send work to it, but this means that between concurrent tasks there is no preemption and on way to handle arbitrary numbers concurrent requests with a very limited number of workers is to make each worker internally use cooperative multi tasking. So we are kind back at cooperative multi tasking on top of preempted OS threads.
I'm reminded of Raymond Chen's many many blogs[1][2][3](there are a lot more) on why TerminateThread is a bad idea. Not surprised at all the same is true elsewhere. I will say in my own code this is why I tend to prefer cancellable system calls that are alertable. That way the thread can wake up, check if it needs to die and then GTFO.
[1] https://devblogs.microsoft.com/oldnewthing/20150814-00/?p=91...
[2] https://devblogs.microsoft.com/oldnewthing/20191101-00/?p=10...
[3] https://devblogs.microsoft.com/oldnewthing/20140808-00/?p=29...
there are a lot more, I'm not linking them all here.
One of my more annoying gotchas on Windows is that despite this advice being very reasonable sounding, the runtime itself (I believe it actually happens in the kernel) essentially calls TerminateThread on all child threads before running global destructors and atexit hooks. Good luck following this advice when the kernel actively fights you when it come time to shutdown
So there is a reason that in the C++ spec if a std::thread is still joinable when the destructor is called it calls std::terminate[1]. That reason being exactly this case. If the house is being torn down it's not safe to try to save the curtains[2]. Just let the house get torn down as quickly as possible. If you wanted to save the curtains (e.g. do things on the threads before they exit) you need to do it before the end of main and thus global destructors start getting called.
[1] https://en.cppreference.com/w/cpp/thread/thread/~thread.html
[2] https://devblogs.microsoft.com/oldnewthing/20120105-00/?p=86...
When you return from main(), there shouldn't be any child threads running in the first place. Join your threads and you will be fine.
Global destructors and atexit are called by the C/C++ runtime, Windows has nothing to do with that. The C and C++ specs require that returning from main() has the same effect of ending the process as exit() does, meaning they can’t allow any still-running threads to continue running. Given these constraints, would you prefer the threads to keep running until after global destructors and atexit have run? That would be at least as likely to wreak havoc. No, in C/C++, you need to make sure that other threads are not running anymore before returning from main().
> Well, since thread cancellation is implemented using exceptions, and thread cancellation can happen in arbitrary places
No, thread cancelation cannot happen in arbitrary places. Or doesn't have to.
There are two kinds of cancelation: asynchronous and deferred.
POSIX provides an API to configure this for a thread, dynamically: pthread_setcanceltype.
Furthermore, cancelation can be enabled and disabled also.
Needless to say, a thread would only turn on asynchronous cancelation over some code where it is safe to do so, where it won't be caught in the middle of allocating resources, or manipulating data structures that will be in a bad state, and such.I talk about the cancelability state and how it can help us shortly after that statement: https://mazzo.li/posts/stopping-linux-threads.html#controlle... . In hindsight I should have made a forward reference to that section when talking about C++. My broad point was that combining C++ exceptions and thread cancellation is fraught with danger and imo best avoided.
I regret to be informed that they still haven't figured this out. I was very active in thread on Linux over 20 years ago, working on glibc and whatnot. That was before C++ had threads, needless to say. There was a time when cancelation didn't do C++ unwinding, only the PTHREAD_CLEANUP_PUSH handlers. So of course cancellation and C++ exceptions was a bad cocktail then.
This article does a nice job of explaining why pthread cancellation is hopeless.
> If we could know that no signal handler is ran between the flag check and the syscall, then we’d be safe.
If you're willing to write assembly, you can accomplish this without rseq. I got it working many years ago on a bunch of platforms. [1] It's similar to what they did in this article: define a "critical region" between the initial flag check and the actual syscall. If the signal happens here, ensure the instruction pointer gets adjusted in such a way that the syscall is bypassed and EINTR returned immediately. But it doesn't need any special kernel support that's Linux-only and didn't exist at the time, just async signal handlers.
(rseq is a very cool facility, btw, just not necessary for this.)
[1] Here's the Linux/x86_64 syscall wrapper: https://github.com/scottlamb/sigsafe/blob/master/src/x86_64-... and the signal handler: https://github.com/scottlamb/sigsafe/blob/master/src/x86_64-...
For interrupting long-running syscalls there is another solution:
Install an empty SIGINT signal handler (without SA_RESTART), then run the loop.
When the thread should stop:
* Set stop flag
* Send a SIGINT to the thread, using pthread_kill or tgkill
* Syscalls will fail with EINTR
* check for EINTR & stop flag , then we know we have to clean up and stop
Of course a lot of code will just retry on EINTR, so that requires having control over all the code that does syscalls, which isn't really feasible when using any libraries.
EDIT: The post describes exactly this method, and what the problem with it is, I just missed it.
This option is described in detail in the blog posts, with its associated problems, see this section: https://mazzo.li/posts/stopping-linux-threads.html#homegrown... .
Ah, fair, I missed it when reading the post because the approach seemed more complicated.
I mean that's the reason EINTR exists at all.
Hi Francesco :)
I think a good shorthand for this stuff is
> One can either preemptively or cooperatively schedule threads, and one can also either preemptively or cooperatively cancel processes, but one can only cooperatively cancel threads.
If you can swing it (don't need to block on IO indefinitely), I'd suggest just the simple coordination model.
To kill the thread, set the stop flag and cond_signal the condvar. (Under the hood on Linux, this uses futex.)> To kill the thread, set the stop flag and cond_signal the condvar
This is a race condition. When you "spin" on a condition variable, the stop flag you check must be guarded by the same mutex you give to cond_wait.
See this article for a thorough explanation:
https://zeux.io/2024/03/23/condvars-atomic/
The tricky part is really point 2 there, that can be harder than it looks (e.g. even simple file I/O can be network drives). Async IO can really shine here, though it’s not exactly trivial designing async cancelletion either.
Relying heavily on a check for an atomic bool is prone to race conditions. I think it's cleaner to structure the event loop as a message queue and have a queued message that indicates it's time to stop.
Queuing a stop means you have to process the queue before stopping. Which certainly is stopping cleanly, but if you wanted to stop the thread because its queue was too long and the work requests were stale, it doesn't help much.
You could maybe allow a queue skipping feature to be used for stop messages... But if it's only for stop messages, set an atomic bool stop, then send a stop message. If the thread just misses the stop bool and waits for messages, you'll get the stop message; if the queue is large, you'll get the stop bool.
ps, hi
> Relying heavily on a check for an atomic bool is prone to race conditions.
It is not, actually. This extremely simple protocol is race-free.
Calling pthread_cond_signal without acquiring mutex can lead to lost wakeup. And of course you can't really acquire a mutex in an async signal safe function like interrupt handler.
Without the signaling thread acquiring a mutex, you might end up signaling after T2 has checked the boolean, but before it has called cond_wait.
But this can be solved by processing the async signal in a deferred manner from some other watcher thread
Every event loop is subject to the blocked-due-to-long-running-computation issue. It bites ...
The same is true if you're repeatedly polling an atomic boolean in an event loop.
How so? It takes only a couple of machine cycles to poll a boolean.
(And what other kind of boolean is there, besides atomic? It's either true or it's false, and if nothing can set it back to false once it goes true, I don't see the hazard. It's a CPU, not an FPGA.)
The type is named atomic, but atomicity is not its only useful property. The atomic types also give control over the memory ordering, defaulting to sequentially consistent (seq_cst, the strongest).
Without memory order guarantees enforced by memory barriers, a write to the boolean in thread A is not guaranteed to be observed by thread B. That matters both after initialization--where thread A sets the boolean to false but thread B may observe true, false, or invalid--and also after the transition--where thread B may fail to observe that the boolean has flipped from false to true.
[edit: I'm not sure the above reasoning actually matters; as stated already by parent, "It's a CPU, not an FPGA"; modern multicore shared-memory CPUs have coherent caches]
> Without memory order guarantees enforced by memory barriers, a write to the boolean in thread A is not guaranteed to be observed by thread B.
No, that's not correct. Memory ordering doesn't influence how fast a write is propagated to other cores, that's what cache coherency is for. Memory ordering of an access only matters in relation to accesses on other memory locations. There's a great introduction to this topic by Mara Bos: https://marabos.nl/atomics/memory-ordering.html
Indeed. I started to figure this out, hence my edit. Thanks for the link.
There are hypothetical, historical, and special-purpose architectures which don't have cache coherency (or implement it differently enough to matter here), but for all practical purposes, it seems that all modern, general-purpose architectures implement it.
Well, it'll be observed the next time through the loop. If that matters, then it's true that this technique isn't desirable.
Without atomic, the compiler won’t bother with there being a next time and just infinite loop (in the old days, you’d mark it volatile instead)
True enough, but volatile still works, of course.
It’s free on x86 for relaxed ordering, which is sufficient for this use case.
disagree. i think then it's too tempting down the line for someone to add a message with blocking processing.
a simple clear loop that looks for a requested stop flag with a confirmed stop flag works pretty well. this can be built into a synchronous "stop" function for the caller that sets the flag and then does a timed wait on the confirmation (using condition variables and pthread_cond_timedwait or waitforxxxobject if you're on windows).
Making your check less stable doesn't prevent this.
The examples in this article IIRC were something like this.
You're still going to be arbitrarily delayed if do_stuff() (or one one of its callees, maybe deep inside the stack) delays, or the sleep call.If you can't accept this, maybe don't play with threads, they are dangerous.
that's the point. use nonblocking io and an event polling mechanism with a timeout to keep an eye on an exit flag- that's all you need to handle clean shutdowns.
i think on windows you can wait on both the sockets/file descriptors and condition variables with the same waitforxxxobject blocking mechanism. on linux you can do libevent, epoll, select or pthread_cond_timedwait. all of these have "release on event or after timeout" semantics. you can use eventfd to combine them.
i would not ever recommend relying on signals and writing custom cleanup handlers for them (!).
unless they're blocked waiting for an external event, most system calls tend to return in a reasonable amount of time. handle the external event blocking scenario (stuff that select waits for) and you're basically there. moreover, if you're looking to exit cleanly, you probably don't want to take your chances interrupting syscalls with signals (!) anyway.
> If you can't accept this, maybe don't play with threads, they are dangerous.
too late. when i first started playing with threads, linux didn't really support them.
> use nonblocking io and an event polling mechanism
Not incompatible with what I said.
> with a timeout to keep an eye on an exit flag
This is the stupid part. You will burn CPU cycles waking up spuriously for timeouts with no work to do. Setting the flag won't wake up the event loop until the timeout hits, adding pointless delay.
You want to make signalling an exit to actually wake up your event loop. Then you also don't need a timeout.
I.e. you should make your "ask to exit" code use the same wakeup mechanism as the work queue, which is what I said at the beginning. Not burning CPU polling a volatile bool in memory on the side.
> This is the stupid part. You will burn CPU cycles waking up spuriously for timeouts with no work to do. Setting the flag won't wake up the event loop until the timeout hits, adding pointless delay.
it's the smart part. waking up at 50hz or 100hz is essentially free and if there's an os bug or other race that causes the one time "wake up to exit" event to get lost, the system will still be able to shut down cleanly with largely imperceptible delay. it also means that it can be ported to systems that don't support combined condition variable/fd semantics.
> You want to make signalling an exit to actually wake up your event loop.
This is exactly what condwait + condsignal do.
libcurl dealt with this a few months ago, and the sentiment is about the same: thread cancellation in glibc is hairy. The short summary (which I think is accurate) is that an hostname query via libnss ultimately had to read a config file, and glibc's `open` is a thead cancellation point, so if it's canceled, it'll won't free memory that was allocated before the `open`.
The write-up is on how they're dealing with it starts at https://eissing.org/icing/posts/pthread_cancel/.
Note that the situation with libcurl is very specific: lookup with libnss is only available as a synchronous call. All other syscalls they make can be done with async APIs, which can easily be cancelled without any of the trickery discussed here.
This was a fun read, I didn't know about rseq until today! And before this I reasonably assumed that the naive busy-wait thing would typically be what you'd do in a thread in most circumstances. Or that at least most threads do loop in that manner. I knew that signals and such were a problem but I didn't think just wanting to stop a thread would be so hard! :)
Hopefully this improves eventually? Who knows?
IIRC rseq was originally proposed by Google to support their pure-userspace read-copy-update (RCU) implementation, which relied on per-CPU not per-thread data.
Definitely fascinating to me, like I said I didn't even know rseq was a thing until today.
Previously: https://news.ycombinator.com/item?id=38908556
And somehow just a day ago: https://news.ycombinator.com/item?id=45589156
When you are on Linux the easiest way is to use signalfd. No unsafe async signal handling, just handling signals by reading from a fd.
This is just doubling down on the wrong approach.
The right approach is to avoid simple syscalls like sleep() or recv(), and instead call use multiplexing calls like epoll() or io_uring(). These natively support being interrupted by some other thread because you can pass, at minimum, two things for them to wait for: the thing you're actually interested in, and some token that can be signalled from another thread. For example, you could start a unix socket pair which you do a read wait on alongside the real work, then write to it from another thread to signal cancellation. Of course, by the time you're doing that you really could multiplex useful IO too.
You also need to manually check this mechanism from time to time even if you're doing CPU bound work.
If you're using an async framework like asyncio/Trio in Python or ASIO in C++, you can request a callback to be run from another other thread (this is the real foothold because it's effectively interrupting a long sleep/recv/whatever to do other work in the thread) at which point you can call cancellation on whatever IO is still outstanding (e.g. call task.cancel() in asyncio). Then you're effectively allowing this cancellation to happen at every await point.
(In C# you can pass around a CancellationToken, which you can cancel directly from another thread to save that extra bit of indirection.)
This is noted in the blog post, but the problem is that sometimes you don't have the freedom to do so. See this sidenote and the section next to it: https://mazzo.li/posts/stopping-linux-threads.html#fn3 .
I'll admit: I didn't see that.
But I also disagree with it. Yes, the logical conclusion of starting down that path is that you end up with full on use of coroutines and some IO framework (though I don't see the problem with that). But a simple wrapper for individual calls that is recv+cancel rather than just recv etc is better than any solution mentioned in the blog post.
The fact is, if you want to wait for more than one thing at once at the syscall level (in this case, IO + inter thread cancellation), then the way to do that is to use select or poll or something else actually designed for that.
I had this problem, and I solved it by farming the known-blocking syscalls to a separate thread pool. Then the calling thread can just abandon the wait. To make it a bit better, you can also use bounded timeouts (~1-2 seconds) with retries for some calls like recvfrom() via SO_TIMEOUT so that the termination time becomes bounded.
This is probably the cleanest solution that is portable.
Then those thread pool threads have to be careful not to take locks and other scarce resources that aren't cleaned up then, no?
Off-Topic: I surprised myself by liking the web site design. Especially the font.
Me too. It is pretty rare for anyone to take so much care. The only other person I can think of now is Gwern Branwen.
This seems like a lot of work to do when you have signalfd, no? That + async and non blocking I/O should create the basis of a simple thread cancellation mechanism that exits pretty immediately, no?
As I note in the blog post in various places if one can organize the code so that cancellation is explicit things are indeed easier. I also cite eventfd as one way of doing so. What I meant to convey is that there's no easy way to cancel arbitrary code safely.
I have been using signalfd + epoll where it looks like I could use eventfd instead (or just epoll_pwait). Is there a significant benefit to one approach over another? I suspect eventfd might be more efficient (and doesn't use up a signal handler... when are we going to get SIGUSR3 ?!?).
I don't really get it. There are two possibilities here:
* You control all the IO, and then you can use some cooperative mechanism to signal cancellation to the thread.
* You don't control IO code at the syscall level (e.g. you're using some library that uses sockets under the hood, such as a database client library)... But then it's just obvious you're screwed. If you could somehow terminate the thread abruptly then you'll leak resources (possibly leaving mutexes locked, as you said), or if you interrupt syscalls with an error code then the library won't understand it. That's too trivial to warrant a blog post fussing about signals.
The only useful discussion to have on the topic of thread cancellation is what happens when you can do a cooperative cancel, so I don't think it's fair to shoot that discussion down.
You can still put the code in another process and send SIGKILL.
If you just want to stop and/or kill all child threads, you can read the list of thread IDs from /proc/pid/task, and send a signal to them with tgkill().
Yeah, and leave mutexes locked indefinitely.
Sometimes that doesn't matter - maybe you are just trying to get the process to exit without core dumping due to running threads accessing things that are disappearing.
I'm not sure there's any better solution if you are dealing with a library that creates threads and doesn't provide an API to shut them down.
Ah, the eternal problem of asynch unwind!
... and to cancel: I think what is really needed is just exception (unwind cleanup) mechanism and a cheap way to mask interrupts. Signal deferral mechanism does exactly that -- so that with(out)-interrupts just simply set a variable and don't need to go through syscall.this stuff always seemed a mess. in practice i've always just used async io (non-blocking) and condition variables with shutdown flags.
trying to preemptively terminate a thread in a reliable fashion under linux always seemed like a fool's errand.
fwiw. it's not all that important, they get cleaned up at exit anyway. (and one should not be relying on operating system thread termination facilities for this sort of thing.)
I claim that this is a solved problem, without rseq.
1. Any given thread in an application waits for "events of interest", then performs computations based on those events (= keeps the CPU busy for a while), then goes back to waiting for more events.
2. There are generally two kinds of events: one kind that you can wait for, possibly indefinitely, with ppoll/pselect (those cover signals, file descriptors, and timing), and another kind you can wait for, possibly indefinitely, with pthread_cond_wait (or even pthread_cond_timedwait). pthread_cond_wait cannot be interrupted by signals (by design), and that's a good thing. The first kind is generally used for interacting with the environment through non-blocking syscalls (you can even notice SIGCHLD when a child process exits, and reap it with a WNOHANG waitpid()), while the second kind is used for distributing computation between cores.
3. The two kinds of waits are generally not employed together in any given thread, because while you're blocked on one kind, you cannot wait for the other kind (e.g., while you're blocked in ppoll(), you can't be blocked in pthread_cond_wait()). Put differently, you design your application in the first place such that threads wait like this.
4. The fact that pthread_mutex_lock in particular is not interruptible by signals (by design!) is no problem, because no thread should block on any mutex indefinitely (or more strongly: mutex contention should be low).
5. In a thread that waits for events via ppoll/pselect, use a signal to indicate a need to stop. If the CPU processing done in this kind of thread may take long, break it up into chunks, and check sigpending() every once in a while, during the CPU-intensive computation (or even unblock the signal for the thread every once in a while, to let the signal be delivered -- you can act on that too).
6. In a thread that waits for events via pthread_cond_wait, relax the logical condition "C" that is associated with the condvar to ((C) || stop), where "stop" is a new variable protected by the mutex that is associated with the condvar. If the CPU processing done in this kind of thread may take long, then break it up into chunks, and check "stop" (bracketed by acquiring and releasing the mutex) every once in a while.
7. For interrupting the ppoll/pselect type of thread, send it a signal with pthread_kill (EDIT: or send it a single byte via a pipe that the thread monitors just for this purpose; but then the periodic checking in that thread has to use a nonblocking read or a distinct ppoll, for that pipe). For interrupting the other type of thread, grab the mutex, set "stop", call pthread_cond_signal or pthread_cond_broadcast, then release the mutex.
8. (edited to add:) with both kinds, you can hierarchically reap the stopped threads with pthread_join.
One does not simply stop a thread…
pthread cancelation ends up not being the greatest, but it's important to represent it accurately. It has two modes: asynchronous and deferred. In asynchronous mode, a thread can be canceled any time, even in the middle of a critical section with a lock held. However, in deferred mode, a thread's cancelation can be delayed to the next cancelation point (a subset of POSIX function calls basically) and so it's possible to make that do-stuff-under-lock flow safe with cancelation after all.
That's not to say people do or that it's a good idea to try.
Cancellation points and cancellability state are discussed in the post. In a C codebase that you fully control pthread cancellation _can_ be made to work, but if you control the whole codebase I'd argue you're better off just structuring your program so that you yield cooperatively frequently enough to ensure prompt termination.
Not arguing there. I'm just pointing out that the post's claim that " Thread cancellation is incompatible with modern C++" needs more nuance.
[dead]
> How to stop Linux threads cleanly
kill -HUP ?
while (true) { if (stop) { break; } }
If there only was a way to stop while loop without having to use extra conditional with break...
Feel free to read the article before commenting.
I’ve read it, and I found nothing to justify that piece of code. Can you please explain?
The while loop surrounds the whole thread, which does multiple tasks. The conditional is there to surround some work completing in a reasonable time. That's how I understood, at least.
Does not seem so clear to me. If so it could be stated with more pseudo code. Also the eventual need for multiple exit points…
Should this code:
Be just: Anyway, the last part:>> It’s quite frustrating that there’s no agreed upon way to interrupt and stack unwind a Linux thread and to protect critical sections from such unwinding. There are no technical obstacles to such facilities existing, but clean teardown is often a neglected part of software.
I think it is a “design feature”. In C everything is low level, so I have no expectation of a high level feature like “stop this thread and cleanup the mess” IMHO asking that is similar to asking for GC in C.
yes, maybe except if you don't have a single tight loop and stop checks are not just done once in the loop body but manually sprinkled through various places of your code (e.g. thing a long running compute task split into part 1,2(tight loop),3(loop),4 then you probably want a stop check between each of them and in each inner iteration of 3 but probably not in each inner iteration of 2 (as each check is an atomic load).
Maybe. But seems to me there should be better ways to organize the code. In the case you mention there will be many places where you have to cleanup (that is what the article is about) so the code will be hell to debug: multithreaded, with multiple exit points in each thread… I have done relly tons and tons of multithreading and never once needed such a conplicated thing. Typically the code which gets run in parallel is either for managing one resource type OR number crunching w/o resource allocation… if you are spawning threads that do lots of resource allocation, maybe you have architecture problems, or you are solving a very niche problem.
If your threads run "cooperative multi threading" task (e.g. rust tokio runtime, JS in general etc.) then this kinda is a non problem.
Due to task frequently returning to the scheduler the scheduler can do "should stop" check there (also as it might be possible to squeeze it into other atomic state bit maps it might have 0 relevant performance overhead (a single is-bit-set check)). And then properly shut down tasks. Now "properly shut down tasks" isn't as trivial, like the "cleaning up local resources" part normally is, but for graceful shutdown you normally also want to allow cleaning up remote resources, e.g. transaction state. But this comes from the difference of "somewhat forced shutdown" and "grace full shutdown". And in very many cases you want "grace full shutdown" and only if it doesn't work force it. Another reason not to use "naive" forced only shutdown...
Interpreter languages can do something similar in a very transparent manner (if they want to). But run into similar issues wrt. locking and forced unwinding/panics from arbitrary places as C.
Sure a very broken task might block long term. But in that case you often are better of to kill it as part of process termination instead and if that doesn't seem an option for "resilience" reasons than you are already in better use "multiple processes for resilience" (potentially across different servers) territory IMHO.
So as much as forced thread termination looks tempting I found that any time I thought I needed it it was because I did something very wrong else where.
user-space threads have entirely different semantics from kernel threads. both have their uses, but should generally not be conflated.
Concepts of cooperate multi threading, co-rutines etc. aren't limited to user space.
Actually they out date the whole "async" movement or whatever you want to call it.
Also the article is about user-space threads, i.e. OS threads, not kernel-space threads (which use kthread_* not pthread_* and kthreads stopping does work by setting a flag to indicate it's supposed to stop, wakes the thread and then waits for exit. I.e. it works much more close to the `if(stop) exit` example then any signal usage.
The pthread API, defined by POSIX, and referred to in TFA's title as "linux threads", concerns kernel threads. Pthreads does not provide user-space threads in any implementation I am aware of, and is not intended to (though it likely could be done). The API is intended to allow a process to have multiple execution contexts that can be scheduled by the kernel independently.
yes POSIX threads, i.e. user-space threads
I think you have a very strange definition of "user-space", "kernel-space".
kernel space is what runs _in_ the kernel, it doesn't involve pthreads (on any OS) and uses kthreads (on Linux).
POSIX threads are user space threads. It doesn't matter that they are scheduled by the kernel, that is the norm for threads in user space. Also know as OS threads.
What you probably mean with user-space threads are green threads. Which are build on one or more OS threads but have an additional scheduling layer which can schedule multiple green threads on one OS thread using some form of multiplexing scheme.
I'm unfamiliar with this, is "coop multithreading" basically equivalent to Windows 3.1 style coop multitasking, where a thread can hang the whole application by ... not cooperating?
Does anyone here remember Windows 3.1?
yesn't
both are forms of cooperative multi tasking/threading
but there are a many differences about what you expect how exactly it is implemented
and most relevantly you run the cooperative tasks on a pool of OS threads which are preempted, so a single thread hanging won't hang your whole application. And dev tooling has gotten much better since then, that helps a lot too.
also Windows 3.1 kinda predates me, so no not really
PS: And JS in the browser still uses cooperative multi tasking and the whole website hanging isn't exactly the norm ;) Partially because you can opt for preempted threads (worker) which also happend to not be good at handling termination. And partially because a lot of iff overhead logic is moved outside of web apps and into the browser itself (e.g. layout, rendering etc.). It's pretty common to setup one such worker and then use message passing to send work to it, but this means that between concurrent tasks there is no preemption and on way to handle arbitrary numbers concurrent requests with a very limited number of workers is to make each worker internally use cooperative multi tasking. So we are kind back at cooperative multi tasking on top of preempted OS threads.