Bug 1603 - Windows: sp_[blocking|nonblocking]_read may fail when last call was from thread that has no exited
Summary: Windows: sp_[blocking|nonblocking]_read may fail when last call was from thre...
Status: IN_PROGRESS
Alias: None
Product: libserialport
Classification: Unclassified
Component: Portability (show other bugs)
Version: unreleased development snapshot
Hardware: All Windows
: Normal normal
Target Milestone: ---
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-08 10:21 CEST by Christian Seiler
Modified: 2021-07-03 15:45 CEST (History)
1 user (show)



Attachments
Multi-threaded "client" that reproduces the issue (3.12 KB, text/plain)
2020-09-08 10:21 CEST, Christian Seiler
Details
"Server" for use in conjunction with the client and a virtual NULL modem (2.46 KB, text/plain)
2020-09-08 10:21 CEST, Christian Seiler
Details
Windows: don't fail if WaitCommEvent was aborted due to thread exit (555 bytes, patch)
2020-09-08 12:38 CEST, Christian Seiler
Details
[v2] Windows: don't fail if WaitCommEvent was aborted due to thread exit (2.18 KB, patch)
2021-07-02 15:35 CEST, Christian Seiler
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christian Seiler 2020-09-08 10:21:10 CEST
Created attachment 671 [details]
Multi-threaded "client" that reproduces the issue

On Windows systems sp_*_read may fail spuriously with SP_ERR_FAIL and an operating system error code ERROR_OPERATION_ABORTED (995, 0x3E3) when the last call to sp_*_read was from a thread that has since been terminated. This is due to the fact that Windows ties OVERLAPPED I/O operations to the currently running thread, and if that thread exits, the I/O operation is aborted.

This does not always happen, because the OVERLAPPED I/O operation could complete before the thread has exited.

This is NOT a race condition, this assumes that the threads do NOT access the serial port at the same time.

I've created a sample reproducer that just starts a thread that does some serial I/O, waits for the thread to terminate (hence the threads being synchronized and no race condition), and then triggers a fail. To use the reproducer I've installed the "com0com" NULL modem driver from <https://sourceforge.net/projects/com0com/files/com0com/3.0.0.0/>. The reproducer consists of two test programs: one that just replies to some dummy requests, while the other triggers the condition with a thread.

Steps to reproduce:

1. Compile the C source files into separate executables, link them against libserialport
2. Install the com0com NULL modem driver and create a virtual serial port pair.
3. In one Terminal run mirror.exe and supply it one of the ports, keep it running in the background (can be stopped via Ctrl+C)
4. In another shell run thread.exe and supply it the other virtual COM port

Example Output (COM5/COM6 is the pair on my system):

Shell 1 (start first, output will appear only when the thread.exe is started):
C:\Test> mirror.exe COM6
Sending 4 bytes
Sending 2 bytes

Shell 2 (start second):
C:\Test> thread.exe COM5
main thread sp_blocking_read() failed: -2
C:\Test>
Comment 1 Christian Seiler 2020-09-08 10:21:43 CEST
Created attachment 672 [details]
"Server" for use in conjunction with the client and a virtual NULL modem
Comment 2 Christian Seiler 2020-09-08 12:38:34 CEST
Created attachment 673 [details]
Windows: don't fail if WaitCommEvent was aborted due to thread exit

I've looked into this a bit more and the actual issue is a bit different: the OVERLAPPED I/O of ReadFile is actually not the culprit here, because that I/O has already completed at that point. Instead, the asynchronous WaitCommEvent wait is what is aborted here, resulting in the error message.

I've attached a patch to libserialport that fixes this issue for the test case at least. I'll perform some additional checks in the actual application to verify this does indeed fix this.
Comment 3 Martin Ling 2021-07-01 13:01:30 CEST
Did you ever confirm further whether this patch fixes the issue for your full application?

In our current thread safety rules we require that only one thread be making read calls on any given port at a time:
https://sigrok.org/api/libserialport/unstable/index.html#autotoc_md8

There's no requirement that it always be the same thread, so it should be okay to do what your test case does: end one thread and then start reading from another, in this case the main thread. For that scenario your fix makes sense.

However, I think there'd be an issue if you were to have a second thread take over read operations when the previous thread hasn't actually terminated.

Consider a plausible example:

Say you had a thread pool system with several worker threads doing arbitrary tasks that get scheduled onto them. One task is to read data from a serial port.

You protect the port with a lock so that only one thread can be doing read operations at a time, as per the libserialport thread safety rules.

So thread 1 takes the lock, reads from the port, and releases the lock. It then goes on to doing something else - it doesn't terminate.

Now thread 2 gets scheduled to read from the port, takes the lock and does so. Thread 1 is still running. The overlapped wait operation is still open, and is associated with thread 1.

What happens in that case?
Comment 4 Christian Seiler 2021-07-01 21:08:48 CEST
> Did you ever confirm further whether this patch fixes the issue for your full application?

I did some tests and it appeared to fix it, though I haven't done any extensive tests in that regard.

In our application we worked around the issue by simply keeping the thread alive doing nothing (just waiting for a termination signal when the program exits), because we only wanted to update libserialport in our build environment once the patch got in officially.

> Now thread 2 gets scheduled to read from the port, takes the lock and does so. Thread 1 is still running. The overlapped wait operation is still open, and is associated with thread 1.
> 
> What happens in that case?

Well, that's kind of what I'm doing right now in my workaround, by keeping the second thread around doing nothing (not calling any other libserialport functions), but moving the logic back to the main thread. That has been used extensively by now, and that didn't cause us any issues.

My guess is that Windows ties the WaitCommEvent() asynchronous (overlapped) action to the thread that called it, and when that thread exits, it aborts that operation -- but querying the result can still be done from any other thread.

What I can't say is if there could be any information loss if my fix is applied: I'm not deep enough into the Windows API to know whether restarting WaitCommEvent will still deliver the missing comm events. It didn't appear to cause any issues in my case, but I call libserialport in a synchronous fashion, so I don't actually use the event logic.

In fact, thinking about it more, and looking at the source code, from what I can see WaitCommEvent is used to determine when new data has reached the read buffer of the operating system for use in the asynchronous API -- and if the thread that last started WaitCommEvent exits, that event won't ever be signalled anymore, because that case isn't actually handled by my "fix".

Maybe the better option would be for libserialport to always create its own thread tied to the port handle that runs WaitCommEvent alternating with waiting for a restart signal in a loop? In that case we can guarantee the thread never dies (it would exit on a port close) and the event signalling always works, and we completely work around that issue. Has the disadvantage of creating an additional thread on Windows, but from my own experience with the Windows API, that is often the easiest thing to do there, and threads are not too expensive on Windows, as long as you're not constantly starting/stopping them, which we wouldn't be here.

I could try to implement something like that and provide a patch, if you agree with that direction.
Comment 5 Martin Ling 2021-07-02 03:11:59 CEST
Just to be clear, there's no "asynchronous API". The libserialport API is entirely synchronous. You call a function, it does things until it returns and that's it. There's no callbacks getting called outside of the main program flow, code running in the background or anything like that which I would call asynchronous.

All of the event-related code is only there to support a single function, sp_wait(), which is synchronous too. It's basically just a libserialport-specific version of select(), poll() or WaitForMultipleObjects(), letting you wait for many possible events across multiple ports at once.

The problem is that there's a mismatch here between how Unix works and how Windows works. In Unix, you can call select() or poll() on a bunch of file descriptors, and have that that call return when any of them have become readable. If that's already happened, the call returns immediately.

On Windows however, there's no direct equivalent to that when it comes to serial ports. You can call WaitForMultipleObjects(), but it operates on event handles, not port handles. To have an event handle which will trigger when data arrives on a port, you need to first call WaitCommEvent() as an overlapped operation.

And if you want that WaitForMultipleObjects() call to return immediately if data has already arrived, then you need to have *already* made that WaitCommEvent() call in the past, such that the event handle for it is already sitting there waiting to be checked.

Which is why we have the code we do. Any time we empty the RX buffer, Windows will clear the RX event if it is pending. So whenever we cause the buffer to be emptied, we start a new WaitCommEvent() call. That way, we can always check whether new data has arrived, along with other events, just by calling WaitForMultipleObjects().

This works in the single threaded case, and it also works in the multithreaded case too so long as you have all your read calls in one thread. As you've discovered though, it falls over when your read thread terminates and another thread then takes over reading, because it will see ERROR_OPERATION_ABORTED.

Actually, I believe your fix is correct for sp_*_read(). In each of those functions, restart_wait() is called via restart_wait_if_needed(). That function checks whether the RX buffer is empty, and only calls restart_wait() if so. And if the RX buffer is empty, then the RX event could not have been pending. So no event is lost. Your fix sets wait_running to FALSE, so the following code block will start a new WaitCommEvent() call, and everything is then working correctly on the new read thread.

Your fix cannot cause an event to be lost. If you get ERROR_OPERATION_ABORTED, then any pending event was *already* lost. So although there may still be things to fix elsewhere, your fix prevents the sp_*_read() functions from breaking as a result of that, which is an improvement. So I think I'd be happy to merge it.

The place where a problem can potentially occur is in sp_wait(), because that's what actually tries to use the pending event handle.

If thread 1 calls sp_*_read() and then terminates, and then thread 2 takes over and calls sp_wait() to check for read events on the same port, then WaitForMultipleObjects() will be called from sp_wait() with an event handle that is in an aborted state.

I'm not sure what WaitForMultipleObjects() does in that case, and the documentation isn't immediately clear.

If it detects the problem and returns immediately, with or without an error, that's easy to deal with. We don't need to tell the caller *which* events happened when sp_wait() returns, and the documentation already cautions that spurious wakeups are possible. So the calling program should act as if each event in the set has *possibly* happened, which means it will proceed to read from the port in question, expecting some data to be available. That read call will then start a new overlapped WaitCommEvent() associated with the new thread, and all will be well.

The difficult scenario would be if WaitForMultipleObjects() ignores the aborted event handle and continues waiting. If it does, the program could hang at that point.

We'll need to test what actually happens and modify sp_wait() accordingly.

But I think your fix is already an improvement for the read functions.
Comment 6 Christian Seiler 2021-07-02 09:33:29 CEST
> The difficult scenario would be if WaitForMultipleObjects() ignores the
> aborted event handle and continues waiting. If it does, the program could
> hang at that point.
> 
> We'll need to test what actually happens and modify sp_wait() accordingly.

Yes, I think your reasoning makes a lot of sense. Many thanks for taking the time to think about this issue carefully.

I'll try to test what behavior sp_wait() has when the WaitCommEvent() was aborted (with my virtual NULL modem test), and if that's not already the desired behavior, update it so that it at least generates a wakeup that the user must handle anyway.

I'll report back with what I've found.
Comment 7 Martin Ling 2021-07-02 14:01:10 CEST
I should add that if we can't solve this nicely within the current approach, there is another way to deal with it: we could move all the WaitCommEvent related code into sp_wait() and stop having to deal with it in the read functions.

The way to do it would be to change what's stored in the event_set. Rather than storing event handles so that WaitForMultipleObjects() can be called immediately, we could store an sp_port pointer and an event mask for each event in the set.

The sp_wait() function would then need to iterate through the events. If it sees an SP_EVENT_RX_READY event, it would check the state of the RX buffer. If the buffer has characters in it, sp_wait() could return immediately. If not, it would start an overlapped WaitCommEvent().

Once all the events have been checked, and overlapped operations started as needed, sp_wait() would proceed to calling WaitForMultipleObjects() as it currently does.

There are two reasons I didn't go with this approach before.

One is that in the scenario where you're using sp_wait(), you're usually in an event loop where you're potentially spinning on the wait call a lot. It's really nice if everything is already set up so that all sp_wait() has to do is call WaitForMultipleObjects().

The other reason was that users might want to integrate libserialport into an existing event loop, so that a program can wait on both serial ports and other possible events in one system call.

To do that, libserialport would need to be able to provide a set of event handles on request which are already set up and can be used as part of a larger set of events to wait on. This is similar to the approach offered by libusb with its libusb_get_pollfds() function.

The current architecture makes that possible. However, so far we've never added an API call for it.

Thinking about it some more, moving the WaitCommEvent() code out of the read functions wouldn't preclude adding that event loop integration support. The call to obtain the event handles would just need to do the WaitCommEvent() setup as I described above before returning the handles.

So it's an option. But I'd still prefer if we could keep sp_wait() as simple as possible.
Comment 8 Christian Seiler 2021-07-02 15:35:22 CEST
Created attachment 739 [details]
[v2] Windows: don't fail if WaitCommEvent was aborted due to thread exit

I've experimented with this a bit, and WaitForMultipleObjects() would return with WAIT_OBJECT_0 + index when an abort happened (tested on Windows 10 Professional with the aforementioned virtual null-model setup), so that was the correct.

However, there was an additional issue that if there was no actual data available on the serial port in the mean time, the next call to sp_*_read() would not actually restart the wait, because it assumed it was still running. To avoid that I added a new BOOL variable to the port that tracks a thread exit and ensures that the second call to restart_wait_if_needed() after the thread exit still calls WaitCommEvent() again.

If I didn't do that this would cause sp_wait() to always return immediately, even if no data was available, causing any program with an event loop to use 100% CPU until at least some data was received on the serial port, which isn't desirable.

The attached patch should fix all of this and provide the correct behavior for both sp_*_read() and sp_wait() on Windows.
Comment 9 Martin Ling 2021-07-03 15:06:58 CEST
Great. Thanks for working on this!

The logic in your latest patch makes sense to me. I may do some more work on the comments to try to make sure the reasoning is fully clear to future readers.

Do you have this patch as a git commit somewhere that we could pull it in from so you are credited in the history?
Comment 10 Martin Ling 2021-07-03 15:45:06 CEST
I've just realised that it would be possible to avoid the spurious wakeup from sp_wait() when switching threads, without having it need to do any setup before calling WaitForMultipleObjects().

It could be done by having sp_wait() do the following:

1. Look at which WAIT_OBJECT_N index is returned.

2. Use some additional bookkeeping added in the sp_event_set structure, to find what port the handle was for and whether it is for TX or RX.

3. If an RX event, use GetOverlappedResult() to check the result of the overlapped WaitCommEvent() call.

4. If the result is ERROR_OPERATION_ABORTED, then the wakeup may be spurious. Otherwise the wakeup is valid, so return.

5. Check the state of the RX buffer. If there are bytes present, then the wakeup is valid. If not, the wakeup was spurious.

6. Either way, call restart_wait() to start a new overlapped call from the current thread.

7. If the wakeup was spurious, loop back to the start of sp_wait() and call WaitForMultipleObjects() again. Otherwise, return.

So I may work on that next. But it's a bit more involved and will need some other changes, so would like to get your fix in as-is first.