Bug 1686 - sp_blocking_read* doesn't handle device removal well on Linux
Summary: sp_blocking_read* doesn't handle device removal well on Linux
Status: CONFIRMED
Alias: None
Product: libserialport
Classification: Unclassified
Component: Portability (show other bugs)
Version: unreleased development snapshot
Hardware: All All
: Normal normal
Target Milestone: ---
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-25 19:59 CEST by Phil
Modified: 2023-08-04 11:17 CEST (History)
3 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Phil 2021-05-25 19:59:31 CEST
I'm using libserialport with a USB serial device.

On Windows, if my device is unplugged then sp_blocking_read_next() returns an error. When the device is unplugged on Linux sp_blocking_read_next() spins on the select() + read() calls (select returns 1, read returns 0) and then finally times out and returns 0.

I don't see any documentation around this use case so maybe it's considered undefined behavior, but I think it would be nice if the unix implementation handled this scenario better and returned an error (or at least doesn't spin)
Comment 1 Martin Ling 2021-07-01 12:06:50 CEST
In that case, I would expect either the select() or the read() to return negative, and to set errno to EIO or some other error.

If they don't, how could libserialport know that the device was removed?

Do you have debug output? You can get some by exporting LIBSERIALPORT_DEBUG=1 and running your example.

Just based on your description of the select() and read() results, I'm afraid this seems like there must be a bug outside the library.

It could be a bug in the kernel driver for that port causing it not to report an error on removal. It could potentially be a bug in your program if you have multiple threads using the same port?
Comment 2 Phil 2021-07-01 16:26:24 CEST
I also would have expected slect() or read() to return error. 

My program has two threads, one with reads on a loop:
```
int ah_threadsafeSerialPortRead(ah_threadsafePort *port, uint8_t *buffer, uint32_t maxLen)
{
    int rv = sp_blocking_read_next(port->hw, buffer, maxLen, 100);
#ifndef _WIN32
    if (rv == 0)
    {
        // Reading 0 bytes means the read timed out. This could be because the other end stopped
        // sending data, but it could also have been disconnected. Check to see if the port still
        // exists, and if not raise an error
        int fd;
        sp_get_port_handle(port->hw, &fd);

        struct stat s;
        fstat(fd, &s);
        if (s.st_nlink == 0)
        {
            return -1;
        }
    }
#endif

    return rv;
}
```

And one which writes, althoguh this one isn't really active and sits idle 99% of the time:
```
int ah_threadsafeSerialPortWrite(ah_threadsafePort *port, uint8_t *buffer, uint32_t len)
{
    if (sp_blocking_write(port->hw, buffer, len, 100) != (int)len)
    {
        return 0;
    }
    if (sp_drain(port->hw) != SP_OK)
    {
        return 0;
    }
    return len;
}
```


Without my workaround (the code in the `#ifndef _WIN32`) the debug output around port removal just looks like:
```
sp: sp_blocking_read_next(0x557317164300, 0x557316b72160, 1024, 100) called.
sp: Reading next max 1024 bytes from port /dev/ttyACM0, timeout 100 ms.
sp: sp_blocking_read_next returning 53.
<port disconnected here>
sp: sp_blocking_read_next(0x557317164300, 0x557316b72160, 1024, 100) called.
sp: Reading next max 1024 bytes from port /dev/ttyACM0, timeout 100 ms.
sp: Read timed out.
sp: sp_blocking_read_next returning 0.
sp: sp_blocking_read_next(0x557317164300, 0x557316b72160, 1024, 100) called.
sp: Reading next max 1024 bytes from port /dev/ttyACM0, timeout 100 ms.
sp: Read timed out.
sp: sp_blocking_read_next returning 0.
sp: sp_blocking_read_next(0x557317164300, 0x557316b72160, 1024, 100) called.
sp: Reading next max 1024 bytes from port /dev/ttyACM0, timeout 100 ms.
sp: Read timed out.
sp: sp_blocking_read_next returning 0.
...
```
Comment 3 Martin Ling 2021-07-01 17:00:57 CEST
Your code looks fine to me; it does seem like Linux just fails to signal an error here.

I am inclined to consider this a kernel bug. The read() calls really ought to be reporting ENODEV at that point.

If we were to try to solve this by integrating your workaround into libserialport, it would mean making an fstat() call after any read() that returned zero. That's not a reasonable or efficient thing to expect userspace to do.

I'm also not certain that fstat() can be relied upon for this anyway. Presumably udev is automatically removing the /dev/ttyUSBx device node for you when the device is removed, which causes a file descriptor opened from that node to report that st_nlink == 0.

But in the general case - particularly on embedded systems with a simpler userspace - you might just have a node created directly by mknod somewhere on the filesystem.

If the device connected to that node goes away at the kernel level, the kernel shouldn't make it necessary to rely on some other program detecting that and removing the device node for you.

I would suggest you file this as a kernel bug and link to this report.
Comment 4 Phil 2021-07-01 17:07:36 CEST
I was able to reproduce the issue with this basic example:
```
#include "libserialport.h"
#include <stdint.h>

int main(int argc, char **argv)
{
    (void) argc;
    (void) argv;

    struct sp_port *hw;
    sp_get_port_by_name("/dev/ttyACM0", &hw);
    sp_open(hw, SP_MODE_READ_WRITE);
    // sp_set_baudrate(hw, 650195);
    // sp_set_bits(hw, 8);
    // sp_set_parity(hw, SP_PARITY_NONE);
    // sp_set_stopbits(hw, 1);
    // sp_set_flowcontrol(hw, SP_FLOWCONTROL_NONE);

    uint8_t buffer[1024];
    int bytesRead = 0;
    while (bytesRead >= 0)
    {
        bytesRead = sp_blocking_read_next(hw, buffer, sizeof(buffer), 100);
    }

    sp_close(hw);
    sp_free_port(hw);
    return 0;
}```

I agree my workaround isn't ideal, and isn't something which should be integrated into libserialport.

I'm running Ubuntu 20.04 and my kernel version is `5.4.0-74-generic`. Will report upstream
Comment 5 Martin Ling 2021-07-01 17:16:50 CEST
Another issue with trying to solve this at the libserialport level is that we actually don't really have a good way to signal the error.

We deliberately have four error codes in the API (SP_ERR_ARG, SP_ERR_FAIL, SP_ERR_SUPP and SP_ERR_MEM), as documented here:
https://sigrok.org/api/libserialport/unstable/index.html#autotoc_md7

When something fails, the assumption is that there's an OS-provided error which we can make available by via sp_last_error_message().

This approach is important because it means:

- Programs using the library are simpler: They only need to handle four errors,
  and can do so consistently over every OS with the same code.

- The library is simpler: Because we don't provide our own error codes/messages,
  we don't have to map every possible OS error to an appropriate library error.

- Internationalisation is simpler: Because we don't provide error messages, we
  don't have to translate anything to every possible user's language. We just
  pass on OS error messages, which already have translations.

However, in this scenario we don't have an error from the OS to pass on.

I suppose we could set errno to ENODEV and return -SP_ERR_FAIL, but that feels rather misleading to me, because the OS didn't actually signal that error.
Comment 6 Phil 2021-07-01 21:29:57 CEST
I took a quick look at miniterm.py (https://github.com/pyserial/pyserial/blob/0e7634747568547b8a7f9fd0c48ed74f16af4b23/serial/serialposix.py), and noticed that they have the following comments/logic:

```
# read should always return some data as select reported it was
# ready to read when we get to this point.
if not buf:
    # Disconnected devices, at least on Linux, show the
    # behavior that they are always ready to read immediately
    # but reading returns nothing.
    raise SerialException(
        'device reports readiness to read but returned no data '
        '(device disconnected or multiple access on port?)')
```

This matches the behavior I'm seeing (select returns 1, but read retuns 0). It would make sense to me to do a `RETURN_FAIL("Device disconnected")` or something like that in this case
Comment 7 Martin Ling 2021-07-02 01:30:22 CEST
The approach in pyserial looks a lot better than the fstat workaround, at least.

However, I'm not sure it's strictly correct. The specification for select() has always been pretty clear that spurious wakeups are possible. I believe it is valid behaviour for select() to return 1 and read() to return zero without the device having disappeared.

I wonder if using poll() rather than select() might help, as that would in theory allow the kernel to provide a POLLERR event if the device is disconnected.

Either way, there is the question of how to report the problem.

RETURN_FAIL("Device disconnected") would not be valid. That signals that there is an OS-reported error pending, which there isn't. I still think that the read() should be reporting an error.

Interestingly, I just found this thread, where someone proposed patching the kernel usb serial code to return ENODEV when a device is disconnected:

https://www.mail-archive.com/linux-usb@vger.kernel.org/msg68542.html

The response was that this is unnecessary because subsequent calls should already return EIO!
Comment 8 Phil 2021-07-04 15:32:32 CEST
You are right, "Device disconnected" coming from libserialport is a bad idea. I was thinking of raising a read error which my app would (potentially incorrectly) interpret as disconnected, like I do in the windows case.

I'm not sure what the correct solution is, but my workaround seems to be good enough for me for now.
Comment 9 Rock12 2023-08-04 11:17:52 CEST
"Your acoustic guitar tutorials are fantastic! I've been following them religiously and seeing real progress in my playing." https://theacousticguitar.com/best-left-handed-acoustic-guitar/