Bug 343

Summary: USB issues on Windows with e.g. fx2lafw, causes frontend hangs
Product: libsigrok Reporter: Uwe Hermann <uwe>
Component: Common: USB handling codeAssignee: Nobody <nobody>
Status: RESOLVED FIXED    
Severity: blocker CC: boris.gjenero, cimba, martin-sigrokbugs, vlad-mbx
Priority: Highest    
Version: unreleased development snapshot   
Target Milestone: ---   
Hardware: All   
OS: Windows   
Attachments: libusb-fx2lafw-win-capture-log
Acquire event waiters lock while calling libusb_wait_for_event().
libusb-fx2lafw-win-capture-log-2
Revise interaction between main and USB waiting threads.
Revise interaction between main and USB waiting threads

Description Uwe Hermann 2014-04-14 18:50:50 CEST
There seem to be issues on Windows when using (e.g.) fx2lafw based logic analyzers (Saleae Logic, others). When running PulseView for example, multiple runs will enventually lead to a hang of the software and apparently it will use up all memory and/or CPU and cause the system to hang, you cannot close the application by clicking the "Close" window icon etc.

It is not yet clear whether this is a libsigrok, USB backend/thread code in libsigrok, libusb, libusb support (or lack thereof or bugs related to it) for FX2 renumeration, fx2lafw firmware, or frontend issue.

It is also not yet clear whether this happens with sigrok-cli as well, or whether other devices than fx2lafw-based ones are affected.

The issue has been observed on Windows XP and Windows 7, several boxes.

More info and traces will be added as they become available.
Comment 1 Boris Gjenero 2014-11-28 05:43:30 CET
There is an Event handle leak somewhere which may be related to this. You can see the handle count in Task Manager if you add the Handles column via View -> Select Columns. Sysinternals Process Explorer can also show Handles but it behaves badly when a process has many handles, and it won't show unnamed handles. Sysinternals Handle http://technet.microsoft.com/en-ca/sysinternals/bb896655.aspx can show the type of handle via "handle -a -p pulseview.exe".

When I first start up PulseView and select CWAV USBee AX, the number of handles is 287. If I run once, for the default 1 M samples at 20 kHz, I end up with 37369 handles. If I repeat this, I get 59309 handles.
Comment 2 Marc 2015-02-20 22:35:36 CET
Is there any progress? This bug renders the windows-version unusable.
Comment 3 Martin Ling 2015-02-25 11:18:21 CET
I've just seen the info here about the event handle leak.

I had a look at the code to see how this might be happening, but at first glance I can't see an obvious reason.

We only call CreateEvent in usb_source_add() in src/usb.c, which is called by dev_acquisition_start() in src/hardware/fx2lafw/api.c.

The corresponding CloseHandle() call is in usb_source_remove(), which is called by finish_acquisition() in src/hardware/fx2lafw/protocol.c.

We can't be calling dev_acquisition_start() that many times, can we? It should be called only once to start the capture.
Comment 4 Vlad Ivanov 2015-02-27 19:03:16 CET
>It is also not yet clear whether this happens with sigrok-cli 
>as well, or whether other devices than fx2lafw-based ones are 
>affected.

I was able to reproduce this bug with sigrok-cli as well (device: lcsoft mini board).
Comment 5 Uwe Hermann 2015-02-27 20:26:37 CET
OK, that's good. That means we can leave Boost and C++11 threads (as used in PulseView) out of the picture.

What remains is MinGW-w64 (gcc) via MXE (which uses "posix" threads via MinGW-w64's "winpthreads" lib) and glib and libusb.
Comment 6 Vlad Ivanov 2015-03-07 14:26:45 CET
Created attachment 122 [details]
libusb-fx2lafw-win-capture-log

I have tried compiling libsigrok fx2lafw driver with libusb_set_debug() call. This log file was produced by capturing 100 samples at 20 kHz rate. I'm not familiar with libusb enough — is this normal that libusb appears to create a file descriptor for every transfer? Are these transfers supposed to be cancelled?
Comment 7 Vlad Ivanov 2015-03-14 12:52:01 CET
Looks like the handle leak is produced due to use of libusb_wait_for_event from a separate thread.

The following call sequence fails:

libsigrok:
	usb.c:187: libusb_wait_for_event(ctx->libusb_ctx, NULL)

libusb:
	io.c: libusb_wait_for_event
	threads_windows.c:160: usbi_mutex_unlock(mutex)
	threads_windows.c:67: ReleaseMutex(*mutex) 
		returns non-zero value, and GetLastError is ERROR_NOT_OWNER (0x120)

This happens because libsigrok tries to ReleaseMutex from usb_thread(), but the mutex was created in the main thread. As a result, usbi_cond_intwait returns at threads_windows.c:160 and creates a new event every time being called (threads_windows.c:148). Therefore, either libusb must be initialized from the usb_thread (will require too many changes I think) or libusb_wait_for_event called from the main thread.
Comment 8 Martin Ling 2015-03-16 13:49:06 CET
> Therefore, either libusb must be initialized from the usb_thread (will require
> too many changes I think) or libusb_wait_for_event called from the main thread.

Argh.

The whole point of that USB thread was to run libusb_wait_for_event().

The issue was that the main libsigrok session loop uses g_poll(), but on Windows libusb_get_pollfds() isn't available, so I tried to fudge it by having a separate USB thread running libusb_wait_for_event(), and triggering an event that could be used with g_poll().

So aving libusb_wait_for_event() called from the main thread is a non starter.

Maybe we will have to move the creation of this USB thread into sr_init(), put the libusb init in there, and have it sat waiting until needed. But what are the other implications of that? Will other libusb calls in the main thread then break?
Comment 9 Martin Ling 2015-03-16 15:28:17 CET
Ah - I think the issue is the lack of a call to libusb_lock_event_waiters() before calling libusb_wait_for_event().
Comment 10 Martin Ling 2015-03-16 15:32:00 CET
Created attachment 126 [details]
Acquire event waiters lock while calling libusb_wait_for_event().

Please try with this patch.
Comment 11 Vlad Ivanov 2015-03-16 17:36:29 CET
>libusb_lock_event_waiters(ctx->libusb_ctx);
>libusb_wait_for_event(ctx->libusb_ctx, NULL);
>libusb_unlock_event_waiters(ctx->libusb_ctx);

Just tried this patch, didn't work (acquisition just hangs, but without hanging the frontend). My guess is the thread just stays at WaitForSingleObject(INFINITE) (libusb:thread_windows.c:59).
Comment 12 Vlad Ivanov 2015-03-16 19:38:11 CET
>Will other libusb calls in the main thread then break?

At least mutex- and event-related ones will.
Comment 13 Martin Ling 2015-03-16 20:50:14 CET
(In reply to comment #11)
> >libusb_lock_event_waiters(ctx->libusb_ctx);
> >libusb_wait_for_event(ctx->libusb_ctx, NULL);
> >libusb_unlock_event_waiters(ctx->libusb_ctx);
> 
> Just tried this patch, didn't work (acquisition just hangs, but without
> hanging the frontend). My guess is the thread just stays at
> WaitForSingleObject(INFINITE) (libusb:thread_windows.c:59).

Can you confirm if that's actually what's happening? I can't see why it should block there.
Comment 14 Vlad Ivanov 2015-03-16 22:00:29 CET
Created attachment 127 [details]
libusb-fx2lafw-win-capture-log-2

Looks like I made a wrong assumption, sorry. usb_thread is blocked by g_mutex_lock in usb_callback, and if that helps, GetLastError is ERROR_INVALID_HANDLE. At this point I can't say if this error code is caused by usb_callback (that's just what I see in ollydbg). I have also attached a log file.
Comment 15 Martin Ling 2015-03-17 01:39:13 CET
(In reply to comment #14)
> Looks like I made a wrong assumption, sorry. usb_thread is blocked by
> g_mutex_lock in usb_callback, and if that helps, GetLastError is
> ERROR_INVALID_HANDLE. At this point I can't say if this error code is caused
> by usb_callback (that's just what I see in ollydbg). I have also attached a
> log file.

Sorry, it's not quite clear what you mean - do you mean that usb_callback() is blocked at g_mutex_lock() because usb_thread() has the lock?

That would make sense - I can see now that there's a deadlock condition where usb_thread() gets an event, sets ctx->usb_event and yields, but then gets run again and reacquires ctx->usb_mutex before usb_callback() can do so. If there's no further libusb events, usb_thread() then waits forever while holding the lock.

I'll see about a fix for that.

That doesn't explain the ERROR_INVALID_HANDLE though - can you work out where that's originating?
Comment 16 Martin Ling 2015-03-17 13:20:29 CET
Created attachment 128 [details]
Revise interaction between main and USB waiting threads.

Please give this patch a try, on top of the previous change.

There might be some stupid mistakes in it - I don't even have a working Windows cross-compile setup just now, let alone system to test on. May be able to set that up later. But hopefully at least the idea is clear: using two events for the interlock between threads rather than racing for a mutex. I've also added commentary to all the windows-specific USB code.
Comment 17 Vlad Ivanov 2015-03-17 18:27:55 CET
Created attachment 129 [details]
Revise interaction between main and USB waiting threads

Tested this patch, and, finally, the acquisition works! I haven't noticed any major handle or memory leak. However, there are still problems regarding the capture:

— for some reason only first 163840 us are captured (screenshot: http://i.imgur.com/O0HFbvu.png, capture log: http://pastebin.com/cGNKG3nf)
— capture triggers don't work, they just fire immediately

However, a separate bugs probably should be opened for these issues, especially for the former, since this issue can be reproduced without the patch too.

I'm attaching two combined patches with minor changes.
Comment 18 Uwe Hermann 2015-03-18 17:03:09 CET
I did a quick test with the patch, according to the Task Manager the number of handles no longer seems to explode indeed.

However, the fact that it only gathers a certain number of sample is strange; without the patch I can (sometimes) get larger amounts of samples just fine, it's just that at some point it'll eventually "hang" due to way too many handles etc.

The triggering issue is probably unrelated, yeah, that should be another bug once this one is fully understood and fixed.
Comment 19 Martin Ling 2015-03-18 17:08:52 CET
I think there's probably still some issues with the code - it looks like some events or timers are ending up not being handled, leading to the capture stopping due to timeout errors.

I finally got a Windows native build/test setup working yesterday so I'll see about doing some debugging soon.

Getting closer though.
Comment 20 Uwe Hermann 2015-03-18 17:46:21 CET
Maybe this is still (partially?) related to threading differences. For C++11 support (PulseView) we need to use winpthreads (if using MXE cross-compiles, but I think the same applies to native builds as well), since the C++11 support is currently apparently tied to POSIX threads (and win32 threads are not supported for that in MinGW-w64).

Which means it *might* be possible that some code chunks (in sr, libusb, or MinGW-w64 itself) are using POSIX threads, and others use win32 threads which could result in all kinds of strange issues (just an assumption on my part, though).

Specifically, C++11 has a std::thread get() method IIRC, which can return the "native" thread handle (which would be a POSIX thread in our case), and if that happens to be used directly or indirectly somewhere where a win32 thread is expected...
Now, this specific bug also happens with just sigrok-cli (no C++11 std::thread), but the builds I was testing on are still built with winpthreads (POSIX threads) and the mixup with win32 may or may not also happen in some place there as well, no idea.

I have no idea whether e.g. libusb uses POSIX threads internally or not (on Windows).

Just some ideas for further places to check.
Comment 21 Uwe Hermann 2015-03-24 20:21:32 CET
The latest patch is now merged as f6c30de4b031ab9a8d838c1ad6ff4118b03c0dc4.

It apparently also fixed #328 and it doesn't make the situation worse for fx2lafw on Windows. Neither does it affect any other OSes negatively, so this is OK for mainline.

I'm leaving the bug open for now though, since this does not yet appear to be the "full" fix.
Comment 22 Martin Ling 2015-03-25 13:13:32 CET
OK, the actual fix is here:

https://github.com/martinling/libsigrok/commit/dfb6023bbfb9fb5f8e4fbcc8ca2471542e89523b

When doing your own poll()/select() loop on libusb fds, you are supposed to call libusb_get_next_timeout() to obtain a minimum timeout for the poll. The libsigrok session loop wasn't implementing this. We got away with that on Linux, where libusb doesn't happen to need to use that mechanism. We also get away with it in lots of cases where drivers set short timeouts of their own.

I ran into an API issue while implementing this though - sr_session_iteration() needs access to the libusb context, which is stored in the libsigrok context. But sr_session_new() and sr_session_load() don't take a context pointer so there's nowhere to associate a session with a context.

So the patch is stacked on top of an API change for that, along with two patches I needed to build natively on MinGW64.

We actually have lots of functions that should really be requiring a context pointer if having multiple contexts is ever going to work. But that's a separate issue so I've just dealt with these ones for now.
Comment 23 Martin Ling 2015-03-25 23:19:49 CET
NB sigrok-cli change to accompany the API changes is here:
https://github.com/martinling/sigrok-cli/commit/3d24ca2da53f03585db396e67252a7dcd9e67e6e
Comment 24 Uwe Hermann 2015-03-26 01:08:16 CET
Thanks a lot! Merged/fixed in bb5f61105bac2d7826416472e2168d8f59300aea.

Seems to work fine for me after a quick test, will do some more testing later on (also with various other USB devices on Windows).
Comment 25 Vlad Ivanov 2015-03-26 11:37:17 CET
Works stable with up to 2 MSPS sample rate, doesn't work with higher rates, unfortunately. Maybe it's just me, so we need to wait for other test results.
Comment 26 Martin Ling 2015-03-26 11:58:34 CET
Because there is no significant memory buffer on FX2 devices, the sample rate achievable with fx2lafw has always been heavily dependent on the system, and the efficiency of the whole USB hardware->driver->OS->libusb->libsigrok stack. If it can't consistently maintain a smoothly flowing stream then the capture will fail. It may just be that 2MSPS is all we can easily achieve on your system at the moment.

Feel free to open a separate bug for the performance issue though if you think it's something that could be solved at the libsigrok level.
Comment 27 Vlad Ivanov 2015-03-26 13:48:43 CET
(In reply to comment #26)
> Because there is no significant memory buffer on FX2 devices, the sample
> rate achievable with fx2lafw has always been heavily dependent on the
> system, and the efficiency of the whole USB
> hardware->driver->OS->libusb->libsigrok stack. If it can't consistently
> maintain a smoothly flowing stream then the capture will fail. It may just
> be that 2MSPS is all we can easily achieve on your system at the moment.

I have mentioned this issue only because the same Saleae Logic device works on this system with 24 MSPS (with Saleae software), and, AFAIK, Saleae uses libusb too, at least for the older models like Logic and Logic 16. Also, just tried replacing USB cable and it helped a bit — 6 MSPS speed was achieved. I will test LCSoft board later too, but both devices use almost the same firmware, as far as I understand.
Comment 28 Martin Ling 2015-03-26 14:40:07 CET
(In reply to comment #27)
> I have mentioned this issue only because the same Saleae Logic device works
> on this system with 24 MSPS (with Saleae software), and, AFAIK, Saleae uses
> libusb too, at least for the older models like Logic and Logic 16. Also,
> just tried replacing USB cable and it helped a bit — 6 MSPS speed was
> achieved. I will test LCSoft board later too, but both devices use almost
> the same firmware, as far as I understand.

OK, please open a new bug with the performance issue, including any results you gather.
Comment 29 Uwe Hermann 2015-03-26 15:05:56 CET
I'm seeing issues with certain samplerates too in fx2lafw, also on Linux. I don't think this specific instance of a problem is related to the USB performance.

It is possible that my attempt of forward-porting and merging the relatively old DSLogic code based on eightdot's patchset has gone wrong and I broke something in the fx2lafw driver itself (when used with non-DSLogic devices).

If anyone wants to do a review of those changes and/or fix up any regressions introduced there that would be great. Anyway, I think these issues are unrelated to #343.