655 – fx2lafw and saleae-logic16 fail to capture on subsequent runs, at high sample rates

Bug 655 - fx2lafw and saleae-logic16 fail to capture on subsequent runs, at high sample rates

Summary: fx2lafw and saleae-logic16 fail to capture on subsequent runs, at high sample...

Status:	RESOLVED FIXED

Alias:	None

Product:	libsigrok
Classification:	Unclassified
Component:	Common: FX2(LP) handling code (show other bugs)
Version:	unreleased development snapshot
Hardware:	All All

Importance:	Normal normal
Target Milestone:	---
Assignee:	Nobody

URL:
Keywords:

Duplicates (1):	904 (view as bug list)
Depends on:
Blocks:

Reported:	2015-09-14 17:42 CEST by Jim Paris
Modified:	2018-01-17 18:00 CET (History)
CC List:	6 users (show)

Attachments
Log of running sigrok-cli 9 times with different sample rates (34.04 KB, application/octet-stream) 2015-09-14 17:42 CEST, Jim Paris	Details
log of running sigrok-cli multiple times on fx2lafw (40.73 KB, text/plain) 2015-11-02 10:57 CET, joefitz	Details
log of running sigrok-cli multiple times on saleae-logic16 (42.75 KB, text/x-log) 2015-11-02 11:07 CET, joefitz	Details
Don't release NACK ALL until back in AUTOIN mode. (302 bytes, patch) 2018-01-17 01:11 CET, Braiden Kindt	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jim Paris 2015-09-14 17:42:28 CEST

Created attachment 157 [details]
Log of running sigrok-cli 9 times with different sample rates

I have a generic FX2LP-based Saleae clone logic analyzer.  It works with sigrok and the fx2lafw firmware, but at higher sample rates, it will work on the first run, but not on subsequent runs.

For example, if I capture at 1 MHz it works every time, but at 24 MHz it will work the first time but not capture any data the next time.  Furthermore, if I capture at 24 MHz and try 1 MHz next, it will fail, but then work for later invocations.

Debug output is attached.  I ran sigrok-cli 9 times in a row:

capture @ 1 MHz -> ok
capture @ 1 MHz -> ok
capture @ 24 MHz -> ok
capture @ 24 MHz -> failed
capture @ 24 MHz -> failed
capture @ 24 MHz -> failed
capture @ 1 MHz -> failed
capture @ 1 MHz -> ok
capture @ 1 MHz -> ok

So a workaround seems to be to just run a dummy capture at a low sample rate before running the real capture at a high sample rate.

Comment 1 joefitz 2015-11-02 10:57:02 CET

Created attachment 178 [details]
log of running sigrok-cli multiple times on fx2lafw

First attachment, confirming the same behaviour Jim reported

Comment 2 joefitz 2015-11-02 11:07:58 CET

Created attachment 179 [details]
log of running sigrok-cli multiple times on saleae-logic16

Second attachment - showing the exact same behavior on the Logic 16 (12.5MHz is max rate with 16 channels enabled)

Although they both use the fx2, i think this indicates that the problem is not in sigrok-firmware-fx2law but further up the stack.

It's also not isolated to sigrok-cli, since the same problem affects pulseview.

sigrok-cli 0.6.0-git-1ef118a

Using libsigrok 0.4.0-git-211cc97 (lib version 2:0:0).
Using libsigrokdecode 0.4.0-git-76a4498 (lib version 2:0:0).

Comment 3 Wolfgang Ocker 2015-11-13 16:48:28 CET

I see these problems, too.

latest sources from the sigrok repos, fedora 22, Saleae clone and original USBeeSX

Comment 4 Stefan Brüns 2017-12-29 04:51:48 CET

*** Bug 904 has been marked as a duplicate of this bug. ***

Comment 5 Braiden Kindt 2018-01-16 02:42:30 CET

I spent some time looking into this, so I'll dump what I learned:

1) The bug appears to be related to the number of simulations bulk transfers created by libsigrok. At higher frequencies, more transfers are used, and that make it more likely to trigger the bug (in firmware?).

As a work-around, but not a fix, I reduced the number of NUM_SIMUL_TRANSFERS, and the issue is mostly gone. Still maybe 1/20 transfers at 24mhz fail with no data.

diff --git a/src/hardware/fx2lafw/protocol.h b/src/hardware/fx2lafw/protocol.h
index 1403c8ab..688df617 100644
--- a/src/hardware/fx2lafw/protocol.h
+++ b/src/hardware/fx2lafw/protocol.h
@@ -36,8 +36,8 @@
 #define NUM_TRIGGER_STAGES     4
 
 #define MAX_RENUM_DELAY_MS     3000
-#define NUM_SIMUL_TRANSFERS    32
-#define MAX_EMPTY_TRANSFERS    (NUM_SIMUL_TRANSFERS * 2)
+#define NUM_SIMUL_TRANSFERS    4
+#define MAX_EMPTY_TRANSFERS    (NUM_SIMUL_TRANSFERS * 4)
 
 #define NUM_CHANNELS           16
 
2) It appears that the high number of simulations transfers triggers a bug in sigrok-firmware-fx2lafw that leaves the device in a bad state when a transfer ends.

I'm mostly sure that bug at then of of a transfer, as opposed to triggering a bug at the start of transfers. Verified by repeated testing transfers alternating between cable unplugs plus patched and unpatched libsigrok. e.g. running 1 capture with patched binary followed by 1 from unpatched binary failed at same 1/20 rate as patch did in pulseview. The opposite did not.

3) I've tired to dig through sigrok-firmware-fx2lafw, but there's a lot of datasheet to get through; here's what I ended up with:

diff --git a/gpif-acquisition.c b/gpif-acquisition.c
index 7d3dcb6..f63b8b8 100644
--- a/gpif-acquisition.c
+++ b/gpif-acquisition.c
@@ -263,7 +263,7 @@ void gpif_poll(void)
                SYNCDELAY();
 
                /* Reset EP2. */
-               FIFORESET = 0x02;
+               FIFORESET = 0x82;
                SYNCDELAY();
 
                /* Return to auto mode. */

Post this match I've seen zero instances of no data. But I do still see some buffer overruns in firmware stop gpif workflow in first second or so, if it gets past the first few seconds it works great.

On this patch;

TRM Ref F: has an example for resetting auto-out endpoints, and it matches exactly gpif-acquisition.c before my patch. But... could it be a typo? Why would you write 0x02 which would release NAK ALL, and then two lines later release NAK ALL again? 

Presumably its not safe to release NAK ALL until the endpoint is back in auto-in?

Comment 6 Braiden Kindt 2018-01-17 01:11:20 CET

Created attachment 378 [details]
Don't release NACK ALL until back in AUTOIN mode.

I _think_ that writing 0x02 releases the NACK ALL before putting the device back in auto mode. This could result in one or more bulk in from host stealing buffer from endpoint being reset?

Comment 7 Uwe Hermann 2018-01-17 18:00:25 CET

Hi, as far as we can tell/test this issue should be fixed in cfeb1a3602f587eb981c4ce1f9ff368997f4e0e0, thanks a lot to Stefan Bruens, Piotr Esden-Tempski and others for investigating, testing and fixing the issue!

Closing this issue for now, but please everyone test this new version on as many different PCs and laptops as possible, since previous iterations of the patches have shown issues on some boxes but worked fine on others.

If this specific issue is still happening for some systems, please feel free to reopen the bug.