parallel: unbreak and improve "word" de-multiplexing
The previous implementation prepared but never fully enabled the
accumulation of several multi-bit items into words that span multiple
bus cycles (think: address or data de-multiplexing on memory busses).
Complete the accumulation, and fixup the end samplenumber for word
annotations. Fixup the endianess logic (the condition was inverted).
Rephrase calculation to be more Python idiomatic.
Default to word size zero, and only emit word annotations for non-zero
word size specs. This keeps the implementation backwards compatible and
still passes the test suite. Default behaviour is most appropriate for
interactive use in GUI environments, while automated processing will
find consistent behaviour across all setups (non-multiplexed busses, and
multiplexed busses with "words" that span one or multiple cycles).