Difference between revisions of "File format:Csv"

From sigrok
Jump to navigation Jump to search
(page title spelling experiment, capital letters)
m (Gsi moved page File format:Csv to File format:CSV without leaving a redirect: properly spell CSV with capital letters)
(No difference)

Revision as of 07:06, 26 October 2019

csv
Name CSV
Status supported
Source code (in) csv.c
Source code (out) csv.c
Common extension(s) .csv
MIME type text/csv, text/csv;header
ASCII format yes
Compression none
Website [1]

File format:CSV

CSV is the abbreviation for Comma separated values. It is a text file format where data is arranged in a tabular representation. CSV files traditionally were used with spreadsheet calculation software, but has also been used as an import and export format for signal analysis software.

In the context of signal analysis, text rows typically relate to time (the number of sample sets taken, one text line per point in time), columns contain the values within a sample set (one value per column). There is a convention that column captions can be kept in a so-called header line.

See RFC4180 for a formal specification, and the Wikipedia Comma-separated values.

Format

The text files usually contain all-printable characters, and use CR/LF for text line termination. Some software versions also accept alternative line termination sequences, like LF only, or CR only.

Columns traditionally are separated by comma. But over time other separators were also used, and may optionally be supported by application software.

Numbers are communicated without specific marking (example: just "123.45" or "-12"). Floating point numbers may use exponential representation (example: "100.0E-3"). Text may or may not be enclosed in a pair of double quotes. There can be other data types for cells like dates, currency, etc, which shall not be discussed here.

Some applications support the notion of comments, which span from a comment leader (usually semi-colon) up to the end of the line. Applications may or may not accept "trailing" comments after data, or require that comments span a complete line on their own.

Since there is a great variety of software and platforms, it's best to generate output files conservatively, and accept the minimum feature set, and optionally a few variants which are rather popular in the field.

Properties

  • The file size is not limited, neither is the line length, nor the acceptable set of used characters. (Specific applications may be limited, but the file format itself is not.)
  • Line termination sequences, column separators, and comment leaders have standard values (CR/LF, comma, semi-colon), but users may want to override these.
  • There is no reliable condition to determine that a file contains CSV data. The .csv file extension is a convention but need not be used. Application software need not have access to "a filename" either. This weakens the support for automatic detection of the file format.

Implementation

It's important to notice that the sigrok project implements support for using CSV formatted data as an input format as well as an output format. These are separate code paths, and need not be symmetric nor identical in their feature set or default behaviour. Given that the input path needs to accept text and translate it to the internal .sr presentation, and that the output path dumps the internally available .sr presentation to a text file, not all concepts translate equally well at either side. Aside from that conceptual difference, other factors like available time and manpower may influence the completeness or reliability of alternate code paths, not all are touched at the same moment by the same persons addressing the same requirements.

CSV input module

  • There is the current implementation in mainline with these features:
    • Only logic data is supported.
    • Either a single column contains all logic data. A multi-bit value in either binary, octal, or hexadecimal format spans the corresponding number of logic channels. Users need to specify the column number and channel count, the number format is user adjustable and defaults to binary.
    • Or a set of adjacent columns contains all logic data (so-called "multi-column mode"). Each column holds a single-bit value for one logic channel, respectively. The first column number must be specified, the channel count is optional and defaults to the remaining number of columns in the text line, the number format does not apply to single-bit data (binary fits as an internal implementation detail).
    • Multi-column mode is the default, spanning all columns of an input file.
    • A number of text lines at the top of the file can get skipped when necessary. The default is to process the complete file.
    • An optional header line can be used to determine channel names (an exclusive feature to multi-column mode, off by default, simple channel names get assigned when the feature is off or the column lacks input text).
    • The line termination gets auto-detected. The column separator and comment leader are user adjustable.
    • Blank lines and comment-only lines get skipped (but are counted where line numbers get referenced).
    • Users can specify a samplerate (by means of options, outside of the file), by default no samplerate can get derived.
    • There are workarounds for odd platforms/applications which don't terminate the last text line in the file, or specify which one of one bytes goes first (useless use of BOM award).
  • A candidate implementation for future integration in mainline amends the above feature set:
    • Single- and multi-column modes are available as described above, multi-column is the default.
    • A more generic "column formats" feature was added which can express the above backwards compatible modes. But also allows the use of single- and multi-bit data in any number format and order of text columns, as well as analog input data (so-called "mixed signal" source), as well as optional timestamps which allow the automatic determination of the input data's samplerate for simple cases.
    • User visible option names were renamed for consistency with other parts of the project, examples and existing external scripts need adjustment (interactive use in GUI applications should not notice).
    • Automatic file format detection was added, which enables the use of simple multi-column data files without header lines, without the need for user intervention or option specification.
    • "Funnies" like double double quotes (escapes), embedded CRLF in fields enclosed in double quotes, etc are not supported.

List of options and builtin help text for the CSV input module:

NOTE! This screen capture is for the new, amended implementation.

 $ sigrok-cli -I csv --show
 ID: csv
 Name: CSV
 Description: Comma-separated values  
 Options:
   column_formats: Text columns data types. A comma separated list of [<cols>]<fmt>[<bits>] items. * for all remaining columns. - ignores columns, x/o/b/l logic data, a (and digits) analog data, t timestamps. (default )
   single_column: Simple single-column mode, exclusively use text from the specified column (number starting at 1). Obsoleted by 'column_formats=4-,x16'. (default 0)
   first_column: First column with logic data in simple multi-column mode (number starting at 1, default 1). Obsoleted by 'column_formats=4-,*l'. (default 1)
   logic_channels: Logic channel count, required in simple single-column mode, defaults to "all remaining columns" in simple multi-column mode. Obsoleted by 'column_formats=8l'. (default 0)
   single_format: The input text number format of simple single-column mode: bin, hex, oct. Obsoleted by 'column_formats=x8'. (default 'bin', possible values 'bin', 'hex', 'oct')
   start_line: The line number at which to start processing input text (default: 1). (default 1)
   header: Use the first processed line's column captions (when available) as channel names. Off by default (default false)
   samplerate: The input data's sample rate in Hz. No default value. (default 0)
   column_separator: The sequence which separates text columns. Non-empty text, comma by default. (default ',')
   comment_leader: The text which starts comments at the end of text lines, semicolon by default. (default ';')

CSV output module

  • TODO
 $ sigrok-cli -O csv --show
 ID: csv
 Name: CSV
 Description: Comma-separated values  
 Options:
   gnuplot: gnuplot script file name (default )
   scale: Scale gnuplot graphs (default true)
   value: Character to print between values (default ',')
   record: String to print between records (default '\n')
   frame: String to print between frames (default '\n')
   comment: String used at start of comment lines (default ';')
   header: Output header comment with capture metdata (default true)
   label: Type of column labels (default 'units', possible values 'units', 'channel', 'off')
   time: Output sample time as column 1 (default true)
   trigger: Output trigger indicator as last column  (default false)
   dedup: Set to false to output duplicate rows (default false)

Examples

CSV input module

NOTE! All examples are for the "new, to become integrated" implementation. I was just too lazy to create examples for the "old, to become obsolete" version.

TODO Check for completeness, and correctness.

Simple multi-column data without header (the default format of the input module).

 $ cat simple-multi-column-no-header.csv
 1,0,1,0,1,1
 0,1,0,1,0,1
 1,0,1,0,1,1
 $ sigrok-cli -i simple-multi-column-no-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 0:101
 1:010
 2:101
 3:010
 4:101
 5:111
 (also works when the input module is selected while no options get specified)
 $ sigrok-cli -I csv -i simple-multi-column-no-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 0:101
 1:010
 2:101
 3:010
 4:101
 5:111
 (being explicit this time, and using a more general format spec, the default format in that case)
 $ sigrok-cli -I csv:column_format='*l' -i simple-multi-column-no-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 0:101
 1:010
 2:101
 3:010
 4:101
 5:111

Simple multi-column data with header (users need to specify options, which involves selecting an input module).

 $ cat simple-multi-column-with-header.csv
 a,b,c,d,e,f
 1,0,1,0,1,1
 0,1,0,1,0,1
 1,0,1,0,1,1
 $ sigrok-cli -I csv:header=yes -i simple-multi-column-with-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 a:101
 b:010
 c:101
 d:010
 e:101
 f:111
 (using the more general format spec) 
 $ sigrok-cli -I csv:header=yes:column_formats="*l" -i simple-multi-column-with-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 a:101
 b:010
 c:101
 d:010
 e:101
 f:111

Simple single-column data with hex numbers (needs module and options specs):

 $ cat simple-single-column-number-formats.csv
 x,y,8,z,1000
 x,y,d,z,1101
 x,y,7,z,0111
 x,y,2,z,0010
 $ sigrok-cli -I csv:single_column=3:single_format=hex:logic_channels=4 -i simple-single-column-number-formats.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 4/4 channels
 0:0110
 1:0011
 2:0110
 3:1100
 (using the more general format spec) 
 $ sigrok-cli -I csv:column_formats=2-,x4 -i simple-single-column-number-formats.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 4/4 channels
 0:0110
 1:0011
 2:0110
 3:1100
 (also works with other number formats)
 $ sigrok-cli -I csv:column_formats=4-,b4 -i simple-single-column-number-formats.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 4/4 channels
 0:0110
 1:0011
 2:0110
 3:1100

Mixed signal data in arbitrary order. Timestamps and automatic samplerate.

 $ cat mixed-signal-data.csv
 time,ch1,ch2,logic,ch3,gray4,ch4,ignore,bits3
 0.000,25.00,50.00,0,75.00,0,0.00,0,000
 0.001,26.00,51.00,1,76.00,1,1.00,1,001
 0.002,27.00,52.00,0,77.00,3,2.00,2,010
 0.003,28.00,53.00,1,78.00,2,3.00,3,011
 0.004,29.00,54.00,0,79.00,6,4.00,4,100
 0.005,30.00,55.00,1,80.00,7,5.00,5,101
 0.006,31.00,56.00,0,81.00,5,6.00,6,110
 0.007,32.00,57.00,1,82.00,4,7.00,7,111
 0.008,33.00,58.00,0,83.00,c,8.00,8,000
 0.009,34.00,59.00,1,84.00,d,9.00,9,001
 (grab the data, ignore the first column)
 $ sigrok-cli -I csv:header=yes:column_formats=-,2a,l,a,x4,a,-,b3 -i mixed-signal-data.csv | head -n 8
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 8/12 channels
 ch1: 25.000
 ch1: 26.000
 ch1: 27.000
 ch1: 28.000
 ch1: 29.000
 ch1: 30.000
 (user specified samplerate)
 $ sigrok-cli -I csv:header=yes:column_formats=-,2a,l,a,x4,a,-,b3:samplerate=8000 -i mixed-signal-data.csv | head -n 8
 META samplerate: 8000
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 8/12 channels at 8 kHz
 ch1: 25.000
 ch1: 26.000
 ch1: 27.000
 ch1: 28.000
 ch1: 29.000
 (automatic samplerate)
 $ sigrok-cli -I csv:header=yes:column_formats=t,2a,l,a,x4,a,-,b3 -i mixed-signal-data.csv
 META samplerate: 1000
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 8/12 channels at 1 kHz
 ch1: 25.000
 ch1: 26.000
 ...
 ch1: 33.000
 ch1: 34.000
 ch2: 50.000
 ch2: 51.000
 ...
 ch2: 58.000
 ch2: 59.000
 ch3: 75.000
 ch3: 76.000
 ...
 ch3: 83.000
 ch3: 84.000
 ch4: 0.000
 ch4: 1.000
 ...
 ch4: 8.000
 ch4: 9.000
 logic:01010101 01
 gray4[0]:01100110 01
 gray4[1]:00111100 00
 gray4[2]:00001111 11
 gray4[3]:00000000 11
 bits3[0]:01010101 01
 bits3[1]:00110011 00
 bits3[2]:00001111 00

Comments, empty lines, skipped lines.

 $ cat -n comments-empty-skipped.csv  
  1  These lines neither are comments 
  2  nor are they header nor data lines.
  3  It's some introductory text, captions,
  4  or whatever -- let's not process that.
  5
  6  ; comments get trimmed and skipped out of the box
  7  ; as are empty lines like above and below
  8
  9  ; yet another comment
 10  1,0,1,0
 11  0,1,0,1
 12  1,0,1,0
 $ sigrok-cli -I csv:start_line=5 -i comments-empty-skipped.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 4/4 channels
 0:101
 1:010
 2:101
 3:010

CSV output module

TODO

Resources