Difference between revisions of "File format:Csv"

From sigrok
Jump to navigation Jump to search
m (Gsi moved page File format:CSV to File format:Csv without leaving a redirect: move CSV to csv since that's the file format page's link's name)
(update CSV input discussion, the "new" version has gone mainline)
 
(One intermediate revision by one other user not shown)
Line 9: Line 9:
| is_ascii        = yes
| is_ascii        = yes
| compression      = none
| compression      = none
| website          = [https://tools.ietf.org/rfc/rfc4180.txt]
| website          = [https://tools.ietf.org/rfc/rfc4180.txt tools.ietf.org]
}}
}}


Line 16: Line 16:
In the context of signal analysis, text rows typically relate to time (the number of sample sets taken, one text line per point in time), columns contain the values within a sample set (one value per column). There is a convention that column captions can be kept in a so-called header line.
In the context of signal analysis, text rows typically relate to time (the number of sample sets taken, one text line per point in time), columns contain the values within a sample set (one value per column). There is a convention that column captions can be kept in a so-called header line.


See [https://tools.ietf.org/rfc/rfc4180.txt RFC4180] for a formal specification, and the [https://en.wikipedia.org/wiki/Comma-separated_values Wikipedia Comma-separated values].
See [https://tools.ietf.org/rfc/rfc4180.txt RFC4180] for a formal specification, and the [https://en.wikipedia.org/wiki/Comma-separated_values Wikipedia Comma-separated values] article.


== Format ==
== Format ==
Line 32: Line 32:
== Properties ==
== Properties ==


* The file size is not limited, neither is the line length, nor the acceptable set of used characters. (Specific applications may be limited, but the file format itself is not.)
* The file size is not limited, neither is the line length, nor the acceptable set of used characters (specific applications may be limited, but the file format itself is not).
* Line termination sequences, column separators, and comment leaders have standard values (CR/LF, comma, semi-colon), but users may want to override these.
* Line termination sequences, column separators, and comment leaders have standard values (CR/LF, comma, semi-colon), but users may want to override these.
* There is no reliable condition to determine that a file contains CSV data. The '''.csv''' file extension is a convention but need not be used. Application software need not have access to "a filename" either. This weakens the support for automatic detection of the file format.
* There is no reliable condition to determine that a file contains CSV data. The '''.csv''' file extension is a convention but need not be used. Application software need not have access to "a filename" either. This weakens the support for automatic detection of the file format.
Line 42: Line 42:
=== CSV input module ===
=== CSV input module ===


* There is the current implementation in mainline with these features:
The input module for the CSV file format implements these features:
** Only logic data is supported.
* Supports logic data, analog data, and timestamps.
** Either a single column contains all logic data. A multi-bit value in either binary, octal, or hexadecimal format spans the corresponding number of logic channels. Users need to specify the column number and channel count, the number format is user adjustable and defaults to binary.
* Logic data can use individual columns per data bit (the default), or can combine multiple bits in a column. For multi-bit columns the user can specify which number representation applies (binary, octal, hexadecimal) and how many bits the column contains.
** Or a set of adjacent columns contains all logic data (so-called "multi-column mode"). Each column holds a single-bit value for one logic channel, respectively. The first column number must be specified, the channel count is optional and defaults to the remaining number of columns in the text line, the number format does not apply to single-bit data (binary fits as an internal implementation detail).
* Analog values always occupy one column per value.
** Multi-column mode is the default, spanning all columns of an input file.
* The order of columns and their data types are up to the user, but must be specified when they differ from the default case. Even for the default layout it is a good idea to specify the format for the purpose of being explicit to avoid surprises, and to become self documenting.
** A number of text lines at the top of the file can get skipped when necessary. The default is to process the complete file.
** The default case is all-logic data with one bit per column in all columns of a text line ("multi-column mode") for backwards compatibility.
** An optional header line can be used to determine channel names (an exclusive feature to multi-column mode, off by default, simple channel names get assigned when the feature is off or the column lacks input text).
** "Simple" layouts can be described in a backwards compatible style with specific keywords, but the generic format specifier is most flexible and feature complete, and is considered to be as accessible to users and readable, so it should be preferred.
** The line termination gets auto-detected. The column separator and comment leader are user adjustable.
** For the single column case where a multi-bit value spans the corresponding number of logic channels, users need to specify the column number and channel count, and the number format if it is not binary.
** Blank lines and comment-only lines get skipped (but are counted where line numbers get referenced).
** For multiple adjacent columns which each contain one bit of logic data (so-called "multi-column mode") the first column number must be specified. The channel count is optional and defaults to the remaining number of columns in the text line. The number format does not apply to single-bit data (binary fits as an implementation detail).
** Users can specify a samplerate (by means of options, outside of the file), by default no samplerate can get derived.
** A more generic "column formats" feature can express the above backwards compatible modes. But also allows the use of single- and multi-bit data in any number format and order of text columns, as well as analog input data (so-called "mixed signal" source), as well as optional timestamps which allow the automatic determination of the input data's samplerate for simple cases.
** There are workarounds for odd platforms/applications which don't terminate the last text line in the file, or specify which one of one bytes goes first (useless use of BOM award).
* A number of text lines at the top of the file can get skipped when necessary. The default is to process the complete file.
* An optional header line can be used to determine channel names (an exclusive feature to multi-column mode, off by default, simple channel names get assigned when the feature is off or the column lacks input text).
* The line termination gets auto-detected. The column separator and comment leader are user adjustable.
* Blank lines and comment-only lines get skipped (but are counted where line numbers get referenced).
* Users can specify a samplerate (by means of options, outside of the file). In the absence of a user spec but the presence of timestamps in the file the samplerate will get determined from input data.
* There are workarounds for odd platforms/applications which don't terminate the last text line in the file, or specify which one of one bytes goes first (useless use of BOM award).
* Automatic file format detection is supported, which enables the use of simple multi-column data files without header lines, without the need for user intervention or option specification.
* "Funnies" like double double quotes (escapes), embedded CRLF in fields enclosed in double quotes, etc. are not supported.


* A candidate implementation for future integration in mainline amends the above feature set:
Notice that in 2019-12 user visible option names were renamed for consistency with other parts of the project, examples and existing external scripts need adjustment (interactive use in GUI applications should not notice).
** Single- and multi-column modes are available as described above, multi-column is the default.
** A more generic "column formats" feature was added which can express the above backwards compatible modes. But also allows the use of single- and multi-bit data in any number format and order of text columns, as well as analog input data (so-called "mixed signal" source), as well as optional timestamps which allow the automatic determination of the input data's samplerate for simple cases.
** User visible option names were renamed for consistency with other parts of the project, examples and existing external scripts need adjustment (interactive use in GUI applications should not notice).
** Automatic file format detection was added, which enables the use of simple multi-column data files without header lines, without the need for user intervention or option specification.
** "Funnies" like double double quotes (escapes), embedded CRLF in fields enclosed in double quotes, etc are not supported.


List of options and builtin help text for the CSV input module:
List of options and builtin help text for the CSV input module:


NOTE! This screen capture is for the new, amended implementation.
<small>
 
   $ '''sigrok-cli -I csv --show'''
   $ sigrok-cli -I csv --show
   ID: csv
   ID: csv
   Name: CSV
   Name: CSV
Line 80: Line 81:
     column_separator: The sequence which separates text columns. Non-empty text, comma by default. (default ',')
     column_separator: The sequence which separates text columns. Non-empty text, comma by default. (default ',')
     comment_leader: The text which starts comments at the end of text lines, semicolon by default. (default ';')
     comment_leader: The text which starts comments at the end of text lines, semicolon by default. (default ';')
</small>


=== CSV output module ===
=== CSV output module ===
Line 85: Line 87:
* TODO
* TODO


   $ sigrok-cli -O csv --show
<small>
   $ '''sigrok-cli -O csv --show'''
   ID: csv
   ID: csv
   Name: CSV
   Name: CSV
Line 101: Line 104:
     trigger: Output trigger indicator as last column  (default false)
     trigger: Output trigger indicator as last column  (default false)
     dedup: Set to false to output duplicate rows (default false)
     dedup: Set to false to output duplicate rows (default false)
</small>


== Examples ==
== Examples ==


=== CSV input module ===
=== CSV input module ===
NOTE! All examples are for the "new, to become integrated" implementation. I was just too lazy to create examples for the "old, to become obsolete" version.
TODO Check for completeness, and correctness.


Simple multi-column data without header (the default format of the input module).
Simple multi-column data without header (the default format of the input module).


<small>
   $ cat simple-multi-column-no-header.csv
   $ cat simple-multi-column-no-header.csv
   1,0,1,0,1,1
   1,0,1,0,1,1
Line 148: Line 149:
   4:101
   4:101
   5:111
   5:111
</small>


Simple multi-column data with header (users need to specify options, which involves selecting an input module).
Simple multi-column data with header (users need to specify options, which involves selecting an input module).


<small>
   $ cat simple-multi-column-with-header.csv
   $ cat simple-multi-column-with-header.csv
   a,b,c,d,e,f
   a,b,c,d,e,f
Line 177: Line 180:
   e:101
   e:101
   f:111
   f:111
</small>


Simple single-column data with hex numbers (needs module and options specs):
Simple single-column data with hex numbers (needs module and options specs):


<small>
   $ cat simple-single-column-number-formats.csv
   $ cat simple-single-column-number-formats.csv
   x,y,8,z,1000
   x,y,8,z,1000
Line 211: Line 216:
   2:0110
   2:0110
   3:1100
   3:1100
</small>


Mixed signal data in arbitrary order. Timestamps and automatic samplerate.
Mixed signal data in arbitrary order. Timestamps and automatic samplerate.


<small>
   $ cat mixed-signal-data.csv
   $ cat mixed-signal-data.csv
   time,ch1,ch2,logic,ch3,gray4,ch4,ignore,bits3
   time,ch1,ch2,logic,ch3,gray4,ch4,ignore,bits3
Line 282: Line 289:
   bits3[1]:00110011 00
   bits3[1]:00110011 00
   bits3[2]:00001111 00
   bits3[2]:00001111 00
</small>


Comments, empty lines, skipped lines.   
Comments, empty lines, skipped lines.   


<small>
   $ cat -n comments-empty-skipped.csv   
   $ cat -n comments-empty-skipped.csv   
   1  These lines neither are comments  
   1  These lines neither are comments  
Line 306: Line 315:
   2:101
   2:101
   3:010
   3:010
</small>


=== CSV output module ===
=== CSV output module ===
Line 314: Line 324:


* [https://tools.ietf.org/rfc/rfc4180.txt RFC4180]
* [https://tools.ietf.org/rfc/rfc4180.txt RFC4180]
* [https://en.wikipedia.org/wiki/Comma-separated_values Wikipedia Comma-separated values]
* [https://en.wikipedia.org/wiki/Comma-separated_values Wikipedia: Comma-separated values]


__NOTOC__
__NOTOC__


[[Category:File format]]
[[Category:File format]]

Latest revision as of 20:16, 22 December 2019

csv
Name CSV
Status supported
Source code (in) csv.c
Source code (out) csv.c
Common extension(s) .csv
MIME type text/csv, text/csv;header
ASCII format yes
Compression none
Website tools.ietf.org

CSV is the abbreviation for Comma separated values. It is a text file format where data is arranged in a tabular representation. CSV files traditionally were used with spreadsheet calculation software, but has also been used as an import and export format for signal analysis software.

In the context of signal analysis, text rows typically relate to time (the number of sample sets taken, one text line per point in time), columns contain the values within a sample set (one value per column). There is a convention that column captions can be kept in a so-called header line.

See RFC4180 for a formal specification, and the Wikipedia Comma-separated values article.

Format

The text files usually contain all-printable characters, and use CR/LF for text line termination. Some software versions also accept alternative line termination sequences, like LF only, or CR only.

Columns traditionally are separated by comma. But over time other separators were also used, and may optionally be supported by application software.

Numbers are communicated without specific marking (example: just "123.45" or "-12"). Floating point numbers may use exponential representation (example: "100.0E-3"). Text may or may not be enclosed in a pair of double quotes. There can be other data types for cells like dates, currency, etc, which shall not be discussed here.

Some applications support the notion of comments, which span from a comment leader (usually semi-colon) up to the end of the line. Applications may or may not accept "trailing" comments after data, or require that comments span a complete line on their own.

Since there is a great variety of software and platforms, it's best to generate output files conservatively, and accept the minimum feature set, and optionally a few variants which are rather popular in the field.

Properties

  • The file size is not limited, neither is the line length, nor the acceptable set of used characters (specific applications may be limited, but the file format itself is not).
  • Line termination sequences, column separators, and comment leaders have standard values (CR/LF, comma, semi-colon), but users may want to override these.
  • There is no reliable condition to determine that a file contains CSV data. The .csv file extension is a convention but need not be used. Application software need not have access to "a filename" either. This weakens the support for automatic detection of the file format.

Implementation

It's important to notice that the sigrok project implements support for using CSV formatted data as an input format as well as an output format. These are separate code paths, and need not be symmetric nor identical in their feature set or default behaviour. Given that the input path needs to accept text and translate it to the internal .sr presentation, and that the output path dumps the internally available .sr presentation to a text file, not all concepts translate equally well at either side. Aside from that conceptual difference, other factors like available time and manpower may influence the completeness or reliability of alternate code paths, not all are touched at the same moment by the same persons addressing the same requirements.

CSV input module

The input module for the CSV file format implements these features:

  • Supports logic data, analog data, and timestamps.
  • Logic data can use individual columns per data bit (the default), or can combine multiple bits in a column. For multi-bit columns the user can specify which number representation applies (binary, octal, hexadecimal) and how many bits the column contains.
  • Analog values always occupy one column per value.
  • The order of columns and their data types are up to the user, but must be specified when they differ from the default case. Even for the default layout it is a good idea to specify the format for the purpose of being explicit to avoid surprises, and to become self documenting.
    • The default case is all-logic data with one bit per column in all columns of a text line ("multi-column mode") for backwards compatibility.
    • "Simple" layouts can be described in a backwards compatible style with specific keywords, but the generic format specifier is most flexible and feature complete, and is considered to be as accessible to users and readable, so it should be preferred.
    • For the single column case where a multi-bit value spans the corresponding number of logic channels, users need to specify the column number and channel count, and the number format if it is not binary.
    • For multiple adjacent columns which each contain one bit of logic data (so-called "multi-column mode") the first column number must be specified. The channel count is optional and defaults to the remaining number of columns in the text line. The number format does not apply to single-bit data (binary fits as an implementation detail).
    • A more generic "column formats" feature can express the above backwards compatible modes. But also allows the use of single- and multi-bit data in any number format and order of text columns, as well as analog input data (so-called "mixed signal" source), as well as optional timestamps which allow the automatic determination of the input data's samplerate for simple cases.
  • A number of text lines at the top of the file can get skipped when necessary. The default is to process the complete file.
  • An optional header line can be used to determine channel names (an exclusive feature to multi-column mode, off by default, simple channel names get assigned when the feature is off or the column lacks input text).
  • The line termination gets auto-detected. The column separator and comment leader are user adjustable.
  • Blank lines and comment-only lines get skipped (but are counted where line numbers get referenced).
  • Users can specify a samplerate (by means of options, outside of the file). In the absence of a user spec but the presence of timestamps in the file the samplerate will get determined from input data.
  • There are workarounds for odd platforms/applications which don't terminate the last text line in the file, or specify which one of one bytes goes first (useless use of BOM award).
  • Automatic file format detection is supported, which enables the use of simple multi-column data files without header lines, without the need for user intervention or option specification.
  • "Funnies" like double double quotes (escapes), embedded CRLF in fields enclosed in double quotes, etc. are not supported.

Notice that in 2019-12 user visible option names were renamed for consistency with other parts of the project, examples and existing external scripts need adjustment (interactive use in GUI applications should not notice).

List of options and builtin help text for the CSV input module:

 $ sigrok-cli -I csv --show
 ID: csv
 Name: CSV
 Description: Comma-separated values  
 Options:
   column_formats: Text columns data types. A comma separated list of [<cols>]<fmt>[<bits>] items. * for all remaining columns. - ignores columns, x/o/b/l logic data, a (and digits) analog data, t timestamps. (default )
   single_column: Simple single-column mode, exclusively use text from the specified column (number starting at 1). Obsoleted by 'column_formats=4-,x16'. (default 0)
   first_column: First column with logic data in simple multi-column mode (number starting at 1, default 1). Obsoleted by 'column_formats=4-,*l'. (default 1)
   logic_channels: Logic channel count, required in simple single-column mode, defaults to "all remaining columns" in simple multi-column mode. Obsoleted by 'column_formats=8l'. (default 0)
   single_format: The input text number format of simple single-column mode: bin, hex, oct. Obsoleted by 'column_formats=x8'. (default 'bin', possible values 'bin', 'hex', 'oct')
   start_line: The line number at which to start processing input text (default: 1). (default 1)
   header: Use the first processed line's column captions (when available) as channel names. Off by default (default false)
   samplerate: The input data's sample rate in Hz. No default value. (default 0)
   column_separator: The sequence which separates text columns. Non-empty text, comma by default. (default ',')
   comment_leader: The text which starts comments at the end of text lines, semicolon by default. (default ';')

CSV output module

  • TODO

 $ sigrok-cli -O csv --show
 ID: csv
 Name: CSV
 Description: Comma-separated values  
 Options:
   gnuplot: gnuplot script file name (default )
   scale: Scale gnuplot graphs (default true)
   value: Character to print between values (default ',')
   record: String to print between records (default '\n')
   frame: String to print between frames (default '\n')
   comment: String used at start of comment lines (default ';')
   header: Output header comment with capture metdata (default true)
   label: Type of column labels (default 'units', possible values 'units', 'channel', 'off')
   time: Output sample time as column 1 (default true)
   trigger: Output trigger indicator as last column  (default false)
   dedup: Set to false to output duplicate rows (default false)

Examples

CSV input module

Simple multi-column data without header (the default format of the input module).

 $ cat simple-multi-column-no-header.csv
 1,0,1,0,1,1
 0,1,0,1,0,1
 1,0,1,0,1,1
 $ sigrok-cli -i simple-multi-column-no-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 0:101
 1:010
 2:101
 3:010
 4:101
 5:111
 (also works when the input module is selected while no options get specified)
 $ sigrok-cli -I csv -i simple-multi-column-no-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 0:101
 1:010
 2:101
 3:010
 4:101
 5:111
 (being explicit this time, and using a more general format spec, the default format in that case)
 $ sigrok-cli -I csv:column_format='*l'  -i simple-multi-column-no-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 0:101
 1:010
 2:101
 3:010
 4:101
 5:111

Simple multi-column data with header (users need to specify options, which involves selecting an input module).

 $ cat simple-multi-column-with-header.csv
 a,b,c,d,e,f
 1,0,1,0,1,1
 0,1,0,1,0,1
 1,0,1,0,1,1
 $ sigrok-cli -I csv:header=yes -i simple-multi-column-with-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 a:101
 b:010
 c:101
 d:010
 e:101
 f:111
 (using the more general format spec) 
 $ sigrok-cli -I csv:header=yes:column_formats="*l" -i simple-multi-column-with-header.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 6/6 channels
 a:101
 b:010
 c:101
 d:010
 e:101
 f:111

Simple single-column data with hex numbers (needs module and options specs):

 $ cat simple-single-column-number-formats.csv
 x,y,8,z,1000
 x,y,d,z,1101
 x,y,7,z,0111
 x,y,2,z,0010
 $ sigrok-cli -I csv:single_column=3:single_format=hex:logic_channels=4 -i simple-single-column-number-formats.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 4/4 channels
 0:0110
 1:0011
 2:0110
 3:1100
 (using the more general format spec) 
 $ sigrok-cli -I csv:column_formats=2-,x4 -i simple-single-column-number-formats.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 4/4 channels
 0:0110
 1:0011
 2:0110
 3:1100
 (also works with other number formats)
 $ sigrok-cli -I csv:column_formats=4-,b4 -i simple-single-column-number-formats.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 4/4 channels
 0:0110
 1:0011
 2:0110
 3:1100

Mixed signal data in arbitrary order. Timestamps and automatic samplerate.

 $ cat mixed-signal-data.csv
 time,ch1,ch2,logic,ch3,gray4,ch4,ignore,bits3
 0.000,25.00,50.00,0,75.00,0,0.00,0,000
 0.001,26.00,51.00,1,76.00,1,1.00,1,001
 0.002,27.00,52.00,0,77.00,3,2.00,2,010
 0.003,28.00,53.00,1,78.00,2,3.00,3,011
 0.004,29.00,54.00,0,79.00,6,4.00,4,100
 0.005,30.00,55.00,1,80.00,7,5.00,5,101
 0.006,31.00,56.00,0,81.00,5,6.00,6,110
 0.007,32.00,57.00,1,82.00,4,7.00,7,111
 0.008,33.00,58.00,0,83.00,c,8.00,8,000
 0.009,34.00,59.00,1,84.00,d,9.00,9,001
 (grab the data, ignore the first column)
 $ sigrok-cli -I csv:header=yes:column_formats=-,2a,l,a,x4,a,-,b3 -i mixed-signal-data.csv | head -n 8
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 8/12 channels
 ch1: 25.000
 ch1: 26.000
 ch1: 27.000
 ch1: 28.000
 ch1: 29.000
 ch1: 30.000
 (user specified samplerate)
 $ sigrok-cli -I csv:header=yes:column_formats=-,2a,l,a,x4,a,-,b3:samplerate=8000 -i mixed-signal-data.csv | head -n 8
 META samplerate: 8000
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 8/12 channels at 8 kHz
 ch1: 25.000
 ch1: 26.000
 ch1: 27.000
 ch1: 28.000
 ch1: 29.000
 (automatic samplerate)
 $ sigrok-cli -I csv:header=yes:column_formats=t,2a,l,a,x4,a,-,b3 -i mixed-signal-data.csv
 META samplerate: 1000
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 8/12 channels at 1 kHz
 ch1: 25.000
 ch1: 26.000
 ...
 ch1: 33.000
 ch1: 34.000
 ch2: 50.000
 ch2: 51.000
 ...
 ch2: 58.000
 ch2: 59.000
 ch3: 75.000
 ch3: 76.000
 ...
 ch3: 83.000
 ch3: 84.000
 ch4: 0.000
 ch4: 1.000
 ...
 ch4: 8.000
 ch4: 9.000
 logic:01010101 01
 gray4[0]:01100110 01
 gray4[1]:00111100 00
 gray4[2]:00001111 11
 gray4[3]:00000000 11
 bits3[0]:01010101 01
 bits3[1]:00110011 00
 bits3[2]:00001111 00

Comments, empty lines, skipped lines.

 $ cat -n comments-empty-skipped.csv  
  1  These lines neither are comments 
  2  nor are they header nor data lines.
  3  It's some introductory text, captions,
  4  or whatever -- let's not process that.
  5
  6  ; comments get trimmed and skipped out of the box
  7  ; as are empty lines like above and below
  8
  9  ; yet another comment
 10  1,0,1,0
 11  0,1,0,1
 12  1,0,1,0
 $ sigrok-cli -I csv:start_line=5 -i comments-empty-skipped.csv
 libsigrok 0.6.0-git-b99c8ecdec45
 Acquisition with 4/4 channels
 0:101
 1:010
 2:101
 3:010

CSV output module

TODO

Resources