Difference between revisions of "File format:Csv"
| m (Gsi moved page File format:CSV to File format:Csv without leaving a redirect: move CSV to csv since that's the file format page's link's name) |  (update CSV input discussion, the "new" version has gone mainline) | ||
| (One intermediate revision by one other user not shown) | |||
| Line 9: | Line 9: | ||
| | is_ascii         = yes | | is_ascii         = yes | ||
| | compression      = none | | compression      = none | ||
| | website          = [https://tools.ietf.org/rfc/rfc4180.txt] | | website          = [https://tools.ietf.org/rfc/rfc4180.txt tools.ietf.org] | ||
| }} | }} | ||
| Line 16: | Line 16: | ||
| In the context of signal analysis, text rows typically relate to time (the number of sample sets taken, one text line per point in time), columns contain the values within a sample set (one value per column). There is a convention that column captions can be kept in a so-called header line. | In the context of signal analysis, text rows typically relate to time (the number of sample sets taken, one text line per point in time), columns contain the values within a sample set (one value per column). There is a convention that column captions can be kept in a so-called header line. | ||
| See [https://tools.ietf.org/rfc/rfc4180.txt RFC4180] for a formal specification, and the [https://en.wikipedia.org/wiki/Comma-separated_values Wikipedia Comma-separated values]. | See [https://tools.ietf.org/rfc/rfc4180.txt RFC4180] for a formal specification, and the [https://en.wikipedia.org/wiki/Comma-separated_values Wikipedia Comma-separated values] article. | ||
| == Format == | == Format == | ||
| Line 32: | Line 32: | ||
| == Properties == | == Properties == | ||
| * The file size is not limited, neither is the line length, nor the acceptable set of used characters | * The file size is not limited, neither is the line length, nor the acceptable set of used characters (specific applications may be limited, but the file format itself is not). | ||
| * Line termination sequences, column separators, and comment leaders have standard values (CR/LF, comma, semi-colon), but users may want to override these. | * Line termination sequences, column separators, and comment leaders have standard values (CR/LF, comma, semi-colon), but users may want to override these. | ||
| * There is no reliable condition to determine that a file contains CSV data. The '''.csv''' file extension is a convention but need not be used. Application software need not have access to "a filename" either. This weakens the support for automatic detection of the file format. | * There is no reliable condition to determine that a file contains CSV data. The '''.csv''' file extension is a convention but need not be used. Application software need not have access to "a filename" either. This weakens the support for automatic detection of the file format. | ||
| Line 42: | Line 42: | ||
| === CSV input module === | === CSV input module === | ||
| The input module for the CSV file format implements these features: | |||
| * | * Supports logic data, analog data, and timestamps. | ||
| * | * Logic data can use individual columns per data bit (the default), or can combine multiple bits in a column. For multi-bit columns the user can specify which number representation applies (binary, octal, hexadecimal) and how many bits the column contains. | ||
| **  | * Analog values always occupy one column per value. | ||
| **  | * The order of columns and their data types are up to the user, but must be specified when they differ from the default case. Even for the default layout it is a good idea to specify the format for the purpose of being explicit to avoid surprises, and to become self documenting. | ||
| ** The default case is all-logic data with one bit per column in all columns of a text line ("multi-column mode") for backwards compatibility. | |||
| ** "Simple" layouts can be described in a backwards compatible style with specific keywords, but the generic format specifier is most flexible and feature complete, and is considered to be as accessible to users and readable, so it should be preferred. | |||
| ** For the single column case where a multi-bit value spans the corresponding number of logic channels, users need to specify the column number and channel count, and the number format if it is not binary. | |||
| ** For multiple adjacent columns which each contain one bit of logic data (so-called "multi-column mode") the first column number must be specified. The channel count is optional and defaults to the remaining number of columns in the text line. The number format does not apply to single-bit data (binary fits as an implementation detail). | |||
| ** A more generic "column formats" feature can express the above backwards compatible modes. But also allows the use of single- and multi-bit data in any number format and order of text columns, as well as analog input data (so-called "mixed signal" source), as well as optional timestamps which allow the automatic determination of the input data's samplerate for simple cases. | |||
| * A number of text lines at the top of the file can get skipped when necessary. The default is to process the complete file. | |||
| * An optional header line can be used to determine channel names (an exclusive feature to multi-column mode, off by default, simple channel names get assigned when the feature is off or the column lacks input text). | |||
| * The line termination gets auto-detected. The column separator and comment leader are user adjustable. | |||
| * Blank lines and comment-only lines get skipped (but are counted where line numbers get referenced). | |||
| * Users can specify a samplerate (by means of options, outside of the file). In the absence of a user spec but the presence of timestamps in the file the samplerate will get determined from input data. | |||
| * There are workarounds for odd platforms/applications which don't terminate the last text line in the file, or specify which one of one bytes goes first (useless use of BOM award). | |||
| * Automatic file format detection is supported, which enables the use of simple multi-column data files without header lines, without the need for user intervention or option specification. | |||
| * "Funnies" like double double quotes (escapes), embedded CRLF in fields enclosed in double quotes, etc. are not supported. | |||
| Notice that in 2019-12 user visible option names were renamed for consistency with other parts of the project, examples and existing external scripts need adjustment (interactive use in GUI applications should not notice). | |||
| List of options and builtin help text for the CSV input module: | List of options and builtin help text for the CSV input module: | ||
| <small> | |||
|    $ '''sigrok-cli -I csv --show''' | |||
|    $ sigrok-cli -I csv --show | |||
|    ID: csv |    ID: csv | ||
|    Name: CSV |    Name: CSV | ||
| Line 80: | Line 81: | ||
|      column_separator: The sequence which separates text columns. Non-empty text, comma by default. (default ',') |      column_separator: The sequence which separates text columns. Non-empty text, comma by default. (default ',') | ||
|      comment_leader: The text which starts comments at the end of text lines, semicolon by default. (default ';') |      comment_leader: The text which starts comments at the end of text lines, semicolon by default. (default ';') | ||
| </small> | |||
| === CSV output module === | === CSV output module === | ||
| Line 85: | Line 87: | ||
| * TODO | * TODO | ||
|    $ sigrok-cli -O csv --show | <small> | ||
|    $ '''sigrok-cli -O csv --show''' | |||
|    ID: csv |    ID: csv | ||
|    Name: CSV |    Name: CSV | ||
| Line 101: | Line 104: | ||
|      trigger: Output trigger indicator as last column  (default false) |      trigger: Output trigger indicator as last column  (default false) | ||
|      dedup: Set to false to output duplicate rows (default false) |      dedup: Set to false to output duplicate rows (default false) | ||
| </small> | |||
| == Examples == | == Examples == | ||
| === CSV input module === | === CSV input module === | ||
| Simple multi-column data without header (the default format of the input module). | Simple multi-column data without header (the default format of the input module). | ||
| <small> | |||
|    $ cat simple-multi-column-no-header.csv |    $ cat simple-multi-column-no-header.csv | ||
|    1,0,1,0,1,1 |    1,0,1,0,1,1 | ||
| Line 148: | Line 149: | ||
|    4:101 |    4:101 | ||
|    5:111 |    5:111 | ||
| </small> | |||
| Simple multi-column data with header (users need to specify options, which involves selecting an input module). | Simple multi-column data with header (users need to specify options, which involves selecting an input module). | ||
| <small> | |||
|    $ cat simple-multi-column-with-header.csv |    $ cat simple-multi-column-with-header.csv | ||
|    a,b,c,d,e,f |    a,b,c,d,e,f | ||
| Line 177: | Line 180: | ||
|    e:101 |    e:101 | ||
|    f:111 |    f:111 | ||
| </small> | |||
| Simple single-column data with hex numbers (needs module and options specs): | Simple single-column data with hex numbers (needs module and options specs): | ||
| <small> | |||
|    $ cat simple-single-column-number-formats.csv |    $ cat simple-single-column-number-formats.csv | ||
|    x,y,8,z,1000 |    x,y,8,z,1000 | ||
| Line 211: | Line 216: | ||
|    2:0110 |    2:0110 | ||
|    3:1100 |    3:1100 | ||
| </small> | |||
| Mixed signal data in arbitrary order. Timestamps and automatic samplerate. | Mixed signal data in arbitrary order. Timestamps and automatic samplerate. | ||
| <small> | |||
|    $ cat mixed-signal-data.csv |    $ cat mixed-signal-data.csv | ||
|    time,ch1,ch2,logic,ch3,gray4,ch4,ignore,bits3 |    time,ch1,ch2,logic,ch3,gray4,ch4,ignore,bits3 | ||
| Line 282: | Line 289: | ||
|    bits3[1]:00110011 00 |    bits3[1]:00110011 00 | ||
|    bits3[2]:00001111 00 |    bits3[2]:00001111 00 | ||
| </small> | |||
| Comments, empty lines, skipped lines.    | Comments, empty lines, skipped lines.    | ||
| <small> | |||
|    $ cat -n comments-empty-skipped.csv    |    $ cat -n comments-empty-skipped.csv    | ||
|     1  These lines neither are comments   |     1  These lines neither are comments   | ||
| Line 306: | Line 315: | ||
|    2:101 |    2:101 | ||
|    3:010 |    3:010 | ||
| </small> | |||
| === CSV output module === | === CSV output module === | ||
| Line 314: | Line 324: | ||
| * [https://tools.ietf.org/rfc/rfc4180.txt RFC4180] | * [https://tools.ietf.org/rfc/rfc4180.txt RFC4180] | ||
| * [https://en.wikipedia.org/wiki/Comma-separated_values Wikipedia Comma-separated values] | * [https://en.wikipedia.org/wiki/Comma-separated_values Wikipedia: Comma-separated values] | ||
| __NOTOC__ | __NOTOC__ | ||
| [[Category:File format]] | [[Category:File format]] | ||
Latest revision as of 19:16, 22 December 2019
| Name | CSV | 
|---|---|
| Status | supported | 
| Source code (in) | csv.c | 
| Source code (out) | csv.c | 
| Common extension(s) | .csv | 
| MIME type | text/csv, text/csv;header | 
| ASCII format | yes | 
| Compression | none | 
| Website | tools.ietf.org | 
CSV is the abbreviation for Comma separated values. It is a text file format where data is arranged in a tabular representation. CSV files traditionally were used with spreadsheet calculation software, but has also been used as an import and export format for signal analysis software.
In the context of signal analysis, text rows typically relate to time (the number of sample sets taken, one text line per point in time), columns contain the values within a sample set (one value per column). There is a convention that column captions can be kept in a so-called header line.
See RFC4180 for a formal specification, and the Wikipedia Comma-separated values article.
Format
The text files usually contain all-printable characters, and use CR/LF for text line termination. Some software versions also accept alternative line termination sequences, like LF only, or CR only.
Columns traditionally are separated by comma. But over time other separators were also used, and may optionally be supported by application software.
Numbers are communicated without specific marking (example: just "123.45" or "-12"). Floating point numbers may use exponential representation (example: "100.0E-3"). Text may or may not be enclosed in a pair of double quotes. There can be other data types for cells like dates, currency, etc, which shall not be discussed here.
Some applications support the notion of comments, which span from a comment leader (usually semi-colon) up to the end of the line. Applications may or may not accept "trailing" comments after data, or require that comments span a complete line on their own.
Since there is a great variety of software and platforms, it's best to generate output files conservatively, and accept the minimum feature set, and optionally a few variants which are rather popular in the field.
Properties
- The file size is not limited, neither is the line length, nor the acceptable set of used characters (specific applications may be limited, but the file format itself is not).
- Line termination sequences, column separators, and comment leaders have standard values (CR/LF, comma, semi-colon), but users may want to override these.
- There is no reliable condition to determine that a file contains CSV data. The .csv file extension is a convention but need not be used. Application software need not have access to "a filename" either. This weakens the support for automatic detection of the file format.
Implementation
It's important to notice that the sigrok project implements support for using CSV formatted data as an input format as well as an output format. These are separate code paths, and need not be symmetric nor identical in their feature set or default behaviour. Given that the input path needs to accept text and translate it to the internal .sr presentation, and that the output path dumps the internally available .sr presentation to a text file, not all concepts translate equally well at either side. Aside from that conceptual difference, other factors like available time and manpower may influence the completeness or reliability of alternate code paths, not all are touched at the same moment by the same persons addressing the same requirements.
CSV input module
The input module for the CSV file format implements these features:
- Supports logic data, analog data, and timestamps.
- Logic data can use individual columns per data bit (the default), or can combine multiple bits in a column. For multi-bit columns the user can specify which number representation applies (binary, octal, hexadecimal) and how many bits the column contains.
- Analog values always occupy one column per value.
- The order of columns and their data types are up to the user, but must be specified when they differ from the default case. Even for the default layout it is a good idea to specify the format for the purpose of being explicit to avoid surprises, and to become self documenting.
- The default case is all-logic data with one bit per column in all columns of a text line ("multi-column mode") for backwards compatibility.
- "Simple" layouts can be described in a backwards compatible style with specific keywords, but the generic format specifier is most flexible and feature complete, and is considered to be as accessible to users and readable, so it should be preferred.
- For the single column case where a multi-bit value spans the corresponding number of logic channels, users need to specify the column number and channel count, and the number format if it is not binary.
- For multiple adjacent columns which each contain one bit of logic data (so-called "multi-column mode") the first column number must be specified. The channel count is optional and defaults to the remaining number of columns in the text line. The number format does not apply to single-bit data (binary fits as an implementation detail).
- A more generic "column formats" feature can express the above backwards compatible modes. But also allows the use of single- and multi-bit data in any number format and order of text columns, as well as analog input data (so-called "mixed signal" source), as well as optional timestamps which allow the automatic determination of the input data's samplerate for simple cases.
 
- A number of text lines at the top of the file can get skipped when necessary. The default is to process the complete file.
- An optional header line can be used to determine channel names (an exclusive feature to multi-column mode, off by default, simple channel names get assigned when the feature is off or the column lacks input text).
- The line termination gets auto-detected. The column separator and comment leader are user adjustable.
- Blank lines and comment-only lines get skipped (but are counted where line numbers get referenced).
- Users can specify a samplerate (by means of options, outside of the file). In the absence of a user spec but the presence of timestamps in the file the samplerate will get determined from input data.
- There are workarounds for odd platforms/applications which don't terminate the last text line in the file, or specify which one of one bytes goes first (useless use of BOM award).
- Automatic file format detection is supported, which enables the use of simple multi-column data files without header lines, without the need for user intervention or option specification.
- "Funnies" like double double quotes (escapes), embedded CRLF in fields enclosed in double quotes, etc. are not supported.
Notice that in 2019-12 user visible option names were renamed for consistency with other parts of the project, examples and existing external scripts need adjustment (interactive use in GUI applications should not notice).
List of options and builtin help text for the CSV input module:
$ sigrok-cli -I csv --show ID: csv Name: CSV Description: Comma-separated values Options: column_formats: Text columns data types. A comma separated list of [<cols>]<fmt>[<bits>] items. * for all remaining columns. - ignores columns, x/o/b/l logic data, a (and digits) analog data, t timestamps. (default ) single_column: Simple single-column mode, exclusively use text from the specified column (number starting at 1). Obsoleted by 'column_formats=4-,x16'. (default 0) first_column: First column with logic data in simple multi-column mode (number starting at 1, default 1). Obsoleted by 'column_formats=4-,*l'. (default 1) logic_channels: Logic channel count, required in simple single-column mode, defaults to "all remaining columns" in simple multi-column mode. Obsoleted by 'column_formats=8l'. (default 0) single_format: The input text number format of simple single-column mode: bin, hex, oct. Obsoleted by 'column_formats=x8'. (default 'bin', possible values 'bin', 'hex', 'oct') start_line: The line number at which to start processing input text (default: 1). (default 1) header: Use the first processed line's column captions (when available) as channel names. Off by default (default false) samplerate: The input data's sample rate in Hz. No default value. (default 0) column_separator: The sequence which separates text columns. Non-empty text, comma by default. (default ',') comment_leader: The text which starts comments at the end of text lines, semicolon by default. (default ';')
CSV output module
- TODO
$ sigrok-cli -O csv --show ID: csv Name: CSV Description: Comma-separated values Options: gnuplot: gnuplot script file name (default ) scale: Scale gnuplot graphs (default true) value: Character to print between values (default ',') record: String to print between records (default '\n') frame: String to print between frames (default '\n') comment: String used at start of comment lines (default ';') header: Output header comment with capture metdata (default true) label: Type of column labels (default 'units', possible values 'units', 'channel', 'off') time: Output sample time as column 1 (default true) trigger: Output trigger indicator as last column (default false) dedup: Set to false to output duplicate rows (default false)
Examples
CSV input module
Simple multi-column data without header (the default format of the input module).
$ cat simple-multi-column-no-header.csv 1,0,1,0,1,1 0,1,0,1,0,1 1,0,1,0,1,1
$ sigrok-cli -i simple-multi-column-no-header.csv libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 6/6 channels 0:101 1:010 2:101 3:010 4:101 5:111
(also works when the input module is selected while no options get specified) $ sigrok-cli -I csv -i simple-multi-column-no-header.csv libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 6/6 channels 0:101 1:010 2:101 3:010 4:101 5:111
(being explicit this time, and using a more general format spec, the default format in that case) $ sigrok-cli -I csv:column_format='*l' -i simple-multi-column-no-header.csv libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 6/6 channels 0:101 1:010 2:101 3:010 4:101 5:111
Simple multi-column data with header (users need to specify options, which involves selecting an input module).
$ cat simple-multi-column-with-header.csv a,b,c,d,e,f 1,0,1,0,1,1 0,1,0,1,0,1 1,0,1,0,1,1
$ sigrok-cli -I csv:header=yes -i simple-multi-column-with-header.csv libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 6/6 channels a:101 b:010 c:101 d:010 e:101 f:111
(using the more general format spec) $ sigrok-cli -I csv:header=yes:column_formats="*l" -i simple-multi-column-with-header.csv libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 6/6 channels a:101 b:010 c:101 d:010 e:101 f:111
Simple single-column data with hex numbers (needs module and options specs):
$ cat simple-single-column-number-formats.csv x,y,8,z,1000 x,y,d,z,1101 x,y,7,z,0111 x,y,2,z,0010
$ sigrok-cli -I csv:single_column=3:single_format=hex:logic_channels=4 -i simple-single-column-number-formats.csv libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 4/4 channels 0:0110 1:0011 2:0110 3:1100
(using the more general format spec) $ sigrok-cli -I csv:column_formats=2-,x4 -i simple-single-column-number-formats.csv libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 4/4 channels 0:0110 1:0011 2:0110 3:1100
(also works with other number formats) $ sigrok-cli -I csv:column_formats=4-,b4 -i simple-single-column-number-formats.csv libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 4/4 channels 0:0110 1:0011 2:0110 3:1100
Mixed signal data in arbitrary order. Timestamps and automatic samplerate.
$ cat mixed-signal-data.csv time,ch1,ch2,logic,ch3,gray4,ch4,ignore,bits3 0.000,25.00,50.00,0,75.00,0,0.00,0,000 0.001,26.00,51.00,1,76.00,1,1.00,1,001 0.002,27.00,52.00,0,77.00,3,2.00,2,010 0.003,28.00,53.00,1,78.00,2,3.00,3,011 0.004,29.00,54.00,0,79.00,6,4.00,4,100 0.005,30.00,55.00,1,80.00,7,5.00,5,101 0.006,31.00,56.00,0,81.00,5,6.00,6,110 0.007,32.00,57.00,1,82.00,4,7.00,7,111 0.008,33.00,58.00,0,83.00,c,8.00,8,000 0.009,34.00,59.00,1,84.00,d,9.00,9,001
(grab the data, ignore the first column) $ sigrok-cli -I csv:header=yes:column_formats=-,2a,l,a,x4,a,-,b3 -i mixed-signal-data.csv | head -n 8 libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 8/12 channels ch1: 25.000 ch1: 26.000 ch1: 27.000 ch1: 28.000 ch1: 29.000 ch1: 30.000
(user specified samplerate) $ sigrok-cli -I csv:header=yes:column_formats=-,2a,l,a,x4,a,-,b3:samplerate=8000 -i mixed-signal-data.csv | head -n 8 META samplerate: 8000 libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 8/12 channels at 8 kHz ch1: 25.000 ch1: 26.000 ch1: 27.000 ch1: 28.000 ch1: 29.000
(automatic samplerate) $ sigrok-cli -I csv:header=yes:column_formats=t,2a,l,a,x4,a,-,b3 -i mixed-signal-data.csv META samplerate: 1000 libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 8/12 channels at 1 kHz ch1: 25.000 ch1: 26.000 ... ch1: 33.000 ch1: 34.000 ch2: 50.000 ch2: 51.000 ... ch2: 58.000 ch2: 59.000 ch3: 75.000 ch3: 76.000 ... ch3: 83.000 ch3: 84.000 ch4: 0.000 ch4: 1.000 ... ch4: 8.000 ch4: 9.000 logic:01010101 01 gray4[0]:01100110 01 gray4[1]:00111100 00 gray4[2]:00001111 11 gray4[3]:00000000 11 bits3[0]:01010101 01 bits3[1]:00110011 00 bits3[2]:00001111 00
Comments, empty lines, skipped lines.
$ cat -n comments-empty-skipped.csv 1 These lines neither are comments 2 nor are they header nor data lines. 3 It's some introductory text, captions, 4 or whatever -- let's not process that. 5 6 ; comments get trimmed and skipped out of the box 7 ; as are empty lines like above and below 8 9 ; yet another comment 10 1,0,1,0 11 0,1,0,1 12 1,0,1,0
$ sigrok-cli -I csv:start_line=5 -i comments-empty-skipped.csv libsigrok 0.6.0-git-b99c8ecdec45 Acquisition with 4/4 channels 0:101 1:010 2:101 3:010
CSV output module
TODO