Check the Format and Content of a Delimited Text File¶
chkcsv.py
is a Python module and program that checks the format
and content of a comma-separated-value (CSV) or similar delimited text
file. It can check whether required columns are present, and the type,
length, and pattern of each column.
Syntax and Options¶
chkcsv.py should be run at the operating-system commannd line–i.e., at a shell prompt in Linux or in a command window in Windows. Python may or may not need to be explicitly invoked, and the .py extension may or may not need to be included, depending on your operating system, operating system seetings, and how execsql is installed.
For Linux users: The chkcsv.py file contains a shebang line pointing to /usr/bin/python, so there should be no need to invoke the Python interpreter. Depending on how chkcsv.py was obtained and installed, it may need to be made executable with the chmod command.
For Windows users: If you are unfamiliar with running Python programs at the command prompt, see https://docs.python.org/2/faq/windows.html.
In the following syntax descriptions, angle brackets identify required replaceable elements, and square brackets identify optional replaceable elements.
chkcsv.py [options] <CSV file name>
Arguments:
<CSV file name> The name of the CSV file to check.
Options:
--version Show program's version number and exit
-h, --help Show this help message and exit
-s, --showspecs Show the format specifications allowed in the
configuration file, and exit.
-f FORMATSPEC, --formatspec=FORMATSPEC
Name of the file with the format specification.
The default is the name of the CSV file with an
extension of fmt.
-r, --required A data value is required in data columns for
which the format specification does not include
an explicit specification of whether data is
required for a column. The default is false
(i.e., data are not required).
-q, --columnsnotrequired
Columns listed in the format configuration file
are not required to be present unless the
column_required specification is explicitly set
in the configuration file. The default is true
(i.e., all columns in the configuration file
are required in the CSV file).
-c, --columnexit Exit immediately if there are more columns in
the CSV file header than are specified in the
format configuration file.
-l, --linelength Allow rows of the CSV file to have fewer columns
than in the column headers. The default is to
report an error for short data rows. If short
data rows are allowed, any row without enough
columns to match the format specification will
still be reported as an error.
-i, --case-insensitive
Case-insensitive matching of column names in
the format configuration file and the CSV file.
The default is case-sensitive (i.e., column
names must match exactly).
-e ENCODING, --encoding=ENCODING
Character encoding of the CSV file. It should
be one of the strings listed at
http://docs.python.org/library/codecs.html#standard- encodings.
-o OPTSECTION, --optsection=OPTSECTION
An alternate name for the chkcsv options section
in the format specification configuration file.
-x, --exitonerror Exit when the first error is found.
Installation¶
The chkcsv program is available on PyPi. It can be installed with:
pip install chkcsv
By default, chkcsv.py will be installed as an executable script. To use it as a library module, it must be copied to Python’s site-packages directory.
Format Specifications¶
The format of each of the columns of the CSV file is specified in a separate configuration file containing a section for each column. Each section begins with the column name in square brackets, followed by key-value pairs identifying the specifications for that column. Each key-value pair consists of a keyword and an associated value. Keywords and values should be separated by either “=” or “:”. Each keyword should be at the beginning of a line.
By default, the configuration file has the same name as the CSV file, but with an extension of “.fmt”. An alternate configuration file can be specified with the “-f” command-line option.
The keywords that can be used for column format specifications are listed below. A specific type of value should be provided for each keyword. Boolean values are indicated by “Yes”, “No”, “True”, “False”, “On”, “Off”, “1”, or “0”. Format specification keywords and values should not be quoted in the configuration file. The allowable keywords are:
- column_required
- Indicates whether or not the column must be present in the CSV file. This is a Boolean value. The default value is True, and can be changed with the “-q” command-line option. This format option need be included in the format configuration file only when the default is to be overridden.
- data_required
- Indicates whether or not a value is required in this column on every row of the CSV file. This is a Boolean value. The default value is False, and can be changed with the “-r” command-line option. This format option need be included in the format configuration file only when the default is to be overridden.
- type
- Identifies the type of data in the data column. Valid values are “string”, “integer”, “float”, “bool”, “date”, and “datetime”. Data values in the CSV file will be checked for compatibility with the specified type. If the data type is not specified, data values will be treated as strings—that is, minimum and maximum lengths and the pattern will be checked if they have been specified.
- minlen
- The required minimum length of data values for this column. This is only checked for string data types and for data with no type specified.
- maxlen
- The maximum allowed length of data values for this column. This is only checked for string data types and for data with no type specified.
- pattern
- A regular expression specifying the content of the column value. Patterns must match at the beginning of the column value. This is checked for string, date, and datetime data types, and for data with no type specified.
Output¶
If the format of the CSV file matches the specification, chkcsv sets the system exit status to 0; no other output is produced. If the CSV file does not match the specification, chkcsv sets the system exit status to 1 and writes error messages to the standard error device. An example set of error message appears as follows:
Error: invalid date in file bad_input.csv on line 4 in column Date.
Error: pattern mismatch in file bad_input.csv on line 6 in column 1221_qual.
Error: not a floating-point number in file bad_input.csv on line 9 in column 1242.
Error: data too short in file bad_input.csv on line 11 in column Station.
Error: missing data in file bad_input.csv on line 12 in column Station.
Error: data too short in file bad_input.csv on line 12 in column Station.
Error: pattern mismatch in file bad_input.csv on line 13 in column 1248_qual.
Error: data too long in file bad_input.csv on line 13 in column Station.
Error: pattern mismatch in file bad_input.csv on line 14 in column Station.
Usage Notes¶
- The first line of the CSV file must contain the names of the columns.
- The order of column specifications in the configuration file does not have to match the order of columns in the CSV file.
- Format specification keywords for a column may be in any order within the column section in the configuration file
- Column names in the CSV file and in the configuration file are case-sensitive, and must match exactly by default. If column names in the configuration file and the CSV file don’t match because the case is different, an error will be reported only if the unmatched column is required. The “-i” command-line option can be used to allow case-insensitive matching of column names.
- The pattern that a column should match is specified by a regular expression. The regular expression syntax supported by chkcsv.py is as documented at http://docs.python.org/library/re.html.
- Patterns (regular expressions) must match at the beginning of the column value. To ensure that the regular expression matches the entire column value, you may need to include “$” at the end of the regular expression.
- By default, all columns listed in the configuration file are considered to be required, and if the column name is not present in the CSV file (header row), this will be considered to be an error and chkcsv.py will halt immediately. The default behavior can be changed with the “-q” command-line option. If “-q” is used, or the “column_required” format specification is set to False, and the column is not present, no error will occur. If the column is present, any other format specifications will be applied. That is, even if a column is not required, if it is present and its data fails some other test, an error will be reported.
- chkcsv.py recognizes a wide variety of date and datetime formats. It may actually recognize a date or datetime format that the target software (e.g., a DBMS) does not. In this case, specifying a pattern for the date column can usefully restrict the types of date and datetime values that are accepted.
- The CSV file is expected to have the same number of data items on each row as there are column names in the first row of the file. If the “-l” command-line option is used, the CSV file may have varying numbers of data values in each row, as long as each row has enough values to correspond to each data column that will be checked. That is, if “-l” is used, and there are columns to the right of all of the required columns, those data items may or may not be present in any row without causing an error. However, if a row is short because a value in a required column is missing, and this omission does not cause any violation of any format specification, this error will not necessarily be recognized.
- chkcsv.py does not transform the input file in any way. It does not produce any output file or send any output to stdout except for help and version messages. chkcsv.py only writes error messages, if any, to stderr and sets the exit status value when it terminates.
- chkcsv.py is intended to verify that a data file is suitable for import to database, statistical, graphics, modeling, or other software. The checks that it can perform are generally sufficient to determine whether each data column is compatible with typical specifications for a database column. However, chkcsv.py does not do any row-level checks to verify that column values within a row are consistent with each other. Nor does it do any dataset-level checks to ensure, for example, that each row is unique.
- chkcsv.py includes a provision to allow additional options to be specified in a special section of the configuration file. By default, the name of this special section is “chkcsvoptions”. A different name for this special section can be specified with the “-o” command-line argument. Currently there are no special options supported, and if this special section is present in the configuration file, it will be ignored.
Example¶
An example configuration file might look like this:
[Study]
data_required=True
type=string
minlen=5
maxlen=20
[Station]
data_required=True
type=string
minlen=4
maxlen=12
[SampleDate]
type=date
[Sample]
type=string
data_required=True
minlen=4
maxlen=20
pattern=(SO|SD|WA).*
[Description]
type=string
column_required=False
maxlen=120
[UpperDepth]
type=float
data_required=True
[LowerDepth]
type=float
[DepthUnits]
type=string
data_required=True
pattern=(?i)(FT|M|CM)$
Module Documentation¶
To use chkcsv.py as a library module, it must be placed in either the global or personal site-packages directory. The module provides the following classes and functions to support reading of format specifications from a file and checking a CSV file.
Classes¶
-
class
chkcsv.
CsvChecker
(fmt_spec, colname, column_required_default, data_required_default)¶ Create an object to check a specific column of a defined type.
Parameters: - fmt_spec – A ConfigParser object.
- colname – The name of the data column.
- column_required_default – A Boolean indicating whether the column is required by default.
- data_required_default – A Boolean indicating whether data values are required (non-null) by default.
After initialization, the ‘check()’ method will return a boolean indicating whether a data value is acceptable.
Functions¶
-
chkcsv.
read_format_specs
(fmt_file, column_required, data_required, chkopts='chkcsvoptions')¶ Read format specifications from a file.
Parameters: - fmt_file – The name of the file containing format specifications.
- column_required – Whether or not the column must be in the CSV file to be checked.
- data_required – Whether or not a data value is required on every row of the CSV file.
- chkopts – The name of a section in the format specification file containing additional options.
-
chkcsv.
check_csv_file
(csv_fname, cols, halt_on_err, columnexit, linelength, caseinsensitive, encoding=None)¶ Check that all of the required columns and data are present in the CSV file, and that the data conform to the appropriate type and other specifications.
Parameters: - csv_fname – The name of the CSV file to check.
- cols – A dictionary of specifications (CsvChecker objects) indexed by column name.
- halt_on_err – Whether to exit on the first error.
- columnexit – Whether to exit if the CSV file doesn’t have exactly the same columns in the format specifications.
- linelength – Whether to report an error if any data row has a different number of items than indicated by the column headers.
- casesensitive – Whether column names in the specifications and CSV file should be compared case-insensitively.
- encoding – The character encoding of the CSV file.
-
chkcsv.
show_errors
(errlist)¶ Write a list of error messages to stderr.
Parameters: errlist – A tuple of a narrative message, the name of the file in which the error occurred, the line number of the file, and the column name of the file. All but the first may be null.
Copyright and License¶
Copyright (c) 2011-2019, R.Dreas Nielsen
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. The GNU General Public License is available at http://www.gnu.org/licenses/.