Friday 1 June 2012

Representing empty CSV columns in a database

The particular case I want to nail down is a CSV file where field values are not quoted:

key,name,options,
1,one,,

The example can be represented as a three field table with non-nullable fields key and name and a nullable field options (yes I know about the trailing comma - they do occur in the wild).

For these purposes the data can be divided into Strings and others, as an empty string ("") is not a legal value for any other type.

The case is straight forward for XML. Paired tags with no content represent a zero length string. Unpaired tags (empty elements) or missing tags represent null.

For CSV the situation is more confused. When importing CSV files there are three possibilities for a column: it is not present, empty or filled with a value.

If the column is missing then the value in the database should be null.

If the column is present and empty and the column type is not String then the value should be Null; for a String column the value might be either an empty string or null: we have to decide.

Both string treatments can be argued for:

  1. A present, empty column, represented by two consecutive commas, is just the limiting case of a normal string and so represents a string of zero length. This ensures that no null value will ever enter that column.
  2. String handling should be consistent with the handling of other types where an empty column represents null, hence an empty string is an illegal value in a non-nullable column.

The problem with position 1. is that there is no way to represent null,
the problem with position 2. is that there is no way to represent the empty string.

Our hand is forced, as it happens, by the behaviour of MySQL, which in characteristic fashion chooses the wrong option.

MySQL chooses option 2, so if we wish to gain nulls we must either hand craft the import or post-process the column.

I have chosen to post process all nullable string columns to convert empty strings to null.

I do not believe there is a use case for storing zero length strings in nullable database columns.

No comments:

Post a Comment