The storage format commands

Overview.

How to use the data converter.

Storage format commands for the converter.

Examples.

The actual data converter page.

Storage format command as appinfo and the hierarchy

Storage format are in appinfo so that they would not affect the normal operation of the schema. There is a single format element in the appinfo, and all the storage format are attributes of that format element. Here is an example.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:data="http://www.datamech.com/storage">
 <xsd:element name="hexDump" type="xsd:hexBinary">
  <xsd:annotation>
    <xsd:appinfo>   
     <data:format dataCounter="EOF"/>
    </xsd:appinfo>
  </xsd:annotation>
 </xsd:element>
</xsd:schema>

Storage format can also be stored as an foreign attribute as in this example.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:data="http://www.datamech.com/storage">
 <xsd:element name="hexDump" type="xsd:hexBinary" data:dataCounter="EOF"/>
</xsd:schema>

The two are considered equivalent. They can be mixed, even in the same element. Writing as appinfo involves more work, however the resulting schema may be accepted by more XML tools. Actually at this point the annotation code has been in use for a while. The foreign attribute code is just written and may have more bugs. So it is possible that there are cases where it work as appinfo but not foreign attribute.

The command can be attached to any element, compositor or type (but not group or group attribute at this point). When the converter looks for storage format of an element in the schema, it would look to see if any is attacked to that element. If it is not there, it look at the type of the element, and then maybe the type of type. If it is still not found, then it look at the containing element/compositor, and all the way up in the containment hierarchy until it reaches the top level element. Then it look at the annotation of the schema as in the sample above. Here you must use appinfo since there is no element to attach the commands. Finally there are built-in default so even if no format is defined, XML data can generated a reasonable binary output. Therefore it make sense to store common values high in the hierarchy or the default to simplify the schema.

While the hierarchy description is generally true, there are exceptions. The choice tag is a tag for choosing which choice is being used. It is never inherited. Imagine a choice inside another choice. Obviously each tag in a single collection of choices must be unique. If choice tags are inherited the contained choices would inherit from the containing choice and all have the same value. This would not make sense. Another exception is using EOF as the item terminator. It does not make sense to inherit the EOF terminator since there can only be one EOF terminator. So during the item counter inheritance, EOF terminator is bypassed so your default can still be used.

Commands will be described below using this notation:

<data:format booleanTrue="\1|string"/>

where | indicates alternates, underlined option is the default, italic xyz means some thing from the type xyz.

It is understood that it can be in attribute form as in

data:booleanTrue="\1|string"

and it will not be listed separately.

All the commands names are subjected to change. We are still in the design phase. Not all commands are defined yet. Once we have a full set of commands, then we can decide the names with the full picture in mind. Also we will try to borrow names from standards such as DFDL when they are functionally equivalent.

Numeric datatypes

<data:format byteOrder="littleEndian|bigEndian"/>

When the datatype is byte, short, int, long, unsignedByte, unsignedShort, unsignedInt or unsignedLong, they will be stored as 1, 2, 4 or 8 byte binary. Float and double are also supported. The only format command is the byteOrder. You have byteOrder="littleEndian" or byteOrder="bigEndian", the default being littleEndian. Chances are that you want to store the byteOrder in the schema default. It is unlikely you have mixed endianness throughout the document, but it is permitted.

Decimal datatype

<data:format printf="string%optionsnumber.numberfstring"/>

There are so many ways to store decimal that we cannot hope to do them all. We just do the most simple case, store the decimal as a simple string. However, there are still so many way they can be written as. You have left or right justification, space or zero for leading zeros, leading space or plus sign for positive number. So you need a lot of commands to specified the exact way the decimal number is written as.

Fortunately there is a commonly used standard to describe all these, in fact it is quite possible that the file was written using the standard. That is the printf formating command used in C as well as a number of scripting languages. While printf uses d, u, f etc to describe the type of data, this is not needed in our case because we already know what type of data it is. So you can use s|d|u|f, we will replace that by the right type for you. The floating point type g is not supported because floating point will likely be stored as the binary float or double anyway.

You can also use * instead of a number in the field width or precision. It will be replaced by totalDigits or fractionalDigits facet.

If you are not familiar with the flags such as +, -, 0 and space use in the justification. You can look it up in the (s)printf manual.

If the datatype is nonNegativeInteger or positiveInteger, then you may not even need a printf command. It is stored as a numeric string where the digit count is equal to the totalDigits facet. For example, 15 with totalDigits=3 will be stored in three byte as '015'.

Boolean datatypes

<data:format booleanTrue="\1|string"/> booleanFalse="\0|string"/>

Boolean value are normally stored as "\0" or "\1", but you may have stored it as some tag, such as "0" or "1", "F" or "T", "no" or "yes", anything you like as long as it is not something ambiguous like "a" or "aa".

<data:format boolean01="false|true"/>

In XML, boolean true can be expressed as "true" or "1", false as "false" or "0". Normally it is written out as false/true. If you want to save space you can set boolean01="true" and get 0 and 1 instead.

<data:format bitField="false|lowBitFirst|highBitFirst"/>

A boolean is normally stored as as one byte (or more if you change booleanTrue), however it is possible to store 8 booleans in one byte. There are two ways of doing this:

lowBitFirst, true => 0x01, false true => 0x02, false false true => 0x04 etc.
highBitFirst, true => 0x80, false true => 0x40, false false true => 0x20 etc.

However, even if you declare using bitField, it will not always be used. Consider an element of type boolean occurs multiple times, then we normally can use bitField. However if the element is nillable, then we really have three states, false, true or nil. Then we cannot use bitField.

If we have a list of required boolean attributes, then we can use bitField. However if the attributes are optional, then we have the same problem as above. Since it is very common to have missing boolean attribute to mean attribute is false, we shall assume that if we have optional boolean attributes and the schema still declares using bitfield, the understanding is that missing boolean attribute is false and we can use bitfield.

Bitfield can also be used in list of booleans. Earlier we have talked about boolean element with maxOccurs > 1 can use bit field provided that it is not nillable. What about a sequence of boolean elements like

Schema of sequence of boolean elements
  <xsd:sequence>
    <xsd:element name="elm1" type="xsd:boolean"/>
    <xsd:element name="elm2" type="xsd:boolean"/>
    <xsd:element name="elm3" type="xsd:boolean"/>
    <xsd:element name="elm4" type="xsd:boolean"/>
    <xsd:element name="elm5" type="xsd:boolean"/>
  </xsd:sequence>

In theory we can use bifField, but this is not implemented this version.

String datatypes

There are many different ways to store a string. In the sample schema that comes with the application, we try to do each string in a different way to illustrate these techniques.

If the string is fixed length, as specified in the facet length, then we can use exactly that many bytes to store the string.

Element state in the purchaseOrder schema is fixed length
  <xsd:element name="state">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:length value="2"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

Debugging output:
<state>CA</state> according to data 0x4341 ("CA")

<data:format fieldWidth="number|xsd:maxLength"/>

If the string is variable length, but we want to use a fixed number of bytes to store the string. Then we can use fieldWidth to specified the number of bytes. While it is possible to use a number, it make sense to use the facet maxLength to do the specification since you should be a normal part of the schema.

If trailing space are not significant, then we can just store the string and pad it with spaces. If you specified that there is no other way to determine string length (by using dataCounter="" discussed later) then this it will be done this way.

City element in the purchaseOrder schema has fixed field width and no dataCounter
  <xsd:element name="city" data:dataCounter="" fieldWidth="xsd:maxLength">;
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:maxLength value="20"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

Debugging output:
<city>Mill Valley</city> according to data 0x4D696C6C2056616C6C6579202020202020202020 ("Mill Valley         ")

Earlier we already discussed that dataCounter="" means that there is no counter or terminator.

We can also determine the length of the string by either a counter in front of the string or a terminator at the end. This can be specified by the dataCounter. The default is the null byte, which means the string is a c string.

PartNum attribute in the purchaseOrder schema just use the default dataCounter
  <xsd:attribute use="required" name="partNum" type="SKU">

Debugging output:
partNum="872-AA" according to data 0x3837322D414100 ("872-AA`")

If the terminator is EOF, then the string would go all the way to the end of file. Or you can pick your own terminator such as "\r\n".

Your can also a 1 byte, 2 bytes or 4 bytes counter in front of the string.

Element name in the purchaseOrder schema has a 1 byte counter in the front
  <xsd:element name="name" type="xsd:string" data:dataCounter="xsd:unsignedByte"/>

Debugging output:
<name>Alice Smith</name> according to data 0x0B416C69636520536D697468 ("`Alice Smith")

<data:format dataCounterDigits="number|xsd:maxLength"/>

We can also have a counter with datatype nonNegativeInteger. Then we need to know the number of digits. You can specified the number of digits using dataCounterDigits, or dataCounterDigits can be "xsd:maxLength". It does not mean that the number of digits is equal to maxLength of the string. Rather the number of digits is the minimum number of digits that can accommodate maxLength. So if maxLength is 125, number of digits is 3.

Element street in the purchaseOrder schema has a two digits counter in the front
  <xsd:element name="street" data:choiceTag="S" data:dataCounter="xsd:nonNegativeInteger" data:dataCounterDigits="xsd:maxLength">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:maxLength value="40"/>
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>

Debugging output:
<street>123 Maple Street</street> according to data 0x3136313233204D61706C6520537472656574 ("16123 Maple Street")

In the last few examples, the space taken depend the length of the string. However, often the space taken is the same regardless the length of the string, especially when you are writing a whole struct out to the file. We can do this by specifying both fieldWidth and dataCounter. For example, the MacOS datatype Str255 always take up 256 bytes, the first byte is the length of the string so a maximum of 255 characters can be accommodated. It can be specified as fieldWidth="255" dataCounter="xsd:unsignedByte".

In decimal datatype, we talk about using the printf command. It can also be used with string datatype. Note that the printf allows extra data to be print before and after the data. If you want some text to appear in the binary file but not the XML data. We can make use of this to have data to appear in the binary data but not in the XML data. Here is an example from the example page. In the "binary" configuration file we have "[homes]", but in XML data we want it to be just "home". Here is how it is accomplished.

Printf command with extra printout before and after
  <xs:element name="name" type="xs:string" data:printf="[%s]"/>

Debugging output:
<name>homes</name> according to data 0x5B686F6D65735D0D0A ("[homes]``")

Enumeration

Currently nothing special is done about enumeration. However if the goal is a compact binary file, I can imagine that enumeration can be used to reduce the size of binary data. This will be done in a manner similar to choiceTag discussed below, but not in this version.

Binary string encoded datatypes

Both hexBinary and base64Binary datatypes are supported. Like data string we can have fixed length data and variable length data. The fieldWidth, dataCounter and dataCounterDigits command still apply. Of course, dataCounter as terminator does not make sense except for EOF.

List datatypes

List datatypes are only partially implemented. For now just stay with variable list with a counter in front.

Date and time dataTypes

There are so many different ways to store these data, implementing a reasonable rich set will take a lot of work. For now they are just treated as string.

Union type

This is not yet implemented. Going from binary to XML should be easy, but it would require some work to go the other direction. So it is not supported in this version.

Nil value

<data:format nil="\0|string" notNil="\1|string"/>

If an element is nillable, then we need a flag to indicate whether it is nil or has value. This command let you specify the flag for nil element and non-nil element. The nil flags use the default.

Element billTo in the purchaseOrder schema is nillable
  <xsd:element name="billTo" type="USAddress" nillable="true"/>

Debugging output:
billTo not nil according to data 0x01 ("`")

Optional attribute or parameter

<data:format optionAbsent="\0|string" optionPresent="\1|string"/>

An attribute can be optional. An element with minOccurs="0" maxOccurs="1" is also considered optional. Then we need a flag to indicate the optional entity is absent or present. The optionAbsent and optionPresent let you set the value of the flag.

Attribute orderData and element shipDate in the purchaseOrder schema are optional
  <xsd:annotation>
    <xsd:appinfo>   
      <data:format optionPresent="P" optionAbsent="A"/>
    </xsd:appinfo>
  </xsd:annotation>
  
    <xsd:attribute name="orderDate" type="xsd:date"/>

    <xsd:element minOccurs="0" name="shipDate" type="xsd:date"/>

Debugging output:
optional @orderDate present according to data 0x50 ("P")
orderDate="1999-10-20" according to data 0x313939392D31302D323000 ("1999-10-20`")

optional shipDate absent according to data 0x41 ("A")

Element count

<data:format itemCounterDigits="number|xsd:maxOccurs"/>

When minOccurs is equal to maxOccurs, we know how many times an element repeats. However if they are not equal and maxOccurs > 1, we need to keep a count of the repeat count. The situation is similar to length of string. We can have a counter in front or a terminator at the back, or use EOF as terminator. And as we discussed earlier, itemCounter="EOF" is never inherited and will be bypassed in the inheritance hierarchy. While we can use dataCounter for both purpose, it would not be a good idea. The reason is that when we have a string, we do not have child element because we do not support mixed content. So we do not have to worry about any element inheriting the command. This is not true for elements. When we specified a counter, we may need to respecified the counter for all child elements. That is why we have separate counters for data and item. Changing the item counter would not affect the inheritance of the data counter.

When we have a terminator string, the repeating element terminates when the terminator is encountered. The converter then skips over the terminator and continues with reading the next element. However if the terminator is considered to be part of the next element, then you want the converter to look ahead for the terminator but leave the terminator in place. You can achieve this by putting ?= before the terminator. You can also terminate if the look ahead does not find a certain terminator. You can do this by using ?! in front. This feature is only implemented for item counter but not data counter. We may implement it for data counter in future version.

As for itemCounterDigits, we no longer have maxLength as in string, however maxOccurs plays a similar role and can be used to determine the totalDigits if you prefer.

Sequence and choice can also have a repeat count. You can apply the itemCounter to choice too. However, counting of sequence can be tricky. So we are not supporting it in this version and may never support it.

In the purchaseOrder schema, the purchased item element is an example of using EOF as the terminator.

The compositors : sequence, choice, all, any

<data:csv="false|true"/>

<data:csvTerminator="\r\n|string"/>

<data:csvSeparatorChar=",|char"/>

<data:csvQuoteChar=""|char"/>

<data:csvEscapeChar="none|char"/>

Comma separated value is a commonly used file format. Usually when csv files are translated into XML, the fields in each record would correspond to a sequence in the XML schema. These delimited values does not mix very well with rest of the storage formats discussed in this document. However we cannot afford to ignore it because csv is so common. We need to think more about the issues and may revise it in future.

For now you can declared that a sequence will be stored as csv just by declaring data:csv="true". Not all sequence can be stored as csv. If the sequence contains optional element, variable count element, then we cannot use csv because we do not know how to store the option flag or counter. However if the last element in a sequence can have variable occurance, then we can treat all the fields at the end to belong to that element, and we can handle that.

If one of the element in the sequence in turn contains a sequence, then the embedded sequence will also be stored as csv. Therefore you do not even have to declare the embedded sequence to be csv. So there is no mixing of csv and other format and this keep things simple. In general the default csv terminator, separator character, quote character and escape character are the ones used in the file so you do need to worry about them. If you are using a different one, just declare its value.

The current implementation uses the perl Text::CSV_XS module. Some problems with the module is that it does not ignore leading or trailing blanks, and string are quoted even when it is not necessary. There are work around even though it involves doing our own parsing. Another strange behavior is that in the examples page, we start with the data

777227878,Simi? D Roy,123000.00

convert it to XML

  <employee>
    <ssn>777227878</ssn>
    <name>
      <fName>Simi D</fName>
      <lName>Roy</lName>
    </name>
    <salary>123000</salary>
  </employee>
  
convert XML back to binary and get

777227878,"""Simi D"" Roy",123000.00

The name looks completely different but it is really logically equivalent. At first I think this is wrong because it fails round-trip fidelity. After working on it for a while I now think this is really the right thing to do.

According to Postel's Law, you should "Be liberal in what you accept, and conservative in what you send". This means that we should read and tolerate the variations of format from different software packages, but we should write in a format that different software packages can read without problem. So when we read csv file, we should accept that may be spaces around the record separator that are not part of the data. When you write, we should remove the extraneous spaces so the file would be readable by more software. Similarly if we write """Simi D"" Roy", more software would be able to interpret it correctly. So this is a feature and not a bug. The format we write out is more readable by other software, a spreadsheet program would show the difference below.

777227878	Simi? D Roy	123000
		
777227878	"Simi D" Roy	123000

"""Simi D"" Roy" may look bad to the untrained eyes, but it shows up just fine in programs.

On further examination it is not that simple, I need to understand more about Text::CSV_XS. For now it works well enough to give acceptable result.

Currently we do not accept linefeed inside quoted fields. It should be solved as part of the general issue of csv inside csv, and that require more careful design.

<data:format choiceTag="string"/>

When there is a choice, the binary data need to identify which choice is in the data. Each choice would have a choice tag and it would occur immediately before the data of the chosen element. The choice tag is never inherited since it does not make sense at all. By inheritance all tags would have the same value and it is not possible to distinguish between them. However, there is a default value for the choice tag. The first choice would have a default tag of '\0', the second one would be '\1' etc. We only have limited support for choice. Each choice must be a different element. If the particle in the choice is a sequence, identifying the choice in XML data can be tricky and would not be supported in this version. Here is an example of using choice.

In the address of purchaseOrder schema, we can choose between street or poBox
  <xsd:choice>
    <xsd:element name="street" data:choiceTag="S" data:dataCounter="xsd:nonNegativeInteger" data:dataCounterDigits="xsd:maxLength">
      <xsd:simpleType>
        <xsd:restriction base="xsd:string">
          <xsd:maxLength value="40"/>
        </xsd:restriction>
      </xsd:simpleType>
    </xsd:element>
    <xsd:element name="poBox" type="xsd:int" data:choiceTag="P"/>;
  </xsd:choice>

Debugging output:
Choice is poBox according to data 0x50 ("P")
<poBox>5354</poBox> according to data 0xEA140000 ("````")

The all compositor is similar to the choice compositor in the sense that the particles also need a choice tag. This will tell you the order the elements are encountered.

The schema allows comment and browseable to appear in any order:
  <xs:all>
    <xs:element name="comment" type="xs:string" data:choiceTag=" comment="/>
    <xs:element name="browseable" minOccurs="0" type="xs:string" data:choiceTag=" browseable="/>
  </xs:all>

Debugging output:
Item in all is comment according to data 0x20636F6D6D656E743D (" comment=")
<comment>All Printers</comment> according to data 0x416C6C205072696E746572730D0A ("All Printers``")
Item in all is browseable according to data 0x2062726F77736561626C653D (" browseable=")
<browseable>no</browseable> according to data 0x6E6F0D0A ("no``")