view mde/mergetag/doc/file-format-text.txt @ 0:d547009c104c

Repository creation. committer: Diggory Hardy <diggory.hardy@gmail.com>
author Diggory Hardy <diggory.hardy@gmail.com>
date Sat, 27 Oct 2007 18:05:39 +0100
parents
children 78eb491bd642
line wrap: on
line source

This is the file format for mergetag text files.
Version: 0.1 unfinalised


The encoding should be unicode UTF-8, UTF-16 or UTF-32, and for anything other than UTF-8 must include a BOM.


Hierarchy:
+	Sections	(special section: see header)
++	Data Tags


IDs:
IDs are used for several purposes; they are always stored as a uint number (0-4294967295). They may
be given in the file as a base-10 or hex number or, where a lookup table is provided to the reader,
as a double-quoted string (with no escape sequences).
Multiple section or data tags with the same ID are allowed; see the "Merging rules" section.


Outside of tags the only whitespace or valid tags is allowed. Whitespace is ignored.
The following tags are valid (see below for details):
tag		purpose
{...}		section identifiers
<...>		data items
!{...}		simple comment block
!<...>		comment block parsed the same as <...>
Within tags, type specifications or data items whitespace is allowed between symbols.


Section identifier tags:
Format: {ID} or {ID|ID}
In the {ID|ID} case, the first ID is the section type, and the second ID the section name.
In the {ID} case, the section type ID has been ommitted and the default type is used (0).
A section identifier marks the beginning of a new section, extending until the next section
identifier or the end of the file. When a section is read, a new 


Data item tags:
Format: <tp|ID=dt>
A data item with type tp, identifier ID and data dt. If the data does not fit the given type it is
an error and the tag is ignored. Once split into a type string, ID and data string, the contents
are passed to an addTag() function within the DataSection class which will parse tags of a
recognised format and either ignore or print a warning about other tags.


Data item tags: Type format:
Note:
	The type is not initially parsed; it is read as a token terminated by any of these
	characters:	<>|=
	Of course any character other than a | terminating the token is an error.
Format:
	tp		a basic type
	tp[]		a dynamic list of sub-type tp
Possible future additions:
	tp()		a dynamic merging list of sub-type tp (only valid as the primary type, ie
        		<subtype()|...>, not a sub-type of a tuple or another dynamic list)
	{t1,t2,...,tn}	a tuple with sub-types t1, t2, ..., tn

Basic types (only items with a + are currently supported):
	abbrev./full name (each type has two names which can be used):
	
	0	void	--- less useful type
+	1	bool	--- integer types
+	s8	byte
+	u8	ubyte
+	s16	short
+	u16	ushort
+	s32	int
+	u32	uint
+	s64	long
+	u64	ulong
	s128	cent
	u128	ucent
	
+		binary	--- alias for ubyte[]
	
+	fp32	float	--- floating point types
+	fp64	double
+	fp	real
	im32	ifloat
	im64	idouble
	im	ireal
	cpx32	cfloat
	cpx64	cdouble
	cpx	creal
	
+	UTF8	char	--- character types (actually these CANNOT support UTF8 chars with length > 1)
	UTF16	wchar
	UTF32	dchar
+		string	--- alias for char[] --- (DOES support UTF8)
		wstring	--- alias for wchar[]
		dstring	--- alias for dchar[]


Data item tags: Data format:
Valid chars:	[](){},+-.0-9eEixXa-fA-F '.' ".*"
Format:
	[d1,d2,...,dn]	data all of type t corresponding to t[]
	(d1,d2,...,dn)	data all of type t corresponding to t()
	{d1,d2,...,dn}	data corresponding to a type declaration of {t1,t2,...,tn}
	d		a single data element

Single data elements:
	z		an integer number (regexp: [+-]?[0-9]+)
	z		a floating point number (rough regexp: [+-]?[0-9]*[.]?[0-9]*(e[+-]?[0-9]+)?)
	zi		an imaginary floating point number (z is a floating point number)
	y+zi, y-zi	a complex number (4+0i may be written as 4, etc) (y, z are f.p.s)
	0xz, -0xz	a hexadecimal integer z (composed of chars 0-9,a-f,A-F)
	'c'		a char/wchar/dchar character, depending on the type specified (c may be any
			single character except ' or an escape sequence)
	"string"	equivalent to ['s','t','r','i','n','g'] (for a string/wstring/dstring type)
			may contain escape sequences
			Escape sequences are a subset of those supported by D: \" \' \\ \a \b \f \n \r \t \v
	XX...XX		Binary (ubyte[]); each pair of chars is read as a hex ubyte
	<void>		void "data" has no symbols


Data format: Escape sequences:
To be created and written.


Comment tags (there are no line comments):
Simple comment blocks:
Format: !{...}
This is a simple comment block, and only curly braces ({,}) are treated specially. A {, whether or
not it is preceded by a !, starts an embedded comment block, and a } ends either an embedded block
or the actual comment block. Note: beware commenting out {...} tags with a string ID containing
curly braces which aren't in matching pairs.
Commented data tags:
Format: !<tp|ID=dt>
Basically a commented out data tag. Conformance to the above spec may not be checked as strictly as
normal, but the dt section is checked for strings so that a > within a string won't end the tag.


Merging rules:
if, when a data item is read, a data item with the same identifier
within the same section exists in the DataSet being read into:
+	if the types are identical:
++		if the primary type is a tp() mergeable dynamic list:
+++			the entries from the item being read are concatenated to those in the item
+++			in the DataSet
++		else:
++-			the item already in the DataSet takes priority and is left untouched
+	else:
+-		a warning is issued, and the data item within the DataSet is left untouched
This allows merging some config settings in a user config file with the remaining settings in a
complete system config file and some support for modifications overriding or adding to some data.


Header:
The header is a standard section which is mandatory and must be the first section. Its section
identifier must start at the beginning of the file with no whitespace, declared with:
	{MTXY}		where XY is a two digit CAPITAL HEX version number representing the
			mergetag format version, e.g. {MT01} .
If these are not the first 6 characters of the file the file will not be regarded as valid.
This formatting is very strict to allow reliable low-level parsing.


The data tags within the header have no special meaning; any may be used such as the following:
	<string|"Author"="...">
	<string|"Name"="...">
	<string|"Description"="...">
	<string|"Program"="...">	(which program created/uses this?)
	<*|"Version"=...>		(use any supported type)
	<string|"Date"="YYYYMMDD">	(reverse date format; optionally "YYYYMMDDhhmmss")
	<{u16,u8,u8}|"Date"={YYYY,MM,DD}>	(actually this type probably won't be supported by
						a standard section)
	<string|"Copyright"=...>


Example:
{MT01}
{example section}
<u32|"num"=5>
<{u32,UTF8[]}()|"DATA"=(
	{1,['a']},
	{59,['w','o','r','d']},
	{2,"strings can be written like this"} )>
<wchar[]|"name"="This string is stored in UTF16, regardless of the file's encoding.">
<{u32,UTF8[]}()|"DATA"=(
	{3,"this is appended to the previous 'DATA' item"} )>
{"section: section identifiers and tuples are not confused since tuples only occur inside <...> items"}
<void|Empty tag= >
!{this is a comment {containing a comment}}