view codeDoc/file/mergetag/file-format-text.txt @ 82:ac1e3fd07275

New ssi file format. (De)serializer now supports non-ascii wide characters (encoded to UTF-8) and no longer supports non-ascii 8-bit chars which would result in bad UTF-8. Moved/renamed a few things left over from the last commit.
author Diggory Hardy <diggory.hardy@gmail.com>
date Sat, 30 Aug 2008 09:37:35 +0100
parents codeDoc/mergetag/file-format-text.txt@611f7b9063c6
children
line wrap: on
line source

Part of mde: a Modular D game-oriented Engine
Copyright © 2007-2008 Diggory Hardy

This program is free software: you can redistribute it and/or modify it under the terms
of the GNU General Public License as published by the Free Software Foundation, either
version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.


This is the file format for mergetag text files.
Version: 0.1 unfinalised


The encoding should be unicode UTF-8, UTF-16 or UTF-32, and for anything other than UTF-8 must include a BOM.


Hierarchy:
+	Sections	(special section: see header)
++	Data Tags


IDs:
IDs are used for several purposes; they are UTF-8 strings. They are stored in text files as unquoted strings; escape sequences are not supported and the strings should not contain the following characters, although this is not checked: <|=>{}
All characters between the appropriate markers are consumed into the ID, hence whitespace is meaningful.
Multiple section or data tags with the same ID are allowed; see the "Merging rules" section.


Outside of tags only whitespace or valid tags is allowed. Whitespace is ignored.
The following tags are valid (see below for details):
tag		purpose
{...}		section identifiers
<...>		data items
!{...}		simple comment block
!<...>		comment block parsed the same as <...>
Within tags, type specifications or data items whitespace is allowed between symbols.


Section identifier tags:
Format: {ID}
The ID is the section identifier/name. The ID type is DefaultData unless overriden by the code using the reader.
A section identifier marks the beginning of a new section, extending until the next section identifier or the end of the file.


Data item tags:
Format: <tp|ID=dt>
A data item with type tp, identifier ID and data dt. If the data does not fit the given type it is an error and the tag is ignored. Once split into a type string, ID and data string, the contents are passed to an addTag() function within the DataSection class which will parse tags of a recognised format and either ignore or print a warning about other tags.


Data item tags: Type format:
Note:
	The type is read as a single token terminated by any of these characters:	<>|=
	There must not be spaces within the type, e.g. "char []".
	Of course any character other than a | terminating the token is an error.
Format:
	tp		a basic type
	tp[]		a dynamic list of sub-type tp
	t1[t2]		an associative array with key-type t2
Possible future additions:
	tp()		a dynamic merging list of sub-type tp (only valid as the primary type, ie <subtype()|...>, not a sub-type of a tuple or another dynamic list)
	{t1,t2,...,tn}	a tuple with sub-types t1, t2, ..., tn

Basic types (only items with a + are currently supported, items with * are in DefaultData):
	name
	
	void	--- less useful type
+*	bool	--- integer types
+*	byte
+*	ubyte
+*	short
+*	ushort
+*	int
+*	uint
+*	long
+*	ulong
	cent
	ucent
	
+*	binary	--- alias for ubyte[]
	
+*	float	--- floating point types
+*	double
+*	real
	ifloat
	idouble
	ireal
	cfloat
	cdouble
	creal
	
+*	char	--- single character types (actually these CANNOT support UTF8 symbols with length > 1)
	wchar
	dchar
+*	string	--- alias for char[] --- (DOES support UTF8)
	wstring	--- alias for wchar[]
	dstring	--- alias for dchar[]


Data item tags: Data format:
Valid chars:	[](){},+-.0-9eEixXa-fA-F '.' ".*"
Format:
	[d1,d2,...,dn]	data all of type t corresponding to t[]
	(d1,d2,...,dn)	data all of type t corresponding to t()
	{d1,d2,...,dn}	data corresponding to a type declaration of {t1,t2,...,tn}
	d		a single data element

Single data elements:
	z		an integer number (regexp: [+-]?[0-9]+)
	z		a floating point number (rough regexp: [+-]?[0-9]*[.]?[0-9]*(e[+-]?[0-9]+)?)
	zi		an imaginary floating point number (z is a floating point number)
	y+zi, y-zi	a complex number (4+0i may be written as 4, etc) (y, z are f.p.s)
	0xz, -0xz	a hexadecimal integer z (composed of chars 0-9,a-f,A-F)
	'c'		a char/wchar/dchar character, depending on the type specified (c may be any single character except ' or an escape sequence)
	"string"	equivalent to ['s','t','r','i','n','g'] --- may contain the following escape sequences as defined in D: \" \' \\ \a \b \f \n \r \t \v
	XX...XX		Binary (ubyte[]); each pair of chars is read as a hex ubyte
	<void>		void "data" has no symbols


Data format: Escape sequences:
To be created and written.


Comment tags (there are no line comments):
Simple comment blocks:
Format: !{...}
This is a simple comment block, and only curly braces ({,}) are treated specially. A {, whether or not it is preceded by a !, starts an embedded comment block, and a } ends either an embedded block or the actual comment block. Note: beware commenting out anything containing curly braces which aren't in matching pairs.
Commented data tags:
Format: !<tp|ID=dt>
Basically a commented out data tag. Conformance to the above spec may not be checked as strictly as normal, but the dt section is checked for strings so that a > within a string won't end the tag.


Merging rules:
if, when a data item is read, a data item with the same identifier
within the same section exists in the DataSet being read into:
+	if the types are identical:
++		if the primary type is a tp() mergeable dynamic list:
+++			the entries from the item being read are concatenated to those in the item
+++			in the DataSet
++		else:
++-			the item already in the DataSet takes priority and is left untouched
+	else:
+-		a warning is issued, and the data item within the DataSet is left untouched
This allows merging some config settings in a user config file with the remaining settings in a
complete system config file and some support for modifications overriding or adding to some data.


Header:
The header is a standard section which is mandatory and must be the first section. Its section identifier must start at the beginning of the file with no whitespace, declared with:
	{MTXY}		where XY is a two digit CAPITAL HEX version number representing the mergetag format version, e.g. {MT01} .
If these are not the first 6 characters of the file the file will not be regarded as valid.
This formatting is very strict to allow reliable low-level parsing.


The data tags within the header have no special meaning; any may be used such as the following:
	<string|"Author"="...">
	<string|"Name"="...">
	<string|"Description"="...">
	<string|"Program"="...">	(which program created/uses this?)
	<*|"Version"=...>		(use any supported type)
	<string|"Date"="YYYYMMDD">	(reverse date format; optionally "YYYYMMDDhhmmss")
	<{u16,u8,u8}|"Date"={YYYY,MM,DD}>	(actually this type probably won't be supported by a standard section)
	<string|"Copyright"=...>


Example:	!THIS IS NO LONGER VALID!
{MT01}
{example section}
<u32|"num"=5>
<{u32,UTF8[]}()|"DATA"=(
	{1,['a']},
	{59,['w','o','r','d']},
	{2,"strings can be written like this"} )>
<wchar[]|"name"="This string is stored in UTF16, regardless of the file's encoding.">
<{u32,UTF8[]}()|"DATA"=(
	{3,"this is appended to the previous 'DATA' item"} )>
{"section: section identifiers and tuples are not confused since tuples only occur inside <...> items"}
<void|Empty tag= >
!{this is a comment {containing a comment}}