FreeM String data structure The FreeM symbol table is fairly complex now, capable of storing multiple types of data (text, numbers, UNICODE text, references, etc..) and capable of growing to unlimited sizes as necessary. It is dynamic, growing as needed and shrinking as variables are killed. This overview describes the complete sym storage structure. Keep in mind that most structure elements are pointers and are not allocated unless necessary. string { // Linked list of strings (structure) unsigned char datalen; // Length of block string *next; // Next block of this string string_rc *rc { // Resource data (structure) char *key; // Keyname m_numeric key_equiv; // Numeric equivalent of keyname (numeric keys only) char status; // Status byte (contains string type) subscript *subs { // Link to subscripts, if any (structure) string *index[97]; } string *next_node; // Next string in LL references *refs { // Linked list of references to this string string *ref; references *next; } } stdata *data { // String data union (all types overlap) char *text; // Pointer to ordinary text short *unicode_text; // Pointer to UNICODE text (hook) m_numeric *numeric; // Pointer to numeric data (currently a float) void *all_data; // Used to refer to any type of data reference *reference { // Pointer to a reference structure subscript *stack_level; // Pointer to the subscript level char *name; // Name of the original variable string *real; // Pointer to data element of original string *previous; // Referring to the string } } } There are at least five separate structures and one union used to cover the different possibilities for data storage. In detail: typedef struct _string { struct _string_resources *rc; /* Resource data */ unsigned char datalen; /* Length of data */ union _string_data *data; /* data block */ struct _string *next_data; /* Next data block, if any */ } string; This creates a variable of type 'string' with four elements. string.rc: A pointer to string resource data that tells us various things about this string, like it's name, what kind of data it is storing, various status flags and more (described elsewhere). string.datalen: The length of data in this block, i.e. how many characters of space this string would take up if printed out. For a number, it would be how many characters long the number is (even though all numbers take up the same four bytes) string.data: A pointer to the data block for this string. Unallocated for empty strings. Can store several types of data. string.next_data: A pointer to the next data block if there is one. typedef struct _string_resources { char *key; /* String keyname */ m_numeric key_equiv; /* String keyname (canonical number) */ char status; /* Status byte */ struct _subscript *subs; /* Subscripts? */ struct _string *next_node; /* Next string in linked list */ struct _references *refs; /* References? */ } string_rc; This structure contains all the meta-information about the string and it's data. string_rc.key: The key for the string at this level. Using the command 's x(1,2,"foo")=3' would create four separate string structures. At the top level would be a string called 'x', at the second level, there would be a string called '1' and at level 4 there would be a string called 'foo'. The keyname is just the segment of the tree currently being looked at, so just 'foo' and not 'x(1,2,"foo")'. Full key names are generated as necessary. string_rc.key_equiv: If the key is a number, this is it's numeric equivalent. string_rc.status: Stores all kinds of status info, what data type this string is and whether it is trashable or not. Expandable to add your own flags and data types. string_rc.subs: If there are subscripts, this links to a subscript index. string_rc.next_node: The next string in the linked list. string_rc.refs: A link to all references to this string. typedef union _string_data { char *text; short *unicode_text; m_numeric *numeric; struct _reference *reference; void *all_data; } stdata; Creates a union called 'stdata'. The union takes up only as much space as it's largest element. All elements here are pointers so this really only uses 4 bytes. Additional data types can be easily added. FILE *s, your own special structures, whatever. At the moment, elements are assigned for storing ordinary text data, UNICODE data (this feature is unimplemented, though numerous 'hooks' are available so it may be added easily in the future), numeric data, references to other strings (also called an m_alias). The variable type 'm_numeric' is currently defined as a double-precision floating point variable (8 bytes of storage, allowing a range +/- 9.2e18 for integers, less to store decimal precision) typedef struct _reference { subscript *stack_level; char *name; struct _string *real; struct _string *previous; } reference; This special structure is used to create an alias/reference to a sym. References frequently refer to earlier stack levels (i.e. to a variable that may have been NEW'd) so the subscript *stack_level refers to the level at which the original variable was defined. You can use a reference to kill a variable, then re-create it as in this example: main_func: s x="foo" d otherfunc(.x,.x) ... func(y,z): n x ; vars y & z point to the 'old' x s x="bar" ; x is 'bar', y is 'foo', z is 'foo' k y ; x is 'bar', y and 'old' x is dead, z is "" s z="foo" ; x is 'bar', y, 'old' x and z are "foo" I don't know if you could do this originally, but you can now. reference.stack_level: Pointer to the level of the stack at which the 'old' variable exists. reference.name: The full name of the variable; x(1,2,3) . This is used when the old variable doesn't exist yet (you pass-by-reference an undefined variable) reference.real: A link to the string if it exists or when it is created. Saves us from doing a lookup based on the name each time. reference.previous: The prior-element that refers to the string (used when killing a variable) typedef struct _subscript { string *index[97]; } subscript; An index of the subscripts to a variable. All top-level variables ( like un-subscripted x, y, z) are indexed this way too using the special subscript index *SymTable (with another one for the UDFSVN) There are 97 entries, one for each printable character valid as the first character of a variable name or subscript, plus one to hold canonical numeric subscript names (so that $q/$o work OK). typedef struct _references { struct string *ref; struct _references *next; } references; Actually just another linked-list of strings. This is used to store information about what references are attached to what variables (so when we kill a variable, we adjust all it's references too) and what references aren't attached to anything. Overhead issues: In the original implementation, the symbol table is stored in one long string with keys and data in alphabetical order (?) and pointers to the start of each alphabet letter used. Storing x, x(1), x(1,1), xa and y results in (x)(data)(x,1)(data)(x,1,1)(data)(xa)(data)(y)(data) With pointers to x and y. Actually, there are embedded "length" bytes before each key and each data segment specifying how long the key+subscript and data segments are so they can be skipped when searching. This has some issues, namely: - Since the entire symbol table is a (char *), no value can be greater than 255. This is an acceptable limitation for data (since we're not using UNICODE), but since the "length" field is also a character, this limits the length of the key+subscript and the data to 255 characters each. - Storing the symbol table in linear fashion makes for slow searches. The symtab() function has to search through all of a variable's subscripts to get to the next variable. It does make $QUERY go slightly faster than in another symbol-table implementation. - The entire variable name down to the subscript is repeated for each key. I consider this wasted space; For the variable x(1,1) and x(1,2), the x,1 part is repeated needlessly. This was done in part due to the way the original was implemented. This is a fairly trivial amount of waste, though. - Both the key and the data are EOL terminated, and copied in and out of memory one byte at a time. This is slow, and means that the string can't have the EOL character anywhere in the middle of it's data or key. - Even numeric strings are stored a series of characters. The "overhead" needed to store a string in the original implementation is fairly small. It amounts to two bytes used to store lengths, two EOL character-terminators and the length of all subscript-keys leading up to the current subscript level, plus one DELIM character per subscript level. To store a 200-character variable at the fourth level of subscripts, the overhead is Length-of-key byte: 1 byte Length-of-data byte: 1 byte DELIM characters: 4 bytes = 6 bytes, plus the length of the names of the prior subscript levels In the new implementation, strings are stored as branches of a tree with an index for each subscript level of each variable. Storing a, x, x(1), x(1,0), xa and y results in: 0 ... a - b - c ... x - y - z { Symbol Table Index } | | | (a)-(data) (x)-(data) (y)-data |\ | \ | 0 - 1 ... z { X Index } | | | (1)-(data) | \ | \ | 0 - 1 ... z { X(1) Index } | | | (0)-(data) | (xa)-(data) Advantages of storing it this way are: - Its more intuitive; ordered more the way you'd expect. - It's easy to take an entire branch out of commission for a NEW, or refer to an entire branch with a pass-by-reference. This allows the potential for both "n x(1)" and "d ^foo(.x(1))" - Inserting, deleting and merging is MUCH faster and doesn't require shuffling memory. - Indexes make finding any subscript faster. - It's dynamic, allocating as much memory as it needs and freeing memory it doesn't. - It's unrestricted. It can grow as big as you like. - It's much faster to do arithmetic operations on numbers - It actually saves memory on storing numeric data; any number in range takes up only 4 bytes (for a 'long') and doesn't use a data block at all. - It can be easily re-worked into a B-Tree or some other management model at a later time without doing much more work. Disadvantages are: - At the moment, its hard to find the entry prior to a given entry (as in for $ORDER(x,-1) and $QUERY(x,-1)) - A lot more "overhead" memory is used usually. The 'Overhead' memory used amounts to 4 pointers + a 'long' to store information about ANY string (20 bytes on most systems), plus 3 pointers plus a 'char' for each block (13 bytes on most systems), plus any portion of a block that is unused. Each subscript level also costs a lot; 96 pointers (384 bytes). To store the same 200-character variable on the fourth level of subscripts: (with 255-byte block size) 4 previous levels (counts as three strings) - 80 bytes (4 x 20) 3 blocks for previous levels (with no data) - 39 bytes (3 x 13) 4 levels of subscripts - 1536 bytes (384 x 4) 1 block for current level (with data) - 13 bytes Unused space in this block (200/255 bytes used) - 55 bytes For a total overhead of: 1723 bytes! As you can see, the subscript indexes are the killer here. However; this is only to store the first string at this level. Adding more strings at this subscript level (or any previous level) costs no more memory for the level or for the subscript indexes. If I added 1000 more 200-character strings at the 1st through 4th subscript levels, the majority of the above memory (1655 bytes) would just link to the new strings, reducing the "administrative" overhead to only 1.6 bytes per string (not including wasted block space). In fact, this method would catch up with the old symbol-table storage method in "administrative" overhead after (at most) 183 variables at any subscript layer (sooner if you use subscript keys longer than one letter) For large amounts of data in local memory, this ends up saving memory AND time over the original implementation. The future: Dynamic block sizes and discreet blocks At the moment, the user must enter a block size in the source code (defaults to 255 bytes which is fairly inefficient for small strings). This becomes the MINIMUM size for any string. If your block size is 255, you have a 254-byte overhead for small strings. I have written algorithms to tell you what the optimum block size would be if you can guess about how long most of your strings are (assumes a bell-curve distribution). Still, this is kludgy. Why use block sizes at all? This cuts down calls to memory-allocation routines which can slow down processing. If I assign a string a block size of 128 bytes (for a 100 byte string), then I don't need to do anything if the string grows to 108 bytes, or shrinks to 90. It speeds up a lot of things. In a future implementation (there is still more work to be done on the current one before I get to this), the symtab function will dynamically determine a block size for each string block based on the size of data stored in it. If you only store one byte in it, the block size might be 8 or 16 bytes (with 7-byte and 15-byte overhead respectively). If, on the other hand, you store 1 MB of data in local memory, the block size might become 1k or greater (with 1023-byte overhead for a string that is 1MB+1 byte). The symtab() function would allocate as many blocks at one time as it could to a single string structure, recording both the block size used and the number of blocks allocated to one string. To store 5 bytes for example, symtab() would decide to use an 8-byte block, and allocate 1 to the structure. To store 200 bytes, it might choose to use 128-byte blocks and allocate 2 to the structure, resulting in a 72-byte overhead, but giving the string room to grow. Yes, this stores multiple blocks per each string structure as compared to the current system, which uses 1-block per structure. If the string grows beyond the number of blocks allocated to it, symtab would try to realloc() and add more discreet blocks on to the end of the same structure. If it couldn't get enough more continuous memory, it would only then choose to tack on another string structure through the next_data link. This would be more memory-efficient, but sacrifice some speed to do the necessary calculating (should be a trivial speed hit actually)