Many data processing algorithms are explicitly or implicitly table-driven. That is, they process a sequence of data objects that can be viewed as a table. This table, like the css3.0 wfdisc table, can contain indirect references to larger data objects accessible by some other mechanism. For the present this would normally be a file visible to a running program through a file system. It could, however, be a more generic handle to any general method to access data. For example, we can anticipate that data from IRIS will soon be visible through a URL mechanism through FISSURES.
That said, keep in mind the basic model remains a table. The pfstream library implements transport of tables through a somewhat verbose file structure based on the parameter file (pf(5)) format. That is, at the outer level a pfstream is best thought of as a series of pf format files (all simple ascii text) separated by a special EOF sentinel. This could be illustrated as follows (within the limits of man page):
pf1 __EOF__ pf2 __EOF__ ... __EOF__ pfn __EOF__ __EOI__
where pf1, pf2, ..., pfn are each (potentially large) parameter files used to describe part of a table. The implicit assumption is that each of the pfi are generated by the same engine producing views of one or more rows of a table of one or more tables. The number of rows in each virtual table is variable. The number of columns (attributes) is implicitly assumed fixed, although nothing in the library actually explicitly depends on this.
Given that each block (pf1, pf2, etc. above) is a parameter file description of M rows of Nv virtual tables, the next level of the heirachy is how the Nv tables are distinguished? The answer is that each table has a name (tag) that surrounds all of the data associated with it. Thus, for example the ith pf in a pfstream might have this outer structure:
ensemble_parameter_1 x ensemble_parameter_2 y arrival &Arr{ ---arrival table parameter contents --- } assoc &Arr{ ---assoc table parameter contents --- } __EOF__
That is, the entire block can have global parameters defined outside the &Arr blocks. These are parsed as global parameters in input as an "ensemble" or can be used to define output options. Each table is defined by a keyword that identifies that table (in the above example "arrival" and "assoc") The typical use of these names at the moment is pfstream2db(1), which assumes these are tables in a schema for output (in the above example the standard arrival and assoc table of css3.0).
The next level in the heirarchy is the block within the individual table name tags (arrival and assoc for the examples above). Each of these blocks describes a sequence of M rows of the associated table. Each such block has this internal structure:
table_global_1 x table_global_2 y ... table_global_n n group_keys &Tbl{ --list of key attributes-- } group_records &Tbl{ --record list-- } ensemble &Arr{ 00000 &Arr{ --row 0 attributes-- } 00001 &Arr{ --row 1 attributes-- } 00002 &Arr{ --row 2 attributes-- } ... n &Arr{ --row n attributes-- } }
Note the position of the globals for each table referred to above with the generic term "table_global_?". This allows global parameters for each table. The group_keys and group_records lists are special cases that are required for certain types of processing. They can be viewed as a description of a "group by" clause in SQL or dbgroup in Antelope. The group_keys list contains the attributes used to define the grouping and the group_records is a list of starting and ending record numbers for each group. The number of groups is derived internally from the size of the group_records list and it is blindly assumed the list has valid data consistent with the table description that follows.
At the bottom of the heirarchy is the actual description of each row of the table. Each of the simple numbered blocks with the surround &Arr{} block defines one row of a table. At the bottom level each element is assumed to be a standard type, which at the moment means real, integer, strings, and boolean. It could presumably be extended, but that is SEP (Somebody Else's Problem) if they wanted to take it on for some reason. That is, these blocks are simple lists of attribute names followed by the actual values. Here is a simple example:
00003 &Arr{ arid 2755 orid 180 sta AAK phase P }
In actual practice it is expected that this structure can be used to transport database data in one of two forms: (1) it can describe a large view in appropriate blocks as an input file for simple, table driven processing; or (2) it can describe output of parameters into one or more output tables. For virtual tables (views) the name placed on the table is generally irrelevant, but for output to a specific schema the name is critical. At this time there is an input program, db2pfstream(1) which takes data from an Antelope database and writes a pfstream. The inverse is pfstream2db(1) which will take a pfstream and write one or more output database tables. A program based on this model will read and/or write a pfstream that is produced/saved by these reader and writer programs. This is classic streams processing. The only variant is that it is virtually infinitely flexible in how the tables are defined. The pfstream programming interface man page (pfstream(3)) describes how this is used in parallel processing for data-driven applications.
pf(3), pf(5), pfstream(3)
The functionality of this file structure could be superceded eventually by XML. Had I been able to find a useful programming library to implement an XML-based transport instead of inventing this file structure it might be better in the long run. Nonetheless, I have attempted to design the libraries based on the pfstream to easily change the transport layer from this file structure to something else.
The restriction of the heirarchy to 3 levels could be removed with some effort. I elected not to do this because it would require a generic algorithm using recursion which I chose to avoid.
Gary L. Pavlis Indiana University pavlis@indiana.edu