pfstream2db file db [-V -v -pf pffile]
pfstream is a database output function for a processing system based on the concept of a pfstream. A pfstream is a way to encapsulate pieces of metadata related to some arbitrary data object. It provides a transfer structure to move data through a data-driven processing system. The structure of a pfstream is described in pfstream(5). It builds on Antelope pf file structure to essentially transfer database tables in ensembles that are blocks equivalent to grouping a database table. The program takes a pfstream input (file) and writes named parameters in the pfstream to an output database (db). The output to the database can be arranged to go to multiple tables with one, some, or all rows of an ensemble saved to the db. This program can be viewed as the inverse of db2pfstream(1).
The db and file parameters are required. The input pfstream, file, can be a fifo connected to another program that is emitting pfstream format data. The db parameter is assumed to be an Antelope database name. If the database exists pfstream2db will try to append to it. If a requested database table is empty, it will be created and data added until the end of file is reached on input.
The behavior of this program is mostly driven by a fairly complex parameter file. To understand the logic of the parameter file it might be useful to first read pfstream(5). The outermost grouping of a pfstream file is a name attached to a block of text surrounded by the &Arr{ ... }. The contents of each of these named blocks are associated with database tables through a Tbl list associated with the parameter table_name_maps. For example,
table_name_maps &Tbl{ pmel_arrivals origin pmel_arrivals assoc station_corrections gridstat station_corrections gridscor }says the origin and assoc table information is to be extracted from the block of data tagged with "pmel_arrivals" while two extension tables (gridstat and gridscor) are to be extracted from a different block tagged with the name "station_correction".
Which attributes are extracted from each of these named blocks and how this is done is controlled by three key parameters: save_by_row, save_by_group, and save_by_ensemble. As described in pfstream(5) the pfstream format essentially defines an ordered table with three levels of grouping: (1) rows of the table, (2) ranges of rows grouped by one or more attributes in the table, and (3) the whole table that defines one block (ensemble) of data. These three parameters control which attributes are pushed to output tables and, by inference, how the output has been grouped. For example,
save_by_group &Arr{ origin &Tbl{ origin.lat real origin.lat origin.lon real origin.lon origin.depth real origin.z origin.time time origin.time orid int orid evid int evid origin.jdate int origin.jdate origin.nass int nass origin.ndef int ndef origin.ndp int ndp grn int grn srn int srn etype string etype review string review depdp real depdp dtype string dtype mb real mb mbid int mbid ms real ms msid int msid ml real ml mlid int mlid auth string auth algorithm string algorithm } }can be used to save origin rows in a CSS3.0 database when the pfstream is grouped by orid. Note the nesting of the parameter space here. The outer block says these attributes are to be saved once for each group boundary (note this blindly assumes all entries in the table have the same attributes for the range that defines this group). The "origin &Tbl{" says the enclosed attributes are to be mapped to the table called origin. The inner &Tbl is a list of name maps. This list structure is intentionally identical to that used in the (inverse) program db2pfstream to a allow the list to be cut and pasted. The first item is the database attribute name, the second is the type of the field, and the third is the name used in the pfstream file.
The same structure is used for save_by_row and save_by_ensemble. The difference is in how many rows are saved per ensemble (many for save_by_row and only one for save_by_ensemble).
Finally, pfstream2db has to deal with the infamous id problem in the output database. That is, some parameters in the pfstream may be database id fields that have to be dealt with carefully to avoid problems with duplicate key errors. The &Arr parameter newids_required tells pfstream2db how to handle this. For example,
newids_required &Arr{ arrival arid origin orid }tells pfstream2db that the attributes arid in the arrival table and orid in the origin table are keys and need to be handled as such. That is, dbnextid will be called for arid when writing to an arrival table to replace the value of this attribute stored in the pfstream. The same, for this example, is done for orid when writing to the origin table.
Be aware this program is multithreaded with a read thread that pulls data from the pfstream and a processing thread that does the work. This was done to implement a type of nonblocking i/o on both ends of a pfstream. For compute intensive jobs this can be important as database updates can be time consuming.
db2pfstream(1),pfstream(5),pfstream(3),pf(3)
Gary L. Pavlis Indiana University (pavlis@indiana.edu)