Generating a data file

<< Click to Display Table of Contents >>

Navigation:  Using GeNIe > Learning >

Generating a data file

A Bayesian network model is a representation of the joint probability distribution over its variables. Given this distribution, it is sometimes useful to generate a data file from it. Such data file can be subsequently used, for example, to test a learning algorithm. GeNIe allows for generating a data file from a model through the Generate Data File... command from the Learning menu.

learning_menu

The following dialog appears

generate_data_file_dialog

The Generate Data File command generates a text file containing records that are representative for the network in the sense of coming from a joint probability distribution modeled by the network. The individual records of the output file contain values of the nodes randomly generated from the joint distribution modeled by the network.

Filename specifies the location for the data file to be stored. Browse (browse_button) button invokes Save As dialog, which helps with finding a location to save the file.

Number of records specifies the number of records to be generated.

Separator character allows for selecting a character that separates individual node states in records. The choices are Space, Tab and Comma.

separator_character_dialog

Add header with node ID's, when checked, makes the first record of the output file contain IDs of the nodes. If the file is to be read into GeNIe, this option should be checked.

Use state indices instead of state ID's leads to saving records with the state indices (0, 1 , 2, etc.) for discrete variables instead of state IDs or state names.

Bias samples by existing evidence, when checked, generates a data file from the posterior joint probability distribution (i.e., biased by the observations) rather than from the original joint probability distribution. This option will not work for continuous or hybrid networks with evidence in nodes with parents. In such cases, GeNIe will signal an error

generated_data_evidence_hybrid

Missing values (%), when checked, produces an output file with missing values. The values are Missing At Random (this is also known as the MAR assumption). Percentage specifies the percentage of values missing.

Explicit random seed allows for reproducibility of the record generation process. Unchecked or checked with zero random number seed (default) leads to using the system clock, which means that the records generated are truly random.

Only selected nodes, when enabled, allows for selective contents of the output file. The user can, in this case, select nodes from the list in the window pane below, to generate records comprising of only the selected nodes. With the option enabled, two buttons, Select all and Select none, help with the selection process by allowing to select all nodes or clearing the selection, respectively.

It is also possible to select nodes in the Graph view before invoking the Generate Data File command. In this case, the nodes selected in the Graph view will be selected in the list of selected nodes. If nothing is selected in the Graph view, GeNIe recreates the last selection used in generating a data file.

Pressing the OK button starts the generation process, which ends with the following dialog

generating_data_file_dialog

In addition to reporting the time taken by the generation process, the number of samples generated (these are divided into Samples accepted and Samples rejected, which becomes of essence only when generating a data file biased by observed evidence - in this case samples incompatible with the evidence are rejected), and generation speed (Samples per second), the dialog allows for opening the newly generated file in GeNIe and in Notepad.

Open in GeNIe shows the newly generated file in a GeNIe Data View window.

generated_data_genie

Please note that the network used in this example was hybrid and this is reflected in the data, which contains two columns with categorical values and the remainder of the columns numerical, reflecting the types of variables in the model.

Please remember about checking the Add header with node ID's option in the Generate Data File dialog. If this is not checked, the generated file will not conform to GeNIe requirement that the first record in the file contain variable names. The same file opened in Notepad looks as follows

generated_data_notepad