PC algorithm

<< Click to Display Table of Contents >>

Navigation:  Using GeNIe > Learning > Structural learning >

PC algorithm

The PC structure learning algorithm is one of the earliest and the most popular algorithms, introduced by (Spirtes et al., 1993). It uses independences observed in data (established by means of classical independence tests) to infer the structure that has generated them. Here is the Learn New Network dialog for the PC algorithm:

learn_new_network_dialog_pc

The PC algorithm has three parameters:

Max Adjacency Size (default 8), which limits the number of neighbors of a node. In practice, this parameter is important for limiting the number of parents that a node corresponding to a variable can have. Because the size of conditional probability tables of a node grow exponentially in the number of the node's parents, it is a good idea to put a limit on the number of parents so that the construction of the network does not exhaust all available computer memory.

Significance Level (default 0.05) is the alpha value used in classical independence tests on which the PC algorithm rests.

Max Time (seconds) (default 0, which means no time limit) sets a limit on the search phase of the PC algorithm. It is a good idea to set a limit for any sizable data set so as to have the algorithm terminate within a reasonable amount of time.

Pressing OK and then OK in the Learn New Network dialog starts the algorithm, which ends in the Pattern Editor dialog.

pattern_editor_dialog_credit_pc

There is one double-edged arc in the pattern (between Assets and Worth). Clicking on the Create DAG (create_dag_button) button, which picks the direction of each undirected arc randomly to yield an acyclic directed graph, clicking on the Learn parameters button, and then running a graph layout algorithm (Parent Ordering from left to right) results in the following Bayesian network:

result_pc_algorithm

The PC algorithm is the only structure learning algorithm in GeNIe that allows for continuous data. The data have to fulfill reasonably the assumption that they come from multivariate normal distribution. To verify this assumption, please check that histograms of every variable are close to the Normal distribution and that scatter plots of every pair of variables show approximately linear relationships. Voortman & Druzdzel (2008) verified experimentally that the PC algorithm is fairly robust to the assumption of multi-variate Normality.

Let us demonstrate the working of the PC algorithm on a continuous data set retention.txt, included among the example data sets.

learn_new_network_dialog_pc

Here is the suggested specification of prior knowledge about the interactions among the variables in the data set:

knowledge_editor_dialog2

Pressing OK and then OK in the Learn New Network dialog starts the algorithm, which ends in the Pattern Editor dialog.

pattern_editor_dialog

The only two causes of the variable apret (average percentage of student retention) are tstsc (average standardized test scores of incoming students) and strat (student-teacher ratio). The connection between strat and apret disappears when we repeat the learning with the Significance Level parameter set to p=0.01. This example is the subject of a paper by Druzdzel & Glymour (1998), who concluded that the only direct cause of low student retention at US universities is the quality of incoming students. This study is one of the successful examples of causal discovery, and its conclusion was verified empirically later in a real-life experiment by Carnegie Mellon University.