Problems with GeNIe-Datasets and Validation

Daniel · Post by **Daniel** » Tue Oct 18, 2016 8:30 pm

Dear BayesFusion-Team,

I have two problems which may be related:

1. I created the structure of a BN manually and want to learn the model parameters with by EM using GeNIe and an artificial dataset which i generated by the distribution parameters of the variables (such as median, mean, range, standart deviation etc) reported in the available literature. To use GeNIe, i first used the "Generate Data File..." option from GeNIe's Learning module and created a data file in which i then copy-pasted the data from my artificial dataset in MS Excel. After that, GeNIe satisfactory learned the parameters for the BN from that data file via EM. I also saved this data file from GeNIe as I have different versions of the artificial dataset and sometimes want to re-learn the parameters by use of a certain version after the use of some other versions. However, after starting GeNIe again and opening the saved data file and the network, GeNIe doesn't recognize the states of some variables in the data file anymore and thus can't match them with the variables' states in the network. Therefore I always need to repeat the whole copy-paste procedure. :-/

2. The artificial dataset is devided in a learning and a validation part. Using GeNIe's validation tools, I'm not quite sure whether the resulting accuracy-values for my model are really valid. For most class variables, I have ~10 states of which 2-3 show an accuracy of 0.4-0.8%, while the other states have an accuracy of 0. What makes me doubt that: if I directly compare the relevant posterior distributions of the class variables with the corresponding distributions in the validation dataset, all states show a very good fit. How can that possibly be?

Thanks in advance for your help!

Daniel

P.S.: GeNIe is a really great tool! I really appreciate that you offer a free academic version for research at universities! :-)

Tue Oct 18, 2016 8:37 pm

Can you post your data files here? Alternatively, you can contact me using private messaging feature of this forum.

Tue Oct 18, 2016 9:51 pm

Thanks for sharing your data files. Due to large number of missing values, GeNIe's text file parser incorrectly treats some of the columns as containing numbers - this is a bug which we'll fix soon. You can see the problem in column like Apoptosis: sort the data using this column in descending order and the top rows will have Apoptosis=normal. Note that value is right-justified; this justification is used for numbers in GeNIe's datasheet window. The problem in EM/validation matching dialog is caused by this misclassification.

As a short-term solution, you can consider moving some data rows to ensure all columns have at least one non-missing value in first 200 rows of the data file. If CSV import is correct, you'll see that all columns are left-justifed.

Daniel · Post by **Daniel** » Wed Oct 19, 2016 8:43 am

Thanks for your help!
Instead of moving the rows, I added a single "normal" case with no missing values in all columns at the beginning of the data files. Just as you've said the columns are now left-justified and the data-files can be reopened again.

The validation bug, however, is still not solved. Any idea what else I could try?

Edit:

For testing purposes I also generated a complete Datafile using the trained BN and relearned the parameters again on this complete datafile. I then used the same complete datafile for validation and still the same bug...

Edit 2:
Using the complete data file for validation, however, I noticed that although the accuracy results are still odd, the ROC curves of the states seem pretty good. Some states have a near perfect ROC curve while still having an accuracy of 0. In contrast, when using the incomplete data files for validation, ROC curves are not shown.

Wed Oct 19, 2016 12:32 pm

Can you re-upload your datafile, so I can reproduce the exact scenario you're describing?

Wed Oct 19, 2016 5:20 pm

You mentioned class nodes having approx. 10 states, but your validation file has full data in Imatinib and Expression_of_Bcr_Abl only. Both have less than 10 states. Which node did you select as the class variable? Also, did you use test only, k-fold or leave one out as validation method?

Daniel · Post by **Daniel** » Wed Oct 19, 2016 5:41 pm

The dataset "AllVars_L" and "AllVars_V" are like the old one except with one case with no missing values. "AllComplete" is a dataset which has no missing values (which i used for validation in the scenario described above). As class variables i selected every variable except for imatinib and expression_of_bcr_abl. I used the "test only" validation method.

Thu Oct 20, 2016 11:26 am

I'm looking into the ROC vs accuracy now.

One aspect of GeNIe's validation code I'd like to highlight is the selection of evidence nodes when multiple class nodes are selected. In such case, the class nodes are never set to be evidence. Your network has 10 nodes; 8 are selected to be class nodes while imatinib and expression_of_bcr_abl are not. For each of 20,000 rows in the datafile, GeNIe will read the values of imatinib and expression_of_bcr_abl and set these two values as evidence. Posteriors for 8 other nodes will be calculated and the outcome with highest probability will be considered the prediction.

Effectively, you can get different results if you run validation 8 times with single class node vs. one run with 8 class nodes.

Daniel · Post by **Daniel** » Thu Oct 20, 2016 1:48 pm

Well, that explains at least the result of the accuracy... For the variable "platelet count" for example, there are 3 states out of 10 which are always the most probable states under any combination of the evidence nodes "imatinib" and "expression_of_bcr_abl". Thus, after running the validation, only these three states have a certain accuracy, while the other 7 have an accuracy of 0...

However, this algorithm leaves me a bit puzzled. Shouldn't it rather roll a dice and then assign a state according to the posterior distribution of the class node's state? That would seem much more adequate to me.

Post by **marek [BayesFusion]** » Thu Oct 20, 2016 2:09 pm

However, this algorithm leaves me a bit puzzled. Shouldn't it rather roll a dice and then assign a state according to the posterior distribution of the class node's state? That would seem much more adequate to me.

This is an interesting idea but it seems counter-intuitive to me. Would you be willing to justify your approach? I am dead serious here -- I have been in this business long enough to distrust my intuition (and anybody else's intuition for that matter). Probabilities are quite hard and not always intuitive :-).
Cheers,

Marek

Post by **marek [BayesFusion]** » Thu Oct 20, 2016 3:18 pm

Tomek and I discussed your problem and we started wondering whether you really want to do what you wrote that you want to do. When dealing with multiple class nodes, it does not seem fair to enter the values of all the other classes when testing the accuracy for a given class. To us, validation with multiple class nodes makes sense when the values of all symptoms and all other measurements are known but none of the class nodes are known/observed. This is what GeNIe does: It treats all the nodes that you designate as class nodes as unknown. If we understand correctly, you would like to run validation for each class node x as if the values of all other (class!) nodes are known/have been observed. I have a hard time understanding the (common) sense of this operation, although (as I pointed out in my previous post above) I have been in this business long enough to distrust my intuition.

If you want to do what you are describing, you can still do it. Just designate each of your eight class nodes as the class node in a separate validation run. When you do this, GeNIe will treat the other class nodes as ordinary nodes and observe their values during the validation run. The additional difficulty for you is that you will have to do it eight times, each time for a different node of the eight class nodes. The results are going to be different (probably better) than what you get from running validation with all class nodes at the same time. The intuition (which I trust in this case -- I have theoretical reasons for trusting it :-)) is that you have more information available -- you know the values of all other class nodes.

I would really like to hear your motivation for doing what you want to do. If we judge this to be common (at the moment, I have to say I find it an exceptionally uncommon situation), we will consider adding this as an option to the Validation dialog.

In any case, thank you so much for bringing this up. I am setting out to expand GeNIe manual and to explain how validation works in case of multiple class nodes. Please expect the Validation section expanded in the next release of GeNIe :-).
Cheers,

Marek

Daniel · Post by **Daniel** » Thu Oct 20, 2016 3:49 pm

Thank you very much. I will sent you both a PM for response.

Fri Oct 21, 2016 4:52 pm

Some states have a near perfect ROC curve while still having an accuracy of 0.

Note that ROC by definition represents the performance of classifier when the threshold changes. The threshold change does not happen when GeNIe does validation.

For example, when running validation for Platelet_count with AllComplete.txt, you'll get accuracy of zero for the p3000_3500 outcome (because only 3 of out 10 states are predicted, p3000_3500 not one of 3). However, for ROC we have to assume a binary case of p3000_3500 vs non-p3000_3500, and a sliding threshold. There are three distinct posterior values we get for 19 rows with p3000_3500:
P1=0.00047325299402649942 (1 record)
P2=0.0010252756316987189 (4 records)
P3=0.0052401864706895594 (14 records)
With threshold=P1 you'll get 100% true positive and 100% false positive, and a point in the upper-right corner of the ROC chart. Moving threshold up to P2 you'll miss exactly one p3000_3500 case and TPR=(19-1)/19=0.947. Using P3 gives TPR=(19-1-4)/19=0.73.

Sat Oct 22, 2016 11:29 am

Daniel wrote:Thank you very much. I will sent you both a PM for response.

Daniel wrote:What puzzles me about the algorithm for predictive accuracy is the following:

Consider the following network A->B and states a1,a2,a3 and b1,b2,b3 respectively. Let the CPT of B be given be such, that under any constellation of B's parents, P(b1|par(B)) > P(b2|par(B)) > P(b3|par(B)). Although the probabilities for B's states given the different states of A can vary considerably, the algorithm for predictive accuracy always picks b1. (I can also make a simple Genie example for that if you want.) Introducing a "rolling the dice" step before picking a state for the class variable would still yield b1 in the majority of cases, but it would also account for the differences in the probabilities P(b1|a1), P(b1|a2) and P(b1|a3) and be therefore more accurate.

I spend this afternoon with the respective chapter on this issue in Korb and Nicholson's "Bayesian Artifical Intelligence" and they point out the same problem with predictive accuracy. However, I also learned that there already are some alternative approaches available such as Kullback-Leibler divergence. Although a different metric, the dice-step in the predictive accuracy algorithm would in my eyes be closer to such approaches.

By the way, I tried to publish a paper about an earlier version of this model in "Artificial Intelligence in Medicine". Previously, we worked with a rather simple software which had no PE-Learning algorithms etc, so we estimated the parameters by hand, which probably led to the rejection of the paper. I was really happy to learn that a tool like GeNIe exists. It makes it A LOOOOT easier to create a better version and to implement the reviewer's suggestions.

So thank you guys for this awesome software and the support you offer! :-)

Hi Daniel,

I quoted a part of your PM that I think is of general nature (and general interest) and does not reveal the details of your project.

My argument against rolling the dice in the last stage is that you are always better off by betting on the more likely thing. Your argument resembles an interesting experiment in psychology, due to W.K. Estes, and an interesting effect that he called "probability matching." Subjects face a black box with two lights, green and red, and are asked to bet on one of the lights. When they guess correctly, they are rewarded. The game takes a longer time, with tens or even hundreds of bets. Suppose that the probability of red is 0.3 and the probability of green is 0.7 (this is generally hidden from the subjects but they figure out over time that green is more likely). What is the optimal betting strategy? Always bet on green, which is more probable. What do the subjects do? It turns out that they try to match the probabilities of lights: they bet 30% of the time on red and 70% of the time on green. This strategy is clearly sub-optimal and leads to lower gains. There are many explanations of this but the one that I like most is that humans have a hard time believing in true randomness and always try to beat the system, learn its deterministic rules.

Now, when you are evaluating a system and your biggest concern is accuracy, you should allow the system to pick the most likely state. Assuming that the system is the best you can build, doing anything else will impact the accuracy negatively. The ROC curve will show you a spectrum of possible behaviors and you can pick another decision rule, depending on the utility of being right or wrong. For example, when diagnosing for cancer, you would like high sensitivity at the expense of specificity. Does this make sense?

I may be misunderstanding the description of your hypothetical network above but I have a feeling that it is impossible. Building it may be a good exercise. The P(b1|par(B)) > P(b2|par(B)) > P(b3|par(B)) relationship has to be violated by some combinations of states of the parents -- it cannot hold always :-).

KL divergence is a useful measure for comparing probability distributions. In case of evaluating a model, you may want to compare posteriors. SMILE has a function call to KL divergence. We use it in GeNIe as well, for example when displaying arc strength through their thickness.

Thank you for your kind words about GeNIe -- we have been trying hard for the last 20 years and we have always had more ideas than person-power to implement them. With our move outside of the academia, we have more resources, so it is only going to get better. Stay tuned!

Marek

Fri Nov 04, 2016 10:27 pm

GeNIe 2.1.1104 (released today) contains a fix for the issues with text file parser you've encountered. Also, the ROC display is improved; for example, you should be able to see the actual classifier thresholds in the tooltips. You can also copy&paste the numbers behind roc - right-click on the chart, select "Copy" then paste into Notepad or any text editor.

BayesFusion Support Forum

Problems with GeNIe-Datasets and Validation

Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation

Re: Problems with GeNIe-Datasets and Validation