Ask learning process, fixed nodes, and validation process

The front end.
Post Reply
danielhombing
Posts: 17
Joined: Sat Jun 10, 2017 6:53 pm

Ask learning process, fixed nodes, and validation process

Post by danielhombing »

Hi,

I constructed a BN for my research. I have several layers. Unfortunately, I only have data on outer layers and the main outcome node (with only two states; yes and no), but no data for intermediate nodes. I have quite a big data sets (about 10,000).

To fill the CPT on intermediate node, I follow fault tree analysis (FTA) concept using boolean algebra and AND/OR gate. when running the parameters learning, I set all intermediate nodes as "fixed nodes" in order to keep their original CPT, which already filled using FTA concept.

I have some questions:
1. I cannot run the EM algorithm and got this message "em: please run with relevance enabled". But if I run it without fixed nodes, it run well. Why? And can you give suggestions also to fill this intermediate nodes?

2. After learning process, I run validation. I am wondering about the function of "fixed nodes" here. In your manual book, you set some nodes as "fixed nodes" (page 415), why?
What I understood is you get your CPT after learning process and afterwards you want to validate your model, which this process (validation) does not change your original CPT that you got from learning process. Then why do you need to fixed those nodes?

3. in leaning parameters, I am a bit confusing about "confidence".
Let say I have node in outer layer (without parent node) with 3 states: A, B, C. The original CPT that I got from Genie is 0.5, 0.5, 0 (If you have only two states, Genie will give CPT equal 0.5, and if you add one node, the value is zero).
What I understood is CPT on outer layer represent the frequency of all the states. Let say if I have 10,000 data, and the frequency of A, B, C is 20%, 50%, and 30%. In my opinion, if I run the learning parameters, the CPT should be close/similar to this percentage.

Because I have 10,000 data, I use this number in "confidence". But the problem is the CPT is not close to that frequency percentages but change “a bit” the original CPT that Genie provide before, let say now the CPT is 0.4, 0.5, 0,1.
But if I set the “confidence” 1, the CPT will close to the frequency percentage. So in my opinion, I should use low confidence even though I have abundant data sets. Am I right?

Thank you for your kind respond and clear answer.

Best,
Daniel
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: Ask learning process, fixed nodes, and validation process

Post by shooltz[BayesFusion] »

1. I cannot run the EM algorithm and got this message "em: please run with relevance enabled". But if I run it without fixed nodes, it run well. Why? And can you give suggestions also to fill this intermediate nodes
Can you enable relevance in EM options?
2. After learning process, I run validation. I am wondering about the function of "fixed nodes" here. In your manual book, you set some nodes as "fixed nodes" (page 415), why?
What I understood is you get your CPT after learning process and afterwards you want to validate your model, which this process (validation) does not change your original CPT that you got from learning process. Then why do you need to fixed those nodes?
Validation window in GeNIe allows for both validation and cross-validation (k-fold and leave one out). During cross-validation the network's parameters are modified using part of the data file selected as input for validation. Fixed nodes are applicable to cross-validation only.
Because I have 10,000 data, I use this number in "confidence".
The confidence applies to the data used previously to train the network. If your network contains default parameters and you run EM for the first time on that network, the confidence should stay at 1.
danielhombing
Posts: 17
Joined: Sat Jun 10, 2017 6:53 pm

Re: Ask learning process, fixed nodes, and validation process

Post by danielhombing »

Thank you for your answer. It works. I do not realize that before.

But what is the function of "fixed nodes" in validation?

I have more questions :)

1. in the outcome node, I have only 2 states: "yes" and "no".
in one of my datasets (I have several datasets for different cases), about 80% of total data give outcome "yes". So when I run the validation, the sensitivity (outcome "yes") is 1 and specificity ("no") is 0 and AUC is not good (0.6).
in other datasets, about 20% give outcome "yes" and the validation give sensitivity (outcome "yes") is 0 and specificity ("no") is 1.
And sometimes there is no result on ROC curve as well.

Why is that happen? Because I have more cases "yes" than "no"? which means maybe the model works if the range is between 20-80% (<20% and >80% is extreme situation).

2. If I re-adjust the CPT after learning parameter and validate it, the validation process will use the CPT that already re-adjusted, isn't?

3. I want to know how to validate using Test-only. Because it's written in manual that we validate model using new data. How can I import this new data?
let say I have 10,000 data. I divide data into 5000 for learning parameters and 5000 for validation. how can I do the validation?

4. And the strange thing is: when I tried to validate using k-fold cross validation and test only, test only give much better AUC value and accuracy rather than k-fold cross.
But in this case, I use all 10,000 data for learning parameters and validation for both validation process (so not divide 50-50 like what I mentioned before because I do not know how to import that data). How can it be totally different?


5. And another question related EM algorithm.

As I said before, I have no data on intermediate nodes. If I run the parameter learning, genie will fill CPT value on this intermediate node. In your opinion, can I use these probability value? Or maybe I need to "calibrate" it somehow and re-adjust the value.

Thank you for your kind answer. I do not know someone in my university can help me.

Best,
Daniel
marek [BayesFusion]
Site Admin
Posts: 430
Joined: Tue Dec 11, 2007 4:24 pm

Re: Ask learning process, fixed nodes, and validation process

Post by marek [BayesFusion] »

Hi Daniel,
But what is the function of "fixed nodes" in validation?
You fix them if you do not want their CPTs to change during the cross-validation. They will normally change, because cross-validation includes a learning and a testing phase.
1. in the outcome node, I have only 2 states: "yes" and "no".
in one of my datasets (I have several datasets for different cases), about 80% of total data give outcome "yes". So when I run the validation, the sensitivity (outcome "yes") is 1 and specificity ("no") is 0 and AUC is not good (0.6).
in other datasets, about 20% give outcome "yes" and the validation give sensitivity (outcome "yes") is 0 and specificity ("no") is 1.
And sometimes there is no result on ROC curve as well.

Why is that happen? Because I have more cases "yes" than "no"? which means maybe the model works if the range is between 20-80% (<20% and >80% is extreme situation).
This question is hard to answer without seeing your model and your data. Certainly, the fact that your class is not evenly distributed is not a problem -- uneven distributions of classes is a fact of life and happens in every data set. Have you tried looking at the output file? You can see the posterior probability distributions over the class variable for each of your data records.
2. If I re-adjust the CPT after learning parameter and validate it, the validation process will use the CPT that already re-adjusted, isn't?
Only if you use validation only. In cross-validation, these probabilities will change at each of the fold (unless you fix these nodes, which is what you were asking about earlier).
3. I want to know how to validate using Test-only. Because it's written in manual that we validate model using new data. How can I import this new data?
let say I have 10,000 data. I divide data into 5000 for learning parameters and 5000 for validation. how can I do the validation?
When you use "Test only", the entire data set is the test set. The standard procedure is to open your training set, learn parameters, open the test set and then use "Test only" on that test set. The procedure tests how well you have learned from the training data set. Does this make sense?
4. And the strange thing is: when I tried to validate using k-fold cross validation and test only, test only give much better AUC value and accuracy rather than k-fold cross.
But in this case, I use all 10,000 data for learning parameters and validation for both validation process (so not divide 50-50 like what I mentioned before because I do not know how to import that data). How can it be totally different?
I am somewhat confused by your statement, so I am just providing a general answer that may help you realize the source of my confusion. If you train and test your model on the same data, you will obtain much better results than when training and testing your model on two distinct data sets, which is what happens in cross-validation. Training and testing the model on the same data is similar to a teacher revealing all exam questions at the beginning of the semester, teaching the students how to answer these questions and then testing them at the end of the semester using the same questions. Cross validation is similar to teaching the students how to answer a certain type of questions but testing them on different (similar) questions so that the students are tested on whether they have mastered the principles. Does this make sense?
5. And another question related EM algorithm.

As I said before, I have no data on intermediate nodes. If I run the parameter learning, genie will fill CPT value on this intermediate node. In your opinion, can I use these probability value? Or maybe I need to "calibrate" it somehow and re-adjust the value.
EM is capable of filling the CPTs for the intermediate nodes so that they joint p.d. over all the other nodes reflects the distribution represented by the data. No adjustment is needed. I'm not sure what you mean by using these values. If you mean these values being a part of the resulting model, the answer is absolutely yes.
I hope this helps.

Marek
danielhombing
Posts: 17
Joined: Sat Jun 10, 2017 6:53 pm

Re: Ask learning process, fixed nodes, and validation process

Post by danielhombing »

Thanks Marek for all your answers. they're very helpful.

If you don't mind, I have more questions :)

1.
Have you tried looking at the output file? You can see the posterior probability distributions over the class variable for each of your data records
Sorry, I did not get this. What do you mean by looking at the output file?

2. When I validate my model (using test only), the accuracy is 65%. the model can predict the outcome node: "no" (84% correct prediction) better than "yes" (only 31% correct prediction). But Why do I not get any picture in my ROC curve? it's totally empty. Please see the picture in the attachment.

Thanks for your help!
Attachments
no result on ROC curve
no result on ROC curve
empty ROC curve.jpg (151.32 KiB) Viewed 6942 times
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: Ask learning process, fixed nodes, and validation process

Post by shooltz[BayesFusion] »

But Why do I not get any picture in my ROC curve? it's totally empty. Please see the picture in the attachment.
Can you click on the 'Mark points on the curve' checkbox and repost the images?
danielhombing
Posts: 17
Joined: Sat Jun 10, 2017 6:53 pm

Re: Ask learning process, fixed nodes, and validation process

Post by danielhombing »

When I click the checkbox, green point appear in the left bottom corner (0,0).
Attachments
Untitled12.jpg
Untitled12.jpg (122.84 KiB) Viewed 6938 times
marek [BayesFusion]
Site Admin
Posts: 430
Joined: Tue Dec 11, 2007 4:24 pm

Re: Ask learning process, fixed nodes, and validation process

Post by marek [BayesFusion] »

danielhombing wrote: 1.
Have you tried looking at the output file? You can see the posterior probability distributions over the class variable for each of your data records
Sorry, I did not get this. What do you mean by looking at the output file?
When you run validation, you can create an output file. It will replicate each of the records and show you the probabilities assigned to each of the classes (please look at the Validation dialog). This should be insightful for you.
danielhombing wrote: 2. When I validate my model (using test only), the accuracy is 65%. the model can predict the outcome node: "no" (84% correct prediction) better than "yes" (only 31% correct prediction). But Why do I not get any picture in my ROC curve? it's totally empty. Please see the picture in the attachment.
From the picture that you placed in reply to Tomek, it looks like your classifier produces zero probability for every instance of HWT=Yes. There is clearly a problem with your model. If you post your model, I will be glad to have a look at it.
I hope this helps.

Marek
danielhombing
Posts: 17
Joined: Sat Jun 10, 2017 6:53 pm

Re: Ask learning process, fixed nodes, and validation process

Post by danielhombing »

Dear Marek, I have some extra questions:

1. When perform parameter learning, we get log(p) value. Do I need to consider this value to evaluate my model’s performance? I see that closer to zero is better. My log (p) value is -1084380.522721 which is very high.

2. Is it possible to have illogical CPT after parameter learning but get good result on validation? If that is a kind of trade-off, which one should I prioritize or sacrifice in your opinion?

3. I have no data on intermediate node. And when I do the learning parameter, EM algorithm will fill the CPT on this node. But I am curious, how can EM get this value because logically EM does not know the relationship/effect of each state in parent nodes on the child node?

For example this network: housing condition (states: finished - natural) -> wealth condition (rich-middle-poor) -> buy product (yes – no).
Let say I have data on "housing condition" and "buy product" but not "wealth condition". Logically, if housing condition is natural then wealth condition is poor and they buy they product. EM algorithm will fill the probability on wealth condition but how can EM find the “logical” relationship? I am trying to find this answer from some literature but they do not explain this part. They only explain EM can learn from missing value but in this case is all value in one node is missing.

4. Can you explain more about “randomize initial parameters ” or “uniformize” in parameter learning? What I know is you choose this option to wipe out the current/default parameter. Which one should I use?

5. I found out also that the validation result is different if I tick randomize initial parameter and if I tick Uniformize. Randomize initial parameter give better result. Can you explain this also, please?

Thank you again for your kind respond. I really appreciate your help.

Best,
Daniel
marek [BayesFusion]
Site Admin
Posts: 430
Joined: Tue Dec 11, 2007 4:24 pm

Re: Ask learning process, fixed nodes, and validation process

Post by marek [BayesFusion] »

1. When perform parameter learning, we get log(p) value. Do I need to consider this value to evaluate my model’s performance? I see that closer to zero is better. My log (p) value is -1084380.522721 which is very high.
The absolute value of log(p) depends on the size of your network, data file and other factors. It is impossible to evaluate it outside of the context, so your statement that it is very high is hard to evaluate. I'd like to point out that this is logarithm of p, so p is a very small number, like one would expect the probability of data given the model to be.
2. Is it possible to have illogical CPT after parameter learning but get good result on validation? If that is a kind of trade-off, which one should I prioritize or sacrifice in your opinion?
I'm not sure what you mean by illogical CPT. GeNIe does not allow CPTs that are inconsistent, so whatever CPT you will enter or learn from data will always be mathematically correct.
3. I have no data on intermediate node. And when I do the learning parameter, EM algorithm will fill the CPT on this node. But I am curious, how can EM get this value because logically EM does not know the relationship/effect of each state in parent nodes on the child node?

For example this network: housing condition (states: finished - natural) -> wealth condition (rich-middle-poor) -> buy product (yes – no).
Let say I have data on "housing condition" and "buy product" but not "wealth condition". Logically, if housing condition is natural then wealth condition is poor and they buy they product. EM algorithm will fill the probability on wealth condition but how can EM find the “logical” relationship? I am trying to find this answer from some literature but they do not explain this part. They only explain EM can learn from missing value but in this case is all value in one node is missing.
Please have a look at the working of the EM algorithm. There is a reasonable article on the EM algorithm on Wikipedia. Essentially, you cannot trust that the EM algorithm will know the semantics of your model. It will just find a set of parameters for the hidden node that are maximizing the probability of data given the model. The joint probability distribution over the observed nodes will be correctly reflecting what is in the data.
4. Can you explain more about “randomize initial parameters ” or “uniformize” in parameter learning? What I know is you choose this option to wipe out the current/default parameter. Which one should I use?
Uniformize will wipe out the current parameters replacing it with uniform distributions. Randomize will wipe our the current parameters and replace them with random values. Because the learning process is heuristic in nature and could be compared to looking for a needle in a haystack, it is hard to say which of the two will give you a better model. Try both and look at the log(p) that you are getting. We are actually working on a module that will perform a more thorough search for the optimal log(p). Please stay tuned.
5. I found out also that the validation result is different if I tick randomize initial parameter and if I tick Uniformize. Randomize initial parameter give better result. Can you explain this also, please?
This is certainly not happening always. If you are looking for a research topic, try this one: Under what circumstances uniformizing is better than randomizing. A sub-topic: What randomized starting points are leading to best results? I haven't done a thorough search of the literature on this topic but I believe it is still weak.

I hope this helps,

Marek
Post Reply