empty ROC

danielhombing · Post by **danielhombing** » Mon Aug 07, 2017 3:25 pm

Dear Marek,

Previously I asked why I got nothing on my ROC curve. Could you help me to find out what the reason is please? I send you private message with my model and data. Thanks.

Best,
Daniel

Post by **marek [BayesFusion]** » Thu Aug 10, 2017 9:53 pm

Hi Daniel,

I have looked at your model and your data and there is indeed something strange going on that has to do with records with missing class variable. Once you delete these from Validation, you will get both ROC and calibration curves. It seems GeNIe should be able to handle this. I will let you know as soon as we have found a solution.
Cheers,

Marek

danielhombing · Post by **danielhombing** » Mon Aug 14, 2017 8:18 am

Thanks Marek!

Which variable do you mean? Yes I have variables with missing value both in some nodes (and also in outcome node (class node in case of validation analysis; is it allowed or not to have missing value in outcome node?).

Waiting forward for your answer. And again, Thanks a lot!

Mon Aug 14, 2017 11:36 am

Dear Daniel,

Having missing values in the class variable is generally odd -- you cannot test the record for accuracy, as you don't know what class the record belongs to. Quality software, like GeNIe should, however, behave correctly even in this case. We have fixed this and you can download the newest GeNIe to check this -- GeNIe essentially ignores those records in the data file that have no value in the class variable and derives the ROC curve correctly. I hope this helps!

Marek

danielhombing · Post by **danielhombing** » Thu Aug 17, 2017 3:14 pm

Thanks Marek for your kind help!

If you don't mind, I have another question. :)
I build a simple model about heart disease (please find the model in attachment). I define also all CPTs. I am trying to compare the Genie's result with manual bayes calculation (in order to improve my knowledge in bayesian concept). But I found the result different.

I want to know the probability of someone getting heard disease if he has high blood pressure.

According to bayes rule:
P(HD=yes | BP=high ) = P( BP = high | HD=yes) x P (HD=yes) / P(BP=high)

According to the CPT value in the model:
P( BP = high | HD=yes) = 0.85
P (HD=yes) = 0.39 -> I got also this number when calculate manually

And P(BP=high) can be elaborated into:
P(BP=high) = P( BP = high | HD=yes) x P (HD=yes) + P( BP = high | HD=no) x P (HD=no)
P(BP=high) = 0.85 x 0.39 + 0.2 x 61
P(BP=high) = 0.45

So,
P(HD=yes | BP=high ) = P( BP = high | HD=yes) x P (HD=yes) / P(BP=high)
P(HD=yes | BP=high ) = 0.85 x 0.39 / 0.45
= 0.73

But why Genie give probability of heart disease “yes” is 0.87 and not 0.73? (please see attached picture). Can you explain to me? Am I doing it wrong or what?

Thanks again for your respond!

*I really like Genie (for its simplicity and free of charge for research purpose) and want to use this software in my research.

Post by **marek [BayesFusion]** » Sun Aug 20, 2017 6:23 pm

Dear Daniel,

One error that I see in your calculation is that you assumed/calculated P(HD=yes)=0.39. Actually, P(HD=yes)=0.61 and P(HD=no)=0.39, so you may have made an error here. Does this solve your problem?

I love to hear that you are going to use GeNIe in your research. GeNIe is incredibly popular in the academia. Price is one factor but I believe it is the quality of the software that makes all the difference. Please remember about acknowledging GeNIe in your publications! Please do not hesitate to contact us in case of any problems.
Cheers,

Marek

danielhombing · Post by **danielhombing** » Tue Aug 29, 2017 11:01 am

Dear Marek,

Thanks for your answer. Yes, that's my mistake.

I have another question about EM algorithm.

To construct my model, I follow one theoretical concept contain several layers from outer nodes to the outcome node. Unfortunately, I have no data on some intermediate nodes. But yes, EM give me CPT for this intermediate node. When I did the validation, the result seems good. But I am wondering how does EM work in this situation (when no data on some nodes)? What I understand about EM is EM learn the probability from known data and generate "dummy" probability and perform the expectation and maximization process iteratively until convergence. But I don't know how EM give probability for one node which is fully empty (no data). I cannot find any explanation for this situation.

I create a simple excel datasets with 50 rows and three nodes. It is a serial connection: house's floor -> affordability -> use product.
When I delete all data from "affordability" then perform EM, it gives CPT for affordability and the value depend on what type of parameter initialization I choose. If I choose "uniformize", the CPT in affordability will be 0.5 for all states. But if I choose "randomize", the CPT is not uniform (give some probabilities).

Could you explain me how does EM work in this situation, please? If there is a paper discuss this situation, it will be helpful. If possible, I need both "simple" explanation (because I am not from mathematics) and also mathematical explanation for this (maybe I can learn from that as well). If you prefer to write it on paper and scan it and attach here, that's also fine for me.

Thank you a lot for your help.

Best,
Daniel

Post by **marek [BayesFusion]** » Tue Aug 29, 2017 3:52 pm

The EM algorithm works quite simple -- it searches for the optimal set of parameters that will maximize the log-likelihood of the data given the model. Missing values are handled by producing the most likely values for them given the current model. This changes as the model develops. When there are no data for a variable, all of the numbers are essentially produced randomly at first and then modified in the search for the optimal set of parameters. You cannot trust the meaning of states and their probabilities for such variables but the learned model will capture the joint p.d. over the variables that are present in the data file. Hence, this explains your reasonable performance of the learned model. I hope this helps. The EM algorithm is described on Wikipedia (https://en.wikipedia.org/wiki/Expectati ... _algorithm) and in the original paper by Dempster, Laird and Rubin (the reference to this paper is under [1] in the Wikipedia article).
I hope this helps,

Marek

danielhombing · Post by **danielhombing** » Tue Aug 29, 2017 4:15 pm

Thanks for your answer Marek.

Yes, I agree with you not to trust the CPT in this no-data nodes. The reason I still use these intermediate nodes is because I follow one theoretical concept and I have also about 10 nodes connected to the outcome node which are too many. So I "clustered" them using these intermediate nodes.

You mentioned about "the algorithm produce the number randomly at first and then modified in the search for the optimal set of parameters". Can I say the way the algorithm works is like this : in first try, the algorithm give random value, for example 0.4. And using this value, the algorithm try to predict the outcome node. and let say if the prediction is below the probability of outcome node, then for the next turn the algorithm use different number, for example 0.5, and so on until the algorithm get the CPT which give the closest probability to the "real" probability of outcome node. Am I right?

And can I conclude also that these "no-data" intermediate nodes are useless in this situation? Because it only capture the joint probability over the variables with the known data?

Thanks again!

Post by **marek [BayesFusion]** » Tue Aug 29, 2017 4:22 pm

Yes, I think you have it right. Cheers -- Marek

danielhombing · Post by **danielhombing** » Wed Sep 27, 2017 2:59 pm

Hi,

I have other questions regarding ROC.

1. If I have 2 states in my class node (yes-no), then ROC will give 2 figures: state 1 and state 2. The both AUC numbers are similar, but the graph slightly different. If I want to report the ROC, which graph should I use? Or it doesn't really matter which one I choose?

2. When I validate my model, AUC is 0.88. But the accuracy result: the accuracy to predict "no" is 0.97 and "yes" is 0.54; overall accuracy is 0.88. Or specificity is good and sensitivity is not really. It maybe because the ratio of "yes":"no" is 1:3.5 (more cases give "no"). But I am wondering how can ROC gives high AUC but the model itself only predict accurately "yes" 54%? It seems the AUC is similar to overall accuracy.

3. If that's the result that I got from the model, how can I say about my model's performance?

4. Could you tell me also how to make ROC manually? If I copy from the ROC picture and paste in notepad, I will get the coordinates to make the ROC. But how do you get this coordinates? And are the number of coordinates similar to the number of our datasets (row)?

Thank you for your kind answer (for this question and my questions on other threads).

Daniel

Post by **marek [BayesFusion]** » Wed Sep 27, 2017 3:36 pm

Hi Daniel,

Let me try to answer your questions:

Ad 1: Generally, you should use the ROC for the abnormality that you want to detect. If you have two classes and each of them is equally interesting, you should report both ROCs. They will be complementary and highly dependent on each other but still you want to see the ROC for each class.

Ad 2: AUC and overall accuracy may be correlated but they are completely different concepts. Every point on the ROC corresponds to a concrete value of sensitivity, specificity, and accuracy, so they each change as you move on the ROC. I recommend an excellent article on ROCs in Wikipedia.

Ad 3: The triple sensitivity, specificity, and accuracy is one way of characterizing the performance but ROC and AUC is another. You can also report both. Keep in mind that the sensitivity, specificity, and accuracy correspond to just one point on ROC, so you can make each of them higher at the cost of the other. There exists also a point on ROC that corresponds to the highest accuracy.

Ad 4: Again, I feel you should look at the Wikipedia article on ROC. The axes are described there and plotting the ROC curve can be done once you have run the model on each of your data points. As you change the threshold, you get different sensitivity and specificity and points on the ROC curve.

I hope this helps,

Marek

BayesFusion Support Forum

empty ROC

empty ROC

Re: empty ROC

Re: empty ROC

Re: empty ROC

Re: empty ROC

Re: empty ROC

Re: empty ROC

Re: empty ROC

Re: empty ROC

Re: empty ROC

Re: empty ROC

Re: empty ROC