Validation using data with known diagnoses
Validation using data with known diagnoses
I saw/read somewhere that there is a video on validation but can't find it. Is this video available somewhere? I'm particularly interested in figuring out what % of each of the cases in my data set end up with the correct diagnosis (using the "diagnosis" functionality) utilizing a naive Bayesian network already created in GeNIe with probability distributions derived from data.
Re: Validation using data with known diagnoses
p.s. after reading through the information on validation within the Help > Using GeNIe > Learning > Validation section, I was able to see what percentage of cases were correctly diagnosed from test and validation sets, with the test set being the set from which the bayesian network was derived, and the validation set an independent set of cases. I was wondering if it is possible to identify which specific cases were misdiagnosed/misclassified. If so, could you please explain how?

 Site Admin
 Posts: 1424
 Joined: Mon Nov 26, 2007 5:51 pm
Re: Validation using data with known diagnoses
If you specify a name for the validation output file, the validation will produce a dataset based on the input data with additional columns added. There will be one column for each class node, containing a predicted outcome, and one column for each class node outcome, containing a probability of that outcome. By comparing the actual entry in the input data with the predicted outcome you'll be able to determine which cases were misclassified.I was wondering if it is possible to identify which specific cases were misdiagnosed/misclassified.
Re: Validation using data with known diagnoses
Thanks! That worked perfectly.
Re: Validation using data with known diagnoses
Question:
Once cases that were "misdiagnosed" are identified (given the actual diagnoses are known for each case), it it possible to identify and prioritize the nodes/values responsible for misdiagnosis?
Once cases that were "misdiagnosed" are identified (given the actual diagnoses are known for each case), it it possible to identify and prioritize the nodes/values responsible for misdiagnosis?

 Site Admin
 Posts: 435
 Joined: Tue Dec 11, 2007 4:24 pm
Re: Validation using data with known diagnoses
The answer to your question is unfortunately negative. It is not easy to pinpoint automatically which part of the model was responsible for misdiagnosis. This task can be best done manually.
Cheers,
Marek
Cheers,
Marek
Re: Validation using data with known diagnoses
Thank you. Next question:
I created a naive bayesian network and entered probability distributions based on data. Then I used the "learn new network" feature to learn a Naive Bayesian network using default parameters. I then compared both networks against a separate data set and they performed slightly differently. I set my 0s to 0.01s and my 1.0s to 0.99. It appears the automatically generated Naive Bayes did something similar but chaged 0s closer to 0.05ish, but not exactly 0.05. For my own network, if there was a missing value I took an average (i.e. hypothetically there are 10 samples and the node is binary with 2 positive and 8 negative, the distribution would be .2 pos and .8 neg, but if there were 10 samples and values could only be obtained for 8 of them, 2 pos and 6 neg, then the distribution would be 2/8 and 6/8. .25 pos and .75 neg). The automatically generated naive bayesian network did not fill in distributions this way (similar but not exactly). Could you explain how the probability distributions are filled in the auto generated naive bayes?
I created a naive bayesian network and entered probability distributions based on data. Then I used the "learn new network" feature to learn a Naive Bayesian network using default parameters. I then compared both networks against a separate data set and they performed slightly differently. I set my 0s to 0.01s and my 1.0s to 0.99. It appears the automatically generated Naive Bayes did something similar but chaged 0s closer to 0.05ish, but not exactly 0.05. For my own network, if there was a missing value I took an average (i.e. hypothetically there are 10 samples and the node is binary with 2 positive and 8 negative, the distribution would be .2 pos and .8 neg, but if there were 10 samples and values could only be obtained for 8 of them, 2 pos and 6 neg, then the distribution would be 2/8 and 6/8. .25 pos and .75 neg). The automatically generated naive bayesian network did not fill in distributions this way (similar but not exactly). Could you explain how the probability distributions are filled in the auto generated naive bayes?

 Site Admin
 Posts: 435
 Joined: Tue Dec 11, 2007 4:24 pm
Re: Validation using data with known diagnoses
Parameter learning is based on the EM algorithm, which is an iterative procedure that replaces the missing values with the most likely values and then does the counting, like you did. I am not surprised that the parameters that you obtained are slightly different from the parameters obtained by EM. The counting will involve 10 records rather than 8 in your example. The values of the missing elements will be first replaced and the the algorithm will perform counting. Zeros and ones are troublesome probabilities (once something is zero, it can never become anything else, even if evidence is overwhelmingly strong; for example, if you assume that there are no ghosts, i.e., p(Ghosts)=0, no matter how strong evidence for their existence is, you will never change your mind. So, zeros are best avoided :).
Cheers,
Marek
Cheers,
Marek
Re: Validation using data with known diagnoses
I don't think I understand what "parameter learning" means. If I provide a data set with known values, what parameters are the EM algorithm learning? i.e. if I have a feature that is present in 2 of 10 cases, absent in 6 of 10 cases, and unknown in 2 of 10 cases, what is being learned? The network structure? The missing values?
What parameters are being learned if there are no missing values?
Also, what does the EM algorithm do with 0s and 1s (i.e. if 0 of 10 are positive and 10 negative, what probability distribution will the algorithm provide)?
What parameters are being learned if there are no missing values?
Also, what does the EM algorithm do with 0s and 1s (i.e. if 0 of 10 are positive and 10 negative, what probability distribution will the algorithm provide)?

 Site Admin
 Posts: 435
 Joined: Tue Dec 11, 2007 4:24 pm
Re: Validation using data with known diagnoses
Parameter learning is essentially learning probability distributions from data. If all values were present and the feature would be present in two cases out of ten, the probability distribution learned would be 0.2/0.8. However, when two of the 10 values are absent, I cannot predict without seeing the whole network and other records/values what the distribution will be. Instead of 0.2, you will get something between 2/10 and 2/8.
Parameter learning assumes that you already have the structure. In case of Naive Bayes, which is what you referred to, the structure is given by assumption. EM uses a prior over the parameters and it should never learn a probability zero. This is a feature of the algorithm, as zeros are dangerous probabilities. In the case that you described, with zero cases of the feature present, there will be some small, nonzero value placed for probability of the feature present. The exact value will depend on the number of records (i.e., will be different for 10 and for 100 records).
I hope this helps,
Marek
Parameter learning assumes that you already have the structure. In case of Naive Bayes, which is what you referred to, the structure is given by assumption. EM uses a prior over the parameters and it should never learn a probability zero. This is a feature of the algorithm, as zeros are dangerous probabilities. In the case that you described, with zero cases of the feature present, there will be some small, nonzero value placed for probability of the feature present. The exact value will depend on the number of records (i.e., will be different for 10 and for 100 records).
I hope this helps,
Marek
Re: Validation using data with known diagnoses
It's still a little fuzzy but becoming clearer. I have a network that was learned from data where one of the variables (binary) had 9 positive out of 10 (and one negative). Rather than the probability distribution being 0.9 and 0.1, it's 0.86363636 and 0.13636364. Based on what you're saying, this is because based on the rest of the network, the prediction is that less than 0.9 will be positive if many more cases are sampled?

 Site Admin
 Posts: 435
 Joined: Tue Dec 11, 2007 4:24 pm
Re: Validation using data with known diagnoses
Correct, if there are missing values in that column  then other variable (or the rest of the network in general) will impact the learned probabilities.
Also correct if there are few records (like 10 :)), as EM starts with a prior in each distribution. The data just change this prior. If there are very few records, you will see a discrepancy between the frequency and the learned probability, as the prior will be playing a more important role. Please try 90 records out of 100 and you should see that the departure from the frequency (i.e., 0.9/0.1) will be smaller. Then 900 out of 1000 and 9000 out of 10000.
Cheers,
Marek
Also correct if there are few records (like 10 :)), as EM starts with a prior in each distribution. The data just change this prior. If there are very few records, you will see a discrepancy between the frequency and the learned probability, as the prior will be playing a more important role. Please try 90 records out of 100 and you should see that the departure from the frequency (i.e., 0.9/0.1) will be smaller. Then 900 out of 1000 and 9000 out of 10000.
Cheers,
Marek