Parameter Learning with not complete Datasets

Christian · Post by **Christian** » Wed Nov 28, 2007 12:47 pm

Hello,

I have a large Bayesian Network. Let say I have 10 chance Nodes N0-N9.

Now I could train a dataset with x rows that have no missing fields in N0-N9.
But this way I have not so much data.

But I have many Datasets with only some node values filled in. For example:

N0, N3,N5,N6
N0,N3,N4
... and so on

Can Genie also use this data? Can I just combine this data with the complete Dataset? Will Genie know what to do with empty datafields?

Or shall I Learn the parameters first with all Data Rows of the kind N0,N3,N4 - then with all Rows of the Kind N0,N3,N6 - then N5,N8 (These are only examples) Or will this destroy the network due the training after training?

I hope you can help me.

Thank you,

Christian

mark · Post by **mark** » Wed Nov 28, 2007 5:05 pm

Christian wrote:I have a large Bayesian Network. Let say I have 10 chance Nodes N0-N9.

Now I could train a dataset with x rows that have no missing fields in N0-N9.
But this way I have not so much data.

But I have many Datasets with only some node values filled in. For example:

N0, N3,N5,N6
N0,N3,N4
... and so on

Can Genie also use this data? Can I just combine this data with the complete Dataset? Will Genie know what to do with empty datafields?

Parameter learning is able to handle missing data, so you can just use all your data (complete and missing) to learn.

Christian wrote: Or shall I Learn the parameters first with all Data Rows of the kind N0,N3,N4 - then with all Rows of the Kind N0,N3,N6 - then N5,N8 (These are only examples) Or will this destroy the network due the training after training?

So do you mean to learn the network incrementally, where in every increment you use data that has more missing values? This is a possibility if you use the confidence parameter every time you learn the next batch. But, as I said before, you can also just use the whole data set as input at once. I don't think the results of the two approaches should differ much, but my guess is that using the whole data set at once should lead to the best results.

Christian · Post by **Christian** » Wed Nov 28, 2007 5:57 pm

I made some tests using the following Network:

First I created exactly the same network. After that I used the Option to generate a data file.

A) I checked every 4 values for creation (Cloudy, Sprinkler, Rain, WetGrass) 1000 Data Rows
B) I created 3 lists: 1. Cloudy-Sprinkler 2. Couldy-Rain, 3. Sprinkler-Rain-WetGrass - each 100000 Data rows
C) The Same as B. But after that I joined the 3 files. Missing data was let empty.

Now I set the weights in the original network to junk (that should be the same as the import parameter/use random should do, am I right?). Cloudy was left to 0.5/0.5

Okay, now I let GeNIe train the parameters using List A. Result: Very accurate.
Training Using List B and C was the same: inaccurate. I tried the training more than once. But for example the Rain propabilites was 0.7/0.3 or 0.9/0.1
So this was not a good way.

Now I tried the following. Using the three lists from B, I calculated the relations manually. For example counting in the Cloudy/Sprinkler - txt files the occurence of TRUE/TRUE - FALSE/TRUE - TRUE/FALSE, FALSE/FALSE. Now I set these values manually as expert advice in GeNIe.
I did the same for the other two txt files.

This way I got values very near to the original values. As good as using parameter learning using List A.

My assumption is: The parameter learning is only good for complete data, not having to much empty fields.

If you have data that is not connected (Cloudy-Sprinkler , Couldy-Rain, Sprinkler-Rain-WetGrass) you should calculate the relations on your own and set these values manually (or via API).

I am just building a large network but I only have some thousands complete datasets, but millions of independent datasets. So I should do the calculations on my own instead of using parameter learning.

I don't know why ths is true. But I tested the above also with an other example and got the same results.

Kind regards,

Christian
[/img]

mark · Post by **mark** » Thu Nov 29, 2007 4:25 am

So, summarizing, when you generated data for just 2 out of the 4 variables (e.g., Cloudy and Sprinkler), and I assume the data for the 2 variables was complete, the results were not as good as was the case when you used counts. That doesn't make much sense, because parameter learning just uses counts when there is no missing data. Did you, by any chance, set the confidence level to a non-zero value? Also, could you share your programs and data so that I could have a look at it?

Christian · Post by **Christian** » Thu Nov 29, 2007 11:47 am

Hello Mark,

mark wrote:So, summarizing, when you generated data for just 2 out of the 4 variables (e.g., Cloudy and Sprinkler), and I assume the data for the 2 variables was complete, the results were not as good as was the case when you used counts.

Just training cloudy sprinkler or cloudy rain was not ok. Training Wetgrass with sprinkler and rain was also not okay. I got a notation in GeNIe that also the cloudy data is needed. Training the network node WetGrass only with WetGrass,Sprinkler,Rain information messed up the things.

But calculate the relation for Wetgrass in relation to Sprinkler,Rain myself was fine. I just calculated the number of rows using excel.

Maybe the learning algorithm needs the cloudy information? As there was a exclamation mark on cloudy while I only learned the data for WetGrass,Sprinkler,Rain.

So I did the following:
Openend the network.
Set all the values to junk
learn WetGrass_sprinkler_cloudy.txt
learn WetGrass_rain_cloudy.txt
learn WetGrass_wetgrass_sprinkler_rain.txt

the sprinkler and rain information are not correct.

then I did the following:
Openend the network.
Set all the values to junk
learn WetGrass_combined.txt (this is WetGrass_sprinkler_cloudy.txt, WetGrass_rain_cloudy.txt, WetGrass_wetgrass_sprinkler_rain.txt

Now there is no exclamation mark for Cloudy in GeNIe while learning parameters.

What we will see: the parameters in Sprinkler and in Rain are also not correct.

One easy example to see what I mean you should easy be able reproduce:

If I just count the values in WetGrass_sprinkler_cloudy I got the correct values.
Cloudy false, sprinkler false: 2575
Cloudy false, sprinkler true: 2532
Cloudy true, sprinkler false: 4368
Cloudy true, sprinkler true: 525

Ok. So we got for cloudy false:
sprinkler false: 2575/ (2575+2532) = 0.50
sprinkler true: 2532/ (2575+2532) = 0.50

cloudy = true:
sprinkler false: 4368/ (4368+525) = 0.893
sprinkler true: 525/ (4368+525) = 0.11

That is very near to the optimum.

BUT: Now open the WetGrass_sprinkler_cloudy.txt file in GeNIe and let GeNIe train these values.
What will we get? first, no correct values. Second the values change a bit every time we train it again. My assumption is that the learning algorithm did something different than just learning.

mark wrote: That doesn't make much sense, because parameter learning just uses counts when there is no missing data. Did you, by any chance, set the confidence level to a non-zero value?

I used random values for training (let the checkbox checked. So there is no possibility to enter the confidence level.
But sure the chance is there that I used a non zero value. I did much testing yesterday. But while I wrote the above I did everything with the checkbox "randomize initial values" checked.

mark wrote: Also, could you share your programs and data so that I could have a look at it?

I have uloaded the network and the training data:

http://rapidshare.com/files/73074257/WetGrass.zip.html

I hope you can at least reproduce the example I wrote below my bold heading.

Thank you,

Christian

mark · Post by **mark** » Thu Nov 29, 2007 3:59 pm

Christian wrote:BUT: Now open the WetGrass_sprinkler_cloudy.txt file in GeNIe and let GeNIe train these values.
What will we get? first, no correct values. Second the values change a bit every time we train it again. My assumption is that the learning algorithm did something different than just learning.

I tried this in my GeNIe and got exactly the results you calculated by hand. Also, the values are exactly the same every time. Maybe there is a bug introduced in a later GeNIe version that causes the problem. I'm using version 2.0.2762.2 built on 7/25/2007, how about you?

Christian · Post by **Christian** » Thu Nov 29, 2007 4:10 pm

Hello,

my Version is 2.0.2811.0 built on 09/12/2007. I got the latest windows version from the homepage.

Then this should be a bug, I think.

Are you using Windows or Linux?

Christian · Post by **Christian** » Thu Nov 29, 2007 7:57 pm

Hmm. It really seems that this version is buggy. But only when importing partial data. While importing complete data is working fine.

Do you know if the odbc connection uses the same import way for partial data or could I import my partial data into the database first and import it then using the odbc connection?

Is it possible to get the older, working version anywhere?

Thank you,

Christian

mark · Post by **mark** » Thu Nov 29, 2007 9:14 pm

You indeed found a (nasty) bug. I just fixed it and a new release of GeNIe (version 2.0.2889.0) has been uploaded to the website. Could you please give it a shot and let me know if it works? Sorry for the inconvenience and thanks for finding this bug!

Christian wrote:Do you know if the odbc connection uses the same import way for partial data or could I import my partial data into the database first and import it then using the odbc connection?

There are no special considerations with respect to ODBC connections. All the data ends up in the same (internal) GeNIe database which is then fed into the learning algorithms.

Christian · Post by **Christian** » Thu Nov 29, 2007 10:15 pm

Yes this is working now! Thank you very much!

The only thing that is left is the following:

Learn WetGrass_sprinkler_cloudy.txt
Learn WetGrass_rain_cloudy.txt

Everything is good so far.

Learn WetGrass_wetgrass_sprinkler_rain.txt

The Node wetgrass ist correct, but the node sprinkler and the node rain changed to wrong values.

While importinhg the WetGrass_wetgrass_sprinkler_rain.txt an "exclamation mark" is shown on the left side on cloudy. Ok, sprinkler and rain is depended of cloudy, but I only want GeNIe to set the WetGrass values.

BUT:

Using the follwing training order everything is fine:
Learn WetGrass_wetgrass_sprinkler_rain.txt
Learn WetGrass_rain_cloudy.txt
Learn WetGrass_sprinkler_cloudy.txt

I don't know if this one is a bug or if it is based on the fact that rain and sprinkler is based on cloudy. I don't know.

It would be nice if you could look after this or explain me why the first import order destrays the already learned values from sprinkler and rain.

Thank you,

Christian

mark · Post by **mark** » Thu Nov 29, 2007 11:32 pm

Christian wrote:Learn WetGrass_wetgrass_sprinkler_rain.txt

The Node wetgrass ist correct, but the node sprinkler and the node rain changed to wrong values.

I did a few test runs myself. The value of WetGrass is very precise, because all the involved variables needed to learn the cpt are observed. In case of Sprinkler and Rain the situation is different, because their common parent is not observed. However, in my test runs the results are reasonably accurate and usually not far from the correct values. (Btw, please note that the columns in the cpt for Sprinkler and Rain could be switched, because of the uniform distribution in Cloudy.) Sometimes the results are more off, but that just a property of the EM algorithm, namely that it can get stuck in local optima. This is a good reason to take the approach I proposed in the other thread (http://genie.sis.pitt.edu/forum/viewtopic.php?t=4). I hope this explanation helps.

Christian · Post by **Christian** » Fri Nov 30, 2007 1:06 am

Thank you Mark for your reply.

You helped me a lot so far. Yet I have not tried to average the training results as it is very complicated to do this manually. I will need to write a programm for that and yet I am not so far. But it is something I will do.
Or is there a way to do the averaging automatically?
For Example just let it learn 100 times and take the averages?

What I don't understand is the following:

When I am learning WetGrass_wetgrass_sprinkler_rain.txt
Why are the nodes Sprinkler and Rain touched? Does GeNIe not just count all the values and assign them to the WetGrass node?

As in the document "WetGrass_wetgrass_sprinkler_rain.txt " is no sprinkler-cloudy and no rain-cloudy information I thought these will be untouched.

That is the only that is not clear now. Beside that I thought I have understood everything

mark · Post by **mark** » Fri Nov 30, 2007 1:52 am

Christian wrote:You helped me a lot so far. Yet I have not tried to average the training results as it is very complicated to do this manually. I will need to write a programm for that and yet I am not so far. But it is something I will do.
Or is there a way to do the averaging automatically?
For Example just let it learn 100 times and take the averages?

The only way to do it automatically is to write a small program. GeNIe cannot do it for you (at least at the moment).

Christian wrote:What I don't understand is the following:

When I am learning WetGrass_wetgrass_sprinkler_rain.txt
Why are the nodes Sprinkler and Rain touched? Does GeNIe not just count all the values and assign them to the WetGrass node?

As in the document "WetGrass_wetgrass_sprinkler_rain.txt " is no sprinkler-cloudy and no rain-cloudy information I thought these will be untouched.

It does more than just counting values for WetGrass. It also tries to infer the most likely parameter values for the other nodes in the network.

Christian · Post by **Christian** » Fri Nov 30, 2007 4:26 pm

Thank you mark.

Everything is clear now

BayesFusion Support Forum

Parameter Learning with not complete Datasets

Parameter Learning with not complete Datasets

Re: Parameter Learning with not complete Datasets