EM Learn

The engine.
oscarpc
Posts: 33
Joined: Mon Jun 04, 2012 1:17 am

EM Learn

Post by oscarpc »

Hi all,

I am trying to learn the parameters of my DBN network with EM, but I have not been successful so far and I feel a bit frustrated :-(.
I have created a dataset, I printed my dataset, created my network, ... and then, when the line em.Learn is executed, the result is always not DSL_OKAY. Any ideas?
I have attached my network and I have put below the code where the error happens, plus my toy dataset.
Thank you very much.
Oscar PC

vector<DSL_datasetMatch> dsMap(ds.GetNumberOfVariables());

string errMsg;
if (ds.MatchNetwork(theNet, dsMap, errMsg)!=DSL_OKAY)
{
cout<<"Matching failed!"<<endl;
}

DSL_em em;
if (em.Learn(ds, theNet, dsMap) != DSL_OKAY) {
cout << "Cannot learn parameters... exiting." << endl;
exit(1);
}

My dataset is like this:

et's create a database
Name DS : Sodium
Name DS : Potassium
Name DS : Sodium_0
Name DS : Potassium_0
Name DS : Sodium_1
Name DS : Potassium_1
Name DS : Sodium_2
Name DS : Potassium_2
Name DS : Sodium_3
Name DS : Potassium_3
Name DS : Sodium_4
Name DS : Potassium_4
Name DS : Sodium_5
Name DS : Potassium_5
Name DS : Sodium_6
Name DS : Potassium_6
Name DS : Sodium_7
Name DS : Potassium_7
Name DS : Sodium_8
Name DS : Potassium_8
Name DS : DM
Name DS : DM_0
Name DS : DM_1
Name DS : DM_2
Name DS : DM_3
Name DS : DM_4
Name DS : DM_5
Name DS : DM_6
Name DS : DM_7
Name DS : DM_8
Database created
===================
-- variable info --
number of variables = 36
Variable 0
id: Age
is discrete
Missing element value: -1
State names:
Variable 1
id: Icd10
is discrete
Missing element value: -1
State names:
Variable 2
id: SoR
is discrete
Missing element value: -1
State names:
Variable 3
id: MoA
is discrete
Missing element value: -1
State names:
Variable 4
id: DoA
is discrete
Missing element value: -1
State names:
Variable 5
id: ToA
is discrete
Missing element value: -1
State names:
Variable 6
id: Sodium
is discrete
Missing element value: -1
State names:
Variable 7
id: Potassium
is discrete
Missing element value: -1
State names:
Variable 8
id: Sodium_0
is discrete
Missing element value: -1
State names:
Variable 9
id: Potassium_0
is discrete
Missing element value: -1
State names:
Variable 10
id: Sodium_1
is discrete
Missing element value: -1
State names:
Variable 11
id: Potassium_1
is discrete
Missing element value: -1
State names:
Variable 12
id: Sodium_2
is discrete
Missing element value: -1
State names:
Variable 13
id: Potassium_2
is discrete
Missing element value: -1
State names:
Variable 14
id: Sodium_3
is discrete
Missing element value: -1
State names:
Variable 15
id: Potassium_3
is discrete
Missing element value: -1
State names:
Variable 16
id: Sodium_4
is discrete
Missing element value: -1
State names:
Variable 17
id: Potassium_4
is discrete
Missing element value: -1
State names:
Variable 18
id: Sodium_5
is discrete
Missing element value: -1
State names:
Variable 19
id: Potassium_5
is discrete
Missing element value: -1
State names:
Variable 20
id: Sodium_6
is discrete
Missing element value: -1
State names:
Variable 21
id: Potassium_6
is discrete
Missing element value: -1
State names:
Variable 22
id: Sodium_7
is discrete
Missing element value: -1
State names:
Variable 23
id: Potassium_7
is discrete
Missing element value: -1
State names:
Variable 24
id: Sodium_8
is discrete
Missing element value: -1
State names:
Variable 25
id: Potassium_8
is discrete
Missing element value: -1
State names:
Variable 26
id: DM
is discrete
Missing element value: -1
State names:
Variable 27
id: DM_0
is discrete
Missing element value: -1
State names:
Variable 28
id: DM_1
is discrete
Missing element value: -1
State names:
Variable 29
id: DM_2
is discrete
Missing element value: -1
State names:
Variable 30
id: DM_3
is discrete
Missing element value: -1
State names:
Variable 31
id: DM_4
is discrete
Missing element value: -1
State names:
Variable 32
id: DM_5
is discrete
Missing element value: -1
State names:
Variable 33
id: DM_6
is discrete
Missing element value: -1
State names:
Variable 34
id: DM_7
is discrete
Missing element value: -1
State names:
Variable 35
id: DM_8
is discrete
Missing element value: -1
State names:
-- data records --
number of records = 8
1935 5 0 1 3 1 * * 5 4 5 5 5 5 5 5 5 6 5 7 5 5 6 5 5 7 0 0 0 0 0 0 0 0 0 2
1956 5 0 1 1 1 5 4 5 4 4 5 4 4 5 5 5 4 5 4 5 4 5 4 * * 0 0 0 0 0 0 0 0 1 1
1934 0 0 2 3 1 5 5 5 3 5 4 5 4 5 5 5 4 5 4 * * * * * * 0 0 0 0 0 0 0 1 1 1
1948 0 0 3 1 0 5 4 4 4 3 4 3 4 4 3 4 3 5 4 5 5 * * * * 0 0 0 0 0 0 0 0 1 1
1942 5 0 1 0 1 4 5 5 5 4 4 4 7 3 7 4 5 4 4 4 4 4 4 5 4 0 0 0 0 0 0 0 0 0 0
1937 5 0 1 5 1 6 4 5 5 4 4 4 4 5 4 5 3 4 2 4 3 5 2 5 2 0 0 0 0 0 0 0 0 0 0
1958 0 0 2 4 1 5 4 4 3 3 5 3 3 4 3 4 2 5 4 5 3 * * * * 0 0 0 0 0 0 0 1 1 1
1944 5 0 1 3 1 * * 6 5 6 5 6 5 6 5 7 3 * * * * 6 4 6 4 0 0 0 0 0 0 0 0 0 0
Attachments
DBN_v07.xdsl
(6 KiB) Downloaded 450 times
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: EM Learn

Post by shooltz[BayesFusion] »

oscarpc wrote:when the line em.Learn is executed, the result is always not DSL_OKAY.
What is the result then?

You can also add the following line at the start of your function to get diagnostic messages in the standard output stream:

Code: Select all

ErrorH.RedirectToFile(stdout);
Martijn
Posts: 76
Joined: Sun May 29, 2011 12:23 am

Re: EM Learn

Post by Martijn »

Hi Oscar,

Since your network is a DBN, you may need a different approach to EM learning (Shooltz, does MatchNetwork work with DBNs already?)

On our wiki we have two tutorials dealing with performing EM on DBNs:
http://genie.sis.pitt.edu/wiki/SMILearn ... Dynamic_1)
http://genie.sis.pitt.edu/wiki/SMILearn ... Dynamic_2)

With these approaches you should be able to perform EM on your network (with maybe a little tweaking here and there).
You do have to make sure that your data file is properly formatted.
Variable names and states should match between network and data.
Also, you will want to ensure that your missing values are handled properly.

Best,

Martijn
oscarpc
Posts: 33
Joined: Mon Jun 04, 2012 1:17 am

Re: EM Learn

Post by oscarpc »

Hi Martijn and Shooltz,

Thank you so much for your replies.

>>On our wiki we have two tutorials dealing with performing EM on DBNs:
>>http://genie.sis.pitt.edu/wiki/SMILearn ... Dynamic_1)
>>http://genie.sis.pitt.edu/wiki/SMILearn ... Dynamic_2)

Martijn, I have tested algorithm 6 and 7 and I haven't been successful so far.
I think I am doing something wrong with the missing values indeed. What do you exactly mean when you said " you will want to ensure that your missing values are handled properly."? How do I have to handle the missing values, because I haven't done anything with them yet.

Thank you so much.
Oscar PC.
Martijn
Posts: 76
Joined: Sun May 29, 2011 12:23 am

Re: EM Learn

Post by Martijn »

Hi Oscar,

I've tried to run EM myself on your data and network (which was attached to your previous post) and found a few thing you will have to fix.
I wasn't successful running the SMILE code, and when trying to learn the parameters using GeNIe I found the following:

1. There's a mismatch between variable states and states in the dataset; most of the variables in the network do not have enough states to match up with what is in the dataset. Could you add the necessary states to the nodes?
2. The SoP variable, which is present in the network, is not present in the dataset, I assume this is a latent variable you are trying to learn. Easiest solution should be to add columns to the dataset with only missing values

Best,

Martijn
oscarpc
Posts: 33
Joined: Mon Jun 04, 2012 1:17 am

Re: EM Learn

Post by oscarpc »

Hi Martijn,

This is so useful and so nice of you! Thank you so much. You helped me a lot.

>>The SoP variable, which is present in the network, is not present in the dataset, I assume this is a latent variable you are trying to learn. Easiest solution should be to >>add columns to the dataset with only missing values

Thank you very much for this. I am programming SMILE on MAC, so I don't use Genie very often and I am not very familiar with it. Nonetheless, I am going to start using it in order to detect this kind of errors.
As you advised me, I have added a new column to my dataset, SoP, with all missing values. Thank you.

>>There's a mismatch between variable states and states in the dataset; most of the variables in the network do not have enough states to match up with what is in the >>dataset. Could you add the necessary states to the nodes?

Again, thank you very much. I have detected some errors in the states. However, I am very surprised to discover that certain variables do not have enough states. For example, the variable Age.
I defined Age in my Network as follows:


/****************************************************************************/
/* Age */
int Age=theNet.AddNode(DSL_CPT,"Age");//
DSL_idArray AgeOutcomes;
for (i=1900;i<2013;i++){
char name[4];
sprintf(name,"%d",i); //converts to decimal base.
std::cout << "Let's create a database " << endl;
AgeOutcomes.Add(name);
}
theNet.GetNode(Age)->Definition()->SetNumberOfOutcomes(AgeOutcomes);
/****************************************************************************/

Nonetheless, when I check Age in Genie, it has only State0 and State 1, which it is weird since I define 2013 States (age == year of birth).
I don't understand what it is going on here, but I will check what it is happening.

Thank you so much for your help. I really appreciate it.

Kind regards,
Oscar PC
oscarpc
Posts: 33
Joined: Mon Jun 04, 2012 1:17 am

Re: EM Learn

Post by oscarpc »

Hi all,

I have also tried this piece of code, but the States keep being 2: State0 and State1 ...

In addition, I am now a bit confused. It seems that the States we give to the nodes in the network are "strings", but the actual values in the Database will be integers. Am I right?

int Age=theNet.AddNode(DSL_CPT,"Age");//
DSL_idArray AgeOutcomes;
for (i=1900;i<2013;i++){

string name; // string which will contain the result
ostringstream convert; // stream used for the conversion
convert << i; // insert the textual representation of 'Number' in the characters in the stream
name = convert.str(); //
AgeOutcomes.Add(name.c_str());
}
theNet.GetNode(Age)->Definition()->SetNumberOfOutcomes(AgeOutcomes);
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: EM Learn

Post by shooltz[BayesFusion] »

oscarpc wrote:,
for (i=1900;i<2013;i++){
string name; // string which will contain the result
ostringstream convert; // stream used for the conversion
convert << i; // insert the textual representation of 'Number' in the characters in the stream
name = convert.str(); //
The identifiers used by SMILE must start with the letter. Your values are numeric and are replaced by SMILE by generic ids.

To match year values to node outcomes, I suggest prefixing values in your input file with a constant string. Otherwise SMILearn will use years as indices into node outcomes. The other option is to de-base years by subtracting a value like 1900. In that scenario you don't need to prefix the values in the input file.
oscarpc
Posts: 33
Joined: Mon Jun 04, 2012 1:17 am

Re: EM Learn

Post by oscarpc »

Dear Shooltz,

Thank you so much. I didn't know that the identifiers used by SMILE must start with the letter. Problem solved!

But I have a new doubt. Let's say that now one of the states of the network for node "Year_Date_of_birth" is "year_1978'.
Nonetheless, the dataset will have stored the Integer 1978. How does SMILE do the association between the state and the year?
Maybe it is something very trivial, so sorry in advance if I am asking something irrelevant.

Thank you again.
Kind regards.

Oscar PC
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: EM Learn

Post by shooltz[BayesFusion] »

Let's say that now one of the states of the network for node "Year_Date_of_birth" is "year_1978'.
Nonetheless, the dataset will have stored the Integer 1978
If the column in the input file contains non-numeric strings (like year_1978), then the dataset variable is an integer representing the index into alphabetically sorted strings. You can access the actual strings with the following DSL_dataset method:

Code: Select all

const std::vector<std::string>& GetStateNames(int var) const
oscarpc
Posts: 33
Joined: Mon Jun 04, 2012 1:17 am

Re: EM Learn

Post by oscarpc »

First of all, thank you very much for your great help.

Nonetheless, I have not been successful with the EM learning.

I think I have got all the network and dataset in the correct format, but nonetheless, my SMILE program doesn't learn. I have tried to run it on Genie and I have got the same problem. I really do not know what the problem could be.

I wanted to attach my network to this message, but the attachment is too big. I am happy to attach it in any other way if possible.
I would be really grateful if you could give me some advice.

Kind regards,
Oscar
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: EM Learn

Post by shooltz[BayesFusion] »

oscarpc wrote:I wanted to attach my network to this message, but the attachment is too big. I am happy to attach it in any other way if possible.
Was the network zipped? If not, try to attach a zip file with the network, otherwise send me a private message.
oscarpc
Posts: 33
Joined: Mon Jun 04, 2012 1:17 am

Re: EM Learn

Post by oscarpc »

How silly of me! Thank you.
Attached to this message , you can find the network and the data.
Thank you so much.

Oscar PC
Attachments
Network&Data.zip
(42.67 KiB) Downloaded 389 times
oscarpc
Posts: 33
Joined: Mon Jun 04, 2012 1:17 am

Re: EM Learn

Post by oscarpc »

I have been debugging my program and I checked that the program gets stuck in the line that says " ds.SetInt(colIdx, k, map[ds.GetInt(colIdx, k)]);" within the code for matching states of variables. Any ideas? I am really stranded in this problem. Thank you very much in advance.
Oscar PC

The code for matching states of variables looks like this:
for (int i = 0; i < dsMap.size(); i++) {
DSL_datasetMatch &m = dsMap;
int nodeHdl = m.node;
int colIdx = m.column;

DSL_idArray* ids = net.GetNode(nodeHdl)->Definition()->GetOutcomesNames();
const DSL_datasetVarInfo &varInfo = ds.GetVariableInfo(colIdx);
const vector<string> &stateNames = varInfo.stateNames;
vector<int> map(stateNames.size(), -1);
for (int j = 0; j < (int) stateNames.size(); j++) {
const char* id = stateNames[j].c_str();
for (int k = 0; k < ids->NumItems(); k++) {
char* tmpid = (*ids)[k];
if (!strcmp(id, tmpid)) {
map[j] = k;
}
}
}
for (int k = 0; k < ds.GetNumberOfRecords(); k++) {
if (ds.GetInt(colIdx, k) >= 0) {
ds.SetInt(colIdx, k, map[ds.GetInt(colIdx, k)]);
}
}
}
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: EM Learn

Post by shooltz[BayesFusion] »

oscarpc wrote:Attached to this message , you can find the network and the data.
EM returns an error code for this network/data due to its internal limitations for DBN structure (related to the presence of initial conditions). It's likely that we'll be able to relax these restrictions. I'll notify you once when new SMILE build is available.
Post Reply