Em based learning problem

The engine.
Looney
Posts: 12
Joined: Sat Jul 11, 2009 9:23 pm

Em based learning problem

Post by Looney »

Hi All,
I have put together a belief network with about 2 dozen discrete nodes. 22 of these nodes are Boolean based, with 2 target nodes.
I intend of learning parameter weights by using DSL_em, I have all the data but it is taking for ever, to train the model. To be a bit more specific one record takes like 3.5 minutes. I would like to train this model with 700,000 records out of a db, which is not possible due to speed of learning. I am running the latest build of VC 9 SP1 on win 2003 server with Quad core and 8Gb of RAM.

I am unsure what i am doing wrong. what follows is brief description of my setup, hopefully someone could help me find my blunder.

1) creatnetwork SetDefaultBNAlgorithm to DSL_ALG_BN_EPISSAMPLING and
SetNumberOfSamples to 493476
2) create DSL_em with defaults
3) Add all nodes with type DSL_CPT for the boolean based variables i
setup outcomes using a DSL_stringArray "false" and "true" also i store the indexes of the string in the array so i can specify the indexes for learning. I was doubtful about this is this correct way to do it. As in some cases i might have more discrete states than just boolean. I set target on 3 nodes and all my 3 target nodes are normally arced with all the rest of evidence nodes as parents(20 arcs for each target).
4)I set up DSL_em SetRelevance to true and activaterelevance on the network.
5) In order to train i do the following

Code: Select all

DSL_dataset ds;
std::vector<DSL_datasetMatch> matchings;
std::string err_msg;
//setup all the vars by calling ds.AddIntVar(node->Info().Header().GetId(), DSL_MISSING_INT);

ds.MatchNetwork(*m_network, matchings, err_msg)

while (More_records)
{
   ds.AddEmptyRecord();
   //for each variable i call ds.SetInt with parameters set as appropriate node handle, record no, index of the corresponding state from DSL_stringArray from step 3
}
then i call learn.
Like i said i am almost certain that i am doing something wrong or inefficiently. Any help or suggestion would be highly appreciated.
Looney
Posts: 12
Joined: Sat Jul 11, 2009 9:23 pm

Post by Looney »

Hi All

Just a few quick questions to add to my initial post.

1- would explicitly setting up state names in the data set variables help speed up learning ?

2- Would attempting to learn dataset with multiple records be faster than learning 1 record at a time. I have been experimenting with learning 2000 records in each chunk that seems to have brought per record learning time from 3 minutes to 6 seconds ?

3- i also observed the first 2000 record chunk trained in 190 minutes, the second one finished in 154 minutes, the third one is still going , should i be expecting the third one to train even quicker or was that just coincidental.
In other words should network parameter learning get faster as the network converges ?

Thanks once again in advance, this library seems like a god send for people like me who are new to machine learning.
Looney
Posts: 12
Joined: Sat Jul 11, 2009 9:23 pm

Post by Looney »

For the record i observed chunks take random time to be learned by em. Though i wonder how to know if the network is over trained . Is that possible ?
shooltz[BayesFusion]
Site Admin
Posts: 1419
Joined: Mon Nov 26, 2007 5:51 pm

Post by shooltz[BayesFusion] »

Looney wrote:1- would explicitly setting up state names in the data set variables help speed up learning ?
The presence of state names in the DSL_dataset object passed to EM has no effect at all. EM internally only uses the integer values stored in dataset. It is assumed that the names of nodes states were matched to the indices before EM starts.

Can you post your network and data file here? Alternatively, you can send me a private message with the network/data attached.
Looney
Posts: 12
Joined: Sat Jul 11, 2009 9:23 pm

Post by Looney »

Firstly it's great to hear from you, thanks for your response.

When you said that you 'd like me to post my model did you mean like a dsl file or the actual code to build up and learn the smile model. I am happy to share my code the only issue is that it is in C++/CLI and a bit of C#, also it would span out to a few hundred lines. Would that be OK ?

I could generate an file out of my database to attach along too.
Please advice about the code or the dsl file issue. I am at work right now so i would only be able to post the model and the file in about 8-9 hours.
Thanks once again for your help/guidance.
shooltz[BayesFusion]
Site Admin
Posts: 1419
Joined: Mon Nov 26, 2007 5:51 pm

Post by shooltz[BayesFusion] »

Looney wrote:When you said that you 'd like me to post my model did you mean like a dsl file or the actual code to build up and learn the smile model.


I didn't say 'model' in my reply to your post :) But I did in other replies on the forum. Model == network.

I'll need your network (preferably in .xdsl format) and the data file. The code may also help to pinpoint the issue(s). I'd like to check the complexity of your network first - I assume this is the reason you use EPIS sampling instead of default exact inference algorithm.
Looney
Posts: 12
Joined: Sat Jul 11, 2009 9:23 pm

network

Post by Looney »

Hey Mate,

Please find attached the zip file with the network and a subset of data. The network snapshot i have attached along is trained over about 90,000 records (possibly repetitive), though the data file attached along contains around distinctive records only. Also since the original network configured with 3 target variables was taking too long to learn each chunk i simplified it to only have one(nicety to necessity), the network attached along reflects the new network.

Even with this modification em is taking between 40 minutes to 1 hour and 45 minutes to learn a chunk of 2000 records.

A discovery while i was reviewing the xdsl file which seemed unusual was

Code: Select all

<cpt id="ChartTimeFrame" mandatory="true">
			<state id="State0" />
			<state id="State1" />
			<probabilities>4e-008 0.99999996</probabilities>
		</cpt>
		<cpt id="CurrencyPair" mandatory="true">
			<state id="State0" />
			<state id="State1" />
			<probabilities>0.77731 0.22269</probabilities>
ChartTimeFrame was suppose to have about 5 different states and similarly CurrencyPair was suppose to have about 30ish, but some how they only state 2, i think this may be a bug in my setup code though every other variable in my network is suppose to have only 2 states. would you say that it is bug too or am i misreading something in the xml ?

I am using Epis as i believe even with handsome set of data i have, I still do not have a record for every possible combination there ever could be, plus i was under the impression that it would be faster, but I am quite concerned about over training the network as what i interpreted from the reference document was that it's accuracy declines with over training. Is that true or did i misinterpret something ?

I am quite happy with the inference speed, which is clocking somewhere between 25-30 seconds currently, though the goal is still for the highest accuracy even if it took twice the inference time.

Also on an unrelated subject i was going through "Tutorial 3: Building an Influence Diagram" and wondering if an influence diagram can made with a variable utility node. (by variable here i mean if i can vary the utility values of 10,000, -5000 and 500 on a record by record basis) or would you advise against doing that ?

Any guidance, suggestions and constructive criticism would be highly appreciated.

Thanks very much in advance.

:D
Attachments
EpisSampling.zip
xdsl file and the data
(911.02 KiB) Downloaded 439 times
shooltz[BayesFusion]
Site Admin
Posts: 1419
Joined: Mon Nov 26, 2007 5:51 pm

Re: network

Post by shooltz[BayesFusion] »

Looney wrote:

Code: Select all

<cpt id="ChartTimeFrame" mandatory="true">
			<state id="State0" />
			<state id="State1" />
			<probabilities>4e-008 0.99999996</probabilities>
		</cpt>
		<cpt id="CurrencyPair" mandatory="true">
			<state id="State0" />
			<state id="State1" />
			<probabilities>0.77731 0.22269</probabilities>
ChartTimeFrame was suppose to have about 5 different states and similarly CurrencyPair was suppose to have about 30ish, but some how they only state 2, i think this may be a bug in my setup code though every other variable in my network is suppose to have only 2 states. would you say that it is bug too or am i misreading something in the xml ?
CurrencyPair and ChartTimeFrame have two outcomes each. Are you creating the network by calling AddNode/AddArc in your own code, or it's the output from SMILearn's structure learning?
Looney
Posts: 12
Joined: Sat Jul 11, 2009 9:23 pm

Post by Looney »

nah i am creating the network by calling AddNode/AddArc functions in my own code. i 'll double check where it is going off rails.
Looney
Posts: 12
Joined: Sat Jul 11, 2009 9:23 pm

addarc problem

Post by Looney »

I fixed the discrete states setup bug, i now explicitly set
SetNumberOfOutcomes first followed by RenameOutcomes
but i have new problem after setting up all evidence nodes i setup my target node and then i wire up arcs to all evidence nodes. The call to last couple of addarc functions throw a std:;bad_alloc but the function returns a DSL_OKAY.
I am using Vs.net 2009 SP1 on 8GB 64but machine. My app is only 32bit though. If i re sequence my evidence input when i arc up to the last discrete node the bad_alloc is thrown.
Have you seen this issue crop up before, any help woudl be highly appreciated.
shooltz[BayesFusion]
Site Admin
Posts: 1419
Joined: Mon Nov 26, 2007 5:51 pm

Re: addarc problem

Post by shooltz[BayesFusion] »

The CPT of the 'Reversal' node with 30+ outcomes and 22 parents will not fit into the memory available to 32-bit application. Adding new arcs multiplies the CPT size in child node by the number of outcomes in the parent.

You have two options:
1) restructure the network (reduce the outcome count/drop some arcs)
2) use 64-bit version of SMILE; unfortunately, this can be only done in Linux at the moment.
Looney
Posts: 12
Joined: Sat Jul 11, 2009 9:23 pm

Post by Looney »

Thanks so much for that info. I have decided to try out the first option for now. I dumped the two discrete parents with (30 outcomes and 6 outcomes respectively), but i still have like 18-19 parents with 2 outcomes each now the learning is super fast in comparison to before. em tore through 140K records over night.

Would post back some test results when the model is a bit more trained.

Since my network is a lot smaller now would you recommend trying out some other sampling algorithm ? if so which ones would you recommend ?
Cheers
Last edited by Looney on Wed Jul 15, 2009 11:07 pm, edited 1 time in total.
shooltz[BayesFusion]
Site Admin
Posts: 1419
Joined: Mon Nov 26, 2007 5:51 pm

Post by shooltz[BayesFusion] »

Looney wrote:Since my network is a lot smaller now would you recommend trying out some other sampling algorithm ? if so which ones would you recommend ?
EPIS should be OK, but did you try the default exact inference algorithm after reducing the network complexity?
Looney
Posts: 12
Joined: Sat Jul 11, 2009 9:23 pm

Post by Looney »

no i have n't as of yet, but that sounds like a very good idea now though does the default select the best one based on some sort of em generated metrics,if not which algorithm would it be ?

Also out of curiosity i am also interested to find out if there are any plans to implement Gibbs Sampling for belief network side of things and if not is it possible to extend smile to add it. My mentor had originally recommended i use Gibbs sampling, that's the only reason i am interested to find out ?

I know Gibbs sampling is applicable when the joint distribution is not known explicitly, but the conditional distribution of each variable is known. I guess knowing my model now could you please shed some light on if Epis Sampling would be better than Gibbs Sampling or is it hard to say, that's considering the default does not kick both of them's butt.
Thanks very much in advance.
:)
shooltz[BayesFusion]
Site Admin
Posts: 1419
Joined: Mon Nov 26, 2007 5:51 pm

Post by shooltz[BayesFusion] »

Looney wrote:no i have n't as of yet, but that sounds like a very good idea now though does the default select the best one based on some sort of em generated metrics,if not which algorithm would it be ?
The default algorithm is deterministic clustering running on top of the 'relevance layer' which simplifies the network. EM does not choose the inference algorithm.

Since I'm not a sampling expert I'll try to find someone to help with Gibbs vs. EPIS.
Post Reply