Crash/error upon inference in a dynamic Bayesian network

PSGH · Post by **PSGH** » Fri Mar 13, 2020 5:45 pm

Hello,

I am using GeNIe and SMILE in my PhD thesis in telecommunications. I am trying to learn the frame structure of a protocol by using a dynamic Bayesian network. I have designed the network, which contains eight nodes, each with a few dozen states, and unrolled on a few hundred slices.

Here is my network:

Network1.xdsl: (296.33 KiB) Downloaded 3075 times

And here a very simple database I use for testing learning:

Trace1.txt: (2.33 KiB) Downloaded 3051 times

I use a Windows 7 Professional 64-bit system with 8GB RAM.

When I attempt to learn the parameters of my system on a database, whatever its size, GeNIe crashes in the first few seconds, and windows closes the application. It does the same when I try only the inference, and whatever the inference algorithm used.

I tried running the learning via SMILE, and it gives me the error: Exception in thread "main" smile.SMILEException: SMILE error -1 in function Learn. Logged information: EM: inference cannot be performed

What I have understood so far, is that my network is too complex for the Lauritzen algorithm, because GeNIe tells me so when I try inference with it. I have tried to reduce the number of states of my nodes to 8, and the learning works, so it may be caused by a lack of RAM allocated to the application. However, when I try inference with a larger set of states for my variables (around 20), but smaller than my objective, it doesn't crash, but just stops with the error message "EM: Inference cannot be performed".

What do you think of my problem? Is it only caused by a lack of RAM, or is there something else wrong?

Post by **marek [BayesFusion]** » Sat Mar 14, 2020 3:48 pm

Hi,

Your network is indeed far too complex for any implementation of Bayesian networks. SMILE may well be the most efficient and fastest piece of Bayesian network software there is but it cannot handle problems that outgrow your computer's memory. One source of complexity is the number of states in your nodes: Some have 20, some have as many as 40. Please keep in mind that the size of the probability tables grows exponentially with the number of parents and the base of the function is the number of states. When the algorithm builds a junction tree, it often makes tables even larger, as it has to combine several families of nodes into one cluster. The second source of complexity is your node Trame -- if you look at the unrolled network, you will see that the node is the parent of two nodes in every slice (of 160). This results in even larger structural complexity and size of the junction tree. Yet another source of complexity is the number of slices -- 160 slices result in a network of well over 800 nodes, each with 20+ states. Quite impressive :-).

Now, how to solve the problem. One way is, of course, to reduce the complexity of your model by reducing the number of states of your variables. Another is perhaps observing the value of the node Trame before performing inference. Isn't it your intention? This will simplify the problem greatly. Yet another is to use an approximate algorithm, like EPIS-BN. EPIS-BN on your current network with 1000 samples, which may be enough given the deterministic character of your nodes, runs less than one minute.
I hope this helps,

Marek

PSGH · Post by **PSGH** » Mon Mar 16, 2020 9:48 am

JsmileCode.txt: (5.29 KiB) Downloaded 2960 times

Hi,

Thank you for your answer.

I know my network is quite big, but, actually, it is only a first version, with relatively few connections, so I knew exact inference would be a bit complicated. Also, the only observation is the Bit value, everything else has to be inferred, as it is the underlying structure of the frame.
I think there must be a way to learn such a big network, as there are barely a few million variables to learn, as opposed to dozens or even hundreds of million in neural networks.

I'm perfectly fine with using EPIS Sampling or any other approximate inference algorithm, however they all cause GeNIe to crash with my current number of states. When I reduce the number of states, there aren't crashes anymore when doing only inference, but I get the error message "EM: Inference cannot be performed" after some time running when I try learning. So I tried with SMILE, but then, wether it is the inference or the learning, it gives me the error message "SMILE error -1 in function Learn. Logged information: EM: inference cannot be performed", and is also way slower. That's incomprehensible, I must have done something wrong. Here is the smile java wrapper code I use:

JsmileCode.txt: (5.29 KiB) Downloaded 2960 times

So, I can conclude that the problem with inference on GeNIe was about memory, so I need to find ways to allocate more. Is it possible? Concerning the issues with learning on GeNIe, and inference and learning on SMILE, I always get an error about inference which cannot be performed, and I have absolutely no idea why.

You talk about setting the number of samples for EPIS sampling to 1000, but how do you do that? When you say it runs in less than a minute, is it with GeNIe or with SMILE? And do you tweak the memory allocated to the program? If so, I'd like to have a way to estimate the memory needed for each algorithm as well as its complexity, to choose a machine for computation accordingly, and allocate the necessary ressources. By the way, are GeNIe and SMILE designed for multi-thread processing and CUDA? Because my school has a workstation with a i9 9900K and an NVIDIA RTX 2080Ti, along with 32GB of RAM. If the software can use all of this processing power, the learning shouldn't be a problem.

Thank you in advance,
PSGH

Wed Mar 18, 2020 10:51 pm

Hi PSGH,

Let me give general thoughts and then answer your specific questions. You are dealing with a problem of exponential complexity, so increasing the amount of memory will help a little but only a little. Exponential curve raises faster than you can keep up with adding more memory. Neural networks work on different principles than Bayesian networks and it is not fair to compare the number of nodes in a neural network to the number of nodes in a Bayesian network. GeNIe/SMILE may well be the fastest package for Bayesian network inference, so you are really testing what the theoretical limits are.

EPIS-BN works fine on my computer and runs under one minute with 1000 samples in GeNIe. To change the number of samples, please look at Network properties, tab Inference. I have not tried EM but I believe it will still have problems, as it uses quite complex data structures. Also, it performs inference repeatedly and you may find that the time EM takes is too long for you.

SMILE should be faster than GeNIe, as GeNIe is just a GUI to SMILE. To compare GeNIe to SMILE, you will have to call the same algorithm and use the same parameters. Please look at SMILE manual to make sure that you do this.

SMILE can be run in a multi-threaded environment. GeNIe does not use multiple threads. No special facilities for CUDA.
I hope this helps,

Marek

PSGH · Post by **PSGH** » Thu Mar 19, 2020 12:14 pm

Hi,

I have modified my network to reduce the CPT of the variables while retaining an equivalent network.

Here is my new network:

NetworkCPTReduced.xdsl: (340.09 KiB) Downloaded 2620 times

There are no longer crash problems caused by memory when doing inference. However, I still have either crashes or the "EM: inference cannot be performed" errors when I try learning parameters. What causes this error? I believe it is not memory, as I monitor it, and it stays way under 2Go. However, what is contradictory is that I can perform learning with a low number of states of my variables (10-15), which implies lower RAM usage.

I would like to know what exactly represent the log(p) given at the end of learning. I thought it was the log-likelihood of the learning data given the parameters learned, but it doesn't make sense : I get a log-likelihood far lower that what I would get by randomly drawing my observations from a uniform law. Or is it not a base 10 logarithm? What is the formula of log(p)?

Also I tried a bit of validating, and I either get the "EM: inference cannot be performed" error, or the "Can't instantiate data record #1; error -4 during SetTarget on node Bit" error.

Could you help me understand what's going on? So much problems with so little information on errors is very frustrating; isn't there an error log file generated somewhere?

Post by **marek [BayesFusion]** » Fri Mar 20, 2020 1:17 pm

Hi PSGH,

I looked at your data set. You have only one variable measured and only one record to learn from. Is this correct? I wonder what sense this makes in theory -- I assume you know what you are doing :-).

The EM algorithm has a set of complex data structures that optimize learning. It looks like your model is too complex for these structures to fit in memory and this is the reason why the algorithm fails. I do want to stress again that your model is quite challenging because of the number of states in individual nodes and the number of time slices.

The logarithm in the log-likelihood is natural, i.e., e-base.

GeNIe does not implement validation in dynamic networks. The theoretical problem is the question what exactly you are trying to predict, i.e., which variable at which time step. We should have blocked validation in DBNs and we will do it now, prompted by your questions (thanks!). The only way you can run validation in DBNs is when you run in unrolled networks. We were able to run validation on your network after unrolling it. However, in order to obtain meaningful results from validation, your data file should have more rows - with at least one row containing a value for each state of the class node.
I hope this helps,

Marek

PSGH · Post by **PSGH** » Fri Mar 20, 2020 3:28 pm

Hi,

The data set I attached was just a trivial one featuring only one observation, the real one will contain thousands of them. I tried learning on this trivial data set in order to check if learning worked, and it didn't.

I am currently trying to reduce more and more memory consumption, but I don't know how much more I need to do it. I would like to have a way to estimate the amount of memory needed for EM along with its complexity, before trying out learning. I want to know if it is better to have lots of variables with many arcs but fewer states for each, or the opposite. I have created a new version of my previous network, modelling the same phenomenon, while slightly reducing the total number of parameters to learn and the maximum number of parameters per variable. However, I have almost doubled the number of variables and more than doubled the number of arcs. I can't upload it, as it is too large (more than 4MB). I plan on doing parent divorcing if possible to reduce even further the number of parameters of the model, and/or switching to directly using SMILE with java on a more powerful machine.

Concerning validation, I was able to perform it without unrolling the network, using EPIS sampling. I'm not sure if it gives right results, but it produces results without error messages. It wasn't functionning before because I was trying to predict the only observed variable. Should GeNIe be able to do so? That is to say test the network not as an inference machine but a generative model.

Also, is there some way to get more detailed error messages, like a verbose or debug mode, or even log files? That would help me to solve my problems myself.

Tue Mar 24, 2020 12:18 am

HI PSGH,

A general rule is that the complexity of BNs is a function of its connectivity. The number of parameters in a node goes up exponentially with the number of parents of the node. Multiply connected networks lead to complex joint trees in inference. So, the short answer is that the network complexity can be thought to correspond to the exponent of the exponential function and takes a larger blame than the number of states, although the number of states is the base of the exponential function, so it also plays an important role.

Validation will not be predictable in DBNs I'm afraid.

I'm afraid GeNIe and SMILE do not have a debug/verbose mode accessible to users.
Cheers,

Marek

PSGH · Post by **PSGH** » Tue Mar 24, 2020 9:32 am

Hi,

I understand well the complexity of the network and how to manage it. What I am asking is information on the complexity and the memory needs of the EM algorithm, which I can't estimate. It seems to be multiples of what is needed for inference, as all approximate inference algorithms work fine on my network, but EM "cannot perform inference" after a number of iterations. So how do I estimate that precisely knowing the connections and states of my network? I need that in order to dimension my network and the hardware I wil use, I can't just proceed with trial and errors.

What is the stopping criteria of EM (value), and is it possible to modify it? It would be strange to not be able to. Is there different versions of GeNIe and SMILE allowing more tunability and more precise error messages?

Wed Mar 25, 2020 8:41 pm

What is the stopping criteria of EM (value), and is it possible to modify it?

We use a ratio of log likelihoods from two subsequent iterations of EM. In current version it's not possible to fine-tune this ratio. We'll be adding this as an option in the next major release of SMILE.

Is there different versions of GeNIe and SMILE allowing more tunability and more precise error messages?

No, the publicly available binaries are the only ones we have.

I agree that EM error messages should be improved. I'll be looking at the behavior of EM with your network/data. The NetworkCPTReduced model has 40 time steps, while the Trace1.txt from the original message uses 160 (as does the initial network). I assume that for CompteurChampActuel and CompteurBitActuel the step 159 should be changed to 39. Is that correct?

PSGH · Post by **PSGH** » Thu Mar 26, 2020 10:33 am

Hi,

Concerning the threshold on the ratio of log likelihood from two subsequent iterations, what is its value?

I'd be glad if you could tell me why EM doesn't work with my network, but what I seek is a general answer, not one specific to this network, as it isn't a final version, just the current one, as I'm still in the prototyping phase. The problem lies probably around available RAM, but I need a way to estimate the need in RAM before running the EM.

I tried to run EM on my network with both GeNIe and SMILE while changing the number of the variables states. The results is that GeNIe crashes when the EM RAM needs get close to 2Go. When I do the same learning on SMILE, while setting a 4Go limit on RAM in the Java VM, the same networks that crashed on GeNIe do not with SMILE. However, when I increase the number of states of my variables even further, depending on the variable on which I add one more state, sometimes it works, sometimes it crashes with "EM: inference cannot be performed". So RAM may not be the only issue.

The trace1.txt contains 160 time slices, so the network should be configured for 160 time slices, I just forgot to change the number of slices as one of my datafiles contains only 40 time slices.

I think GeNIe and SMILE are very powerful, flexible, and easy-to-use softwares, but they are a bit too much "black boxes". There should be much more detailed error messages, or the possibility to go further in the code while in debug mode (for SMILE). I also think adding the possibility to use CUDA cores would be very much appreciated, as it could lead up to a hundreds folds processing speed increase.

Mon Mar 30, 2020 7:34 pm

I've used SMILE compiled with full debugging information to run EM on your network with slice count set to 40. The EM fails due to insufficient memory.

The first part of the algorithm which puts a pressure on RAM is probability of evidence calculation. It uses a special algorithm based on a jointree by default, but automatically falls back to a chain rule-based option. If you set your network to use EPIS sampling, you'll be trading memory usage for time and precision (as EPIS is sampling, therefore inexact).

The second part of EM which requires substantial memory is the calculation of the expected sufficient statistics; this is where your code reports the "inference cannot be performed" error. The data structures created in this phase are closely related to the jointree used by exact inference algorithm. Unfortuantely, there's no workaround for this in SMILE. I also don't think increasing RAM from 8 gigabytes to 16 or 64gb is going to help here - the structure of your unrolled network causes the jointree to grow very quickly.

Concerning the threshold on the ratio of log likelihood from two subsequent iterations, what is its value?

The current EM implementation in SMILE converges when prevLogLik / currentLogLik is less than 1.0001. Note that changing this value will not change the memory requirements; the condition is checked after each successful EM iteration.

PSGH · Post by **PSGH** » Tue Mar 31, 2020 8:48 am

Hi,

Thank you for your answer. I was highly suspecting it, but it surprises me a bit that the EM takes that much memory, as the inference doesn't need so much when using a sampling algorithm. I can do inference on my network with 160 slices, and it only takes a few hundred MB, while the EM would need countless GB? Is that always the case with EM, or is it just the SMILE implementation for increased speed? I thought expected sufficient statistics were directly obtained from the marginals by simple multiplication, so no need for any additional ressources.

Whatever the answer, I need a formula to estimate the RAM needed for EM directly from the number of nodes, states, arcs and slices of my networks, as well as the nodes excluded from learning. If I knew this, I would be able to adapt my network to the hardware I have available. You said it was linked to the memory needed for the jointree of the graph, this I can estimate quite easily, but is it a multiple of that? If yes, what is it?

Concerning the stopping criterion of EM, prevLogLik / currentLogLik less than 1.0001 doesn't make any sense to me. Isn't the LogLikelihood supposed to increase at each iteration? That would mean this ratio is always inferior to 1, so it converges instantly. Didn't you reverse the ratio?

I'd be really glad if you could answer me quickly, as I need to have some results of learning within the week.

Tue Mar 31, 2020 4:46 pm

Is that always the case with EM, or is it just the SMILE implementation for increased speed?

I believe this was a performance-related decision. I'll forward this question to the person who wrote SMILE's EM implementation.

Whatever the answer, I need a formula to estimate the RAM needed for EM directly from the number of nodes, states, arcs and slices of my networks, as well as the nodes excluded from learning

As far as the size of the join tree is concerned, there is no simple formula that will calculate it from the number of nodes or arcs. It all depends on the connectivity of the network. If you want to have a better grasp on this problem, any book on Bayesian networks should explain this. There is also a good Coursera course by Daphne Koller on probabilistic reasoning (she also wrote a book on Bayesian networks) that you may consider following.

Concerning the stopping criterion of EM, prevLogLik / currentLogLik less than 1.0001 doesn't make any sense to me. Isn't the LogLikelihood supposed to increase at each iteration? That would mean this ratio is always inferior to 1, so it converges instantly. Didn't you reverse the ratio?

Did you consider the fact that it is a logarithm of a fraction and it is negative. So, you are right about the value being larger but its absolute value is smaller.

PSGH · Post by **PSGH** » Wed Apr 01, 2020 8:47 am

Hi,

I know how a junction tree is formed, but even when I estimate the number and size of the cliques in a very pessimistic way, the real memory usage is always a multiple of that.
Here is how I estimate it : for each node of each slice, I consider all of its neighbours along itself as one clique, then I multiply the number of states of each of the nodes within each clique, and then multiply again by 8 to get the size in bytes of the network's cliques.
This should be a more than sufficient upper bound, however it is always inaccurate, and I've observed that the memory needs do not grow linearly to the number of slices as it should : whether I learn on 10 slices or 40, the difference is barely 1%.
What's wrong with my way of estimating the RAM needs?

BayesFusion Support Forum

Crash/error upon inference in a dynamic Bayesian network

Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network

Re: Crash/error upon inference in a dynamic Bayesian network