Issues with PC, questions to continuous networks, and more

CptQuak · Post by **CptQuak** » Wed Oct 18, 2023 8:43 pm

Hi, I have have ran into some problems when playing with Genie/Smile so maybe you can help me. For context, I'm using pysmile on linux, python version 3.10.12, pysmile 2.0.10. I'm working on a rather complex dataset with about 50 variables and 1 million records. I'm also sharing a sample dataset, and some networks that I have noticed the following issues https://github.com/CptQuak/smile_problems

1. When using the PC algorithm in pysmile there is no way to set the maximum search time like in Genie. There is an option to set this time for the BayesianSearch algorithm, which makes me wonder why is it not available for PC?
2. When using PC for both discrete and continuous networks, I think there is a problem with the Max Adjacency parameter as it makes no difference to the final number of parents in the network. I can often see nodes with more than 5 parents when the parameter is set to 3. (All the shared networks are trained with max 3).
3. When we learn parameters of a discrete network using EM, I find many cases where the probability is 0. Since the prior is 0, the posterior will always be zero for such states. This sometimes makes sense in my data (e.g. we'll never observe a temperature of 288K when we're 10km above sea level), but I've been told it can be problematic because it makes such states completely impossible (e.g. they could occur in the real world due to the error in the sensor reading). I could easily go through the whole CPT for that state, add some value like 1e-6 to the 0 probability states and normalise, but shouldn't there be an option to do this automatically in smile?
4. When trying to set discretisation intervals for continuous variables using the smile interface, I noticed that there is no way to set 'infinity' as an upper bound. The parameter requires a float value, whereas in Genie we can type inf/-inf. in the definition of a node.
5. For each node in a continuous network I have specified the discretisation intervals. There is a problem with inference, you can find these networks on the github repo (with cont_discretization_intervals.xsdl and without cont_targets.xdsl). You can perform the inference on the continuous network quiet fast, but once the discretisation intervals are specified it actually takes a long time to run (I actually let it run for over 2h for a single sample in smile and it still didn't finish, while the genie just crashes).
6. I am also not sure if specifying targets actually reduces the time needed for inference. I ran an inference loop to gather information about the variables RUL, X_flow_mod, X_eff_mod (for x substitute lpc, hpc, hpt, lpt). About 20 variables were considered as observed and an evidence was set for them. I ran an experiment for 100 samples in each network (with and without targets), but it was basically the same time 2 min 17 sec vs 2 min 19 sec. (cont_targets.xdsl vs cont_notargets.xdsl)
7. When performing inference we are using an approximate algorithm to sample from the posterior distribution. There is an option to set the number of samples, but honestly I'm also not sure if its working. Theoretically if we set it to some low value like 1 or 10, (for now completely ignore that it will be the wrong representation of posterior), it should run quiet fast. But again I don't see any changes if its 10, 10000 or 10000000 samples in the run time.
8. I'm also interested in more details about continuous networks, so I made one smaller network for test hpt.xdsl. For example I want to perform inference on variable P45. Let's say that we set some evidence on Nc (value 8000).
- First of all why is the discretionary necessary here, I still have access to the mean and stddev of that nodes which for me feel more useful.
- Next, output at the bottom in Genie tells me that there was a large number of Under/Overflow samples when discretising nodes. I'm not sure how to feel about it, like 70% of samples fall outside my specified intervals. Is this kind of like a rejections sampling from MCMC algorithms, or what happens here? Also 10,000,000 samples when in network properties its only set to 10,000, did it just automatically increase so that we have any non rejected samples?.
- Can I somehow access the drawn samples from smile so that I can calculate mean, highest density intervals, etc. in my code?
- Finally, when performing sampling in Smile on continuous network it runs kinda slow. I wonder if when I'm iterating rows of the dataset it has to perform discretisation every single time, or does it remember once performed.

Thu Oct 19, 2023 1:25 pm

Thanks for the long and detailed post. We will try to provide answers to all your questions in multiple posts.

1. We have added an entry in our issue tracker re: missing maxSearchTime for the PC learning. This is a simple change, and we should be able to implement it before the upcoming SMILE release.

2. We are aware of the max adjacency problem with PC, currently working on it.

6. The performance improvement to be gain by setting explicit targets depends on the network structure and actual targets selected - there's certainly a case where output from the relevance layer with targets is no simpler than full network, and some time is required to determine which parts of the network can be pruned when targets are present. AFAIR the following model was built by for an industrial diagnosis and shows very significant gains for typical evidence sets: https://repo.bayesfusion.com/network/pe ... nosis.xdsl

piotr [BayesFusion] · Post by **piotr [BayesFusion]** » Thu Oct 19, 2023 2:03 pm

4. When trying to set discretisation intervals for continuous variables using the smile interface, I noticed that there is no way to set 'infinity' as an upper bound. The parameter requires a float value, whereas in Genie we can type inf/-inf. in the definition of a node.

In PySMILE, you can set infinity as the upper discretization interval as follows:

Code: Select all

net.set_node_equation_bounds("nodeID", lower_bound, float("inf"))

Tue Oct 24, 2023 10:13 am

Re: item 3, we will add an option to avoid zeros in the learned parameters. We already have similar functionality in the hybrid inference sampling algorithm. The replacement value for zero will be 1/N, where N is the numbers of records in the dataset.

BayesFusion Support Forum

Issues with PC, questions to continuous networks, and more

Issues with PC, questions to continuous networks, and more

Re: Issues with PC, questions to continuous networks, and more

Re: Issues with PC, questions to continuous networks, and more

Re: Issues with PC, questions to continuous networks, and more