Learning structure with missing data

liid · Post by **liid** » Tue May 04, 2021 9:47 pm

Hi,

1. From what I read in your manual, structure learning algorithms do not work for data that is incomplete.
Are there any alternatives that can be implemented in GenIe for learning structure in the presence of missing values and latent variables?

2. Does genie offer any FS algorithms?

Thanks!

Post by **marek [BayesFusion]** » Wed May 05, 2021 2:52 pm

When learning the structure with missing values, you can rely on any available ML methodology. One way of dealing with missing values is through deleting all records with missing values. This works when the number of missing values is small. Another way is imputation -- you can replace the missing values by a value/label that will make it to the learned models. I would advise it when the number of missing values is large. Any method for dealing with missing values really that is available.

I assume you mean by FS "feature selection." Nothing automatic built into GeNIe, although there are several techniques that I have been advising people doing practical work with learning BN models. One is selecting features based on the strength of their dependence on the class variable(s). Please create a Naïve Bayes model and then use GeNIe's arc strength capability. You can view a spreadsheet of arcs with their strengths. Very often selecting nodes with arcs that are on the top, especially when considering MAX impact, will lead to higher accuracy prediction. Then the second approach is through structural learning. When you run the Bayesian Search, PC, or GTT algorithm, the result will tell generally you which variables are directly connected to the class variables. You can use this information in selecting features. The above two approaches are generally very effective in selecting features and building accurate models.

I hope this helps,

Marek

liid · Post by **liid** » Wed May 05, 2021 5:36 pm

Thank you!

As for the FS, I was actually thinking about the structure learning (SL) approach, but again, since I have quite a lot of missing values both in the training and testing data, its a problem. I am aware that I can do something like knn-imputation and then apply structure learning, but I am not sure how well its going to work, I just thought you might have implemented SEM or something similar so the learning can work directly with missing values.

May I ask two other questions please:
1. what do you think is the best way to impute the missing values *with regards to your SL algos (I know that in general there is no one answer as to which imputation method is better, but thought that perhaps you will have a sense of which method will work more holistically with the BN SL algos?

2. Am I right when saying that, if I do impute the missing vals for SL, then after the SL (for example for parameter learning or just inference) I can un-impute the dataset and use the dataset with missing values again? my intuition is that if I use SL for FS, this wont hurt bc we use the imputation to know *which* are the most "important features". but then when we actually estimate *what* is the influence of each feature, we use the original "missing" dataset.
but my intuition is less clear as to the scenario whereI use SL FOR SL, AND NOT for FS.

This question turned out complex so I will sum it: if I impute my missing values and use the imputed dataset both for either FS using SL, or for purely SL, or for both, is it "ok" to then unimpute the dataset and use the original dataset with the missing values for parameter learning and inference?

Thanks!

Post by **marek [BayesFusion]** » Wed May 05, 2021 7:53 pm

I'd like to confirm that the procedure that you proposed is fine, i.e., using structural learning with imputed values and then going back to the original file with missing values for parameter learning and possible validation.
Cheers,

Marek

liid · Post by **liid** » Wed May 05, 2021 9:22 pm

Thank you Mark.

Any chance you can answer my first question as well? I know its a bit out of scope but I have been reading mixed research about how is it best to impute values in the training for SL (and FL as well) and each different compelling arguments. I thought that you might have an insight on that, as an adequate imputation is essential for the SL to actually be meaningful. (and especially given the fact that for parameter learning I am going back to the unimputed, incomplete dataset).

Post by **marek [BayesFusion]** » Wed May 05, 2021 9:33 pm

I see, I though you summarized both questions at the end. As far as the best method of imputation is concerned, I don't have much insight. I co-authored a study once in which we compared different methods. The results were inconclusive and there was not much difference between various methods. One thing that was robust is that when you have missing values in medical data sets, then imputing "normal" values worked best. Normal like in a node "Fever" entering "NoFever" -- usually a missing value in a medical record means that the symptom was not observed because there was no abnormality.

I hope this helps,

Marek

liid · Post by **liid** » Fri May 07, 2021 1:08 am

Thank you so much for the replies Mark, I really appreciate it.

Going back to the FS, regarding the markov-blanket suggestion --- sometime didnt sound right to me and I think I know what it is:
In my data I not only have a "latent" target node. I also have other latent variables (i.e i have values for them in the training but not in the test set). So doesnt this mean that I need to do the SL and look for the blankets of EACH of those latent variables? (more accurately, the blankets of the latent variables that also appear in the target's blanket). If so, I assume that I can find all the blankets at the same SL run? and dont need to do different runs for discovering different blankets for different latent variables?

In general, does this scheme makes sense to you?
1. Run a SL algorithm (PC or GTT) on all the training data (*perhaps on a subset of it that is produced using a simple univariate filter-based method like IG, just to do the initial filtering and filter the real junk) and mark the markov blankets of ALL the latent variable (including of course the target).
2. Ignore: 1. everything that is not in one of the markov blankets 2. the latent variables that are not in the markov blanket of the target AND their markov blankets (unless one of the nodes in this blanket also belongs to the targets blanket or to other latent vars blankets)
3. Consider all the markov blankets of all the latent variables. Because I have missing values AND I have some strong domain knowledge on some of the features and correlations, do some quick "corrections" (with respect to the difference between what the SL-FS found, and knowledge that I have).
4. Run SL again only on features that I received in 3, this time purely for SL purpose (Perhaps using a *different SL algo like the bayesian algo with some sorts of priors that reflect the knowledge acquired up until and including step 3.) (is that step even necessary?)

*I do remember that in the previous time that I used genie I received different nets for different SL algos.

Thanks a lot, again.

liid · Post by **liid** » Mon May 10, 2021 2:18 am

Politely requesting, perhaps someone can answer my question from my previous message regarding the use of markov blankets for FS in genie?

Post by **marek [BayesFusion]** » Mon May 10, 2021 8:22 pm

The procedure makes sense but not completely. If you have missing values in your data set, it could happen that they are missing in the Markov blanket variables, in which case other variables, outside the Markov blanket, may play a role in calculating the probability distribution over your target variable.

When I mentioned structural learning as an aid to feature selection (FS) a couple of posts ago, I meant that this serves as an indication of which variables could possibly be close to the target variable and could act as useful/important features.

I hope this helps,

Marek

liid · Post by **liid** » Sat May 15, 2021 2:27 am

Thanks mark. I see your point. However, isnt this the case for every FS method? I mean, many FS algos (not just genie's SL) require complete datasets. So obviously if you first impute the missing values and then feed the dataset into the FS algorithm, the FS algorithm (no matter which one) would "miss" those connections between features with missing values (that are now artificially "complete"), and other features that could have somehow helped in "predicting" the missing features' values.

Sun May 16, 2021 11:17 am

What you are writing is correct. One thing, I believe is not. Missing values would be more likely to add variables to the Markov blanket rather than missing variables. Where some variables in the Markov blanket are unobserved, other variables (i.e., variables outside of the Markov blanket) remain dependent.

I hope this helps,

Marek

liid · Post by **liid** » Sat May 22, 2021 7:16 pm

I see.
Perhaps I should also include in my FS the MB of features with significant number of missing values (like>40% or so).
I cant really find works on that topic, most of the papers talk about MB in the context of "measured" variables.

Thanks Mark.

BayesFusion Support Forum

Learning structure with missing data

Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data

Re: Learning structure with missing data