Naive Bayes

nazim91 · Post by **nazim91** » Mon Sep 12, 2016 5:24 pm

Hi,

Im still new in BN and genie. Currently, I am working on simple Naive Bayes where consist of 1 parent node and 5 child nodes. The parent nodes is represent overall team performance (worst,bad,medium,good and superb) while 5 child node represent each player performance in 1st five player played in the game (worst,bad,medium,good and superb) in basketball. So, how I want to compute the overall team strength (worst,bad,medium,good and superb) from the child nodes?then i want to compare with opponent team in order to make early prediction of probabilities win,draw and loss?
p/s: I already construct the DAG and input on the player performance.

Thank you...

Post by **marek [BayesFusion]** » Tue Sep 13, 2016 9:50 am

The general direction of your work is fine, i.e., creating a class node and feature nodes that are children of the class node. It looks like you need to read up a little on the theory, i.e., what a Naive Bayes model does and how it works. There is plenty of literature on that, just try "naive Bayes" in Google. From the point of view of GeNIe, this is the simplest model that you can build and it is one that performs reasonably (although it can be typically improved upon). If you have a data set, you can learn both the structure and the parameters from the data. If you don't have data, parameters are not terribly difficult to estimate. Just remember that the probability distribution over the states of a feature node is the conditional probability given the class. Good luck!

Marek

nazim91 · Post by **nazim91** » Sun Nov 06, 2016 4:16 am

Thank you Marek for your reply.

I have followed all your guidance and strengthening my fundamental knowledge in BN and found BN is very interesting. I have a dataset and I try learnt it using Bayesian search algorithm. However, I am now stuck on validation. Here, I would like to ask whether genie support Confusion matrix for multiple classes? and how to do it?because when I run the validation with Leave One Out with a class (match results) contain three outcome (win, draw and loss), I only get percentage of win for accuracy nor draw and loss. I really appreciate your help. Thank you

Post by **marek [BayesFusion]** » Sun Nov 06, 2016 8:35 am

GeNIe supports validation for multiple classes. Just select the nodes containing your classes in the validation dialog. The classes do not have to be all in the same node. The results will be shown in terms of accuracy data for each of the nodes/classes, confusion matrices, then ROC and calibration curves. Try it on something simple to get the idea. Also, please download the newest GeNIe, released on November 4 -- we have made some changes to the program and the documentation, including the validation section. Good luck!

Marek

nazim91 · Post by **nazim91** » Mon Nov 07, 2016 7:13 am

Hi Marek,

I already download and install the new version of GeNIe and try to learn a new network based on more simple datasets for my learning purposes. As a results, there are some nodes that independence which means it is not connected to other nodes. Secondly, my accuracy during validation (Leave one out) keep changing between 0.88++, 0.83++ and 0.77++ although the network and dataset is same. Why this happen? and I found out there are method for discretization which are Hierarchical,Uniform widths and uniform counts. Can you give me some brief explanation about it? Lastly, I read an articles (Cooper and Herkovitz, 1992) which are listed in GeNIe manual. Is Bayesian Search algo. and Hill climbing algo. is same algorithm? sorry if this is a silly question.

Thank you.

Post by **marek [BayesFusion]** » Mon Nov 07, 2016 9:30 pm

I'm not sure what you are doing, so I cannot give you a 100% certain answer. If you have the same data set and the same model and perform Validation multiple times, you can get different results each time if you use Folding seed of 0. In that case, GeNIe will use the system clock to start the random number generator and will generally choose different records for the folds, hence different results. They should be similar, though. If you want the results to be always the same, please enter a non-zero random number seed and use the same seed each time. This is explained in the on-line help. Assuming that you are using a random number seed of zero, the big differences that you are observing may be due to a small data set -- then the variance is large. I cannot help more without seeing the data and the model.

Nodes that are not connected to other nodes are possible -- it all depends on whether they are dependent on any other nodes.

There are three discretization methods implemented in GeNIe now: Uniform Widths, which makes the widths of the discretization intervals the same, Uniform Counts, which makes the number of values in each of the discretization bins the same, and Hierarchical, which is an unsupervised discretization method related to clustering. We do do not have the literature reference handy but here is a sketch of the algorithm implemented in GeNIe:
Input: N=# of records, K=# of desired bins
1. Let k denote the running number of bins, initialized to k=N (each record starts in its own cluster)
2. If k=K quit, else set k=k-1 by combining the two bins whose mean value has the smallest separation
3. Repeat 2

Finally, Hill Climbing is a general term for optimization algorithms that looks a limited number of steps ahead (often one) and picks the step that leads to the biggest improvement in score. Bayesian search is a hill climbing algorithm but hill climbing is not necessarily Bayesian search :-).
I hope this helps.

Marek

nazim91 · Post by **nazim91** » Wed Nov 09, 2016 5:55 am

Sorry for disturbing you again (~.~")

I have another question regarding why there is an error while I am learning using Bayesian search saying constant values not allowed? This is the situation, I am listing down the present/absent of player in a game and there are player which present in all match and there are player present in some match. The match outcome as the class (win,draw and lose).

Thank you.

Post by **marek [BayesFusion]** » Wed Nov 09, 2016 7:31 am

At least one of the columns in your data file contains a constant value, as you say. Constant values are useless in learning. The common sense of this is the following: If a player is present during each of the games, then his/her presence cannot be a predictor whether the team wins. He/she will be there anyway, right? If you want to predict what happens when he/she is not there, there is no basis for this judgment during the learning process, as you have no cases when he/she was absent. So, this variable is useless in learning. You could enhance your model afterwards by adding this variable and making a judgment of what happens during this player's absence but not during the learning procedure.
I hope this helps,

Marek

nazim91 · Post by **nazim91** » Sun Nov 27, 2016 12:35 am

Hi Marek,

Thank you for your reply. Actually, I try to reproduce back a predicting model of Joseph et al. (Predicting football results using Bayesian nets and other machine learning techniques, 2006) in order to study the football prediction model using bayesian network. For their general model, they use presence/absence of player and other variable such as opponent quality(team ranking) and venue (home/away). Why when I try to use bayesian search for learning, none of nodes connected? Next, I have a question regarding EM algorithm for parameter learning. It is stated that the final Log(p) is ranging from minus infinity to zero in order to measure data fit to the model. So, it is better for the value of Log(p) is near to zero or minus infinity in order to show the model is fit to the data?

Thank you

Post by **marek [BayesFusion]** » Mon Nov 28, 2016 7:47 am

Your first question cannot be answered without data. I looked at the paper but cannot find the source of the data. Do you have the data or know where I can take it from? We can see whether variables are independent in the data or not.

log(p) is the overall log likelihood score of the final iteration of the EM algorithm and it expresses how well the final set of parameters fits all the data. The higher, the better, with log(1)=0 being the largest possible value. Please be careful with interpreting the log(p), as it does not take into account model complexity. Adding arcs, for example, will never decrease log(p), so most structure search algorithms use also a penalty score for complexity.

The soundest approach for comparing the quality of models is cross-validation. It measures how well a model does in predicting the values of variables of interest.
I hope this helps.

Marek

nazim91 · Post by **nazim91** » Mon Dec 05, 2016 3:14 am

Thank you for your previous response.

I am already PM the data to you. Besides, I try to do the Naive BN but I do not get same results in accuracy like stated in Table 1 for separate training and test data between period except I got exact percentage of 26(68.24%) for train period for 95/96 and test period for 95/96. It is because of different tool? or scoring metric? or data?

Thank you :)

Post by **marek [BayesFusion]** » Mon Dec 05, 2016 1:50 pm

Got your data. It has only 38 data records (very few) and 23 variables. What is it exactly that you would like to do with the data? Replicating the results from the paper will be hard, if not impossible, given that you don't know which of the records were used for training and which for testing. Was their data set exactly what you sent me? How did they split the records between the training and test set? I have to say that the number of records is very small, so I would expect the results to be weak. I have noticed that Sharingham and Walker were there during each of the games, so you can and should remove them from learning -- their presence tells nothing. I have tried several algorithms, including Naive Bayes and leave-one-out cross-validation with rather poor results, accuracy depends on whether we focus on Results or Team Ranking and on the algorithm. Please advise what you would like to do and how I can help.
Cheers,

Marek

nazim91 · Post by **nazim91** » Thu Jan 05, 2017 8:31 am

Thank you for your response and I am apologize for my late reply.

Actually, I am studying this paper for my literature review and found this paper very interesting. I am organized the data accordingly from the paper. The data were divided into three group of ten matches and one group of eight matches, organised chronological. So, period 1 until period 3 will have ten matches each while period 4 will have 8 matches. The data which I have sent to you was organised by me and maybe the reliability of data can be questioned because I extracted it from website. Yeah, I exclude some of player which constant in their presence for each matches because their presence will tell nothing. For this time being, my plan is to go through other paper about BN in sports and enhance my fundamental knowledge in BN.

p/s: I found GeNIe is really a great tools. I learned so much about BN while try to use this tools and I really appreciate for this free academic version that you offered. Thank you very much for your fast support and if I have other question, we will meet again :)

Thu Jan 05, 2017 10:38 am

Glad to be of service -- Marek

BayesFusion Support Forum

Naive Bayes

Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes

Re: Naive Bayes