The question about processing of continuous data to discretized data

The front end.
Post Reply
wxk8000
Posts: 20
Joined: Fri Jan 19, 2018 11:58 am

The question about processing of continuous data to discretized data

Post by wxk8000 »

I use the discretize function of GeNIe to discretize the continuous datasets.
discretize.png
discretize.png (40.24 KiB) Viewed 4159 times
and obtain the discretized data from the continuous data, but I find that there are some questions of the distretized data, as shown in the following figure.
discretize process.png
discretize process.png (235.05 KiB) Viewed 4159 times
The original continuous data, different value of the parents' nodes have different value of children's nodes, but after discretizing, the value of some rows become the same. Moreover, in some rows, same value of the parents' nodes have different value of children's nodes. I don't know whether the distretized data can be used to learn the bayesian net? is there any negative influences on my network?
1, What should be paid attention to when discretizing the continuous dataset, for example, the method, the bin count, the boundaries.
2, How to estimate the quality of the distretized data? Is there any discrete criteria?
marek [BayesFusion]
Site Admin
Posts: 430
Joined: Tue Dec 11, 2007 4:24 pm

Re: The question about processing of continuous data to discretized data

Post by marek [BayesFusion] »

The original continuous data, different value of the parents' nodes have different value of children's nodes, but after discretizing, the value of some rows become the same. Moreover, in some rows, same value of the parents' nodes have different value of children's nodes. I don't know whether the distretized data can be used to learn the bayesian net? is there any negative influences on my network?
1, What should be paid attention to when discretizing the continuous dataset, for example, the method, the bin count, the boundaries.
2, How to estimate the quality of the distretized data? Is there any discrete criteria?
I'm not sure what to make of the example that you gave. Are you suggesting that there is an error in the discretization or are you asking for an explanation how it works? Discretization is the process of replacing numerical values by labels of the intervals in which they fall and it may happen that different numerical values of a variable result in the same label. To verify that the discretizaed values that you have circled are correct (which I suspect they are), I would have to know the discretization thresholds that you used. Names of the labels are not enough, as you may have changed the interval bounds manually.

Discretized data can absolutely be used to learn the structure of a BN -- this is what discretization is often used for. There are theoretical reasons for concern (discussed, for example, in the book by Spirtes, Glymour and Scheines, reference can be found in GeNIe manual) -- discretization may lead to spurious dependencies.

There is no general advice that I can give you for discretization. Various discretization methods have been shown in many settings to lead to similar results. Too many intervals will increase the complexity of your model, which may become a problem if your data set is too small. In decision analysis, there is a popular heuristic that 3-5 intervals express a distribution reasonably. The ultimate test of quality of discretization is model accuracy or calibration, which you can check using cross-validation on your original data. When working on practical data sets, I have found that natural boundaries for the intervals work well (for example, I would discretize a variable "body temperature" into "hypothermia", "normal" and "fever", using generally accepted thresholds for these).
I hope this helps,

Marek
wxk8000
Posts: 20
Joined: Fri Jan 19, 2018 11:58 am

Re: The question about processing of continuous data to discretized data

Post by wxk8000 »

marek wrote:Discretization is the process of replacing numerical values by labels of the intervals in which they fall and it may happen that different numerical values of a variable result in the same label.
Dear Marek, thank you for your advice, I just have the problem you point out,
1, different numerical values of a variable result in the same label. I want to know will it reduce the valid number of the datasets?
2, some times the four same lable, but one different lable. How to understand that four same states can lead to another different Labels? will it affect the training result of my Network?

I upload my dataset in the attachment, I hope you can give me some advice to discretize them. Thank you!
CL continuous.csv
(2.13 KiB) Downloaded 345 times
marek [BayesFusion]
Site Admin
Posts: 430
Joined: Tue Dec 11, 2007 4:24 pm

Re: The question about processing of continuous data to discretized data

Post by marek [BayesFusion] »

There seems to be nothing wrong with your data, so I suspect that it is just confusion on your part. Discretization will obviously lead to loss of information and may decrease the quality of your model. Sometimes, however, you have no choice and need to discretize. I would definitely make then 2 for Ma and D_mmin, 5 for feed_mm, and then anything between 3 and 10 for CL_mm. Only the number of intervals in CL_mm will make a difference for you, as the first four variables have naturally small number of values. I hope this helps.

Marek
Post Reply