Discretization of Distributions

The front end.
Post Reply
Floistderbeste
Posts: 1
Joined: Sun Jan 07, 2018 11:28 pm

Discretization of Distributions

Post by Floistderbeste »

Hello everyone,

I have a dataset of a production system with some discrete and continous domains of variables. For Example i have some continous production input variables like LoadingVolume which has a space from 100 to 200 and is almost uniform distributed.
Or I have the output variable Throughput (150 to 400) which has almost a normal distribution.
How can I discretize these variables? Just binning them with equal-width seems indiscriminately to me, also I dont know how to determine the number of bins.
I heard about several algorithms like chimerge, k-means or entropy based discretization, but they need class labled datas, which I dont have. So there is nothing else left but equal-width/equal-frequency binning?
For the normal distributions, I heard I can partition them in standard deviations. Would you agree to this?

sorry for my bad english and thanks in advance for your help!
marek [BayesFusion]
Site Admin
Posts: 430
Joined: Tue Dec 11, 2007 4:24 pm

Re: Discretization of Distributions

Post by marek [BayesFusion] »

Since you don't have the class labels, you need to rely on unsupervised discretization, which is what GeNIe offers. In my experience, the difference among various methods for discretization is minimal in terms of resulting model accuracy, as long as you don't do it too weirdly, i.e., as long as you conform to some reasonable rules of behavior. You will see this in the literature testing various discretization methods.

What does it mean to discretize reasonably? You need at least 3-5 discretization intervals in case of a continuous variable but 3-5 may be enough, so you should not go crazy with using too many intervals. The more intervals, the more complex is your model and the more data you need to train it. I would suggest doing it interactively in GeNIe. I usually start with one of the available methods and then adjust the interval boundaries manually by dragging them with the mouse (I assume you know how to do it). Please look for natural boundaries. Sometimes you see them in the data -- there are clusters of values that you can see on the histogram. Sometimes, the boundaries are described in the literature, e.g., we know what is the normal range of blood pressure, what is hypo-tension and what is hypertension. Finally, please test your models for different discretizations and pick one that works best for your data/problem.
I hope this helps,

Marek
Post Reply