Using SMILE > Diagnosis > Distance and entropy-based measures

The diagnostic measures calculated for observations depend on the pursued fault set. The pursued faults define the focus of reasoning, and the changes in their probabilities are the inputs to the measure algorithms. To select the algorithm, use DSL_diagSession::SetSingleFaultAlgorithm and SetMultiFaultAlgorithm. The output of the algorithm is a single number for each uninstantiated observation.

The available algorithms for single fault diagnosis are:

•Max probability change (default): for each observation, the maximum change is taken over the absolute values of the change in probability of the pursued fault. Note that this is a signed measure, meaning that some values may be negative (when the largest magnitude of fault probability change represents the probability decrease).

•Cross-entropy: the insight from the Max probability change is limited in the sense of not telling us a key piece of information how likely each of the changes happens. For example, a positive test result for cancer will make a huge change it the probability of cancer. However, the probability of seeing a positive test may be very small in a generally healthy person. So, effectively the expected amount of diagnostic information from performing this test is rather small. Cross-entropy is an information-theoretic measure that takes into account both the amount of information flowing from observing individual states of an observation variable and the probabilities of observing this states. A high cross-entropy indicates a high expected contribution of observing a variable to the probability of the pursued fault. Cross-entropy is unsigned.

•Normalized cross-entropy: the cross entropy divided by the current value of the entropy of the pursued fault node

For multi-fault diagnosis, the following algorithms are available:

•Max probability change (default): the value of the measure is the maximum change of probability over all pursued faults and outcomes of the observation. This is a signed measure.

•Euclidean distance, or L2 Norm: calculates the distance between two vectors in Euclidean space, where the vector coordinates are the probabilities of the pursued faults before and after the observation. The distance is normalized to ensure that a change from impossible (all coordinates are zero) to certain (all coordinates are ones) is equal to 1.0. For each observation, the selected value is the greatest distance over all observation outcomes. The larger the distance, the larger the impact of the observation.

•Cityblock distance: as above, but using cityblock metric

•Averaged L2 and cityblock distance

•Cosine distance, or cosine similarity: calculated between two fault probability vectors. Since vector coordinates are probabilities, and therefore non-negative, this measure will always produce non-negative values (despite general cosine similarity range between -1 and 1).

•A family of six entropy-based measures, which require calculation of the joint probability distribution over all pursued faults, which is computationally prohibitive, it is necessary to use approximations of the joint probability distribution. The approximations are based on two strong assumptions about dependencies among them: (1) complete independence (this is taken by the first group of approaches) and (2) complete dependence (this is taken by the second group of approaches). Each of the two extremes is divided into three groups: (1) At Least One, (2) Only One, and (3) All. These refer to different partitioning of the combinations of diseases in cross-entropy calculation.

•Two marginal probability-based measures, which are much faster than the independence/dependence-based joint probability distribution approaches but it is not as accurate because they make a stronger assumption about the joint probability distribution. Entropy calculations in this approach are based purely on the marginal probabilities of the pursued faults. The two algorithms that use the Marginal Probability Approach differ essentially in the function that they use to select the tests to perform. Both functions are scaled so that they return values between 0 and 1. Entropy/Marginal 1 uses a function without the support for maximum distance and its minimum is reached when all probabilities of the faults are equal to 0.5. Entropy/Marginal 2 uses a function that has support for maximum distance and is continuous in the domain [0,1].