Dear Community,
I am following the PDF version of the tutorial for SMILE wrappers, specifically for python language.
I am doing the following experiment to learn more about SMILE package.
(1)
I defined a network using only categorical variables in the following Python code:
```
cate_net = pysmile.Network()
A = create_cpt_node(cate_net, "A", "A", ['a1', 'a2', 'a3'], 10, 20)
B = create_cpt_node(cate_net, "B", "B", ['b1', 'b2'], 10, 30)
C = create_cpt_node(cate_net, "C", "C", ['c1', 'c2', 'c3'], 10, 40)
D = create_cpt_node(cate_net, "D", "D", ['d1', 'd2'], 10, 50)
cate_net.add_arc(C, A)
cate_net.add_arc(A, D)
cate_net.add_arc(B, D)
cate_net.write_file("./simulated_data/predefined_cate_net.xdsl")
```
I can see that, the saved network file from predefined network is like below:
```
# predefined network
<nodes>
<cpt id="C">
<state id="c1" />
<state id="c2" />
<state id="c3" />
<probabilities>0.5 0.5 0</probabilities>
</cpt>
<cpt id="A">
<state id="a1" />
<state id="a2" />
<state id="a3" />
<parents>C</parents>
<probabilities>0.5 0.5 0 0.5 0.5 0 0.5 0.5 0</probabilities>
</cpt>
<cpt id="B">
<state id="b1" />
<state id="b2" />
<probabilities>0.5 0.5</probabilities>
</cpt>
<cpt id="D">
<state id="d1" />
<state id="d2" />
<parents>A B</parents>
<probabilities>0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5</probabilities>
</cpt>
</nodes>
```
(2)
I also simulated the data, based on this predefined network. Using the simulated dataset to learn a network using PC algorithm and save the network into file.
Part of the network file is as below:
```
# learned network
<cpt id="C">
<state id="a1" />
<state id="a2" />
<state id="a3" />
<probabilities>0.3333333333333333 0.3333333333333333 0.3333333333333333</probabilities>
</cpt>
<cpt id="A">
<state id="c1" />
<state id="c2" />
<state id="c3" />
<parents>C</parents>
<probabilities>0.3333333333333333 0.3333333333333333 0.3333333333333333 0.3333333333333333 0.3333333333333333 0.3333333333333333 0.3333333333333333 0.3333333333333333 0.3333333333333333</probabilities>
</cpt>
```
My question is:
Why the predefined network, the probability number looks not right. For instance, the probability value of variable C should be like `<probabilities>0.3333333333333333 0.3333333333333333 0.3333333333333333</probabilities>` all random right?
Why it is like `<probabilities>0.5 0.5 0</probabilities>` when saved from predefined network?
I have attached my code and file in this post.
I have attached the generated network files.
Because it is unable to attach .py file. The python code is below (in case you want to run it):
```
# %% [markdown]
# # Simulate Categorical Network
# %%
# env: smile2
import pandas as pd
import pysmile
import pysmile_license
import numpy as np
# %%
!pip show pysmile
# %% [markdown]
# ## Create simulated data from a network with discrete variables
#
# To simulate categorical variables $A, B, C, D$ with the following Bayesian network dependencies:
#
# * $C \rightarrow A$
# * $A, B \rightarrow D$
#
# We'll:
#
# 1. Sample $C$ independently.
# 2. Sample $A$ conditioned on $C$.
# 3. Sample $B$ independently.
# 4. Sample $D$ conditioned on $A$ and $B$.
# %%
simulated_dataset_path = "./simulated_data/simulated_cate_data.csv"
# %%
import numpy as np
import pandas as pd
# Define categories
categories_C = ['c1', 'c2', 'c3']
categories_A = ['a1', 'a2', 'a3']
categories_B = ['b1', 'b2']
categories_D = ['d1', 'd2']
n_samples = 10000
# 1. Sample C independently
C = np.random.choice(categories_C, size=n_samples, p=[0.3, 0.4, 0.3])
# 2. Sample A conditioned on C
# Conditional probability table: P(A | C)
P_A_given_C = {
'c1': [0.7, 0.2, 0.1],
'c2': [0.2, 0.5, 0.3],
'c3': [0.1, 0.3, 0.6],
}
A = np.array([
np.random.choice(categories_A, p=P_A_given_C[c_val])
for c_val in C
])
# 3. Sample B independently
B = np.random.choice(categories_B, size=n_samples, p=[0.6, 0.4])
# 4. Sample D conditioned on A and B
# Conditional probability table: P(D | A, B)
P_D_given_AB = {
('a1', 'b1'): [0.9, 0.1],
('a1', 'b2'): [0.6, 0.4],
('a2', 'b1'): [0.5, 0.5],
('a2', 'b2'): [0.4, 0.6],
('a3', 'b1'): [0.3, 0.7],
('a3', 'b2'): [0.2, 0.8],
}
D = np.array([
np.random.choice(categories_D, p=P_D_given_AB[(a_val, b_val)])
for a_val, b_val in zip(A, B)
])
# Create DataFrame
df = pd.DataFrame({'C': C, 'A': A, 'B': B, 'D': D})
print(df.head())
# Save to CSV
df.to_csv(simulated_dataset_path, index=False)
# %%
cate_ds = pysmile.learning.DataSet()
try:
cate_ds.read_file(simulated_dataset_path)
except pysmile.SMILEException:
print("Dataset load failed")
#endtry
print(f"Dataset has {cate_ds.get_variable_count()} variables (columns) "
+ f"and {cate_ds.get_record_count()} records (rows)")
# %% [markdown]
# ## Structure learning from simulated dataset
# %%
pc = pysmile.learning.PC()
try:
pattern = pc.learn(cate_ds)
except pysmile.SMILEException:
print("PC failed")
#endtry
learned_cate_net = pattern.make_network(cate_ds)
print("PC finished, proceeding to parameter learning")
learned_cate_net.write_file("./simulated_data/learned_cate_net.xdsl")
# %%
def create_cpt_node(net, id, name, outcomes, x_pos, y_pos):
handle = net.add_node(pysmile.NodeType.CPT, id)
net.set_node_name(handle, name)
net.set_node_position(handle, x_pos, y_pos, 85, 55)
initial_outcome_count = net.get_outcome_count(handle)
for i in range(0, initial_outcome_count):
net.set_outcome_id(handle, i, outcomes)
for i in range(initial_outcome_count, len(outcomes)):
net.add_outcome(handle, outcomes)
return handle
# %% [markdown]
# To simulate categorical variables $A, B, C, D$ with the following Bayesian network dependencies:
#
# * $C \rightarrow A$
# * $A, B \rightarrow D$
# %%
# Define categories
# categories_C = ['c1', 'c2', 'c3']
# categories_A = ['a1', 'a2', 'a3']
# categories_B = ['b1', 'b2']
# categories_D = ['d1', 'd2']
cate_net = pysmile.Network()
A = create_cpt_node(cate_net, "A", "A", ['a1', 'a2', 'a3'], 10, 20)
B = create_cpt_node(cate_net, "B", "B", ['b1', 'b2'], 10, 30)
C = create_cpt_node(cate_net, "C", "C", ['c1', 'c2', 'c3'], 10, 40)
D = create_cpt_node(cate_net, "D", "D", ['d1', 'd2'], 10, 50)
cate_net.add_arc(C, A)
cate_net.add_arc(A, D)
cate_net.add_arc(B, D)
cate_net.write_file("./simulated_data/predefined_cate_net.xdsl")
# %%
em = pysmile.learning.EM()
try:
matching = cate_ds.match_network(cate_net)
except pysmile.SMILEException:
print("Can't automatically match network with dataset")
#endtry
em.set_uniformize_parameters(False)
em.set_randomize_parameters(False)
em.set_eq_sample_size(0)
try:
em.learn(cate_ds, cate_net, matching)
except pysmile.SMILEException:
print("EM failed")
#endtry
print("EM finished")
cate_net.write_file("simulated_data_em_cate.xdsl")
print("Complete.")
```
how default probability number in the network with only categorical variable is set?
-
- Posts: 8
- Joined: Tue Jun 10, 2025 3:51 pm
how default probability number in the network with only categorical variable is set?
- Attachments
-
- predefined_cate_net.xdsl
- (1.69 KiB) Downloaded 32 times
-
- learned_cate_net.xdsl
- (1.88 KiB) Downloaded 32 times
-
- Site Admin
- Posts: 1457
- Joined: Mon Nov 26, 2007 5:51 pm
Re: how default probability number in the network with only categorical variable is set?
Code: Select all
A = create_cpt_node(cate_net, "A", "A", ['a1', 'a2', 'a3'], 10, 20)
It's up to the SMILE user to provide actual values to the CPTs, possibly by learning parameters from data with EM.
-
- Posts: 8
- Joined: Tue Jun 10, 2025 3:51 pm
Re: how default probability number in the network with only categorical variable is set?
Currently I am just learning how to use the package. My final goal is to use it for real-world project.
In a real-world project, we want to first use our own causal discovery algorithm to generate the skeleton of the network (like the PC algorithm outputs the skeleton of the network), then we want to use the SMILE software to learn the parameters of the network given our real-world dataset.
Our own causal discovery algorithm will give me an adjacency matrix to represent the network.
For categorical variables, it is OK to create the skeleton of the network demonstrated by the code above right?
Later we call the EM algorithm for parameter learning on this constructed network `cate_net`, the initial numbers like 0.5, 0.5, 0, 0 etc does not matter right?
In a real-world project, we want to first use our own causal discovery algorithm to generate the skeleton of the network (like the PC algorithm outputs the skeleton of the network), then we want to use the SMILE software to learn the parameters of the network given our real-world dataset.
Our own causal discovery algorithm will give me an adjacency matrix to represent the network.
Code: Select all
cate_net = pysmile.Network()
A = create_cpt_node(cate_net, "A", "A", ['a1', 'a2', 'a3'], 10, 20)
B = create_cpt_node(cate_net, "B", "B", ['b1', 'b2'], 10, 30)
C = create_cpt_node(cate_net, "C", "C", ['c1', 'c2', 'c3'], 10, 40)
D = create_cpt_node(cate_net, "D", "D", ['d1', 'd2'], 10, 50)
cate_net.add_arc(C, A)
cate_net.add_arc(A, D)
cate_net.add_arc(B, D)
Later we call the EM algorithm for parameter learning on this constructed network `cate_net`, the initial numbers like 0.5, 0.5, 0, 0 etc does not matter right?