When testing a new data set on a trained network, is it possible to use node states and column values for automatic matching, instead of just labels?
In my use case, network nodes and data set columns that correspond to each other have significantly different labels, so the automatic label-based match doesn't work. However, having an automated workflow is still important.
I'm considering implementing a simulated annealing approach to find the optimal mapping.
Asking here first though, since I'm still discovering new features even after months of experimenting with the BayesFusion ecosystem. :)
Thanks in advance for any info.
Automatic network-data matching based on values?
-
- Site Admin
- Posts: 1444
- Joined: Mon Nov 26, 2007 5:51 pm
Re: Automatic network-data matching based on values?
The automatic data column/network node matching is done by comparing node ids and column labels only.
If you plan to automate the matching, please keep in mind that for performance reasons the parameter learning and validation algorithms like EM use the dataset data as integer indices. If your node N has two outcomes, True and False (with True at index 0 and False at index 1), a data element in the matched dataset column will simply set evidence on N to 0 or 1 when processing a dataset record. If the dataset column outcomes are naturally ordered as False at index 0 and True at index 1, this will use the wrong data. The automatic matching in SMILE takes care of this by rewriting the integers in the dataset column if it's matched to a node.
If your node outcomes and dataset element labels (not the column labels) are already matched, one way to proceed would be to write an algorithm which modifies node ids to match column labels, and let SMILE work on the actual data by calling match_network. I assume your column labels can be used as node ids (no spaces, only letters, digits and underscores, start with the letter). Note that you currently cannot set the column label from PySMILE.
If you plan to automate the matching, please keep in mind that for performance reasons the parameter learning and validation algorithms like EM use the dataset data as integer indices. If your node N has two outcomes, True and False (with True at index 0 and False at index 1), a data element in the matched dataset column will simply set evidence on N to 0 or 1 when processing a dataset record. If the dataset column outcomes are naturally ordered as False at index 0 and True at index 1, this will use the wrong data. The automatic matching in SMILE takes care of this by rewriting the integers in the dataset column if it's matched to a node.
If your node outcomes and dataset element labels (not the column labels) are already matched, one way to proceed would be to write an algorithm which modifies node ids to match column labels, and let SMILE work on the actual data by calling match_network. I assume your column labels can be used as node ids (no spaces, only letters, digits and underscores, start with the letter). Note that you currently cannot set the column label from PySMILE.