Dear SMILE creators,
Thank you for your time! I want to prepare the data input for training a Hidden Markov Model (a special type of DBN) using jsmile and have some questions which seem not to be answered by previous posts. 
My data consists of multiple samples. Each sample (corresponding to a student) is a sequence of observed performance (correct/incorrect).  Different samples can have different lengths. For example (a toy example), for the 1st sample, there are 3 observed data points (time slices), and for the 2nd sample, there are 5 observed data points (time slices). My questions are:
-- Shall I set the number of columns equal to the length of the longest sequence among all samples (e.g. 5 in this toy example)? 
-- Is there a limit (or suggested limit) for the number of columns (i.e., the length of a sample)? I found out that in my real dataset, the longest sample can be of 70 time slices(resulting in 140 columns considering hidden variables), with the average number of time slices being 21 across samples. However, I only have 289 samples for this HMM. Do you have any suggestions of how I should process this data and use it as input for jsmile?
-- Shall I put missing value symbols in the end of a sample if it is not as long as the longest sequence (e.g., in the toy example, for the 1st sample, I put missing value symbols for the last 2 time slices)?  
Here is how I prepare the data for this toy example. K denotes the hidden variable, C denotes the observed variable.
   K  K_1 K_2 K_3 K_4  C  C_1 C_2 C_3 C_4
   *    *    *     *     *      f    f     t     *     *
   *    *    *     *     *      f    f     f     t     t
-- When I create the network file (xdsl) and set "numslices", shall I first look at my data and find out the length of the longest sequence (e.g., 5 in this toy example) and then set "numslices" equal to that number? I found out that its value will affect the learned parameters.
       <dynamic numslices="5">
		<cpt id="K1" order="1">
			<parents>K1</parents>
			<probabilities>0.9 0.1 0.3 0.7</probabilities>
		</cpt>
	</dynamic>
Thank you so much for your help! Look forward to hear from you (soon)!
Yun
Intelligent Systems Program
University of Pittsburgh
			
			
									
						
										
						Data input for HMM(DBN) with different lengths across samples
Re: Data input for HMM(DBN) with different lengths across samples
Dear SMILE creators,
I have been trying to figure out the answer to my previous post's question and still couldn't find a solution. I created different datasets and compare the learned parameters with BNT (Kevin Murphy's toolkit). Following are the two cases that my way of preparing data (mentioned in my previous post) enables similar learned parameters:
-- when all my HMM samples have the same length (without putting missing values for observables at the end)
-- when half of my HMM samples only have half of the length of numslices which is the maximum length among all samples (i.e., they are put missing values * at the second half of the observables). The remaining half of the samples all have the length of numslices
However, when I did this on the real data, the learned parameters from both toolkits differ significantly. One characteristic of my real data is that once I set numslices as the maximum length among all samples, most of the samples have missing values at the end of the sequence of the observable. The magnitude of the difference doesn't seem to be related to numslices (e.g. I have an HMM with numslices=70 but it doesn't generate the maximum difference comparing with other HMMs).
I think HMM should be a standard DBN model, and it seems to be typical to get samples with different sequence lengths. I really appreciate your help on solving this problem, because I really want to keep using the tool and am catching some deadlines. Otherwise, I will have to turn to other slower toolkits :( Please let me know whether I have misunderstood anything, or missing any tutorial/posts that already solved this problem. Thanks a lot!
			
			
									
						
										
						I have been trying to figure out the answer to my previous post's question and still couldn't find a solution. I created different datasets and compare the learned parameters with BNT (Kevin Murphy's toolkit). Following are the two cases that my way of preparing data (mentioned in my previous post) enables similar learned parameters:
-- when all my HMM samples have the same length (without putting missing values for observables at the end)
-- when half of my HMM samples only have half of the length of numslices which is the maximum length among all samples (i.e., they are put missing values * at the second half of the observables). The remaining half of the samples all have the length of numslices
However, when I did this on the real data, the learned parameters from both toolkits differ significantly. One characteristic of my real data is that once I set numslices as the maximum length among all samples, most of the samples have missing values at the end of the sequence of the observable. The magnitude of the difference doesn't seem to be related to numslices (e.g. I have an HMM with numslices=70 but it doesn't generate the maximum difference comparing with other HMMs).
I think HMM should be a standard DBN model, and it seems to be typical to get samples with different sequence lengths. I really appreciate your help on solving this problem, because I really want to keep using the tool and am catching some deadlines. Otherwise, I will have to turn to other slower toolkits :( Please let me know whether I have misunderstood anything, or missing any tutorial/posts that already solved this problem. Thanks a lot!
Re: Data input for HMM(DBN) with different lengths across samples
Based on your description, I might know what the issue is. In EM in SMILE the time series are shortened to length k if slice k+1 and beyond contain no evidence. The reason for this is that a time series could go on forever without any evidence and this (or any arbitrary cut off point) should not affect the parameters. I'm not 100% sure this is the cause, but it sounds plausible.
Perhaps we could add an option to make SMILE not do this. At least then we can compare the outputs and see whether they match (if not, something else must be wrong). Unfortunately, I do not have access to the SMILE source code anymore so someone else would need to help you with that.
			
			
									
						
										
						Perhaps we could add an option to make SMILE not do this. At least then we can compare the outputs and see whether they match (if not, something else must be wrong). Unfortunately, I do not have access to the SMILE source code anymore so someone else would need to help you with that.
Re: Data input for HMM(DBN) with different lengths across samples
Dear Mark,
Thank you for your help. Today I tried again putting the first row the sample with the longest sequence length hoping that it will automatically set the numSlices according to the first sample. However, it still generates the same parameters as before (which are significantly different from the ones got from BNT). I would appreciate if someone can look into the source code for further helping this issue :) Thank you so much for your time!
			
			
									
						
										
						Thank you for your help. Today I tried again putting the first row the sample with the longest sequence length hoping that it will automatically set the numSlices according to the first sample. However, it still generates the same parameters as before (which are significantly different from the ones got from BNT). I would appreciate if someone can look into the source code for further helping this issue :) Thank you so much for your time!
- 
				shooltz[BayesFusion]
- Site Admin
- Posts: 1477
- Joined: Mon Nov 26, 2007 5:51 pm
Re: Data input for HMM(DBN) with different lengths across samples
Yun,
We're considering adding an option to control the slice count handling in EM/DBN. What is the platform you're using jSMILE on?
			
			
									
						
										
						We're considering adding an option to control the slice count handling in EM/DBN. What is the platform you're using jSMILE on?
Re: Data input for HMM(DBN) with different lengths across samples
Dear Shooltz,
Thank you so much for spending your time considering adding the option! I really appreciate it!
I am using the MAC OS platform (64 bit, 10.10.2) and have been using jsmile in my Eclipse Java projects. Also, I have successfully deployed some BN (not DBN though) built based on jsmile on my MAC onto a Windows server in our lab for last semester's classroom and user studies.
It will be great if this option could be added, so I could keep using jsmile for building (more complex) DBN for our project! Looking forward to hearing from you!
Thanks,
Yun
			
			
									
						
										
						Thank you so much for spending your time considering adding the option! I really appreciate it!
I am using the MAC OS platform (64 bit, 10.10.2) and have been using jsmile in my Eclipse Java projects. Also, I have successfully deployed some BN (not DBN though) built based on jsmile on my MAC onto a Windows server in our lab for last semester's classroom and user studies.
It will be great if this option could be added, so I could keep using jsmile for building (more complex) DBN for our project! Looking forward to hearing from you!
Thanks,
Yun
- 
				shooltz[BayesFusion]
- Site Admin
- Posts: 1477
- Joined: Mon Nov 26, 2007 5:51 pm
Re: Data input for HMM(DBN) with different lengths across samples
I've sent you a private message with the download link included.
			
			
									
						
										
						Re: Data input for HMM(DBN) with different lengths across samples
Thank you soooo much! I really appreciate it!!!
Thanks,
Yun
			
			
									
						
										
						Thanks,
Yun