Very long training Time with smaller network
Very long training Time with smaller network
Hello,
I have build a network with 6,500,000 parameters, 13 nodes, 217 states. I let it train with 2,500,000 rows of training data.
It needed 110 minutes. Thats okay.
To make the net a bit smaller I used the complete same network, but joined each three 3 states in node one to one state. Every other node, every connection, even the training data was the same. The training data was the same amount, but I adopted it to the changed node.
So I had 2,200,000 parameters, 13 nodes, 106 states.
So I decreased the parameter count to 33%.
I let it train using the normal EM-Algorithm in Genie with random initial parameters. But now it is training since 1131 minutes. that's over 18 hours.
I don't want to stop because I fear I need to do this again. But what's wrong?
Meanwhile I have started training of the same smaller network on two other cores but stopped it after 4 hours.
Once I train the old network (just MORE states in one node) it is finished in 110 minutes.
What could be wrong? Is there a possibility that a smaller network with the same node count, same connections, but less states in one node needs 10x times the training time?
Or is the EM-Algorithm only efficient if the training data amount is less than the parameter count? This does not sound logical for me, but it is the only thing I can see that differs.
I have a Quad Core Q6600, x64bit and 4GB RAM.
I really hope you can help me as I need to train this net some times again to build the averages.
I have build a network with 6,500,000 parameters, 13 nodes, 217 states. I let it train with 2,500,000 rows of training data.
It needed 110 minutes. Thats okay.
To make the net a bit smaller I used the complete same network, but joined each three 3 states in node one to one state. Every other node, every connection, even the training data was the same. The training data was the same amount, but I adopted it to the changed node.
So I had 2,200,000 parameters, 13 nodes, 106 states.
So I decreased the parameter count to 33%.
I let it train using the normal EM-Algorithm in Genie with random initial parameters. But now it is training since 1131 minutes. that's over 18 hours.
I don't want to stop because I fear I need to do this again. But what's wrong?
Meanwhile I have started training of the same smaller network on two other cores but stopped it after 4 hours.
Once I train the old network (just MORE states in one node) it is finished in 110 minutes.
What could be wrong? Is there a possibility that a smaller network with the same node count, same connections, but less states in one node needs 10x times the training time?
Or is the EM-Algorithm only efficient if the training data amount is less than the parameter count? This does not sound logical for me, but it is the only thing I can see that differs.
I have a Quad Core Q6600, x64bit and 4GB RAM.
I really hope you can help me as I need to train this net some times again to build the averages.
Re: Very long training Time with smaller network
I'm not sure what's going on. The only thing I can think of at the moment is that, for some reason, the convergence is slower in the smaller network. Can you share the network and the data so that I can take a look at it?Christian wrote:So I decreased the parameter count to 33%.
I let it train using the normal EM-Algorithm in Genie with random initial parameters. But now it is training since 1131 minutes. that's over 18 hours.
I don't want to stop because I fear I need to do this again. But what's wrong?
The network it a very big one and also the data is confidential. I hope you could solve the problem without seeing the data. But I could create a similar network with the same nodes (renamed), same connections and same states.
I would have no training djata for this one.
But as I could just rename everything in the already trained networks you could easily create own data using genie.
Would that help?
But good to know the training finished after 1864 minutes. So it needs 17x the training time with the same data.
Other things I think that could cause this error:
1) the small network has 40MB, the big one 120MB. Maybe you are handling large networks in a other way? Maybe you allocate memory in a other way?
2) For bigger networks you are using a different implementation of this algorithm
Do you really think a worse convergence could lead to 17 times slower training?
I would have no training djata for this one.
But as I could just rename everything in the already trained networks you could easily create own data using genie.
Would that help?
But good to know the training finished after 1864 minutes. So it needs 17x the training time with the same data.
Other things I think that could cause this error:
1) the small network has 40MB, the big one 120MB. Maybe you are handling large networks in a other way? Maybe you allocate memory in a other way?
2) For bigger networks you are using a different implementation of this algorithm
Do you really think a worse convergence could lead to 17 times slower training?
All networks are learned in the same way, so that cannot cause the difference. But I thought of an explanation of this phenomenon that has to do with the number of records per parameter. In the slow case, there are many more records per parameter, so that means that the parameter values might fluctuate more compared to the faster case. And because the changes in parameters are bigger, it is not unreasonable to think that the convergence is slower. Another way of putting it is that a lower number of parameters makes the learning problem in some sense more constraint and finding a good solution more difficult. What do you think of this explanation?
I am not that good in understanding the EM-Algorithm, but what you say sounds reasonable.
As in the slower network the parameters are much more fluctuating, do you think the faster trained network will result in better forecasts as the values are more clear (not fluctuating so much) and so more correct?
As in the slower network the parameters are much more fluctuating, do you think the faster trained network will result in better forecasts as the values are more clear (not fluctuating so much) and so more correct?
No, the fluctuations are only during learning and don't affect the performance of the network afterwards. I think the network with fewer parameters will perform better, because there is more data to learn each parameter. But of course it is also a question of how well the model fits the real world. It could be that the model with more parameters is better fit in this case.Christian wrote:I am not that good in understanding the EM-Algorithm, but what you say sounds reasonable.
As in the slower network the parameters are much more fluctuating, do you think the faster trained network will result in better forecasts as the values are more clear (not fluctuating so much) and so more correct?
I had the problem that sometimes states in the node with the many states showed 0 due to lack of available information I think. So I have put each three logical states to one. I thought I would get better results this way.
Calculation speed does not make such a difference. I just thought when it is such complicated for the em-algorithm to get the correct values, that later the prognosis would also not such accurate as maybe the EM-algorithm made compromises.
Would it possible to implement some kind of remaining time estimation in genie while learnon parameters is active?
Calculation speed does not make such a difference. I just thought when it is such complicated for the em-algorithm to get the correct values, that later the prognosis would also not such accurate as maybe the EM-algorithm made compromises.
Would it possible to implement some kind of remaining time estimation in genie while learnon parameters is active?
If you have zero probabilities in the learned CPTs you could also use a prior to prevent this from happening. The reason they occur is that there is no record that actually matches the configuration of states for that parameter.Christian wrote:I had the problem that sometimes states in the node with the many states showed 0 due to lack of available information I think. So I have put each three logical states to one. I thought I would get better results this way.
Would it possible to implement some kind of remaining time estimation in genie while learnon parameters is active?
I'm not sure how hard time estimation will be, I never really looked into it. Maybe it is possible to recognize a trend in convergence and predict the time remaining.
Hello Marc,
what do you exactly mean by using a prior? Could you please give me alittle example?
I have an other idea what the reason for the long develope time could be.
I have 3 times Genie open and I am training a network on each Genie instance as I have a quad core and need much networks to build the averages.
Could Genie get confused by this? For example that they are try to use the same memory allocations or things like that?
what do you exactly mean by using a prior? Could you please give me alittle example?
I have an other idea what the reason for the long develope time could be.
I have 3 times Genie open and I am training a network on each Genie instance as I have a quad core and need much networks to build the averages.
Could Genie get confused by this? For example that they are try to use the same memory allocations or things like that?
By prior I mean the equivalent sample size parameter that is part of the EM class. Basically, you can give a weight to the parameters in the network so that they are not overwritten by learning, but updated. If you run three different GeNIe instances they should not interfere with each other, because they are separate OS processes and cannot access each other's memory directly.Christian wrote:Hello Marc,
what do you exactly mean by using a prior? Could you please give me alittle example?
I have an other idea what the reason for the long develope time could be.
I have 3 times Genie open and I am training a network on each Genie instance as I have a quad core and need much networks to build the averages.
Could Genie get confused by this? For example that they are try to use the same memory allocations or things like that?
Can you post your data and networks (or at least the obfuscated ones) so that I can take a look at it if I have some spare time today?
Thank you for your offer. I think you were right with the assumption that the training took longer because the values had problems to convert to optimum.
I had two nodes with each 6 different incoming connection from other nodes. Lets call these two nodes getting all those incoming nodes root nodes.
Both root nodes were only connected via two other nodes.
The training data consists of 2,400,000 Datasets. 2,100,000 only values from nodes connecting to root node 1 (and the node itself and to the interconnection nodes to root node 2).
The other 300,000 datasets were training data for all nodes.
The training was also fast (2 hours) when I did one of the following:
Only use the 2,100,000 training data for root node 1.
Only use the 300,000 training data for root node 2.
Eliminated root node (and all parent nodes) from eiter root node 1 or roor node 2.
So the training takes very long when I had training data from all nodes and the interconnection node has less states
When the interconnection node has more states the EM-Algorithm will convert much faster to the optimum as there are more possibilites to find a optimum.
The problem is that if I would post the obfuscated node setup I have no training data (as it is not obfuscated). If you would let Genie extract training values you would have values where all nodes have data. This way the net will convert very fast i think.
But if you could print out the convergence ratio while training, that would be nice. This way I could know how long I need to wait.
I had two nodes with each 6 different incoming connection from other nodes. Lets call these two nodes getting all those incoming nodes root nodes.
Both root nodes were only connected via two other nodes.
The training data consists of 2,400,000 Datasets. 2,100,000 only values from nodes connecting to root node 1 (and the node itself and to the interconnection nodes to root node 2).
The other 300,000 datasets were training data for all nodes.
The training was also fast (2 hours) when I did one of the following:
Only use the 2,100,000 training data for root node 1.
Only use the 300,000 training data for root node 2.
Eliminated root node (and all parent nodes) from eiter root node 1 or roor node 2.
So the training takes very long when I had training data from all nodes and the interconnection node has less states
When the interconnection node has more states the EM-Algorithm will convert much faster to the optimum as there are more possibilites to find a optimum.
The problem is that if I would post the obfuscated node setup I have no training data (as it is not obfuscated). If you would let Genie extract training values you would have values where all nodes have data. This way the net will convert very fast i think.
But if you could print out the convergence ratio while training, that would be nice. This way I could know how long I need to wait.
If you send me the network, I can generate data using GeNIe and write a small program in SMILE to remove some of the values (or, if you want, you can do it yourself). Then I'll be able to inspect the convergence and also use a profiler to see in which methods the program spends most of its time, and then apply some optimizations. Having all this set up I will also experiment with predicting the remaining learning time, which can then be reported in GeNIe. I think that would be very useful. Thanks for all your input!
Hello Marc,
I have prepared both networks. I am just compressing them and will upload them via rapidshare. I will send you the link via pm.
I really hope it helps to improve Genies EM-Algorithm implementation.
When you write the software that eliminates about 80 percent of the dataset, maybe you can just extend Genie? Something like "only generate x% data for row x".
However. I had a mistake in the training data count I remembered and posted in last post. The settings were the following for both networks:
1.157.000 Datasets with Node1 to Node8, Node9 to Node 13 empty
237.000 Datasets with data for Node1 to Node13
Average training times:
Big network: 110 minutes
Small network: about 2400 minutes.
Every network was trained about 5 times to get averages.
By the way, to simulate soft evidence I have found a good way:
For example if I want to set 30% state1 and 70% state2. I just set 100% state1. Then I multiply the results by 0.3. After that I set 100% state2. And multiply these results by 0.7.
The adding both values and I have soft evidence. Sounds very simple
Maybe I am making a big mistake in my thoughts, but if not you could very easy adapt that in smile.
I have prepared both networks. I am just compressing them and will upload them via rapidshare. I will send you the link via pm.
I really hope it helps to improve Genies EM-Algorithm implementation.
When you write the software that eliminates about 80 percent of the dataset, maybe you can just extend Genie? Something like "only generate x% data for row x".
However. I had a mistake in the training data count I remembered and posted in last post. The settings were the following for both networks:
1.157.000 Datasets with Node1 to Node8, Node9 to Node 13 empty
237.000 Datasets with data for Node1 to Node13
Average training times:
Big network: 110 minutes
Small network: about 2400 minutes.
Every network was trained about 5 times to get averages.
By the way, to simulate soft evidence I have found a good way:
For example if I want to set 30% state1 and 70% state2. I just set 100% state1. Then I multiply the results by 0.3. After that I set 100% state2. And multiply these results by 0.7.
The adding both values and I have soft evidence. Sounds very simple
Maybe I am making a big mistake in my thoughts, but if not you could very easy adapt that in smile.
-
- Site Admin
- Posts: 1417
- Joined: Mon Nov 26, 2007 5:51 pm
AFAIK if you set missing value to 80% then randomly 80% of the data is missing. But that is not the true data source. Row 1-8 has 100% data and row 9 to 13 about 20%.
if you set the missing value to 80% then everywhere data is missing, not only node 9 to 13. At least this is what I interpret the Genie function missing value.
if you set the missing value to 80% then everywhere data is missing, not only node 9 to 13. At least this is what I interpret the Genie function missing value.