<< Click to Display Table of Contents >> Navigation: Using GeNIe > Learning > Data format |
The data used by all GeNIe learning functionality has to conform to certain format constraints. Columns in data files correspond to variables and rows (records) to various values of these variables. Variable names (the first row in the data file) should be strings of alphanumerical characters with no spaces and no special characters (with the exception of underscore characters). Values of variables should either be strings of alphanumerical characters with no spaces and no special characters (with the exception of underscore characters) or, in case of continuous variables, numbers. Letters are a-z and A-Z but also all Unicode characters above codepoint 127, which allows using characters from other alphabets than the Latin alphabet.
GeNIe will be somewhat tolerant to minor departures from these requirements but we advise strict adherence because departure from the required format will lead to GeNIe replacing illegal characters, which may lead to later mismatch between a model and the data from which it was learned. Even if you depart from the strict adherence to the format, please make sure that you avoid some characters in the variables names. We have found the that underlying database in which we save data in GeNIe format does not like the "/" and "." characters.
Users often ask us about limitations on the size of the data file: (1) What is the largest number of records in a data file that GeNIe can handle?, and (2) What is the minimum required number of records in a data file? Both questions do not have clear-cut answers.
There is never too many data records in learning and as their number increases, the quality of the learned structures and parameters will increase as well. GeNIe deals with data quite efficiently and usually compresses them before or during learning. Should the number of data records cause a slowdown of learning (we have not encountered this but it is theoretically possible), we suggest reducing the data set size by selecting a representative sample of the records. As long as the sample reflects the properties of the original data, the learning procedure will be sound. To ensure that the sample is representative, one has to give every member of the population (the original huge set of records) an equal chance of being selected.
Too few records in a data file is a much more common problem in practice. Here, there is no clear-cut answer of what the minimum should be, as it depends on the properties of the underlying model. We have seen a heuristic "at least ten times as many records as the number of variables" but it is really just a heuristic that can be easily criticized. One way of estimating the number of records that one needs to learn a network is to look at the largest CPT in the model. When learning parameters, all records in the data have to be distributed among the columns of that largest CPT. Suppose we have 100 records in the data file and the largest CPT consists of 32 columns. The hundred records will have to be distributed among the 32 columns, which gives roughly three records per column on the average. It is hard to learn a probability distribution from just three data points. Another difficulty is that the distribution among the 32 columns is not going to be uniform and there will be many columns with zero records. This is a rough, order of magnitude argument but it should give an idea of how many records one needs to achieve a reasonable accuracy.