dataset textfile parser

The engine.
Post Reply
nordic
Posts: 14
Joined: Wed Sep 25, 2013 5:16 pm

dataset textfile parser

Post by nordic »

hi,

im using the

Code: Select all

ReadFile(fileName)
command to read a dataset from the file. I allready noticed, that if a variable has only integer values it's assumed to be "discrete". Cause this don't fits my needs i decided to manually set those variables as continous, so that i can discretize them into some intervalls later. My Code is the following

// parser detects int variables as discrete
// set to continous if there are too much states
string id;
// iterate over all variables
for(int i=0; i< m_trainData.GetNumberOfVariables(); ++i)
{
int numberOfStates = m_trainData.GetVariableInfo(i).stateNames.size();

if(m_trainData.IsDiscrete(i) && numberOfStates == 0)
{
// add new float variable
if(m_trainData.AddFloatVar( "tmp") != DSL_OKAY)
return 1;

// copy content into float variable
for(int j=0; j<m_trainData.GetNumberOfRecords(); ++j)
m_trainData.SetFloat( m_trainData.FindVariable("tmp"), j, (double) m_trainData.GetInt(i,j) );

// delete old int variable
id = m_trainData.GetId(i);
if( m_trainData.RemoveVar(i) != DSL_OKAY )
return 1;

// set id
if(m_trainData.SetId(m_trainData.FindVariable("tmp"), id) != DSL_OKAY)
return 1;
}
}

So how you can see, my current criterio to detect variables who are discrete, but have to be continous is that they donÄt have any statenames..But is this really correct? Or it's just a fluke? I read the Documentation about the dsl_dataset type, but it seems to be out of date...

Thanks in advance,
nordic
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: dataset textfile parser

Post by shooltz[BayesFusion] »

nordic wrote:So how you can see, my current criterio to detect variables who are discrete, but have to be continous is that they don't have any statenames..But is this really correct? Or it's just a fluke?
That's highly dependent on the actual dataset and application. If the dataset column has no states, but the only values are zero and one, it's most likely discrete. If you're writing a general purpose code you may want to provide the user with the means for orverriding the default types.
nordic
Posts: 14
Joined: Wed Sep 25, 2013 5:16 pm

Re: dataset textfile parser

Post by nordic »

Thanks for the quick reply... so what exactly is the criterion to distinguish between discrete and continous variables? Is it correct to say an column consiting of only integers is always a discrete variable with no states?
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: dataset textfile parser

Post by shooltz[BayesFusion] »

nordic wrote:Thanks for the quick reply... so what exactly is the criterion to distinguish between discrete and continous variables? Is it correct to say an column consiting of only integers is always a discrete variable with no states?
No that's not correct. Consider a column without state names containing a range of integer values like [0..N-1]. This can be easily mapped to N-ary discrete node. On the other hand, you may encounter another integre column without state names, which represents a salary rounded to nearest monetary unit - this will have the characteristics of the continuous variable, despite using integers only.

The bottom line is that you can't predict how to handle a dataset variable without external knowledge (a context for your application). If you know the context and can make assumptions about the sources of your data, you can derive some heuristic approach based on the number of unique value in the column, its min/max values, etc.
nordic
Posts: 14
Joined: Wed Sep 25, 2013 5:16 pm

Re: dataset textfile parser

Post by nordic »

No that's not correct. Consider a column without state names containing a range of integer values like [0..N-1]. This can be easily mapped to N-ary discrete node. On the other hand, you may encounter another integre column without state names, which represents a salary rounded to nearest monetary unit - this will have the characteristics of the continuous variable, despite using integers only.
Yes, i think we missunderstood, respectively my question wasn't formulated correctly...It's clear that to i need some context knowledge to interpret the column...But my problem is that the build-in parser in the dataset-class detects every integer column automatically to be discrete, what doesn't fit my needs, so i have to manually fix it...
Since my text-data holds only continous variables (integer or float columns) and discrete variables with state names (strings) it seems to be straightforward to use the above mentioned criterion.

In Fact this question
so what exactly is the criterion to distinguish between discrete and continous variables? Is it correct to say any column consiting of only integers is always a discrete variable with no states?
Has to be reformulated to

What exactly is the criterion the dsl_dataset::readFile() method distinguishes between discrete and continous variables? Is it correct to say an column consiting of only integers is always detected to be a discrete variable with no states?
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: dataset textfile parser

Post by shooltz[BayesFusion] »

nordic wrote:What exactly is the criterion the dsl_dataset::readFile() method distinguishes between discrete and continous variables? Is it correct to say an column consiting of only integers is always detected to be a discrete variable with no states?
The continuous column contains only numeric values and at least one has a fractional part. Otherwise the column is considered discrete.
nordic
Posts: 14
Joined: Wed Sep 25, 2013 5:16 pm

Re: dataset textfile parser

Post by nordic »

But zero fractional parts don't count so

Var1
22.00
21.00
56.00

Is a discrete variable?
shooltz[BayesFusion]
Site Admin
Posts: 1417
Joined: Mon Nov 26, 2007 5:51 pm

Re: dataset textfile parser

Post by shooltz[BayesFusion] »

nordic wrote:But zero fractional parts don't count so

Var1
22.00
21.00
56.00

Is a discrete variable?
Yes, that's a discrete variable.
Post Reply