Discretize floats with exponential notation

rickyegeland · Post by **rickyegeland** » Fri Apr 12, 2013 2:29 am

I've run into a problem with DSL_dataset::Discretize() and a column with floats in exponential notation. Is exponential notation not supported by Discretize()?
I have the following data:

Code: Select all

tot_flux	gwill	avg_grad	eff_sep	pil_len	fut_fl_gtM_48
2.46E+22	3853.68	22.699	34.3525	169.773	F
5.89E+22	21875.6	44.9802	82.9117	486.339	F
2.41E+22	2962.65	19.0078	31.3857	155.865	F
2.39E+22	2603.8	20.5585	31.1818	126.653	F
2.39E+22	3131.52	24.5121	31.4442	127.754	F

I run the following code to discretize it:

Code: Select all

void prepareData(string file, DSL_dataset& ds,                                                                          
                 DSL_dataset::DiscretizeAlgorithm method = DSL_dataset::UniformWidth,                                   
                 int nbins = 5) {                                                                                       
  // Open the data set                                                                                                  
  cout << "Reading file " << file << "...";                                                                             
  if (ds.ReadFile(file.c_str()) != DSL_OKAY) {                                                                          
    cout << "Cannot read data file... exiting." << endl;                                                                
    exit(1);                                                                                                            
  }                                                                                                                     
  cout << "Done." << endl;                                                                                              
                                                                                                                        
  cout << "Discretizing dataset..." << endl;                                                                            
  std::vector<double> edges;                                                                                            
  for (int v=0; v < ds.GetNumberOfVariables(); v++) {                                                                   
    if (! ds.IsDiscrete(v)) {                                                                                           
      edges.clear();                                                                                                    
      ds.Discretize(v, method, nbins, "State_", edges);                                                                 
    }                                                                                                                   
  }                                                                                                                     
  cout << "Done." << endl;                                                                                              
  PrintDataset(ds);                                                                                                     
}

Where PrintDataset() is the function given in the tutorials. I get the following output:

Code: Select all

Reading file test6.tsv...Done.
Discretizing dataset...
Done.
 ===================
 -- variable info --
number of variables = 6
 Variable 0
	id: tot_flux
	is discrete
	Missing element value: -1
	State names: 
 Variable 1
	id: gwill
	is discrete
	Missing element value: -1
	State names: State_1_below_2783_22 State_2_2783_22_3047_08 State_3_3047_08_3492_6 State_4_3492_6_12864_6 State_5_12864_6_up 
 Variable 2
	id: avg_grad
	is discrete
	Missing element value: -1
	State names: State_1_below_19_7831 State_2_19_7831_21_6287 State_3_21_6287_23_6055 State_4_23_6055_34_7462 State_5_34_7462_up 
 Variable 3
	id: eff_sep
	is discrete
	Missing element value: -1
	State names: State_1_below_31_2838 State_2_31_2838_31_415 State_3_31_415_32_8984 State_4_32_8984_58_6321 State_5_58_6321_up 
 Variable 4
	id: pil_len
	is discrete
	Missing element value: -1
	State names: State_1_below_127_203 State_2_127_203_141_81 State_3_141_81_162_819 State_4_162_819_328_056 State_5_328_056_up 
 Variable 5
	id: fut_fl_gtM_48
	is discrete
	Missing element value: -1
	State names: F 
 -- data records --
number of records = 5
first 5 records:
-2147483648	3	2	3	3	0	
-2147483648	4	4	4	4	0	
-2147483648	1	0	1	2	0	
-2147483648	0	1	0	0	0	
-2147483648	2	3	2	1	0

Notice that the first variable "tot_flux" has no state names and the post-Discretize value seems to be the maximum negative integer.
I'm running:

Code: Select all

> uname -v && gcc --version
Darwin Kernel Version 11.4.2: Thu Aug 23 16:26:45 PDT 2012; root:xnu-1699.32.7~1/RELEASE_I386
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)

Fri Apr 12, 2013 1:04 pm

rickyegeland wrote:I've run into a problem with DSL_dataset::Discretize() and a column with floats in exponential notation.

This is an overflow issue in the dataset parser. If you check the dataset contents right after the ReadFile call you'll see that all values are negative integers with large magnitude.

We'll fix the problem in the next SMILE release.

rickyegeland · Post by **rickyegeland** » Fri Apr 12, 2013 2:14 pm

Alright, in the meantime I'll rescale my data. Any idea what the maximum allowable float might be?

Fri Apr 12, 2013 2:24 pm

rickyegeland wrote:Alright, in the meantime I'll rescale my data. Any idea what the maximum allowable float might be?

With the binaries you have now the problems will manifest itself if you cross 32-bit signed integer boundary. If you stay below 2^31 you should be OK.

BayesFusion Support Forum

Discretize floats with exponential notation

Discretize floats with exponential notation

Re: Discretize floats with exponential notation

Re: Discretize floats with exponential notation

Re: Discretize floats with exponential notation