Fandom

Scratchpad

MBPA4.0 AdvStatsFuncSpec

216,252pages on
this wiki
Add New Page
Discuss this page0 Share

Ad blocker interference detected!


Wikia is a free-to-use site that makes money from advertising. We have a modified experience for viewers using ad blockers

Wikia is not accessible if you’ve made further modifications. Remove the custom ad blocker rule(s) and the page will load as expected.

Overview of Stats

Available statistics features include

  • Univariate
  • Frequency
  • Quantiles
  • Principal Component Analysis
  • Covariance

The Stats API is a programmatic API for configuring and running tasks that generate results and reports. Additioal API calls can access results.

Functional Descriptions

The main user APIs for configuring, running, and interrogating stats are the Stats, StatResults, and StatResultTable classes. These classes let you configure complex Stats task configurations using the Groovy Builder pattern (see the Groovy Builder design pattern in Groovy in Action, Ch. 8).

The Stats builder allows you to specify a data source (a dataset) and transformations, such as binners and variable library invocations, to apply before calculating stats.

The combination of raw data and transformations is called the data pipeline.

You can then configure univariate, frequency, quantile, and covariance calculations to be made on the data pipeline. You can configure the results of these calculations to generate formatted reports, print to the console or files, or the results can be output to tables (in-memory runtime objects) that you can manipulate and interrogate.

Data Pipeline

A data pipeline is a stream of records. The records are read from data sources and fields in the records can be transformed and replaced or new fields can be added to records via binners, variable libraries, and custom user code.

Data Sources

A data source block describes the path to a dataset, the optional sample weight to be used, and optional transformations to be applied to the data records.

The Stats API is able to calculate stats directly from .mbd dataset files (native files). To work with CSV, fixed width, SAS, or other non-Model Builder files, import data into a .mbd file.

Here is a simple example of a frequency calculation of the variable "age" in a census dataset.

new Stats()
stats.configure() {
	data(source:’file:/c:/temp/census.mbng’)
	freq(vars:’age’, table:’ageFreq’)
}

This example reads records from the census dataset and feeds them to the frequency calculator. The frequency calculator collects frequency counts for the unique values of "age" and put the results in a StatResultTable called "ageFreq".

The options for the data element are:

  • source -- a String indicating a path to a dataset; a reference to a dataset instance
  • one or more transforms spes.

Data transformations

A data pipeline can include an arbitrary number of transformations. A transformation is added as a child element to the data command, as is this example:

def ageBinner = new Binning(‘numeric’)
ageBinner.configure() {
	spec ‘(0,50]’
	spec ‘(50,100+)’
}
def stats = new Stats()
stats.configure() {
	data(source:’file:/c:/temp/census.mbng’) {
		binner(vars:’age=ageB’, ref:ageBinner )
		// Compute length of string variable, 
		// save the result to new “address_len” variable
		strlen(var:’address’)
		// Rename “zip” variable to new “zipCode” variable
          rename(vars:’zip=zipCode’)		
	}
	freq(vars:’ageB’, table:’ageFreq’)
    uni(vars:’@n(*) addressLen’ , stats:’min max mean stddev’, table:’uni1’)
}


This example applies a binner to the "age" variable and places the binned value in each record as a new variable "ageB". The frequency calculator then computes counts of the binned values and generates a table called "ageFreq".

The options for configuring binners are:

  • vars – A String describing a list of variables to apply the binner to. The user has the option to define the binned variable name via the <var>=<alias>. If the option is omitted, the binned variable will be added to the record with the suffix "_binned".
  • ref – a reference to a Binner instance; or a String naming a globally available system Binner such as "GeometricFine".
  • use – this option configures the return type from the binner. By default, the binner label is used. The available options are: ’label’, ’index’, ’numeric’
  • suffix – use suffix to override the default suffix "_binned" for result variables for binning.

Some system-provided transforms includes:

  • strlen – Calculate the length of desired variable of type string. The newly created variable that hold strlen calculation will have default suffix "_len" .
  • copy – Duplicate a variable to new variable. The newly created variable that is a duplicate of desired variable(s) will have default suffix "_copied".
  • rename – Rename a vairable. The current variable will be removed. The newly created variable will have default suffix "_renamed".
  • zscale – Zscale a variable with provided mean, and stddev. The newly zscale variable will have a default suffix "_Z". Other required settings for zscale include "mean" and "stddev" which take numeric values that are used in zscale calculation.

All above transforms accept the standard configurations:

  • vars  : as it is defined for binner transform.
  • suffix : accept a string used to override the default suffix of that transform.

Weight Specifications

For Stats weights, you can simply specify the variable that will have weight value for each record. The weight is applied for all stat table calculations, as this example illustrates:

def stats = new Stats()
stats.configure() {
	data(source:’file:/c:/temp/census.mbng’) {
	}
	weight ‘sampleWeightVar’
	freq(vars:’ageB’, table:’ageFreq’)
    uni(vars:’@n(*)’ , stats:’min max mean stddev’, table:’uni1’)
}

Univariate statistics

Univariate Configuration

You can configure and run the Stats task to calculate univariate statistics. This includes min, max, mean, variance, standard deviation, kurtosis, and skewness of a series of numerical data.

The command for specifying univariate stats is uni. Each argument to the uni command is a name/value pair. The uni command has these options:

  • vars – The value for this argument is a String describing a list of variables in the dataset. The syntax of this list is very similar to SLIM and supports wild carding for names and types. See the variable list documentation for more information.
  • stats – the value for this option is a String containing a space delimited list of stat names. If no stats option is specified, all available stats will be computed. The available univariate stats are:
  • min – the minimum value of all the non-missing data in this column
  • max – the maximum value of all the non-missing data in this column
  • mean – the mean (or average) of all the non-missing data in this column
  • stddev – the standard deviation of all the non-missing data in this column
  • variance - the standard deviation of all the non-missing data in this column
  • kurtosis - the standard deviation of all the non-missing data in this column
  • skewness - the standard deviation of all the non-missing data in this column
  • count – the total count of all records read
  • countNum – the count of all non-missing values in this column
  • countMiss – the missing value count in this column
  • pctNum – the percentage of non-missing values (countNum / count * 100)
  • pctMiss – the percentage of missing values (countMiss / count * 100)
  • by - the value for this option is a String specifying a list of variables to use as the by variables. The syntax of this list is the same as the vars list.
  • table – a String that is the name of a result table to be generated by the stats task. A table can be retrieved from the StatsResults object.
  • report – a String representing the path to where the generated report will be written. The type of the report is derived from the suffix. For example, "c:/reports/census_freq.html" will be a generated as an HTML report.

Here is a simple example:

def stats = new Stats()
results = stats.configure() {
	// specify the URL to the data file containing the
   // numeric variables‘income’ and ‘age’
	data(source:’file:/c:/temp/census.mbng’)

	// calculate univariate stats for age and income and
   // put the results in a table called ‘ageIncome’ and
	// generate an HTML report
	uni(vars: ‘age income’, stats:’min max’, table:’ageIncome’, report:’c:/work/jdoe/reports/ageIncome.html’)
}
def results = stats.run()

// retrieve the results
Table ageIncomeTable = results.ageIncome
println ageIncomeTable

output:

var		min	max
---------------------------
age		10	90
income	0	100000

Univariate statistics tables

Univariate stats can be programmatically accessed via the StatResultTable. A StatResultTable is a list of rows where each variable selected in the uni command is a row and each column contains a univariate calculation.

You can then access individual univariate calculations by selecting a row and accessing a column. Using the ageIncome table generated above, you can access a values like this:

// select the row where var column equals ‘age’
def ageRow = ageIncomeTable.find{row -> row.var == ‘age’}
println “age min: “ + ageRow.min
println “age max: “ + ageRow.max

output:

age min: 10
age max: 90

// select the row where var column equals ‘income’
def incomeRow = ageIncomeTable.find{row -> row.var == ‘income’}
println “income min: “ + incomeRow.min
println “income max: “ + incomeRow.max

output:

income min: 0
income max: 100000

Univariate by-table results

Univariate calculations can also be done with by variables. Each row is a combination of the by variable unique value combinations and the univariate results. For example,

def stats = new Stats()
stats.configure() {
	// specify the URL to the data file containing the
   // numeric variables‘income’ and ‘age’
	data(source:’file:/c:/temp/census.mbng’)

	// calculate univariate stats for age and income and
   // put the results in a table called ‘ageIncome’ and
	// generate an HTML report
	uni(vars: ‘income acctBal’, by: ‘age’,  stats:’min max mean’, table:’ageIncome’)
}

def results = stats.run()
def ageIncomeTable = results.ageIncome
meanIncome = ageIncomeTable.find{it.age == 15 && it.var == ‘income’ }.mean
println ‘mean income of 15 year olds: ‘ + meanIncome

meanAcctBal = ageIncomeTable.find{it.age == 15 && it.var == ‘acctBal’ }.mean
println ‘mean account balance of 15 year olds: ‘ + meanAcctBal

The results of a by table are always tables within tables. Each by-level permutation defines a unique row in the table. The sub-table is then accessed via the .table accessor method. The contents of the univar table then has one row per variable specified and one column per stat specified. Here is an example,

Contents of ageIncomeTable

age var min max mean
15 income 0 15000 6500
15 acctBal 0 2000 560.35
16 income 0 20000 9000
16 acctBal 0 4000 998.45
17 income 0 20000 8500
17 acctBal 0 4567 1500

Frequencies

You can configure and run the Stats task to calculate frequency statistics. Frequencies are counts of the occurrences of unique values of variables.

Frequency Configuration

The command for specifying frequency stats is freq. Each option for the freq command is a name/value pair. The freq command has these options:

  • vars – The value for this argument is a String describing a list of variables in the dataset. The syntax of this list is very similar to SLIM and supports wild carding for names and types. See the variable list documentation for more information.
  • by - the value for this option is a String specifying a list of variables to use as the by variables. The syntax of this list is the same as the vars list.
  • table – a String that is the name of a result table to be generated by the stats task. A table can be retrieved from the StatsResults object.
  • report – a String representing the path to where the generated report will be written. The type of the report is derived from the suffix. For example, "c:/reports/census_freq.html" will be a generated as an HTML report.

Here is an example of configuring and running a frequency calculation.


def stats = new Stats()
stats.configure() {
	// specify the URL to the data file containing the string variable
	// ‘US_state’
	data(source:’file:/c:/temp/census.mbng’)

	// calculate freq stats for US states and put the results
	// in a table called ‘stateFreq’ and pretty print the results to HTML.
	freq(vars: ‘US_state’, table:’stateFreq’, report:’/work/jdoe/stateFreq.html’)
}

def results = stats.run()

// retrieve the results
def stateFreq = results.stateFreq

Frequency Table Results

Frequency tables are list of rows, where each row contains the variable name, value and count. To access individual cells in the table, you need to find the matching row in the table. For example, using the table generated above, the user can access the VT frequency count like this:

def row = stateFreq.find{ it.value == ‘VT’ }
printlnt “VT count: “ + row.count

output:

VT count: 100

If you have created a table with multiple vars, the results are a bit more complex. For example,

def stats = new Stats()
stats.configure() {
	// specify the URL to the data file containing the string variable
	// ‘US_state’
	data(source:’file:/c:/temp/census.mbng’)
	freq(vars: ‘US_state age’, table:’ageStateFreq’)
}
def results = stats.run()
Table ageStateTable = results.ageStateFreq
def ageResults = ageStateTable.findAll{ it.var == ‘age’ }
def ageRow = ageResults.find{ it.value == ‘20’ }
println “age count where age is 20: “ + ageRow.count

output:

age count where age is 20: 53

Raw contents of ageStateFreq table

var value Count
age 20 53
age 30 45
age 40 50
   
US_state AK 30
   
US_state VT 100

Frequency by-table results

By table results are also placed into StatResultTable. Like univar by-var tables, each row is uniquely defined by the permutations of the by var levels. Each row then has a table cell containing the frequency table for the vars specified. For example,

def stats = new Stats()
stats.configure() {
	// specify the URL to the data file containing the string variable
	// ‘US_state’
	data(source:’file:/c:/temp/census.mbng’)
	freq(vars: ‘US_state’, by: ’age’, table:’stateByAge’)
}
def results = stats.run()
def stateByAgeTable = results.stateByAge

Contents of stateByAge table

age var value count cumCount cumPct
10 US_state AK 12 12  
10 US_state AR 5 17  
10 US_state      
11 US_state AK 8 8  
11 US_state AR 5 13  
11 US_state      
12 US_state AK 15 15  
12 US_state AR 8 23  
12 US_state      

You can then query the table to find individual rows to access counts:

// get count of people where age is 12 and US state is Alaska
freqTable = stateByAgeTable.findAll{ it.age == 12 }
count = freqTable.find{ var == ‘US_state’ && value == ‘AK’ }.count
println ‘count: ‘ + count

output:

count: 15

Here is an example with two by variables:

def stats = new Stats()
stats.configure() {
	// specify the URL to the data file containing the string variable
	// ‘US_state’
	data(source:’file:/c:/temp/census.mbng’)
	freq(vars: ‘US_state’, by: ’age income’, table:’stateByAgeByIncome’)
}
def results = stats.run()
def byTable = results.stateByAgeByIncome

Contents of stateByAgeByIncome table

age income var value count cumCount cumPct
10 0 US_state AK 12 12 2.2
10 0 US_state AR 5 17 3.15
10
10 0 US_state WY 8 540 100
10 1-10000 US_state AK 32 32 1.39
10 1-10000 US_state AR 28 60 2.61
10    
10 1-10000 US_state WY 25 2300 100

You can then query the table to find individual rows to access counts:

// get all rows where age is 10, income is ‘1-10000’ 
subTable = byTable.findAll{ it.age == 10 && it.income == ‘1-10000’ }

// get count of first row where US_state == AK
count = subTable.find{ var == ‘US_state’ && value == ‘AK’ }.count
println ‘count: ‘ + count

output:

count: 32
<pre>

===Multidimensional frequency tables===
Multidimensional tables let you see joint distributions of two or more variables.  In the 2-dimensional case, the additional values of pct, pctCol, and pctRow are calculated. Respectively, these values are the percentage of total counts, percentage of column, and percentage of row.

This example shows the joint distribution of ''zip1'' and ''reactivated''.
<pre>
stats = new Stats()
stats.configure() {
	data(source:’/work/data/postmail.mbng’)
	freq(vars:’zip1 & reactivated’, table:’zip1Reactivated’)
}
def results = stats.run()
Table zip1Reactivated = rsults.zip1Reactivated

’zip1Reactivated’ Raw Results:

reactivated zip1 count pct pctCol pctRow
0 0 1229 5.012 10.0049 49.1207
0 1 1184      
0 2 1207      
0 3 1183      
0 4 1273      
0 5 1239      
0 6 1254      
0 7 1267      
0 8 1220      
0 9 1228      
1 0 1273      
1 1 1182      
1 2 1165      
1 3 1226      
1 4 1222      
1 5 1256      
1 6 1221      
1 7 1239      
1 8 1237      
1 9 1216      

The raw data can be accessed with scripts like this:

row = zip1Reactivated.find{ it.reactivated == 1 && it.zip1 == ‘0’}
println ‘count where zip1 = 0 and reactivated = 1’ + row.count
println ‘% of total where zip1 = 0 and reactivated = 1’ + row.pct

Sample report

Zip1, Mailed, file: ’/work/data/postmail.mbng’

Count
%
% Column
% Row
0 1 Total
0 1229.0
5.012
10.0049
49.1207
1273.0
5.1914687
10.402877
50.879295
2502.0
10.203499
10.203499
100.0
1 1184.0
4.828514
9.638555
50.042267
1182.0
4.8203583
9.65923
49.957733
2366.0
9.648872
9.648872
100.0
2 1207.0
4.9223113
9.825789
50.88533
1165.0
4.7510295
9.520308
49.11467
2366.0
9.648872
9.648872
100.0
3 1183.0
4.824436
9.630414
49.107513
1165.0
4.7510295
9.520308
49.11467
2366.0
9.648872
9.648872
100.0
4 1273.0
5.1914687
10.363074
51.022045
1165.0
4.7510295
9.520308
49.11467
2495.0
10.1749525
10.1749525
100.0
5 1239.0
5.052812
10.086291
49.659317
1256.0
5.1221404
10.263953
50.340683
2495.0
10.1749525
10.1749525
100.0
6 1254.0
5.113984
10.208401
50.666668
1221.0
4.9794054
9.977936
49.333332
2475.0
10.0933895
10.0933895
100.0
7 1267.0
5.167
10.31423
50.55866
1239.0
5.052812
10.1250305
49.44134
2506.0
10.219811
10.219811
100.0
8 1220.0
4.9753275
9.931619
49.65405
1237.0
5.044656
10.108686
50.34595
2457.0
10.019983
10.019983
100.0
9 1228.0
5.007952
9.996744
50.2455
1216.0
4.959015
9.937076
49.7545
2444.0
9.966967
9.966967
100.0
Total 12284.0
50.095837
100.0
50.095837
12237.0
49.904163
100.0
49.904163
24521.0
100.0
100.0
100.0

For 3 or more dimensions, the only available value is the raw count. For example,

stats = new Stats()
stats.configure() {
	data(source:’/work/data/postmail.mbng’)
	freq(vars:’zip1 & reactivated & age’, table:’zip1Reactivated’)
}

def results = stats.run()
Table zip1Table = results.zip1Reactivated

Raw results

Zip1 Reactivated Age Count
0 1 15 12
0 1 16 15
       

Multidimensional frequency tables with by variables

Multidimensional frequency tables allow the user to subdivide a dataset with by variables. Here is an example of how to create a 2d freq table by income. The results are a table of tables.

Stats = new Stats() {
stats.configure() {
	data(source:’/work/data/postmail.mbng’)
	freq(vars:’zip1 & reactivated’, by:’income’, table:’zip1ReactByIncome’)
}

def results = stats.run()
Table zipTable = results.zip1ReactByIncome 
println zipTable

Sample output of zipTable:

Income (str) Reactivated (num) zip1 (str) Count (num) pct (num) pctCol (num) pctRow (num)
0 0 0 200      
0 0 1 189      
0 0 2 203      
0 0 3 178      
0 0 4 178      
0 0 5 193      
0 0 6 201      
0 0 7 204      
0 0 8 180      
0 0 9 185      
0 1 0 190      
0 1 1 193      
0 1 2 188      
0 1 3 193      
0 1 4 202      
0 1 5 200      
0 1 6 203      
0 1 7 191      
0 1 8 187      
0 1 9 200      
1-10000 0 1 5      
     

Code example of accessing 2d freq table with by var

// accessing results in 2d freq table with by vars
subTable = zipTable.findAll{ it.income == ‘0’ }

count = subTable.find{ it.reactivated == ‘0’ && it.zip1 == ‘0’ }.count

println ‘count where income = 0 and reactivated is 0 and zip1 is 0: ’ + count

output:
where income = 0 and reactivated is 0 and zip1 is 0: 189

Quantiles

Quantiles are points taken at regular intervals from the cumulative distribution function of a random variable. The system will provide a set of pre-canned quantiles which include: default, percentiles (100-quantiles), quartiles (4-quantiles), duo-deciles (20-quantiles), tails (as mapped to those provided by classic ModelBuilder).

Quantile Configuration

The primary end user API for quantiles will be the quantile() element in the Groovy Stats builder.

The following keywords are supported by quantile:

  • table: take a String value to specify the named table that hold quantile result. This is optional. If table is not defined or omitted, that quantile table result will be assigned a default name with prefix "quantile" following by a numeric value (the next available number of tables that are named with same prefix in the stat result).
  • vars: take a collection of input variables or filter expression for number of input variables that will be included in this quantile table.
  • by: take a collection of input variables or filter expression for number of input variables that will be used as by-var variables in quantile table.
  • bounds: defines the bounds for quantile tables. This values can be
  • String  : as name of one of system-provided quantile bounds. The name of these predefined quantile bounds can be as short as the first three letters of the actual name. For example, user may wish to enter "per" instead of "percentiles".
  • String : contains space-delimited doubles to specify on-fly bounds by users.
  • Bounds instance : re-usable Bounds instance defined by users.
  • report: take a String in URI format to specify the location where report will be generated for quantile result.

For example:

stats = new Stats()
stats.configure() {
	data(source:’/work/mydata/census.mbng’)

	// Create a percentile table named spending on variable
	// spending
	quantile(vars:’income’, bounds:’per’, table:’spending’)
	
	// Create a quantile table using default name (quantile1 – given
	// this is the first result table with prefix “quantile” .
	// embeded Bounds is defined by user.
	quantile(vars:’age’, by:’income’, bounds:’0.15 0.25 0.35 0.65 0.75 0.85’)
}
results = stats.run()

Quantile Table Results

The result for each quantile computation is also placed in StatResultTable with a name. The result is also a table itself whereas the quantile result is placed under column named quantileResult. If the quantile has by-var, it will have additional columns for by-var variables.

Quantile Result table without by-var

quantileResult
quantileResult Instance

Quantile Result table with by-var

byVar1 byVar2 quantileResult
1 aa quantileResult Instance
2 bb quantileResult Instance
…..   quantileResult Instance
…..   quantileResult Instance

Each quantileResult instance can be viewed as a table with at least two or more columns where the first column is "bounds" , and each additional column is named with variable with quantile calculations. Access the value in "bounds" or each variable column will yields an array of double coresponding to the quantile bound and its value for each variable .

bounds cb1 cb2 cb3 cb4
double[] double[] double[] double[] double[]
d = [0.15, 0.25] as double[]
def bnds = new Bounds(d);

q = new Quantile()
q.configure() {
	data(source:’/work/jdoe/data/census.mbng’)
	quantile(vars: 'cb*',bounds: bnds)           	       	
	quantile(vars: 'zip', by:'target', bounds: bnds)	 
}
def rs = q.run();
// Access first quantile table and its simple result using first index 0
def quantileRS1 = rs.quantile1[0].quantileResult
// Access the quantile result for cb2
def bounds = quantileRS1.bounds  // bounds is a double[] with {0.15 , 0.25}
def cb2 = quantileRS1.cb2 	   // cb2 is a double[2] with two values 
					   // coresponding to 0.15 and 0.25 quantiles.

// Access the quantile result for by-var target value ‘1’
def quantileRS2 = rs.quantile2.find{it.target == ‘1’}.quantileResult
 

To quickly access quantile result, user can simply invoke a println statement on the table result such as

  • println results.spending
  • println results.quantile1

Here’s an example for the raw result produced by println statement:

Quantile	      spending	      cb1
(Percentiles)	
0.0000		1.2800	       601
1.0000		11.0900	641	
2.0000		14.5400	648	
3.0000		17.2000	653			
……	
98.0000		972.8900	650
99.0000		1321.3900	653
100.0000		8495.4902	655

For a quantile with by-variable, result will be printed as:

Quantile		A
(Quartiles)

By-Var [B = 1]
0.000000		1
25.000000		23	
50.000000		48	
75.000000		73	
100.000000	96	

By-Var [B=2]	
0.000000		2	
25.000000		24	
50.000000		49	
75.000000		74	
100.000000	97
….

Covariance

The user can compute variable covariance with the covar element in a Stats task. The covar element supports weight variables, by-variables, and binners on both by-variables and input variables. Optionally, on symmetric correlation matrices, user can also perform PCA as well.

Covariance Configuration

The following keywords are supported in covariance.

  • table: take a String value to specify the named table that hold covariance result. This is optional. If table is not defined or omitted, that covariance table result will be assigned a default name with prefix "covar" following by a numeric value (the next available number of tables that are named with same prefix in the stat result).
  • report: take a String in URI format to specify the location where report will be generated for quantile result.
  • rows: take a collection of input variables or filter expression for number of input variables that will be used as row variables.
  • cols: take a collection of input variables or filter expression for number of input variables that will be used as column variables. If this is omitted, it will be the same as those variables defines "rows", and covariance will be symmetric.
  • by: take a collection of input variables or filter expression for number of input variables that will be used as by-var variables in quantile table.
  • pca: accepts the strings (’yes’, ’no’ ), booleans (true, false) or numbers (1, 0) to compute a PCA result when the input variables are symmetric. When the input is symmetric, and pca is omitted, no PCA result is generated.
  • covar: accepts the strings (’yes’, ’no’ ), booleans (true, false) or numbers (1, 0). If this value is true, the PCA result (if generated) will be computed with the covariance matrix instead of the default correlation matrices. The default is false.
  • missing : take input String (’drop’, ’pairwise’) to indicate how to handle record with missing variables. The default behavior is dropping record with missing variables.

The following example illustrates how to use covar() to produce simple symmetric covariance matrix for three variables ’A C D’ . The selection of rows and columns works just like the SLIM vars list in frequency and univariate statistics. For example,


stats = new Stats()
stats.configure() {
	data(source:’/work/jdoe/data/somedata.mbng’) 
	covar(rows:’A C D’, table:’myTable’)
}	   
results = stats.run()

The following example illustrates how to configure a covariance generator to do PCA (Principal Component Analysis). The ’pca’ option indicates a PCA analysis will be included in the result if and only if the covariance is symmetric. Because a table name is omitted, the Stats task will place the results in a table with an auto-generated name of "covar1".


stats = new Stats()
stats.configure() {
	data(source:’/work/jdoe/data/somedata.mbng’)
	covar(rows:’A C D’, pca:true)	
}
results = stats.run()

This example shows how to configure a covar table with by-variables and the missing value handling as "pairwise".

stats = new Stats()
stats.configure() {
	data(source:’/work/jdoe/data/somedata.mbng’)
	covar(rows:’A C D’, by:’B’ pca:true, missing:’pairwise’)	
}
results = stats.run()

By default if only a rows option is provided, the covar matrices will be symmetric. If the user wants to configure asymmetric covariance they must provide a columns list. This is done with the "cols" option. If the table is asymmetric, PCA will not be computed and the pca option will be ignored. The following example shows an asymmetric covar declaration with a single by-variable.

stats = new Stats()
stats.configure() {
	data(source:’/work/jdoe/data/somedata.mbng’)
	covar(table:’myCovar’ rows:’A C D’, by:’B’, cols:’A’)
}
results = stats.run()

Covariance Results

The result for each covar request is also placed in StatResultTable with a name. The result is also a table itself whereas the covariance result is placed under column named covarResult. If the covar supports by-var, it will have additional columns for by-var variables.

Covariance Result table without by-var

covarResult
CovarianceResult Instance

Covariance Result table with by-var

byVar1 byVar2 CovarResult
1 aa CovarianceResult Instance
2 bb CovarianceResult Instance
…..   CovarianceResult Instance
…..   CovarianceResult Instance

Each CovarianceResult instance is also a table with the followings columns :

vars CovarMatrix CorrMatrix PWeightMatrix CountMatrix CorrSumProductsMatrix
String[] double[][] double[][] double[][] double[][] double[][]


SampleWeightSumMatrix UncorrSumProdMatrix PCountMatrix RowMeansMatrix ColumnMeansMatrix PCA
double[][] double[][] double[][] double[][] double[][] PCA instance

Each PCA instance is also a table with n+5 columns, where n is the number of by-var variables.

PCA without by-var

vars EigenVectors EigenValues VarMeans VarVariances
String[] double[][] double[] double[] double[]

PCA with by-var

byVar1 byVar2 vars EigenVectors EigenValues VarMeans VarVariances
1 aaa String[] double[][] double[] double[] double[]

The following example shows how to access covariance instance through the result table and its pca


c = new Covar()
c.configure() {
	data(source:’/work/jdoe/somedata.mbng’)
	covar (rows:’A C D’, pca:’yes’) // results are in table ‘covar1’
	covar (rows:’A C’, by:’B D’, pca:’yes’)	// results are in table ‘covar2’   
}
def statRS = c.run()  
// Get to the covariance result instance by first retrieving named table
// and get the covariance result using column name (covarResult) at index 0
// since we know this covariance result table does not support by-var
def covarRS = statRS.covar1[0].CovarResult
// Access its PCA and its eigenvalues
def pca = covarRS.PCA
def eigenvalues = pca.EigenValues
// Access Covariance Matrix
println(“Covariance Matrix “ + covarRS.CovarMatrix)
// Access Correlation Matrix
println(“Correlation Matrix “ + covarRS.CorrMatrix)
….

// Get to the covariance result with by-var by first retrieving the named 
// table and find the covariance result with (by-var) key
covarRS = statRS.covar2.find{it.B == ’1’ && it.D == ‘aaa’}.CovarResult

A quick snapshot of covariance result included here:

results = stats.run()
println results

[Covariance]	A	C		D	
A		833.250000	-8.250000	-833.250000	
C		-8.250000	8.250000	8.250000	
D		-833.250000	8.250000	833.250000

[P(Count)]	A	C		D	
A		0.000000	0.324637	0.000000	
C		0.324635	0.000000	0.324637	
D		0.000000	0.324637	0.000000

[Correlation]	A	C		D	
A		1.000000	-0.099504	-1.000000	
C		-0.099504	1.000000	0.099504	
D		-1.000000	0.099504	1.000000
….
[P(Weight)]	A	C		D
…
[Count]		A	C		D
…
[CorrSumProd]	A	C		D
…
[Sum Weight]	A	C		D
…
[UncorrSumProd]	A	C		D	
A		338350.0	26950.0	171700.0	
C		26950.0	3850.0		28600.0
D		171700.0	28600.0	338350.0	 

If PCA is produced as part of covar() statement

PCA Result
Eigenvalues	2.019425	0.980575	-0.000000	
A			-0.700465	0.096691	-0.707107
C			0.136742	0.990607	0.000000
D			0.700465	-0.096691	-0.707107	

Principal Component Analysis Projection

The PCA projection is a system-provided transform. You can project PCA as part of data transform in either Stat command or using the Apply task. The PCA projection needs a PCA result from the covariance calculation. You must specify the number of eigenvectors (subspace size) to use when computing the main projection vector using the eigenvectors with the largest eigenvalues. The result vectors (main, coeff, and residual) of PCA can be either flatted out (by default) or kept as numeric array.

PCA Projection Configuration

The following keywords are supported in PCA projection.

  • data. The data node is used to load the target dataset. The argument for this data section is very similar to those accepted by stats(). Using a system-provided transform, you can also rename or copy the variable on-the-fly so the new dataset can work with expected variables from the PCA result.
  • project. Projection configuration options include:
  • ref : take a PCA result from covariance analysis.
  • vectors (v) : take a numeric value to indicate the subspaces.
  • compute (c) : compute lists (what will be included in the output). It takes any combinations of c (coefficients), r (residuals), m (main vectors)
  • zscaling (z) : take boolean value [yes/no, true/false, 1/0] to indicate if zScaling should be on. Default is on.
  • normalize (n) : take a boolean value [yes/no, true/false, 1/0] to define if normalization should be applied. Default is to apply.
  • print: take a boolean value [yes/no, true/false, 1/0] to indicate if a summary should be displayed at the end.. The default is to display summary.
  • flat: take a boolean value [yes/no, true/false, 1/0] to indicate if projection result should be flatted out as individual (double) variable for each component, or if flat value is false, projection result should be output as double[] for vairables : mains, coefficients, and residuals. The default is to flat out (flat = true).

The output dataset contains all original variable + projection values in which

Flat option = true (by default)

main1 main2 coefficient1 coefficient2 residual1 residuals2
double double double double double double double double double

or

Flat option = false

Mains coefficients residuals
double[] double[] double[]

The following example shows how to project a PCA projection on to a dataset using a pre-calculated PCA instance. This example makes use of all available options.The dataset can be the original dataset is used to calculate eigenvectors or a new dataset. A new dataset will not have names for those variables used in PCA calculations, so be sure to rename variables in the new dataset to map the expected variables that are in PCA calculation.

// Assume that we have generated PCA result 
pcaInstance = covarRS.PCA

// Create an Apply task to perform PCA projection.
Apply apply = new Apply()

apply.configure() {
	// define source data to be projected.
	data(source:Dataset , sink:’/work/data/projectPCA.mbd’) {
		…
		rename(vars=’zip=zipCode’)
		pca(ref: pcaInstance, vectors:2, compute:’c r m’, 
              zScaling: true, normalize: true, 
              print : true)

	}		
}

apply.run()

Here’s another example for projecting PCA with all default options:

apply.configure() {
	data(source:’/work/somedataset’, sink:’/work/someOutput.mbd’) {
		pca(ref : somePCA )
	}
}
apply.run()

Also on Fandom

Random wikia