Overview of Stats
Available statistics features include
- Univariate
- Frequency
- Quantiles
- Principal Component Analysis
- Covariance
The Stats API is a programmatic API for configuring and running tasks that generate results and reports. Additioal API calls can access results.
Functional Descriptions
The main user APIs for configuring, running, and interrogating stats are the Stats, StatResults, and StatResultTable classes. These classes let you configure complex Stats task configurations using the Groovy Builder pattern (see the Groovy Builder design pattern in Groovy in Action, Ch. 8).
The Stats builder allows you to specify a data source (a dataset) and transformations, such as binners and variable library invocations, to apply before calculating stats.
The combination of raw data and transformations is called the data pipeline.
You can then configure univariate, frequency, quantile, and covariance calculations to be made on the data pipeline. You can configure the results of these calculations to generate formatted reports, print to the console or files, or the results can be output to tables (in-memory runtime objects) that you can manipulate and interrogate.
Data Pipeline
A data pipeline is a stream of records. The records are read from data sources and fields in the records can be transformed and replaced or new fields can be added to records via binners, variable libraries, and custom user code.
Data Sources
A data source block describes the path to a dataset, the optional sample weight to be used, and optional transformations to be applied to the data records.
The Stats API is able to calculate stats directly from .mbd dataset files (native files). To work with CSV, fixed width, SAS, or other non-Model Builder files, import data into a .mbd file.
Here is a simple example of a frequency calculation of the variable "age" in a census dataset.
new Stats() stats.configure() { data(source:’file:/c:/temp/census.mbng’) freq(vars:’age’, table:’ageFreq’) }
This example reads records from the census dataset and feeds them to the frequency calculator. The frequency calculator collects frequency counts for the unique values of "age" and put the results in a StatResultTable called "ageFreq".
The options for the data element are:
- source -- a String indicating a path to a dataset; a reference to a dataset instance
- one or more transforms spes.
Data transformations
A data pipeline can include an arbitrary number of transformations. A transformation is added as a child element to the data command, as is this example:
def ageBinner = new Binning(‘numeric’) ageBinner.configure() { spec ‘(0,50]’ spec ‘(50,100+)’ } def stats = new Stats() stats.configure() { data(source:’file:/c:/temp/census.mbng’) { binner(vars:’age=ageB’, ref:ageBinner ) // Compute length of string variable, // save the result to new “address_len” variable strlen(var:’address’) // Rename “zip” variable to new “zipCode” variable rename(vars:’zip=zipCode’) } freq(vars:’ageB’, table:’ageFreq’) uni(vars:’@n(*) addressLen’ , stats:’min max mean stddev’, table:’uni1’) }
This example applies a binner to the "age" variable and places the binned value in each record as a new variable "ageB". The frequency calculator then computes counts of the binned values and generates a table called "ageFreq".
The options for configuring binners are:
- vars – A String describing a list of variables to apply the binner to. The user has the option to define the binned variable name via the <var>=<alias>. If the option is omitted, the binned variable will be added to the record with the suffix "_binned".
- ref – a reference to a Binner instance; or a String naming a globally available system Binner such as "GeometricFine".
- use – this option configures the return type from the binner. By default, the binner label is used. The available options are: ’label’, ’index’, ’numeric’
- suffix – use suffix to override the default suffix "_binned" for result variables for binning.
Some system-provided transforms includes:
- strlen – Calculate the length of desired variable of type string. The newly created variable that hold strlen calculation will have default suffix "_len" .
- copy – Duplicate a variable to new variable. The newly created variable that is a duplicate of desired variable(s) will have default suffix "_copied".
- rename – Rename a vairable. The current variable will be removed. The newly created variable will have default suffix "_renamed".
- zscale – Zscale a variable with provided mean, and stddev. The newly zscale variable will have a default suffix "_Z". Other required settings for zscale include "mean" and "stddev" which take numeric values that are used in zscale calculation.
All above transforms accept the standard configurations:
- vars : as it is defined for binner transform.
- suffix : accept a string used to override the default suffix of that transform.
Weight Specifications
For Stats weights, you can simply specify the variable that will have weight value for each record. The weight is applied for all stat table calculations, as this example illustrates:
def stats = new Stats() stats.configure() { data(source:’file:/c:/temp/census.mbng’) { } weight ‘sampleWeightVar’ freq(vars:’ageB’, table:’ageFreq’) uni(vars:’@n(*)’ , stats:’min max mean stddev’, table:’uni1’) }
Univariate statistics
Univariate Configuration
You can configure and run the Stats task to calculate univariate statistics. This includes min, max, mean, variance, standard deviation, kurtosis, and skewness of a series of numerical data.
The command for specifying univariate stats is uni. Each argument to the uni command is a name/value pair. The uni command has these options:
- vars – The value for this argument is a String describing a list of variables in the dataset. The syntax of this list is very similar to SLIM and supports wild carding for names and types. See the variable list documentation for more information.
- stats – the value for this option is a String containing a space delimited list of stat names. If no stats option is specified, all available stats will be computed. The available univariate stats are:
- min – the minimum value of all the non-missing data in this column
- max – the maximum value of all the non-missing data in this column
- mean – the mean (or average) of all the non-missing data in this column
- stddev – the standard deviation of all the non-missing data in this column
- variance - the standard deviation of all the non-missing data in this column
- kurtosis - the standard deviation of all the non-missing data in this column
- skewness - the standard deviation of all the non-missing data in this column
- count – the total count of all records read
- countNum – the count of all non-missing values in this column
- countMiss – the missing value count in this column
- pctNum – the percentage of non-missing values (countNum / count * 100)
- pctMiss – the percentage of missing values (countMiss / count * 100)
- by - the value for this option is a String specifying a list of variables to use as the by variables. The syntax of this list is the same as the vars list.
- table – a String that is the name of a result table to be generated by the stats task. A table can be retrieved from the StatsResults object.
- report – a String representing the path to where the generated report will be written. The type of the report is derived from the suffix. For example, "c:/reports/census_freq.html" will be a generated as an HTML report.
Here is a simple example:
def stats = new Stats() results = stats.configure() { // specify the URL to the data file containing the // numeric variables‘income’ and ‘age’ data(source:’file:/c:/temp/census.mbng’) // calculate univariate stats for age and income and // put the results in a table called ‘ageIncome’ and // generate an HTML report uni(vars: ‘age income’, stats:’min max’, table:’ageIncome’, report:’c:/work/jdoe/reports/ageIncome.html’) } def results = stats.run() // retrieve the results Table ageIncomeTable = results.ageIncome println ageIncomeTable output: var min max --------------------------- age 10 90 income 0 100000
Univariate statistics tables
Univariate stats can be programmatically accessed via the StatResultTable. A StatResultTable is a list of rows where each variable selected in the uni command is a row and each column contains a univariate calculation.
You can then access individual univariate calculations by selecting a row and accessing a column. Using the ageIncome table generated above, you can access a values like this:
// select the row where var column equals ‘age’ def ageRow = ageIncomeTable.find{row -> row.var == ‘age’} println “age min: “ + ageRow.min println “age max: “ + ageRow.max output: age min: 10 age max: 90 // select the row where var column equals ‘income’ def incomeRow = ageIncomeTable.find{row -> row.var == ‘income’} println “income min: “ + incomeRow.min println “income max: “ + incomeRow.max output: income min: 0 income max: 100000
Univariate by-table results
Univariate calculations can also be done with by variables. Each row is a combination of the by variable unique value combinations and the univariate results. For example,
def stats = new Stats() stats.configure() { // specify the URL to the data file containing the // numeric variables‘income’ and ‘age’ data(source:’file:/c:/temp/census.mbng’) // calculate univariate stats for age and income and // put the results in a table called ‘ageIncome’ and // generate an HTML report uni(vars: ‘income acctBal’, by: ‘age’, stats:’min max mean’, table:’ageIncome’) } def results = stats.run() def ageIncomeTable = results.ageIncome meanIncome = ageIncomeTable.find{it.age == 15 && it.var == ‘income’ }.mean println ‘mean income of 15 year olds: ‘ + meanIncome meanAcctBal = ageIncomeTable.find{it.age == 15 && it.var == ‘acctBal’ }.mean println ‘mean account balance of 15 year olds: ‘ + meanAcctBal
The results of a by table are always tables within tables. Each by-level permutation defines a unique row in the table. The sub-table is then accessed via the .table accessor method. The contents of the univar table then has one row per variable specified and one column per stat specified. Here is an example,
Contents of ageIncomeTable
age | var | min | max | mean |
15 | income | 0 | 15000 | 6500 |
15 | acctBal | 0 | 2000 | 560.35 |
16 | income | 0 | 20000 | 9000 |
16 | acctBal | 0 | 4000 | 998.45 |
17 | income | 0 | 20000 | 8500 |
17 | acctBal | 0 | 4567 | 1500 |
Frequencies
You can configure and run the Stats task to calculate frequency statistics. Frequencies are counts of the occurrences of unique values of variables.
Frequency Configuration
The command for specifying frequency stats is freq. Each option for the freq command is a name/value pair. The freq command has these options:
- vars – The value for this argument is a String describing a list of variables in the dataset. The syntax of this list is very similar to SLIM and supports wild carding for names and types. See the variable list documentation for more information.
- by - the value for this option is a String specifying a list of variables to use as the by variables. The syntax of this list is the same as the vars list.
- table – a String that is the name of a result table to be generated by the stats task. A table can be retrieved from the StatsResults object.
- report – a String representing the path to where the generated report will be written. The type of the report is derived from the suffix. For example, "c:/reports/census_freq.html" will be a generated as an HTML report.
Here is an example of configuring and running a frequency calculation.
def stats = new Stats() stats.configure() { // specify the URL to the data file containing the string variable // ‘US_state’ data(source:’file:/c:/temp/census.mbng’) // calculate freq stats for US states and put the results // in a table called ‘stateFreq’ and pretty print the results to HTML. freq(vars: ‘US_state’, table:’stateFreq’, report:’/work/jdoe/stateFreq.html’) } def results = stats.run() // retrieve the results def stateFreq = results.stateFreq
Frequency Table Results
Frequency tables are list of rows, where each row contains the variable name, value and count. To access individual cells in the table, you need to find the matching row in the table. For example, using the table generated above, the user can access the VT frequency count like this:
def row = stateFreq.find{ it.value == ‘VT’ } printlnt “VT count: “ + row.count output: VT count: 100
If you have created a table with multiple vars, the results are a bit more complex. For example,
def stats = new Stats() stats.configure() { // specify the URL to the data file containing the string variable // ‘US_state’ data(source:’file:/c:/temp/census.mbng’) freq(vars: ‘US_state age’, table:’ageStateFreq’) } def results = stats.run() Table ageStateTable = results.ageStateFreq def ageResults = ageStateTable.findAll{ it.var == ‘age’ } def ageRow = ageResults.find{ it.value == ‘20’ } println “age count where age is 20: “ + ageRow.count output: age count where age is 20: 53
Raw contents of ageStateFreq table
var | value | Count |
age | 20 | 53 |
age | 30 | 45 |
age | 40 | 50 |
… | ||
US_state | AK | 30 |
… | ||
US_state | VT | 100 |
Frequency by-table results
By table results are also placed into StatResultTable. Like univar by-var tables, each row is uniquely defined by the permutations of the by var levels. Each row then has a table cell containing the frequency table for the vars specified. For example,
def stats = new Stats() stats.configure() { // specify the URL to the data file containing the string variable // ‘US_state’ data(source:’file:/c:/temp/census.mbng’) freq(vars: ‘US_state’, by: ’age’, table:’stateByAge’) } def results = stats.run() def stateByAgeTable = results.stateByAge
Contents of stateByAge table
age | var | value | count | cumCount | cumPct |
10 | US_state | AK | 12 | 12 | |
10 | US_state | AR | 5 | 17 | |
10 | US_state | … | |||
11 | US_state | AK | 8 | 8 | |
11 | US_state | AR | 5 | 13 | |
11 | US_state | … | |||
12 | US_state | AK | 15 | 15 | |
12 | US_state | AR | 8 | 23 | |
12 | US_state | … |
You can then query the table to find individual rows to access counts:
// get count of people where age is 12 and US state is Alaska freqTable = stateByAgeTable.findAll{ it.age == 12 } count = freqTable.find{ var == ‘US_state’ && value == ‘AK’ }.count println ‘count: ‘ + count output: count: 15
Here is an example with two by variables:
def stats = new Stats() stats.configure() { // specify the URL to the data file containing the string variable // ‘US_state’ data(source:’file:/c:/temp/census.mbng’) freq(vars: ‘US_state’, by: ’age income’, table:’stateByAgeByIncome’) } def results = stats.run() def byTable = results.stateByAgeByIncome
Contents of stateByAgeByIncome table
age | income | var | value | count | cumCount | cumPct |
10 | 0 | US_state | AK | 12 | 12 | 2.2 |
10 | 0 | US_state | AR | 5 | 17 | 3.15 |
10 | … | … | … | … | … | … |
10 | 0 | US_state | WY | 8 | 540 | 100 |
10 | 1-10000 | US_state | AK | 32 | 32 | 1.39 |
10 | 1-10000 | US_state | AR | 28 | 60 | 2.61 |
10 | … | … | … | … | ||
10 | 1-10000 | US_state | WY | 25 | 2300 | 100 |
You can then query the table to find individual rows to access counts:
// get all rows where age is 10, income is ‘1-10000’ subTable = byTable.findAll{ it.age == 10 && it.income == ‘1-10000’ } // get count of first row where US_state == AK count = subTable.find{ var == ‘US_state’ && value == ‘AK’ }.count println ‘count: ‘ + count output: count: 32 <pre> ===Multidimensional frequency tables=== Multidimensional tables let you see joint distributions of two or more variables. In the 2-dimensional case, the additional values of pct, pctCol, and pctRow are calculated. Respectively, these values are the percentage of total counts, percentage of column, and percentage of row. This example shows the joint distribution of ''zip1'' and ''reactivated''. <pre> stats = new Stats() stats.configure() { data(source:’/work/data/postmail.mbng’) freq(vars:’zip1 & reactivated’, table:’zip1Reactivated’) } def results = stats.run() Table zip1Reactivated = rsults.zip1Reactivated
’zip1Reactivated’ Raw Results:
reactivated | zip1 | count | pct | pctCol | pctRow |
0 | 0 | 1229 | 5.012 | 10.0049 | 49.1207 |
0 | 1 | 1184 | |||
0 | 2 | 1207 | |||
0 | 3 | 1183 | |||
0 | 4 | 1273 | |||
0 | 5 | 1239 | |||
0 | 6 | 1254 | |||
0 | 7 | 1267 | |||
0 | 8 | 1220 | |||
0 | 9 | 1228 | |||
1 | 0 | 1273 | |||
1 | 1 | 1182 | |||
1 | 2 | 1165 | |||
1 | 3 | 1226 | |||
1 | 4 | 1222 | |||
1 | 5 | 1256 | |||
1 | 6 | 1221 | |||
1 | 7 | 1239 | |||
1 | 8 | 1237 | |||
1 | 9 | 1216 |
The raw data can be accessed with scripts like this:
row = zip1Reactivated.find{ it.reactivated == 1 && it.zip1 == ‘0’} println ‘count where zip1 = 0 and reactivated = 1’ + row.count println ‘% of total where zip1 = 0 and reactivated = 1’ + row.pct
Sample report
Zip1, Mailed, file: ’/work/data/postmail.mbng’
Count % % Column % Row | 0 | 1 | Total |
0 | 1229.0 5.012 10.0049 49.1207 | 1273.0 5.1914687 10.402877 50.879295 | 2502.0 10.203499 10.203499 100.0 |
1 | 1184.0 4.828514 9.638555 50.042267 | 1182.0 4.8203583 9.65923 49.957733 | 2366.0 9.648872 9.648872 100.0 |
2 | 1207.0 4.9223113 9.825789 50.88533 | 1165.0 4.7510295 9.520308 49.11467 | 2366.0 9.648872 9.648872 100.0 |
3 | 1183.0 4.824436 9.630414 49.107513 | 1165.0 4.7510295 9.520308 49.11467 | 2366.0 9.648872 9.648872 100.0 |
4 | 1273.0 5.1914687 10.363074 51.022045 | 1165.0 4.7510295 9.520308 49.11467 | 2495.0 10.1749525 10.1749525 100.0 |
5 | 1239.0 5.052812 10.086291 49.659317 | 1256.0 5.1221404 10.263953 50.340683 | 2495.0 10.1749525 10.1749525 100.0 |
6 | 1254.0 5.113984 10.208401 50.666668 | 1221.0 4.9794054 9.977936 49.333332 | 2475.0 10.0933895 10.0933895 100.0 |
7 | 1267.0 5.167 10.31423 50.55866 | 1239.0 5.052812 10.1250305 49.44134 | 2506.0 10.219811 10.219811 100.0 |
8 | 1220.0 4.9753275 9.931619 49.65405 | 1237.0 5.044656 10.108686 50.34595 | 2457.0 10.019983 10.019983 100.0 |
9 | 1228.0 5.007952 9.996744 50.2455 | 1216.0 4.959015 9.937076 49.7545 | 2444.0 9.966967 9.966967 100.0 |
Total | 12284.0 50.095837 100.0 50.095837 | 12237.0 49.904163 100.0 49.904163 | 24521.0 100.0 100.0 100.0 |
For 3 or more dimensions, the only available value is the raw count. For example,
stats = new Stats() stats.configure() { data(source:’/work/data/postmail.mbng’) freq(vars:’zip1 & reactivated & age’, table:’zip1Reactivated’) } def results = stats.run() Table zip1Table = results.zip1Reactivated
Raw results
Zip1 | Reactivated | Age | Count |
0 | 1 | 15 | 12 |
0 | 1 | 16 | 15 |
Multidimensional frequency tables with by variables
Multidimensional frequency tables allow the user to subdivide a dataset with by variables. Here is an example of how to create a 2d freq table by income. The results are a table of tables.
Stats = new Stats() { stats.configure() { data(source:’/work/data/postmail.mbng’) freq(vars:’zip1 & reactivated’, by:’income’, table:’zip1ReactByIncome’) } def results = stats.run() Table zipTable = results.zip1ReactByIncome println zipTable
Sample output of zipTable:
Income (str) | Reactivated (num) | zip1 (str) | Count (num) | pct (num) | pctCol (num) | pctRow (num) |
0 | 0 | 0 | 200 | |||
0 | 0 | 1 | 189 | |||
0 | 0 | 2 | 203 | |||
0 | 0 | 3 | 178 | |||
0 | 0 | 4 | 178 | |||
0 | 0 | 5 | 193 | |||
0 | 0 | 6 | 201 | |||
0 | 0 | 7 | 204 | |||
0 | 0 | 8 | 180 | |||
0 | 0 | 9 | 185 | |||
0 | 1 | 0 | 190 | |||
0 | 1 | 1 | 193 | |||
0 | 1 | 2 | 188 | |||
0 | 1 | 3 | 193 | |||
0 | 1 | 4 | 202 | |||
0 | 1 | 5 | 200 | |||
0 | 1 | 6 | 203 | |||
0 | 1 | 7 | 191 | |||
0 | 1 | 8 | 187 | |||
0 | 1 | 9 | 200 | |||
1-10000 | 0 | 1 | 5 | |||
… | … | … | … |
Code example of accessing 2d freq table with by var
// accessing results in 2d freq table with by vars subTable = zipTable.findAll{ it.income == ‘0’ } count = subTable.find{ it.reactivated == ‘0’ && it.zip1 == ‘0’ }.count println ‘count where income = 0 and reactivated is 0 and zip1 is 0: ’ + count output: where income = 0 and reactivated is 0 and zip1 is 0: 189
Quantiles
Quantiles are points taken at regular intervals from the cumulative distribution function of a random variable. The system will provide a set of pre-canned quantiles which include: default, percentiles (100-quantiles), quartiles (4-quantiles), duo-deciles (20-quantiles), tails (as mapped to those provided by classic ModelBuilder).
Quantile Configuration
The primary end user API for quantiles will be the quantile() element in the Groovy Stats builder.
The following keywords are supported by quantile:
- table: take a String value to specify the named table that hold quantile result. This is optional. If table is not defined or omitted, that quantile table result will be assigned a default name with prefix "quantile" following by a numeric value (the next available number of tables that are named with same prefix in the stat result).
- vars: take a collection of input variables or filter expression for number of input variables that will be included in this quantile table.
- by: take a collection of input variables or filter expression for number of input variables that will be used as by-var variables in quantile table.
- bounds: defines the bounds for quantile tables. This values can be
- String : as name of one of system-provided quantile bounds. The name of these predefined quantile bounds can be as short as the first three letters of the actual name. For example, user may wish to enter "per" instead of "percentiles".
- String : contains space-delimited doubles to specify on-fly bounds by users.
- Bounds instance : re-usable Bounds instance defined by users.
- report: take a String in URI format to specify the location where report will be generated for quantile result.
For example:
stats = new Stats() stats.configure() { data(source:’/work/mydata/census.mbng’) // Create a percentile table named spending on variable // spending quantile(vars:’income’, bounds:’per’, table:’spending’) // Create a quantile table using default name (quantile1 – given // this is the first result table with prefix “quantile” . // embeded Bounds is defined by user. quantile(vars:’age’, by:’income’, bounds:’0.15 0.25 0.35 0.65 0.75 0.85’) } results = stats.run()
Quantile Table Results
The result for each quantile computation is also placed in StatResultTable with a name. The result is also a table itself whereas the quantile result is placed under column named quantileResult. If the quantile has by-var, it will have additional columns for by-var variables.
Quantile Result table without by-var
quantileResult |
quantileResult Instance |
Quantile Result table with by-var
byVar1 | byVar2 | quantileResult |
1 | aa | quantileResult Instance |
2 | bb | quantileResult Instance |
….. | quantileResult Instance | |
….. | quantileResult Instance |
Each quantileResult instance can be viewed as a table with at least two or more columns where the first column is "bounds" , and each additional column is named with variable with quantile calculations. Access the value in "bounds" or each variable column will yields an array of double coresponding to the quantile bound and its value for each variable .
bounds | cb1 | cb2 | cb3 | cb4 |
double[] | double[] | double[] | double[] | double[] |
d = [0.15, 0.25] as double[] def bnds = new Bounds(d); q = new Quantile() q.configure() { data(source:’/work/jdoe/data/census.mbng’) quantile(vars: 'cb*',bounds: bnds) quantile(vars: 'zip', by:'target', bounds: bnds) } def rs = q.run(); // Access first quantile table and its simple result using first index 0 def quantileRS1 = rs.quantile1[0].quantileResult // Access the quantile result for cb2 def bounds = quantileRS1.bounds // bounds is a double[] with {0.15 , 0.25} def cb2 = quantileRS1.cb2 // cb2 is a double[2] with two values // coresponding to 0.15 and 0.25 quantiles. // Access the quantile result for by-var target value ‘1’ def quantileRS2 = rs.quantile2.find{it.target == ‘1’}.quantileResult
To quickly access quantile result, user can simply invoke a println statement on the table result such as
- println results.spending
- println results.quantile1
Here’s an example for the raw result produced by println statement:
Quantile spending cb1 (Percentiles) 0.0000 1.2800 601 1.0000 11.0900 641 2.0000 14.5400 648 3.0000 17.2000 653 …… 98.0000 972.8900 650 99.0000 1321.3900 653 100.0000 8495.4902 655
For a quantile with by-variable, result will be printed as:
Quantile A (Quartiles) By-Var [B = 1] 0.000000 1 25.000000 23 50.000000 48 75.000000 73 100.000000 96 By-Var [B=2] 0.000000 2 25.000000 24 50.000000 49 75.000000 74 100.000000 97 ….
Covariance
The user can compute variable covariance with the covar element in a Stats task. The covar element supports weight variables, by-variables, and binners on both by-variables and input variables. Optionally, on symmetric correlation matrices, user can also perform PCA as well.
Covariance Configuration
The following keywords are supported in covariance.
- table: take a String value to specify the named table that hold covariance result. This is optional. If table is not defined or omitted, that covariance table result will be assigned a default name with prefix "covar" following by a numeric value (the next available number of tables that are named with same prefix in the stat result).
- report: take a String in URI format to specify the location where report will be generated for quantile result.
- rows: take a collection of input variables or filter expression for number of input variables that will be used as row variables.
- cols: take a collection of input variables or filter expression for number of input variables that will be used as column variables. If this is omitted, it will be the same as those variables defines "rows", and covariance will be symmetric.
- by: take a collection of input variables or filter expression for number of input variables that will be used as by-var variables in quantile table.
- pca: accepts the strings (’yes’, ’no’ ), booleans (true, false) or numbers (1, 0) to compute a PCA result when the input variables are symmetric. When the input is symmetric, and pca is omitted, no PCA result is generated.
- covar: accepts the strings (’yes’, ’no’ ), booleans (true, false) or numbers (1, 0). If this value is true, the PCA result (if generated) will be computed with the covariance matrix instead of the default correlation matrices. The default is false.
- missing : take input String (’drop’, ’pairwise’) to indicate how to handle record with missing variables. The default behavior is dropping record with missing variables.
The following example illustrates how to use covar() to produce simple symmetric covariance matrix for three variables ’A C D’ . The selection of rows and columns works just like the SLIM vars list in frequency and univariate statistics. For example,
stats = new Stats() stats.configure() { data(source:’/work/jdoe/data/somedata.mbng’) covar(rows:’A C D’, table:’myTable’) } results = stats.run()
The following example illustrates how to configure a covariance generator to do PCA (Principal Component Analysis). The ’pca’ option indicates a PCA analysis will be included in the result if and only if the covariance is symmetric. Because a table name is omitted, the Stats task will place the results in a table with an auto-generated name of "covar1".
stats = new Stats() stats.configure() { data(source:’/work/jdoe/data/somedata.mbng’) covar(rows:’A C D’, pca:true) } results = stats.run()
This example shows how to configure a covar table with by-variables and the missing value handling as "pairwise".
stats = new Stats() stats.configure() { data(source:’/work/jdoe/data/somedata.mbng’) covar(rows:’A C D’, by:’B’ pca:true, missing:’pairwise’) } results = stats.run()
By default if only a rows option is provided, the covar matrices will be symmetric. If the user wants to configure asymmetric covariance they must provide a columns list. This is done with the "cols" option. If the table is asymmetric, PCA will not be computed and the pca option will be ignored. The following example shows an asymmetric covar declaration with a single by-variable.
stats = new Stats() stats.configure() { data(source:’/work/jdoe/data/somedata.mbng’) covar(table:’myCovar’ rows:’A C D’, by:’B’, cols:’A’) } results = stats.run()
Covariance Results
The result for each covar request is also placed in StatResultTable with a name. The result is also a table itself whereas the covariance result is placed under column named covarResult. If the covar supports by-var, it will have additional columns for by-var variables.
Covariance Result table without by-var
covarResult |
CovarianceResult Instance |
Covariance Result table with by-var
byVar1 | byVar2 | CovarResult |
1 | aa | CovarianceResult Instance |
2 | bb | CovarianceResult Instance |
….. | CovarianceResult Instance | |
….. | CovarianceResult Instance |
Each CovarianceResult instance is also a table with the followings columns :
vars | CovarMatrix | CorrMatrix | PWeightMatrix | CountMatrix | CorrSumProductsMatrix |
String[] | double[][] | double[][] | double[][] | double[][] | double[][] |
SampleWeightSumMatrix | UncorrSumProdMatrix | PCountMatrix | RowMeansMatrix | ColumnMeansMatrix | PCA |
double[][] | double[][] | double[][] | double[][] | double[][] | PCA instance |
Each PCA instance is also a table with n+5 columns, where n is the number of by-var variables.
PCA without by-var
vars | EigenVectors | EigenValues | VarMeans | VarVariances |
String[] | double[][] | double[] | double[] | double[] |
PCA with by-var
byVar1 | byVar2 | vars | EigenVectors | EigenValues | VarMeans | VarVariances |
1 | aaa | String[] | double[][] | double[] | double[] | double[] |
The following example shows how to access covariance instance through the result table and its pca
c = new Covar() c.configure() { data(source:’/work/jdoe/somedata.mbng’) covar (rows:’A C D’, pca:’yes’) // results are in table ‘covar1’ covar (rows:’A C’, by:’B D’, pca:’yes’) // results are in table ‘covar2’ } def statRS = c.run() // Get to the covariance result instance by first retrieving named table // and get the covariance result using column name (covarResult) at index 0 // since we know this covariance result table does not support by-var def covarRS = statRS.covar1[0].CovarResult // Access its PCA and its eigenvalues def pca = covarRS.PCA def eigenvalues = pca.EigenValues // Access Covariance Matrix println(“Covariance Matrix “ + covarRS.CovarMatrix) // Access Correlation Matrix println(“Correlation Matrix “ + covarRS.CorrMatrix) …. // Get to the covariance result with by-var by first retrieving the named // table and find the covariance result with (by-var) key covarRS = statRS.covar2.find{it.B == ’1’ && it.D == ‘aaa’}.CovarResult
A quick snapshot of covariance result included here:
results = stats.run() println results [Covariance] A C D A 833.250000 -8.250000 -833.250000 C -8.250000 8.250000 8.250000 D -833.250000 8.250000 833.250000 [P(Count)] A C D A 0.000000 0.324637 0.000000 C 0.324635 0.000000 0.324637 D 0.000000 0.324637 0.000000 [Correlation] A C D A 1.000000 -0.099504 -1.000000 C -0.099504 1.000000 0.099504 D -1.000000 0.099504 1.000000 …. [P(Weight)] A C D … [Count] A C D … [CorrSumProd] A C D … [Sum Weight] A C D … [UncorrSumProd] A C D A 338350.0 26950.0 171700.0 C 26950.0 3850.0 28600.0 D 171700.0 28600.0 338350.0
If PCA is produced as part of covar() statement
PCA Result Eigenvalues 2.019425 0.980575 -0.000000 A -0.700465 0.096691 -0.707107 C 0.136742 0.990607 0.000000 D 0.700465 -0.096691 -0.707107
Principal Component Analysis Projection
The PCA projection is a system-provided transform. You can project PCA as part of data transform in either Stat command or using the Apply task. The PCA projection needs a PCA result from the covariance calculation. You must specify the number of eigenvectors (subspace size) to use when computing the main projection vector using the eigenvectors with the largest eigenvalues. The result vectors (main, coeff, and residual) of PCA can be either flatted out (by default) or kept as numeric array.
PCA Projection Configuration
The following keywords are supported in PCA projection.
- data. The data node is used to load the target dataset. The argument for this data section is very similar to those accepted by stats(). Using a system-provided transform, you can also rename or copy the variable on-the-fly so the new dataset can work with expected variables from the PCA result.
- project. Projection configuration options include:
- ref : take a PCA result from covariance analysis.
- vectors (v) : take a numeric value to indicate the subspaces.
- compute (c) : compute lists (what will be included in the output). It takes any combinations of c (coefficients), r (residuals), m (main vectors)
- zscaling (z) : take boolean value [yes/no, true/false, 1/0] to indicate if zScaling should be on. Default is on.
- normalize (n) : take a boolean value [yes/no, true/false, 1/0] to define if normalization should be applied. Default is to apply.
- print: take a boolean value [yes/no, true/false, 1/0] to indicate if a summary should be displayed at the end.. The default is to display summary.
- flat: take a boolean value [yes/no, true/false, 1/0] to indicate if projection result should be flatted out as individual (double) variable for each component, or if flat value is false, projection result should be output as double[] for vairables : mains, coefficients, and residuals. The default is to flat out (flat = true).
The output dataset contains all original variable + projection values in which
Flat option = true (by default)
main1 | main2 | … | coefficient1 | coefficient2 | … | residual1 | residuals2 | … |
double | double | double | double | double | double | double | double | double |
or
Flat option = false
Mains | coefficients | residuals |
double[] | double[] | double[] |
The following example shows how to project a PCA projection on to a dataset using a pre-calculated PCA instance. This example makes use of all available options.The dataset can be the original dataset is used to calculate eigenvectors or a new dataset. A new dataset will not have names for those variables used in PCA calculations, so be sure to rename variables in the new dataset to map the expected variables that are in PCA calculation.
// Assume that we have generated PCA result pcaInstance = covarRS.PCA // Create an Apply task to perform PCA projection. Apply apply = new Apply() apply.configure() { // define source data to be projected. data(source:Dataset , sink:’/work/data/projectPCA.mbd’) { … rename(vars=’zip=zipCode’) pca(ref: pcaInstance, vectors:2, compute:’c r m’, zScaling: true, normalize: true, print : true) } } apply.run()
Here’s another example for projecting PCA with all default options:
apply.configure() { data(source:’/work/somedataset’, sink:’/work/someOutput.mbd’) { pca(ref : somePCA ) } } apply.run()