binning - WeAreCAS

Description

The `binning` action in the `dataPreprocess` action set is a powerful tool for unsupervised variable discretization. It groups continuous numerical variables into a smaller number of 'bins'. This is a common data preparation step for many machine learning algorithms, as it can help manage outliers, reduce noise, and handle non-linear relationships. The action supports several methods for creating bins, such as equal-width (bucket), equal-frequency (quantile), or user-defined cutpoints.

dataPreprocess.binning { binEnds={double-1, double-2, ...}, binMapping="LEFT"|"RIGHT", binMissing=TRUE|FALSE, binStarts={double-1, double-2, ...}, binWidths={double-1, double-2, ...}, casOut={...}, casOutBinDetails={...}, code={...}, copyAllVars=TRUE|FALSE, copyVars={"variable-name-1", "variable-name-2", ...}, cutPoints={double-1, double-2, ...}, freq="variable-name", fuzzyCompare=double, includeInputVars=TRUE|FALSE, includeMissingGroup=TRUE|FALSE, inputs={{...}, {...}}, method="BUCKET"|"CUTPTS"|"QUANTILE", nBinsArray={integer-1, integer-2, ...}|integer, noDataLowerUpperBound=TRUE|FALSE, outputTableOptions={...}, outVarsNamePrefix="string", outVarsNameSuffix="string", percentileDefinition=integer, percentileMaxIterations=integer, percentileTolerance=double, sasVarNameLength=TRUE|FALSE, table={...}, weight="variable-name" };

Settings

Parameter	Description
binEnds	Specifies the bin end values. If applicable, they override the data maximum values.
binMapping	Controls how to map values that fall at the boundary between consecutive bins. LEFT enables you to express the bins with [], (], ..., (] notation. RIGHT enables [), [), ..., [] notation.
binMissing	When set to True, bins missing values into a separate bin. The ID for this bin is 0.
binStarts	Specifies the bin start values. If applicable, they override the data minimum values.
binWidths	Specifies the bin width.
casOut	Specifies the output table to store the scored data.
casOutBinDetails	Specifies the output table to store details about the created bins.
code	Specifies settings for generating SAS DATA step scoring code.
copyAllVars	When set to True, all variables from the input table are copied to the output table.
copyVars	Specifies a list of variables to copy from the input table to the output table.
cutPoints	Specifies the user-provided cutpoints for the 'CUTPTS' binning method.
freq	Specifies the frequency variable for the analysis.
fuzzyCompare	Specifies the fuzzy comparison threshold used to determine the distinctness of numeric values.
includeInputVars	When set to True, the original analysis variables are included in the output table.
includeMissingGroup	When set to True, missing values are allowed as group-by keys.
inputs	Specifies the numerical variables to be binned.
method	Specifies the binning technique to use: BUCKET (equal-width), QUANTILE (equal-frequency), or CUTPTS (user-defined).
nBinsArray	Specifies the number of bins to create for each variable.
noDataLowerUpperBound	When set to True, the global lower and upper bounds of the bin set are unlimited in the generated score code.
outputTableOptions	Specifies options for the result tables, such as which tables to return.
outVarsNamePrefix	Specifies a prefix to apply to the names of the generated binned variables.
outVarsNameSuffix	Specifies a suffix to apply to the names of the generated binned variables.
percentileDefinition	Specifies the percentile definition to use for the QUANTILE method (from 1 to 6).
percentileMaxIterations	Specifies the maximum number of iterations for percentile computation.
percentileTolerance	Specifies the tolerance for percentile computation.
sasVarNameLength	When set to True, constrains the output variable names to a maximum length of 32 characters.
table	Specifies the input CAS table containing the data to be processed.
weight	Specifies the weight variable for the analysis.

Data Preparation

Data Creation

This example creates a sample dataset `sample_data` in the active caslib. The table contains customer information, including age and income, which will be used in the binning examples.

data mycas.sample_data;
  do i = 1 to 100;
    age = 20 + floor(rand('UNIFORM') * 50);
    income = 30000 + floor(rand('UNIFORM') * 70000);
    output;
  end;
run;

Examples

This example performs a simple bucket (equal-width) binning on the `age` variable, dividing it into 5 bins. The results are stored in a new table named `binned_age`.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	dataPreprocess.binning
3	TABLE={name='sample_data'},
4	inputs={{name='age'}},
5	method='BUCKET',
6	nBinsArray=5,
7	casOut={name='binned_age', replace=true};
8	RUN;

Result :
The action generates an output table `binned_age` in the active caslib. This table contains the original variables plus a new variable, `bin_age`, which holds the bin number (from 1 to 5) for each observation's age.

This example demonstrates quantile (equal-frequency) binning on both `age` and `income`. It creates 4 bins for `age` and 10 for `income`. It also generates two output tables: `binned_customers` containing the scored data, and `bin_details` containing the metadata about the bins (like lower/upper bounds for each bin). The original input variables are also copied to the output table.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	dataPreprocess.binning
3	TABLE={name='sample_data'},
4	inputs={{name='age'}, {name='income'}},
5	method='QUANTILE',
6	nBinsArray={4, 10},
7	includeInputVars=true,
8	casOut={name='binned_customers', replace=true},
9	casOutBinDetails={name='bin_details', replace=true};
10	RUN;

Result :
Two tables are created in the active caslib. `binned_customers` contains the original `age` and `income` columns, plus two new columns: `bin_age` (with values from 1-4) and `bin_income` (with values from 1-10). The `bin_details` table contains detailed information about the cutoffs and statistics for each bin created for both variables.

This example uses the `CUTPTS` method to create custom bins for the `income` variable based on specific financial thresholds. It also demonstrates how to customize the output variable name using a prefix and suffix.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	dataPreprocess.binning
3	TABLE={name='sample_data'},
4	inputs={{name='income'}},
5	method='CUTPTS',
6	cutPoints={50000, 75000, 90000},
7	outVarsNamePrefix='custom',
8	outVarsNameSuffix='group',
9	casOut={name='income_groups', replace=true};
10	RUN;

Result :
An output table `income_groups` is created. It includes a new variable named `custom_income_group`. This variable will have values 1 for income <= 50000, 2 for income between 50000 and 75000, 3 for income between 75000 and 90000, and 4 for income > 90000.

Description

Data Creation

Examples

Simple Bucket Binning

Quantile Binning with Multiple Variables and Detailed Output

User-Defined Binning with Cut Points