?> binning - WeAreCAS
dataPreprocess

binning

Description

The `binning` action in the `dataPreprocess` action set is a powerful tool for unsupervised variable discretization. It groups continuous numerical variables into a smaller number of 'bins'. This is a common data preparation step for many machine learning algorithms, as it can help manage outliers, reduce noise, and handle non-linear relationships. The action supports several methods for creating bins, such as equal-width (bucket), equal-frequency (quantile), or user-defined cutpoints.

dataPreprocess.binning { binEnds={double-1, double-2, ...}, binMapping="LEFT"|"RIGHT", binMissing=TRUE|FALSE, binStarts={double-1, double-2, ...}, binWidths={double-1, double-2, ...}, casOut={...}, casOutBinDetails={...}, code={...}, copyAllVars=TRUE|FALSE, copyVars={"variable-name-1", "variable-name-2", ...}, cutPoints={double-1, double-2, ...}, freq="variable-name", fuzzyCompare=double, includeInputVars=TRUE|FALSE, includeMissingGroup=TRUE|FALSE, inputs={{...}, {...}}, method="BUCKET"|"CUTPTS"|"QUANTILE", nBinsArray={integer-1, integer-2, ...}|integer, noDataLowerUpperBound=TRUE|FALSE, outputTableOptions={...}, outVarsNamePrefix="string", outVarsNameSuffix="string", percentileDefinition=integer, percentileMaxIterations=integer, percentileTolerance=double, sasVarNameLength=TRUE|FALSE, table={...}, weight="variable-name" };
Settings
ParameterDescription
binEndsSpecifies the bin end values. If applicable, they override the data maximum values.
binMappingControls how to map values that fall at the boundary between consecutive bins. LEFT enables you to express the bins with [], (], ..., (] notation. RIGHT enables [), [), ..., [] notation.
binMissingWhen set to True, bins missing values into a separate bin. The ID for this bin is 0.
binStartsSpecifies the bin start values. If applicable, they override the data minimum values.
binWidthsSpecifies the bin width.
casOutSpecifies the output table to store the scored data.
casOutBinDetailsSpecifies the output table to store details about the created bins.
codeSpecifies settings for generating SAS DATA step scoring code.
copyAllVarsWhen set to True, all variables from the input table are copied to the output table.
copyVarsSpecifies a list of variables to copy from the input table to the output table.
cutPointsSpecifies the user-provided cutpoints for the 'CUTPTS' binning method.
freqSpecifies the frequency variable for the analysis.
fuzzyCompareSpecifies the fuzzy comparison threshold used to determine the distinctness of numeric values.
includeInputVarsWhen set to True, the original analysis variables are included in the output table.
includeMissingGroupWhen set to True, missing values are allowed as group-by keys.
inputsSpecifies the numerical variables to be binned.
methodSpecifies the binning technique to use: BUCKET (equal-width), QUANTILE (equal-frequency), or CUTPTS (user-defined).
nBinsArraySpecifies the number of bins to create for each variable.
noDataLowerUpperBoundWhen set to True, the global lower and upper bounds of the bin set are unlimited in the generated score code.
outputTableOptionsSpecifies options for the result tables, such as which tables to return.
outVarsNamePrefixSpecifies a prefix to apply to the names of the generated binned variables.
outVarsNameSuffixSpecifies a suffix to apply to the names of the generated binned variables.
percentileDefinitionSpecifies the percentile definition to use for the QUANTILE method (from 1 to 6).
percentileMaxIterationsSpecifies the maximum number of iterations for percentile computation.
percentileToleranceSpecifies the tolerance for percentile computation.
sasVarNameLengthWhen set to True, constrains the output variable names to a maximum length of 32 characters.
tableSpecifies the input CAS table containing the data to be processed.
weightSpecifies the weight variable for the analysis.
Data Preparation
Data Creation

This example creates a sample dataset `sample_data` in the active caslib. The table contains customer information, including age and income, which will be used in the binning examples.

data mycas.sample_data;
  do i = 1 to 100;
    age = 20 + floor(rand('UNIFORM') * 50);
    income = 30000 + floor(rand('UNIFORM') * 70000);
    output;
  end;
run;

Examples

This example performs a simple bucket (equal-width) binning on the `age` variable, dividing it into 5 bins. The results are stored in a new table named `binned_age`.

SAS® / CAS Code
Copied!
1PROC CAS;
2 dataPreprocess.binning
3 TABLE={name='sample_data'},
4 inputs={{name='age'}},
5 method='BUCKET',
6 nBinsArray=5,
7 casOut={name='binned_age', replace=true};
8RUN;
Result :
The action generates an output table `binned_age` in the active caslib. This table contains the original variables plus a new variable, `bin_age`, which holds the bin number (from 1 to 5) for each observation's age.

This example demonstrates quantile (equal-frequency) binning on both `age` and `income`. It creates 4 bins for `age` and 10 for `income`. It also generates two output tables: `binned_customers` containing the scored data, and `bin_details` containing the metadata about the bins (like lower/upper bounds for each bin). The original input variables are also copied to the output table.

SAS® / CAS Code
Copied!
1PROC CAS;
2 dataPreprocess.binning
3 TABLE={name='sample_data'},
4 inputs={{name='age'}, {name='income'}},
5 method='QUANTILE',
6 nBinsArray={4, 10},
7 includeInputVars=true,
8 casOut={name='binned_customers', replace=true},
9 casOutBinDetails={name='bin_details', replace=true};
10RUN;
Result :
Two tables are created in the active caslib. `binned_customers` contains the original `age` and `income` columns, plus two new columns: `bin_age` (with values from 1-4) and `bin_income` (with values from 1-10). The `bin_details` table contains detailed information about the cutoffs and statistics for each bin created for both variables.

This example uses the `CUTPTS` method to create custom bins for the `income` variable based on specific financial thresholds. It also demonstrates how to customize the output variable name using a prefix and suffix.

SAS® / CAS Code
Copied!
1PROC CAS;
2 dataPreprocess.binning
3 TABLE={name='sample_data'},
4 inputs={{name='income'}},
5 method='CUTPTS',
6 cutPoints={50000, 75000, 90000},
7 outVarsNamePrefix='custom',
8 outVarsNameSuffix='group',
9 casOut={name='income_groups', replace=true};
10RUN;
Result :
An output table `income_groups` is created. It includes a new variable named `custom_income_group`. This variable will have values 1 for income <= 50000, 2 for income between 50000 and 75000, 3 for income between 75000 and 90000, and 4 for income > 90000.