?>
The `binning` action in the `dataPreprocess` action set is a powerful tool for unsupervised variable discretization. It groups continuous numerical variables into a smaller number of 'bins'. This is a common data preparation step for many machine learning algorithms, as it can help manage outliers, reduce noise, and handle non-linear relationships. The action supports several methods for creating bins, such as equal-width (bucket), equal-frequency (quantile), or user-defined cutpoints.
| Parameter | Description |
|---|---|
| binEnds | Specifies the bin end values. If applicable, they override the data maximum values. |
| binMapping | Controls how to map values that fall at the boundary between consecutive bins. LEFT enables you to express the bins with [], (], ..., (] notation. RIGHT enables [), [), ..., [] notation. |
| binMissing | When set to True, bins missing values into a separate bin. The ID for this bin is 0. |
| binStarts | Specifies the bin start values. If applicable, they override the data minimum values. |
| binWidths | Specifies the bin width. |
| casOut | Specifies the output table to store the scored data. |
| casOutBinDetails | Specifies the output table to store details about the created bins. |
| code | Specifies settings for generating SAS DATA step scoring code. |
| copyAllVars | When set to True, all variables from the input table are copied to the output table. |
| copyVars | Specifies a list of variables to copy from the input table to the output table. |
| cutPoints | Specifies the user-provided cutpoints for the 'CUTPTS' binning method. |
| freq | Specifies the frequency variable for the analysis. |
| fuzzyCompare | Specifies the fuzzy comparison threshold used to determine the distinctness of numeric values. |
| includeInputVars | When set to True, the original analysis variables are included in the output table. |
| includeMissingGroup | When set to True, missing values are allowed as group-by keys. |
| inputs | Specifies the numerical variables to be binned. |
| method | Specifies the binning technique to use: BUCKET (equal-width), QUANTILE (equal-frequency), or CUTPTS (user-defined). |
| nBinsArray | Specifies the number of bins to create for each variable. |
| noDataLowerUpperBound | When set to True, the global lower and upper bounds of the bin set are unlimited in the generated score code. |
| outputTableOptions | Specifies options for the result tables, such as which tables to return. |
| outVarsNamePrefix | Specifies a prefix to apply to the names of the generated binned variables. |
| outVarsNameSuffix | Specifies a suffix to apply to the names of the generated binned variables. |
| percentileDefinition | Specifies the percentile definition to use for the QUANTILE method (from 1 to 6). |
| percentileMaxIterations | Specifies the maximum number of iterations for percentile computation. |
| percentileTolerance | Specifies the tolerance for percentile computation. |
| sasVarNameLength | When set to True, constrains the output variable names to a maximum length of 32 characters. |
| table | Specifies the input CAS table containing the data to be processed. |
| weight | Specifies the weight variable for the analysis. |
This example creates a sample dataset `sample_data` in the active caslib. The table contains customer information, including age and income, which will be used in the binning examples.
data mycas.sample_data;
do i = 1 to 100;
age = 20 + floor(rand('UNIFORM') * 50);
income = 30000 + floor(rand('UNIFORM') * 70000);
output;
end;
run;This example performs a simple bucket (equal-width) binning on the `age` variable, dividing it into 5 bins. The results are stored in a new table named `binned_age`.
| 1 | PROC CAS; |
| 2 | dataPreprocess.binning |
| 3 | TABLE={name='sample_data'}, |
| 4 | inputs={{name='age'}}, |
| 5 | method='BUCKET', |
| 6 | nBinsArray=5, |
| 7 | casOut={name='binned_age', replace=true}; |
| 8 | RUN; |
This example demonstrates quantile (equal-frequency) binning on both `age` and `income`. It creates 4 bins for `age` and 10 for `income`. It also generates two output tables: `binned_customers` containing the scored data, and `bin_details` containing the metadata about the bins (like lower/upper bounds for each bin). The original input variables are also copied to the output table.
| 1 | PROC CAS; |
| 2 | dataPreprocess.binning |
| 3 | TABLE={name='sample_data'}, |
| 4 | inputs={{name='age'}, {name='income'}}, |
| 5 | method='QUANTILE', |
| 6 | nBinsArray={4, 10}, |
| 7 | includeInputVars=true, |
| 8 | casOut={name='binned_customers', replace=true}, |
| 9 | casOutBinDetails={name='bin_details', replace=true}; |
| 10 | RUN; |
This example uses the `CUTPTS` method to create custom bins for the `income` variable based on specific financial thresholds. It also demonstrates how to customize the output variable name using a prefix and suffix.
| 1 | PROC CAS; |
| 2 | dataPreprocess.binning |
| 3 | TABLE={name='sample_data'}, |
| 4 | inputs={{name='income'}}, |
| 5 | method='CUTPTS', |
| 6 | cutPoints={50000, 75000, 90000}, |
| 7 | outVarsNamePrefix='custom', |
| 8 | outVarsNameSuffix='group', |
| 9 | casOut={name='income_groups', replace=true}; |
| 10 | RUN; |