?>
The bartGauss action fits Bayesian additive regression trees (BART) models for a continuous response variable that is assumed to follow a normal distribution. BART is a non-parametric regression method that uses a sum of regression trees to model the relationship between predictors and a response. It is particularly effective for capturing complex, non-linear relationships and interactions in the data without requiring pre-specification of the model form. The method is Bayesian, meaning it uses priors for the model parameters and provides a full posterior distribution for predictions, allowing for robust uncertainty quantification.
| Parameter | Description |
|---|---|
| alpha | Specifies the significance level for constructing equal-tail credible limits for predictive margins. |
| attributes | Changes the attributes of variables used in the action. |
| class | Names the classification variables to use as explanatory variables in the analysis. |
| distributeChains | Specifies a distributed mode that divides the MCMC sampling in a grid environment. When you specify a value of 0, a single chain is run, and each worker node is assigned a portion of the training data. |
| freq | Names the numeric variable that contains the frequency of occurrence for each observation. |
| inputs | Specifies the input variables to use in the analysis. |
| leafSigmaK | Specifies the value used to determine the prior variance for the leaf parameter. |
| maxTrainTime | Specifies an upper limit (in seconds) on the time for MCMC sampling. |
| minLeafSize | Specifies the minimum number of observations that each child of a split must contain in the training data for the split to be considered. |
| missing | Specifies how to handle missing values in predictor variables. 'SEPARATE' is often a good default as it treats missingness as a potentially informative category. |
| model | Defines the model structure, including the dependent variable (target) and the explanatory variables (effects). |
| nBI | Specifies the number of burn-in iterations to perform before the action starts to save samples for prediction. These initial samples are discarded to allow the Markov chain to reach its stationary distribution. |
| nBins | Specifies the number of bins to use for discretizing continuous input variables, which can improve performance. |
| nClassLevelsPrint | Limits the display of class levels in the output tables. A value of 0 suppresses all levels. |
| nMC | Specifies the number of MCMC iterations to perform after the burn-in phase. This is the main sample size for posterior inference. |
| nMCDist | Specifies the number of MCMC iterations for each chain when running in distributed mode. |
| nominals | Specifies the nominal (categorical) input variables to use in the analysis. |
| nThin | Specifies the thinning rate of the simulation, which saves one sample every 'nThin' iterations to reduce autocorrelation in the saved chain. |
| nTree | Specifies the number of trees in the sum-of-trees ensemble. A larger number of trees can capture more complex patterns but increases computation time. |
| obsLeafMapInMem | When set to True, stores a mapping of each observation to terminal nodes in memory, which can speed up certain post-processing tasks. |
| orderSplit | Specifies the minimum cardinality for which a categorical input uses splitting rules according to level ordering. |
| output | Creates an output table containing observation-wise statistics, such as predicted values and residuals. |
| outputTables | Lists the names of results tables (e.g., ModelInfo, VarImp) to save as CAS tables on the server. |
| partByFrac | Specifies the fraction of the data to be used for testing, allowing for random partitioning. |
| partByVar | Names a variable in the input table whose values are used to partition the data into training and testing roles. |
| quantileBin | When set to True, bin boundaries are set at quantiles of numeric inputs, which can handle skewed distributions better than equal-width bins. |
| sampleSummary | Creates an output table that contains a summary of the sum-of-trees ensemble samples. |
| seed | Specifies a seed for the pseudorandom number generator to ensure reproducibility of the analysis. |
| sigmaDF | Specifies the degrees of freedom of the scaled inverse chi-square prior for the error variance parameter. |
| sigmaLambda | Specifies the scale parameter of the scaled inverse chi-square prior for the error variance parameter. |
| sigmaQuantile | Specifies the quantile level to determine the scale parameter of the inverse chi-square prior for the error variance. |
| store | Saves the fitted model to a binary object in a CAS table, which can be used later for scoring new data with the bart.bartScore action. |
| table | Specifies the input data table for the analysis. |
| target | Specifies the target (dependent or response) variable for the model. |
| trainInMem | When set to True, stores the training data in memory to potentially speed up the training process. |
| treePrior | Specifies the parameters of the regularization prior for the tree structure, controlling its complexity. |
| varAutoCorr | Specifies the autocorrelation lags to compute for the variance parameter in the MCMC chain diagnostics. |
| varEst | Specifies an initial value for the error variance. If not specified, it's estimated from an initial linear regression. |
This example creates a sample dataset named 'sample_data'. The target variable 'y' is generated from a combination of linear and non-linear functions of the predictors 'x1', 'x2', 'x3', and 'c1', with added Gaussian noise. This makes it suitable for the bartGauss action, which models a normally distributed response.
data sample_data;
call streaminit(123);
do i = 1 to 1000;
x1 = rand('UNIFORM');
x2 = rand('UNIFORM') * 2 - 1;
x3 = rand('NORMAL');
if rand('UNIFORM') < 0.5 then c1 = 'A';
else c1 = 'B';
y = 10 * sin(3.14 * x1) + 20 * (x2 - 0.5)**2 + 10 * x3 + (ifc(c1='A', 5, 0)) + rand('NORMAL');
output;
end;
run;This example demonstrates a basic call to the bartGauss action. It loads the 'sample_data' table into a CAS session, then fits a BART model with 'y' as the target and 'x1', 'x2', 'x3', and 'c1' as predictors. This is the simplest way to run the action, relying on default settings for the number of trees, MCMC iterations, and other hyperparameters.
| 1 | PROC CAS; |
| 2 | loadactionset 'bart'; |
| 3 | bart.bartGauss / |
| 4 | TABLE='sample_data', |
| 5 | target='y', |
| 6 | inputs={'x1', 'x2', 'x3', 'c1'}; |
| 7 | RUN; |
This example shows a more advanced usage of the bartGauss action. It specifies the number of trees (nTree=100), burn-in iterations (nBI=500), and MCMC samples (nMC=2000). It also partitions the data, using 25% for testing (partByFrac). The fitted model is saved to a CAS table named 'bart_model_store' for later use. Additionally, an output table 'bart_predictions' is created to store predicted values and residuals for each observation.
| 1 | PROC CAS; |
| 2 | loadactionset 'bart'; |
| 3 | bart.bartGauss / |
| 4 | TABLE={name='sample_data'}, |
| 5 | target='y', |
| 6 | inputs={'x1', 'x2', 'x3'}, |
| 7 | nominals={'c1'}, |
| 8 | nTree=100, |
| 9 | nBI=500, |
| 10 | nMC=2000, |
| 11 | seed=456, |
| 12 | partByFrac={test=0.25, seed=123}, |
| 13 | store={name='bart_model_store', replace=true}, |
| 14 | OUTPUT={casOut={name='bart_predictions', replace=true}, pred='predicted_y', resid='residual_y'}, |
| 15 | display={'FitStatistics', 'VarImp'}; |
| 16 | RUN; |