?> bartGauss - WeAreCAS
bart

bartGauss

Description

The bartGauss action fits Bayesian additive regression trees (BART) models for a continuous response variable that is assumed to follow a normal distribution. BART is a non-parametric regression method that uses a sum of regression trees to model the relationship between predictors and a response. It is particularly effective for capturing complex, non-linear relationships and interactions in the data without requiring pre-specification of the model form. The method is Bayesian, meaning it uses priors for the model parameters and provides a full posterior distribution for predictions, allowing for robust uncertainty quantification.

bart.bartGauss result=results status=rc / alpha=double, attributes={{name="variable-name", format="string", formattedLength=integer, label="string", nfd=integer, nfl=integer}, ...}, class={{vars={"variable-name-1", ...}, descending=TRUE | FALSE, order="FORMATTED" | "FREQ" | "FREQFORMATTED" | "FREQINTERNAL" | "INTERNAL", ref="FIRST" | "LAST" | double | "string"}, ...}, distributeChains=integer, freq="variable-name", inputs={{name="variable-name", ...}, ...}, leafSigmaK=double, maxTrainTime=double, minLeafSize=integer, missing="MACBIG" | "MACSMALL" | "NONE" | "SEPARATE", model={depVars={{name="variable-name"}, ...}, effects={{vars={"string-1", ...}}, ...}}, nBI=integer, nBins=integer, nClassLevelsPrint=integer, nMC=integer, nMCDist=integer, nominals={{name="variable-name", ...}, ...}, nThin=integer, nTree=integer, obsLeafMapInMem=TRUE | FALSE, orderSplit=integer, output={alpha=double, avgOnly=TRUE | FALSE, casOut={caslib="string", ...}, copyVars="ALL" | "ALL_MODEL" | "ALL_NUMERIC" | {"variable-name-1", ...}, lcl="string", pred="string", resid="string", role="string", ucl="string"}, outputTables={names={"string-1", ...} | {key-1={casouttable-1}, ...}, ...}, partByFrac={seed=integer, test=double}, partByVar={name="variable-name", test="string", train="string"}, quantileBin=TRUE | FALSE, sampleSummary={casout={caslib="string", ...}, avgNode="string", propAccepted="string", sampSaved="string", variance="string"}, seed=64-bit-integer, sigmaDF=double, sigmaLambda=double, sigmaQuantile=double, store={caslib="string", name="table-name", ...}, table={name="table-name", ...}, target="variable-name", trainInMem=TRUE | FALSE, treePrior={depthBase=double, depthPower=double, pPrune=double, pSplit=double}, varAutoCorr={integer-1, ...}, varEst=double;
Settings
ParameterDescription
alphaSpecifies the significance level for constructing equal-tail credible limits for predictive margins.
attributesChanges the attributes of variables used in the action.
classNames the classification variables to use as explanatory variables in the analysis.
distributeChainsSpecifies a distributed mode that divides the MCMC sampling in a grid environment. When you specify a value of 0, a single chain is run, and each worker node is assigned a portion of the training data.
freqNames the numeric variable that contains the frequency of occurrence for each observation.
inputsSpecifies the input variables to use in the analysis.
leafSigmaKSpecifies the value used to determine the prior variance for the leaf parameter.
maxTrainTimeSpecifies an upper limit (in seconds) on the time for MCMC sampling.
minLeafSizeSpecifies the minimum number of observations that each child of a split must contain in the training data for the split to be considered.
missingSpecifies how to handle missing values in predictor variables. 'SEPARATE' is often a good default as it treats missingness as a potentially informative category.
modelDefines the model structure, including the dependent variable (target) and the explanatory variables (effects).
nBISpecifies the number of burn-in iterations to perform before the action starts to save samples for prediction. These initial samples are discarded to allow the Markov chain to reach its stationary distribution.
nBinsSpecifies the number of bins to use for discretizing continuous input variables, which can improve performance.
nClassLevelsPrintLimits the display of class levels in the output tables. A value of 0 suppresses all levels.
nMCSpecifies the number of MCMC iterations to perform after the burn-in phase. This is the main sample size for posterior inference.
nMCDistSpecifies the number of MCMC iterations for each chain when running in distributed mode.
nominalsSpecifies the nominal (categorical) input variables to use in the analysis.
nThinSpecifies the thinning rate of the simulation, which saves one sample every 'nThin' iterations to reduce autocorrelation in the saved chain.
nTreeSpecifies the number of trees in the sum-of-trees ensemble. A larger number of trees can capture more complex patterns but increases computation time.
obsLeafMapInMemWhen set to True, stores a mapping of each observation to terminal nodes in memory, which can speed up certain post-processing tasks.
orderSplitSpecifies the minimum cardinality for which a categorical input uses splitting rules according to level ordering.
outputCreates an output table containing observation-wise statistics, such as predicted values and residuals.
outputTablesLists the names of results tables (e.g., ModelInfo, VarImp) to save as CAS tables on the server.
partByFracSpecifies the fraction of the data to be used for testing, allowing for random partitioning.
partByVarNames a variable in the input table whose values are used to partition the data into training and testing roles.
quantileBinWhen set to True, bin boundaries are set at quantiles of numeric inputs, which can handle skewed distributions better than equal-width bins.
sampleSummaryCreates an output table that contains a summary of the sum-of-trees ensemble samples.
seedSpecifies a seed for the pseudorandom number generator to ensure reproducibility of the analysis.
sigmaDFSpecifies the degrees of freedom of the scaled inverse chi-square prior for the error variance parameter.
sigmaLambdaSpecifies the scale parameter of the scaled inverse chi-square prior for the error variance parameter.
sigmaQuantileSpecifies the quantile level to determine the scale parameter of the inverse chi-square prior for the error variance.
storeSaves the fitted model to a binary object in a CAS table, which can be used later for scoring new data with the bart.bartScore action.
tableSpecifies the input data table for the analysis.
targetSpecifies the target (dependent or response) variable for the model.
trainInMemWhen set to True, stores the training data in memory to potentially speed up the training process.
treePriorSpecifies the parameters of the regularization prior for the tree structure, controlling its complexity.
varAutoCorrSpecifies the autocorrelation lags to compute for the variance parameter in the MCMC chain diagnostics.
varEstSpecifies an initial value for the error variance. If not specified, it's estimated from an initial linear regression.
Data Preparation
Data Creation

This example creates a sample dataset named 'sample_data'. The target variable 'y' is generated from a combination of linear and non-linear functions of the predictors 'x1', 'x2', 'x3', and 'c1', with added Gaussian noise. This makes it suitable for the bartGauss action, which models a normally distributed response.

data sample_data;
  call streaminit(123);
  do i = 1 to 1000;
    x1 = rand('UNIFORM');
    x2 = rand('UNIFORM') * 2 - 1;
    x3 = rand('NORMAL');
    if rand('UNIFORM') < 0.5 then c1 = 'A';
    else c1 = 'B';
    y = 10 * sin(3.14 * x1) + 20 * (x2 - 0.5)**2 + 10 * x3 + (ifc(c1='A', 5, 0)) + rand('NORMAL');
    output;
  end;
run;

Examples

This example demonstrates a basic call to the bartGauss action. It loads the 'sample_data' table into a CAS session, then fits a BART model with 'y' as the target and 'x1', 'x2', 'x3', and 'c1' as predictors. This is the simplest way to run the action, relying on default settings for the number of trees, MCMC iterations, and other hyperparameters.

SAS® / CAS Code
Copied!
1PROC CAS;
2 loadactionset 'bart';
3 bart.bartGauss /
4 TABLE='sample_data',
5 target='y',
6 inputs={'x1', 'x2', 'x3', 'c1'};
7RUN;

This example shows a more advanced usage of the bartGauss action. It specifies the number of trees (nTree=100), burn-in iterations (nBI=500), and MCMC samples (nMC=2000). It also partitions the data, using 25% for testing (partByFrac). The fitted model is saved to a CAS table named 'bart_model_store' for later use. Additionally, an output table 'bart_predictions' is created to store predicted values and residuals for each observation.

SAS® / CAS Code
Copied!
1PROC CAS;
2 loadactionset 'bart';
3 bart.bartGauss /
4 TABLE={name='sample_data'},
5 target='y',
6 inputs={'x1', 'x2', 'x3'},
7 nominals={'c1'},
8 nTree=100,
9 nBI=500,
10 nMC=2000,
11 seed=456,
12 partByFrac={test=0.25, seed=123},
13 store={name='bart_model_store', replace=true},
14 OUTPUT={casOut={name='bart_predictions', replace=true}, pred='predicted_y', resid='residual_y'},
15 display={'FitStatistics', 'VarImp'};
16RUN;

FAQ

What is the purpose of the bart.bartGauss action?
Which parameter is used to specify the input data table?
How can I define the model's dependent and independent variables?
What does the 'nTree' parameter control?
How are the MCMC iterations managed in this action?
How can the trained model be saved for future scoring?
How does the action handle missing values in predictor variables?