bartGauss - WeAreCAS

Description

The bartGauss action fits Bayesian additive regression trees (BART) models for a continuous response variable that is assumed to follow a normal distribution. BART is a non-parametric regression method that uses a sum of regression trees to model the relationship between predictors and a response. It is particularly effective for capturing complex, non-linear relationships and interactions in the data without requiring pre-specification of the model form. The method is Bayesian, meaning it uses priors for the model parameters and provides a full posterior distribution for predictions, allowing for robust uncertainty quantification.

bart.bartGauss result=results status=rc / alpha=double, attributes={{name="variable-name", format="string", formattedLength=integer, label="string", nfd=integer, nfl=integer}, ...}, class={{vars={"variable-name-1", ...}, descending=TRUE | FALSE, order="FORMATTED" | "FREQ" | "FREQFORMATTED" | "FREQINTERNAL" | "INTERNAL", ref="FIRST" | "LAST" | double | "string"}, ...}, distributeChains=integer, freq="variable-name", inputs={{name="variable-name", ...}, ...}, leafSigmaK=double, maxTrainTime=double, minLeafSize=integer, missing="MACBIG" | "MACSMALL" | "NONE" | "SEPARATE", model={depVars={{name="variable-name"}, ...}, effects={{vars={"string-1", ...}}, ...}}, nBI=integer, nBins=integer, nClassLevelsPrint=integer, nMC=integer, nMCDist=integer, nominals={{name="variable-name", ...}, ...}, nThin=integer, nTree=integer, obsLeafMapInMem=TRUE | FALSE, orderSplit=integer, output={alpha=double, avgOnly=TRUE | FALSE, casOut={caslib="string", ...}, copyVars="ALL" | "ALL_MODEL" | "ALL_NUMERIC" | {"variable-name-1", ...}, lcl="string", pred="string", resid="string", role="string", ucl="string"}, outputTables={names={"string-1", ...} | {key-1={casouttable-1}, ...}, ...}, partByFrac={seed=integer, test=double}, partByVar={name="variable-name", test="string", train="string"}, quantileBin=TRUE | FALSE, sampleSummary={casout={caslib="string", ...}, avgNode="string", propAccepted="string", sampSaved="string", variance="string"}, seed=64-bit-integer, sigmaDF=double, sigmaLambda=double, sigmaQuantile=double, store={caslib="string", name="table-name", ...}, table={name="table-name", ...}, target="variable-name", trainInMem=TRUE | FALSE, treePrior={depthBase=double, depthPower=double, pPrune=double, pSplit=double}, varAutoCorr={integer-1, ...}, varEst=double;

Settings

Parameter	Description
alpha	Specifies the significance level for constructing equal-tail credible limits for predictive margins.
attributes	Changes the attributes of variables used in the action.
class	Names the classification variables to use as explanatory variables in the analysis.
distributeChains	Specifies a distributed mode that divides the MCMC sampling in a grid environment. When you specify a value of 0, a single chain is run, and each worker node is assigned a portion of the training data.
freq	Names the numeric variable that contains the frequency of occurrence for each observation.
inputs	Specifies the input variables to use in the analysis.
leafSigmaK	Specifies the value used to determine the prior variance for the leaf parameter.
maxTrainTime	Specifies an upper limit (in seconds) on the time for MCMC sampling.
minLeafSize	Specifies the minimum number of observations that each child of a split must contain in the training data for the split to be considered.
missing	Specifies how to handle missing values in predictor variables. 'SEPARATE' is often a good default as it treats missingness as a potentially informative category.
model	Defines the model structure, including the dependent variable (target) and the explanatory variables (effects).
nBI	Specifies the number of burn-in iterations to perform before the action starts to save samples for prediction. These initial samples are discarded to allow the Markov chain to reach its stationary distribution.
nBins	Specifies the number of bins to use for discretizing continuous input variables, which can improve performance.
nClassLevelsPrint	Limits the display of class levels in the output tables. A value of 0 suppresses all levels.
nMC	Specifies the number of MCMC iterations to perform after the burn-in phase. This is the main sample size for posterior inference.
nMCDist	Specifies the number of MCMC iterations for each chain when running in distributed mode.
nominals	Specifies the nominal (categorical) input variables to use in the analysis.
nThin	Specifies the thinning rate of the simulation, which saves one sample every 'nThin' iterations to reduce autocorrelation in the saved chain.
nTree	Specifies the number of trees in the sum-of-trees ensemble. A larger number of trees can capture more complex patterns but increases computation time.
obsLeafMapInMem	When set to True, stores a mapping of each observation to terminal nodes in memory, which can speed up certain post-processing tasks.
orderSplit	Specifies the minimum cardinality for which a categorical input uses splitting rules according to level ordering.
output	Creates an output table containing observation-wise statistics, such as predicted values and residuals.
outputTables	Lists the names of results tables (e.g., ModelInfo, VarImp) to save as CAS tables on the server.
partByFrac	Specifies the fraction of the data to be used for testing, allowing for random partitioning.
partByVar	Names a variable in the input table whose values are used to partition the data into training and testing roles.
quantileBin	When set to True, bin boundaries are set at quantiles of numeric inputs, which can handle skewed distributions better than equal-width bins.
sampleSummary	Creates an output table that contains a summary of the sum-of-trees ensemble samples.
seed	Specifies a seed for the pseudorandom number generator to ensure reproducibility of the analysis.
sigmaDF	Specifies the degrees of freedom of the scaled inverse chi-square prior for the error variance parameter.
sigmaLambda	Specifies the scale parameter of the scaled inverse chi-square prior for the error variance parameter.
sigmaQuantile	Specifies the quantile level to determine the scale parameter of the inverse chi-square prior for the error variance.
store	Saves the fitted model to a binary object in a CAS table, which can be used later for scoring new data with the bart.bartScore action.
table	Specifies the input data table for the analysis.
target	Specifies the target (dependent or response) variable for the model.
trainInMem	When set to True, stores the training data in memory to potentially speed up the training process.
treePrior	Specifies the parameters of the regularization prior for the tree structure, controlling its complexity.
varAutoCorr	Specifies the autocorrelation lags to compute for the variance parameter in the MCMC chain diagnostics.
varEst	Specifies an initial value for the error variance. If not specified, it's estimated from an initial linear regression.

Data Preparation

Data Creation

This example creates a sample dataset named 'sample_data'. The target variable 'y' is generated from a combination of linear and non-linear functions of the predictors 'x1', 'x2', 'x3', and 'c1', with added Gaussian noise. This makes it suitable for the bartGauss action, which models a normally distributed response.

data sample_data;
  call streaminit(123);
  do i = 1 to 1000;
    x1 = rand('UNIFORM');
    x2 = rand('UNIFORM') * 2 - 1;
    x3 = rand('NORMAL');
    if rand('UNIFORM') < 0.5 then c1 = 'A';
    else c1 = 'B';
    y = 10 * sin(3.14 * x1) + 20 * (x2 - 0.5)**2 + 10 * x3 + (ifc(c1='A', 5, 0)) + rand('NORMAL');
    output;
  end;
run;

Examples

This example demonstrates a basic call to the bartGauss action. It loads the 'sample_data' table into a CAS session, then fits a BART model with 'y' as the target and 'x1', 'x2', 'x3', and 'c1' as predictors. This is the simplest way to run the action, relying on default settings for the number of trees, MCMC iterations, and other hyperparameters.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	loadactionset 'bart';
3	bart.bartGauss /
4	TABLE='sample_data',
5	target='y',
6	inputs={'x1', 'x2', 'x3', 'c1'};
7	RUN;

This example shows a more advanced usage of the bartGauss action. It specifies the number of trees (nTree=100), burn-in iterations (nBI=500), and MCMC samples (nMC=2000). It also partitions the data, using 25% for testing (partByFrac). The fitted model is saved to a CAS table named 'bart_model_store' for later use. Additionally, an output table 'bart_predictions' is created to store predicted values and residuals for each observation.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	loadactionset 'bart';
3	bart.bartGauss /
4	TABLE={name='sample_data'},
5	target='y',
6	inputs={'x1', 'x2', 'x3'},
7	nominals={'c1'},
8	nTree=100,
9	nBI=500,
10	nMC=2000,
11	seed=456,
12	partByFrac={test=0.25, seed=123},
13	store={name='bart_model_store', replace=true},
14	OUTPUT={casOut={name='bart_predictions', replace=true}, pred='predicted_y', resid='residual_y'},
15	display={'FitStatistics', 'VarImp'};
16	RUN;

FAQ

What is the purpose of the bart.bartGauss action?

Which parameter is used to specify the input data table?

How can I define the model's dependent and independent variables?

What does the 'nTree' parameter control?

How are the MCMC iterations managed in this action?

How can the trained model be saved for future scoring?

How does the action handle missing values in predictor variables?

Description

Data Creation

Examples

Basic Model Fitting

Detailed Model with Output and Storage

FAQ