?> analyzeMissingPatterns - WeAreCAS
dataSciencePilot

analyzeMissingPatterns

Description

The analyzeMissingPatterns action performs a missing pattern analysis. It is a part of the Data Science Pilot action set, designed to automate and enhance data science workflows. This action is particularly useful in the exploratory data analysis phase to understand the extent and nature of missing data, which is crucial for subsequent modeling steps. It can identify different patterns of missingness across variables and analyze their relationship with a target variable, helping to decide on an appropriate imputation strategy.

dataSciencePilot.analyzeMissingPatterns / table={...} casOut={...} <inputs={{...}, ...}> <nominals={"variable-name-1", ...}> <target="variable-name"> <freq="variable-name"> <distinctCountLimit=integer> <ecdfTolerance=double> <misraGries=TRUE | FALSE>;
Settings
ParameterDescription
tableSpecifies the input table for the analysis. This table should contain the data for which you want to analyze missing value patterns.
casOutSpecifies the output table to store the results of the missing pattern analysis. This is a required parameter.
inputsSpecifies the list of variables to be included in the analysis. If not specified, all numeric and character variables from the input table are used.
nominalsSpecifies which of the input variables should be treated as nominal (categorical). This affects how statistics are calculated for these variables.
targetSpecifies a target variable. When a target is provided, the action analyzes the relationship between missing value patterns and the target variable's distribution.
freqSpecifies a frequency variable. Each observation in the input table is treated as if it appears n times, where n is the value of the frequency variable for that observation.
distinctCountLimitSets a limit on the number of distinct values for frequency counting. If this limit is exceeded, the action may switch to an estimation algorithm (Misra-Gries) or abort.
ecdfToleranceSpecifies the tolerance for the empirical cumulative distribution function (ECDF), used by the quantile sketch algorithm for robust statistics.
misraGriesWhen set to TRUE, enables the use of the Misra-Gries algorithm for frequency estimation if the distinct count limit is surpassed.
Data Preparation
Creating a Sample Dataset with Missing Values

This SAS code creates a sample dataset named 'sample_data_missing' in the 'casuser' caslib. The dataset includes several variables with intentionally placed missing values (.) to demonstrate how the analyzeMissingPatterns action works.

data casuser.sample_data_missing; 
  input var1 var2 $ var3 var4 target; 
  cards; 
1 10 A 100 1 
2 . B 200 0 
3 30 . 300 1 
4 40 C .   0 
5 .  D 500 1 
6 60 . 600 0 
7 70 E .   1 
; 
run;

Examples

This example performs a fundamental missing pattern analysis on the 'sample_data_missing' table. The results, which include tables detailing missing counts and patterns, are saved to a CAS table named 'missing_patterns_summary'.

SAS® / CAS Code
Copied!
1PROC CAS;
2 dataSciencePilot.analyzeMissingPatterns /
3 TABLE={name='sample_data_missing'},
4 casOut={name='missing_patterns_summary', replace=true};
5RUN;
6QUIT;
Result :
The action generates several result tables. The 'MissingPatterns' table outlines the different combinations of missing values found. The 'MissingCounts' table provides the count and percentage of missing values for each variable. 'NumVarInfo' and 'CharVarInfo' provide descriptive statistics for numeric and character variables, respectively.

This detailed example analyzes missing value patterns in relation to a specific target variable ('target'). It explicitly defines 'var3' as a nominal variable and focuses the analysis on a specified list of input variables. This approach helps in understanding if the missingness of data is correlated with the outcome variable.

SAS® / CAS Code
Copied!
1PROC CAS;
2 dataSciencePilot.analyzeMissingPatterns /
3 TABLE={name='sample_data_missing'},
4 inputs={{name='var1'}, {name='var2'}, {name='var3'}, {name='var4'}},
5 nominals={'var3'},
6 target='target',
7 casOut={name='missing_patterns_details', replace=true};
8RUN;
9QUIT;
Result :
The output includes several analytical tables. Notably, the 'TargetCounts' table shows the frequency distribution of the target variable for each identified missing value pattern. The 'TargetMeans' table provides the mean of the target variable for each pattern. These results are crucial for determining if the missing data mechanism is Missing Not At Random (MNAR) with respect to the target.

FAQ

What is the purpose of the analyzeMissingPatterns action?
What are the required parameters for the analyzeMissingPatterns action?
How can I specify which variables to analyze for missing patterns?
What is the role of the 'target' parameter?
How does the action handle variables with a high number of unique values?
Can I incorporate observation frequencies into the analysis?