?>
The analyzeMissingPatterns action performs a missing pattern analysis. It is a part of the Data Science Pilot action set, designed to automate and enhance data science workflows. This action is particularly useful in the exploratory data analysis phase to understand the extent and nature of missing data, which is crucial for subsequent modeling steps. It can identify different patterns of missingness across variables and analyze their relationship with a target variable, helping to decide on an appropriate imputation strategy.
| Parameter | Description |
|---|---|
| table | Specifies the input table for the analysis. This table should contain the data for which you want to analyze missing value patterns. |
| casOut | Specifies the output table to store the results of the missing pattern analysis. This is a required parameter. |
| inputs | Specifies the list of variables to be included in the analysis. If not specified, all numeric and character variables from the input table are used. |
| nominals | Specifies which of the input variables should be treated as nominal (categorical). This affects how statistics are calculated for these variables. |
| target | Specifies a target variable. When a target is provided, the action analyzes the relationship between missing value patterns and the target variable's distribution. |
| freq | Specifies a frequency variable. Each observation in the input table is treated as if it appears n times, where n is the value of the frequency variable for that observation. |
| distinctCountLimit | Sets a limit on the number of distinct values for frequency counting. If this limit is exceeded, the action may switch to an estimation algorithm (Misra-Gries) or abort. |
| ecdfTolerance | Specifies the tolerance for the empirical cumulative distribution function (ECDF), used by the quantile sketch algorithm for robust statistics. |
| misraGries | When set to TRUE, enables the use of the Misra-Gries algorithm for frequency estimation if the distinct count limit is surpassed. |
This SAS code creates a sample dataset named 'sample_data_missing' in the 'casuser' caslib. The dataset includes several variables with intentionally placed missing values (.) to demonstrate how the analyzeMissingPatterns action works.
data casuser.sample_data_missing; input var1 var2 $ var3 var4 target; cards; 1 10 A 100 1 2 . B 200 0 3 30 . 300 1 4 40 C . 0 5 . D 500 1 6 60 . 600 0 7 70 E . 1 ; run;
This example performs a fundamental missing pattern analysis on the 'sample_data_missing' table. The results, which include tables detailing missing counts and patterns, are saved to a CAS table named 'missing_patterns_summary'.
| 1 | PROC CAS; |
| 2 | dataSciencePilot.analyzeMissingPatterns / |
| 3 | TABLE={name='sample_data_missing'}, |
| 4 | casOut={name='missing_patterns_summary', replace=true}; |
| 5 | RUN; |
| 6 | QUIT; |
This detailed example analyzes missing value patterns in relation to a specific target variable ('target'). It explicitly defines 'var3' as a nominal variable and focuses the analysis on a specified list of input variables. This approach helps in understanding if the missingness of data is correlated with the outcome variable.
| 1 | PROC CAS; |
| 2 | dataSciencePilot.analyzeMissingPatterns / |
| 3 | TABLE={name='sample_data_missing'}, |
| 4 | inputs={{name='var1'}, {name='var2'}, {name='var3'}, {name='var4'}}, |
| 5 | nominals={'var3'}, |
| 6 | target='target', |
| 7 | casOut={name='missing_patterns_details', replace=true}; |
| 8 | RUN; |
| 9 | QUIT; |