?> brTrain - WeAreCAS
boolRule

brTrain

Description

The brTrain action extracts Boolean rules from a collection of documents. It is a key part of supervised learning for text categorization, creating a human-readable model that explains the classification logic. This action analyzes the relationship between terms present in documents and their assigned categories (targets) to generate a set of IF-THEN rules. These rules can then be used by the brScore action to classify new documents.

proc cas; boolRule.brTrain / table={name='cas_table_in'} docId='_document_' termId='_termnum_' docInfo={table={name='doc_info_table'}, id='_document_', targets={'target_var'}} casOut={name='rules_out', replace=true}; run;
Settings
ParameterDescription
tableSpecifies the input data table that contains the document-term information for rule extraction.
docIdSpecifies the variable in the input table that contains the document ID.
termIdSpecifies the variable in the input table that contains the term ID.
docInfoSpecifies the table containing document metadata, including the target variables for classification.
termInfoSpecifies the table containing term metadata, such as the term's text (label).
casOutsSpecifies the output tables to be created, which can include the generated rules, the terms used in those rules, and the candidate terms considered.
gPositiveSpecifies the minimum g-score for a positive term to be considered for rule extraction. Higher values lead to more selective term inclusion.
gNegativeSpecifies the minimum g-score for a negative term to be considered. This helps in identifying terms that are indicative of a document NOT belonging to a category.
mPositiveSpecifies the 'm' value for computing estimated precision for positive terms, used in statistical calculations to smooth probability estimates.
mNegativeSpecifies the 'm' value for computing estimated precision for negative terms.
maxCandidatesSpecifies the maximum number of term candidates to be selected for each category during the rule generation process.
maxTriesInSpecifies the k-in value for the k-best search in the term ensemble process for creating individual rules.
maxTriesOutSpecifies the k-out value for the k-best search in the rule ensemble process for creating the final rule set.
minSupportsSpecifies the minimum number of documents in which a term must appear to be considered for rule creation.
nThreadsSpecifies the number of threads to use per node for the computation.
useOldNamesWhen set to TRUE, uses legacy variable names from the HPBOOLRULE procedure for the output tables.
Data Preparation
Data Creation: Document-Term and Document-Category Data

First, we create two tables. 'doc_term_data' contains the sparse representation of documents, linking document IDs to term IDs. 'doc_info_data' contains the category (target) for each document. This setup is typical for text mining tasks where document content and metadata are stored separately.

data mycas.doc_term_data;
   infile datalines delimiter=',';
   input docid termid;
   datalines;
1,1
1,2
2,2
2,3
3,3
3,4
4,1
4,4
;
run;

data mycas.doc_info_data;
   infile datalines delimiter=',';
   input docid category $;
   datalines;
1,A
2,A
3,B
4,B
;
run;

Examples

This example performs a basic training operation. It uses the document-term data from 'doc_term_data' and the document category information from 'doc_info_data'. The action will identify rules that predict the 'category' variable and store them in the 'rules_out' table.

SAS® / CAS Code
Copied!
1PROC CAS;
2 boolRule.brTrain /
3 TABLE={name='doc_term_data'},
4 docId='docid',
5 termId='termid',
6 docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}},
7 casOut={name='rules_out', replace=true};
8RUN;

This example demonstrates a more advanced use case. It specifies all three possible output tables: 'rules_out' for the final rules, 'rule_terms_out' for the terms within each rule, and 'candidate_terms_out' for all terms considered. It also adjusts the statistical parameters 'gPositive' and 'mPositive' to be more selective, requiring a higher statistical significance for terms to be included in a rule.

SAS® / CAS Code
Copied!
1PROC CAS;
2 boolRule.brTrain /
3 TABLE={name='doc_term_data'},
4 docId='docid',
5 termId='termid',
6 docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}, targetType='MULTICLASS'},
7 gPositive=10,
8 mPositive=1,
9 casOut={name='rules_out', replace=true, candidateTerms={name='candidate_terms_out', replace=true}, ruleTerms={name='rule_terms_out', replace=true}};
10RUN;
11 
12PROC PRINT DATA=mycas.rules_out;
13RUN;
14PROC PRINT DATA=mycas.rule_terms_out;
15RUN;

FAQ

What is the purpose of the brTrain action in SAS Viya?
What does the `docId` parameter specify?
What is the `docInfo` parameter used for?
How do `gPositive` and `gNegative` parameters influence rule extraction?
What is the function of the `maxCandidates` parameter?
What are the `maxTriesIn` and `maxTriesOut` parameters?
What does the `minSupports` parameter define?
What are the `mPositive` and `mNegative` parameters used for?
What information does the `termInfo` parameter require?
What is the purpose of the `casOuts` parameter?