brTrain - WeAreCAS

Description

The brTrain action extracts Boolean rules from a collection of documents. It is a key part of supervised learning for text categorization, creating a human-readable model that explains the classification logic. This action analyzes the relationship between terms present in documents and their assigned categories (targets) to generate a set of IF-THEN rules. These rules can then be used by the brScore action to classify new documents.

proc cas; boolRule.brTrain / table={name='cas_table_in'} docId='_document_' termId='_termnum_' docInfo={table={name='doc_info_table'}, id='_document_', targets={'target_var'}} casOut={name='rules_out', replace=true}; run;

Settings

Parameter	Description
table	Specifies the input data table that contains the document-term information for rule extraction.
docId	Specifies the variable in the input table that contains the document ID.
termId	Specifies the variable in the input table that contains the term ID.
docInfo	Specifies the table containing document metadata, including the target variables for classification.
termInfo	Specifies the table containing term metadata, such as the term's text (label).
casOuts	Specifies the output tables to be created, which can include the generated rules, the terms used in those rules, and the candidate terms considered.
gPositive	Specifies the minimum g-score for a positive term to be considered for rule extraction. Higher values lead to more selective term inclusion.
gNegative	Specifies the minimum g-score for a negative term to be considered. This helps in identifying terms that are indicative of a document NOT belonging to a category.
mPositive	Specifies the 'm' value for computing estimated precision for positive terms, used in statistical calculations to smooth probability estimates.
mNegative	Specifies the 'm' value for computing estimated precision for negative terms.
maxCandidates	Specifies the maximum number of term candidates to be selected for each category during the rule generation process.
maxTriesIn	Specifies the k-in value for the k-best search in the term ensemble process for creating individual rules.
maxTriesOut	Specifies the k-out value for the k-best search in the rule ensemble process for creating the final rule set.
minSupports	Specifies the minimum number of documents in which a term must appear to be considered for rule creation.
nThreads	Specifies the number of threads to use per node for the computation.
useOldNames	When set to TRUE, uses legacy variable names from the HPBOOLRULE procedure for the output tables.

Data Preparation

Data Creation: Document-Term and Document-Category Data

First, we create two tables. 'doc_term_data' contains the sparse representation of documents, linking document IDs to term IDs. 'doc_info_data' contains the category (target) for each document. This setup is typical for text mining tasks where document content and metadata are stored separately.

data mycas.doc_term_data;
   infile datalines delimiter=',';
   input docid termid;
   datalines;
1,1
1,2
2,2
2,3
3,3
3,4
4,1
4,4
;
run;

data mycas.doc_info_data;
   infile datalines delimiter=',';
   input docid category $;
   datalines;
1,A
2,A
3,B
4,B
;
run;

Examples

This example performs a basic training operation. It uses the document-term data from 'doc_term_data' and the document category information from 'doc_info_data'. The action will identify rules that predict the 'category' variable and store them in the 'rules_out' table.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	boolRule.brTrain /
3	TABLE={name='doc_term_data'},
4	docId='docid',
5	termId='termid',
6	docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}},
7	casOut={name='rules_out', replace=true};
8	RUN;

This example demonstrates a more advanced use case. It specifies all three possible output tables: 'rules_out' for the final rules, 'rule_terms_out' for the terms within each rule, and 'candidate_terms_out' for all terms considered. It also adjusts the statistical parameters 'gPositive' and 'mPositive' to be more selective, requiring a higher statistical significance for terms to be included in a rule.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	boolRule.brTrain /
3	TABLE={name='doc_term_data'},
4	docId='docid',
5	termId='termid',
6	docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}, targetType='MULTICLASS'},
7	gPositive=10,
8	mPositive=1,
9	casOut={name='rules_out', replace=true, candidateTerms={name='candidate_terms_out', replace=true}, ruleTerms={name='rule_terms_out', replace=true}};
10	RUN;
11
12	PROC PRINT DATA=mycas.rules_out;
13	RUN;
14	PROC PRINT DATA=mycas.rule_terms_out;
15	RUN;

FAQ

What is the purpose of the brTrain action in SAS Viya?

What does the `docId` parameter specify?

What is the `docInfo` parameter used for?

How do `gPositive` and `gNegative` parameters influence rule extraction?

What is the function of the `maxCandidates` parameter?

What are the `maxTriesIn` and `maxTriesOut` parameters?

What does the `minSupports` parameter define?

What are the `mPositive` and `mNegative` parameters used for?

What information does the `termInfo` parameter require?

What is the purpose of the `casOuts` parameter?

Description

Data Creation: Document-Term and Document-Category Data

Examples

Basic Rule Training

Detailed Rule Training with Multiple Outputs and Tuned Parameters

FAQ