?>
The brTrain action extracts Boolean rules from a collection of documents. It is a key part of supervised learning for text categorization, creating a human-readable model that explains the classification logic. This action analyzes the relationship between terms present in documents and their assigned categories (targets) to generate a set of IF-THEN rules. These rules can then be used by the brScore action to classify new documents.
| Parameter | Description |
|---|---|
| table | Specifies the input data table that contains the document-term information for rule extraction. |
| docId | Specifies the variable in the input table that contains the document ID. |
| termId | Specifies the variable in the input table that contains the term ID. |
| docInfo | Specifies the table containing document metadata, including the target variables for classification. |
| termInfo | Specifies the table containing term metadata, such as the term's text (label). |
| casOuts | Specifies the output tables to be created, which can include the generated rules, the terms used in those rules, and the candidate terms considered. |
| gPositive | Specifies the minimum g-score for a positive term to be considered for rule extraction. Higher values lead to more selective term inclusion. |
| gNegative | Specifies the minimum g-score for a negative term to be considered. This helps in identifying terms that are indicative of a document NOT belonging to a category. |
| mPositive | Specifies the 'm' value for computing estimated precision for positive terms, used in statistical calculations to smooth probability estimates. |
| mNegative | Specifies the 'm' value for computing estimated precision for negative terms. |
| maxCandidates | Specifies the maximum number of term candidates to be selected for each category during the rule generation process. |
| maxTriesIn | Specifies the k-in value for the k-best search in the term ensemble process for creating individual rules. |
| maxTriesOut | Specifies the k-out value for the k-best search in the rule ensemble process for creating the final rule set. |
| minSupports | Specifies the minimum number of documents in which a term must appear to be considered for rule creation. |
| nThreads | Specifies the number of threads to use per node for the computation. |
| useOldNames | When set to TRUE, uses legacy variable names from the HPBOOLRULE procedure for the output tables. |
First, we create two tables. 'doc_term_data' contains the sparse representation of documents, linking document IDs to term IDs. 'doc_info_data' contains the category (target) for each document. This setup is typical for text mining tasks where document content and metadata are stored separately.
data mycas.doc_term_data; infile datalines delimiter=','; input docid termid; datalines; 1,1 1,2 2,2 2,3 3,3 3,4 4,1 4,4 ; run; data mycas.doc_info_data; infile datalines delimiter=','; input docid category $; datalines; 1,A 2,A 3,B 4,B ; run;
This example performs a basic training operation. It uses the document-term data from 'doc_term_data' and the document category information from 'doc_info_data'. The action will identify rules that predict the 'category' variable and store them in the 'rules_out' table.
| 1 | PROC CAS; |
| 2 | boolRule.brTrain / |
| 3 | TABLE={name='doc_term_data'}, |
| 4 | docId='docid', |
| 5 | termId='termid', |
| 6 | docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}}, |
| 7 | casOut={name='rules_out', replace=true}; |
| 8 | RUN; |
This example demonstrates a more advanced use case. It specifies all three possible output tables: 'rules_out' for the final rules, 'rule_terms_out' for the terms within each rule, and 'candidate_terms_out' for all terms considered. It also adjusts the statistical parameters 'gPositive' and 'mPositive' to be more selective, requiring a higher statistical significance for terms to be included in a rule.
| 1 | PROC CAS; |
| 2 | boolRule.brTrain / |
| 3 | TABLE={name='doc_term_data'}, |
| 4 | docId='docid', |
| 5 | termId='termid', |
| 6 | docInfo={TABLE={name='doc_info_data'}, id='docid', targets={'category'}, targetType='MULTICLASS'}, |
| 7 | gPositive=10, |
| 8 | mPositive=1, |
| 9 | casOut={name='rules_out', replace=true, candidateTerms={name='candidate_terms_out', replace=true}, ruleTerms={name='rule_terms_out', replace=true}}; |
| 10 | RUN; |
| 11 | |
| 12 | PROC PRINT DATA=mycas.rules_out; |
| 13 | RUN; |
| 14 | PROC PRINT DATA=mycas.rule_terms_out; |
| 15 | RUN; |