?>
The applyCategory action categorizes text documents based on a pre-built category model (in MCO file format). It processes an input table containing the text data and applies the categorization rules, generating output tables with the results. This action is essential for automated text classification tasks, allowing users to assign documents to predefined categories.
| Parameter | Description |
|---|---|
| casOut | Specifies the output CAS table to store the categorization results for each document. |
| docId | Specifies the name of the variable in the input table that contains the unique document identifier. |
| docType | Specifies the type of the input documents. Can be 'TEXT' for plain text or 'XML' for XML documents. |
| groupedMatchOut | Specifies an output table that groups all matched terms for each category within a single row per document. |
| matchDelimiter | Specifies the character or string used to separate matched terms in the 'groupedMatchOut' table. |
| matchOut | Specifies the output table to store detailed information about each individual term match that contributes to a category assignment. |
| model | Specifies the input CAS table that contains the compiled category model (MCO) to be used for scoring. |
| modelOut | Specifies an output table to save the model information used for scoring. |
| scoringAlgorithm | Specifies the algorithm used for scoring. 'FREQUENCY' counts the number of times a category's rules are met, while 'WEIGHTED' considers the weights assigned to the rules. |
| table | Specifies the input CAS table containing the text documents to be categorized. |
| text | Specifies the name of the variable in the input table that contains the text content of the documents. |
This SAS code creates two tables: 'reviews' which contains the text data to be categorized, and 'category_model_table' which holds the pre-compiled categorization model. The 'reviews' table has a unique ID ('docId') and the review text ('text'). The model table is a placeholder for a real MCO model.
1 PROC CAS; 2 datastep.runCode(code='data mycas.reviews; 3 length text $200; 4 infile datalines delimiter="|"; 5 input docId $ text $; 6 datalines; 7 1|This is a great product, I love it! 8 2|The service was terrible and the food was cold. 9 3|I am not sure how I feel about this. 10 ; 11 run; 12 13 datastep.runCode(code="DATA mycas.category_model_table; LENGTH _mco_ long; _mco_ = 12345; RUN;"); 14 QUIT;
This example demonstrates a basic use of the applyCategory action. It takes the 'reviews' table and the 'category_model_table' as input, categorizes the text in the 'text' column, and stores the main results in the 'reviews_categorized' table.
| 1 | PROC CAS; |
| 2 | textRuleScore.applyCategory / |
| 3 | TABLE={name='reviews'}, |
| 4 | docId='docId', |
| 5 | text='text', |
| 6 | model={name='category_model_table'}, |
| 7 | casOut={name='reviews_categorized', replace=true}; |
| 8 | RUN; |
| 9 | QUIT; |
This example shows a more advanced use of applyCategory. It uses the 'WEIGHTED' scoring algorithm and generates three distinct output tables: 'categorized_docs' for the main category scores, 'category_matches' for detailed term-level matches, and 'category_grouped_matches' which aggregates matches by category for each document, using a semicolon as a delimiter.
| 1 | PROC CAS; |
| 2 | textRuleScore.applyCategory / |
| 3 | TABLE={name='reviews', caslib='mycas'}, |
| 4 | docId='docId', |
| 5 | text='text', |
| 6 | model={name='category_model_table'}, |
| 7 | casOut={name='categorized_docs', replace=true}, |
| 8 | matchOut={name='category_matches', replace=true}, |
| 9 | groupedMatchOut={name='category_grouped_matches', replace=true}, |
| 10 | matchDelimiter=';', |
| 11 | scoringAlgorithm='WEIGHTED'; |
| 12 | RUN; |
| 13 | QUIT; |