?> applyCategory - WeAreCAS
textRuleScore

applyCategory

Description

The applyCategory action categorizes text documents based on a pre-built category model (in MCO file format). It processes an input table containing the text data and applies the categorization rules, generating output tables with the results. This action is essential for automated text classification tasks, allowing users to assign documents to predefined categories.

textRuleScore.applyCategory { casOut={...}, docId="string", docType="TEXT"|"XML", groupedMatchOut={...}, matchDelimiter="string", matchOut={...}, model={...}, modelOut={...}, scoringAlgorithm="FREQUENCY"|"WEIGHTED", table={...}, text="string" };
Settings
ParameterDescription
casOutSpecifies the output CAS table to store the categorization results for each document.
docIdSpecifies the name of the variable in the input table that contains the unique document identifier.
docTypeSpecifies the type of the input documents. Can be 'TEXT' for plain text or 'XML' for XML documents.
groupedMatchOutSpecifies an output table that groups all matched terms for each category within a single row per document.
matchDelimiterSpecifies the character or string used to separate matched terms in the 'groupedMatchOut' table.
matchOutSpecifies the output table to store detailed information about each individual term match that contributes to a category assignment.
modelSpecifies the input CAS table that contains the compiled category model (MCO) to be used for scoring.
modelOutSpecifies an output table to save the model information used for scoring.
scoringAlgorithmSpecifies the algorithm used for scoring. 'FREQUENCY' counts the number of times a category's rules are met, while 'WEIGHTED' considers the weights assigned to the rules.
tableSpecifies the input CAS table containing the text documents to be categorized.
textSpecifies the name of the variable in the input table that contains the text content of the documents.
Data Preparation
Data Creation for Categorization

This SAS code creates two tables: 'reviews' which contains the text data to be categorized, and 'category_model_table' which holds the pre-compiled categorization model. The 'reviews' table has a unique ID ('docId') and the review text ('text'). The model table is a placeholder for a real MCO model.

1PROC CAS;
2 datastep.runCode(code='data mycas.reviews;
3 length text $200;
4 infile datalines delimiter="|";
5 input docId $ text $;
6 datalines;
7 1|This is a great product, I love it!
8 2|The service was terrible and the food was cold.
9 3|I am not sure how I feel about this.
10 ;
11 run;
12 
13 datastep.runCode(code="DATA mycas.category_model_table; LENGTH _mco_ long; _mco_ = 12345; RUN;");
14QUIT;

Examples

This example demonstrates a basic use of the applyCategory action. It takes the 'reviews' table and the 'category_model_table' as input, categorizes the text in the 'text' column, and stores the main results in the 'reviews_categorized' table.

SAS® / CAS Code
Copied!
1PROC CAS;
2 textRuleScore.applyCategory /
3 TABLE={name='reviews'},
4 docId='docId',
5 text='text',
6 model={name='category_model_table'},
7 casOut={name='reviews_categorized', replace=true};
8RUN;
9QUIT;
Result :
An output table named 'reviews_categorized' is created in the 'mycas' caslib. It contains the original data plus new columns for each category, indicating whether a document belongs to that category (1 for a match, 0 otherwise).

This example shows a more advanced use of applyCategory. It uses the 'WEIGHTED' scoring algorithm and generates three distinct output tables: 'categorized_docs' for the main category scores, 'category_matches' for detailed term-level matches, and 'category_grouped_matches' which aggregates matches by category for each document, using a semicolon as a delimiter.

SAS® / CAS Code
Copied!
1PROC CAS;
2 textRuleScore.applyCategory /
3 TABLE={name='reviews', caslib='mycas'},
4 docId='docId',
5 text='text',
6 model={name='category_model_table'},
7 casOut={name='categorized_docs', replace=true},
8 matchOut={name='category_matches', replace=true},
9 groupedMatchOut={name='category_grouped_matches', replace=true},
10 matchDelimiter=';',
11 scoringAlgorithm='WEIGHTED';
12RUN;
13QUIT;
Result :
Three tables are created in the 'mycas' caslib: 'categorized_docs' with weighted category scores, 'category_matches' detailing every rule match, and 'category_grouped_matches' providing a summarized view of matches per document and category.

FAQ

What is the primary function of the applyCategory action?
Which parameter is mandatory for specifying the category model to be used?
How can I define the input data table and the specific text variable to be categorized?
What are the available scoring algorithms for this action?
What is the difference between the `casOut`, `matchOut`, and `groupedMatchOut` output parameters?
How can I specify a unique identifier for each document in the input table?