applyCategory - WeAreCAS

Description

The applyCategory action categorizes text documents based on a pre-built category model (in MCO file format). It processes an input table containing the text data and applies the categorization rules, generating output tables with the results. This action is essential for automated text classification tasks, allowing users to assign documents to predefined categories.

textRuleScore.applyCategory { casOut={...}, docId="string", docType="TEXT"|"XML", groupedMatchOut={...}, matchDelimiter="string", matchOut={...}, model={...}, modelOut={...}, scoringAlgorithm="FREQUENCY"|"WEIGHTED", table={...}, text="string" };

Settings

Parameter	Description
casOut	Specifies the output CAS table to store the categorization results for each document.
docId	Specifies the name of the variable in the input table that contains the unique document identifier.
docType	Specifies the type of the input documents. Can be 'TEXT' for plain text or 'XML' for XML documents.
groupedMatchOut	Specifies an output table that groups all matched terms for each category within a single row per document.
matchDelimiter	Specifies the character or string used to separate matched terms in the 'groupedMatchOut' table.
matchOut	Specifies the output table to store detailed information about each individual term match that contributes to a category assignment.
model	Specifies the input CAS table that contains the compiled category model (MCO) to be used for scoring.
modelOut	Specifies an output table to save the model information used for scoring.
scoringAlgorithm	Specifies the algorithm used for scoring. 'FREQUENCY' counts the number of times a category's rules are met, while 'WEIGHTED' considers the weights assigned to the rules.
table	Specifies the input CAS table containing the text documents to be categorized.
text	Specifies the name of the variable in the input table that contains the text content of the documents.

Data Preparation

Data Creation for Categorization

This SAS code creates two tables: 'reviews' which contains the text data to be categorized, and 'category_model_table' which holds the pre-compiled categorization model. The 'reviews' table has a unique ID ('docId') and the review text ('text'). The model table is a placeholder for a real MCO model.

1 PROC CAS; 
2   datastep.runCode(code='data mycas.reviews; 
3     length text $200; 
4     infile datalines delimiter="|"; 
5     input docId $ text $; 
6     datalines; 
7   1|This is a great product, I love it! 
8   2|The service was terrible and the food was cold. 
9   3|I am not sure how I feel about this. 
10   ; 
11   run; 
12  
13   datastep.runCode(code="DATA mycas.category_model_table; LENGTH _mco_ long; _mco_ = 12345; RUN;");
14 QUIT;

Examples

This example demonstrates a basic use of the applyCategory action. It takes the 'reviews' table and the 'category_model_table' as input, categorizes the text in the 'text' column, and stores the main results in the 'reviews_categorized' table.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	textRuleScore.applyCategory /
3	TABLE={name='reviews'},
4	docId='docId',
5	text='text',
6	model={name='category_model_table'},
7	casOut={name='reviews_categorized', replace=true};
8	RUN;
9	QUIT;

Result :
An output table named 'reviews_categorized' is created in the 'mycas' caslib. It contains the original data plus new columns for each category, indicating whether a document belongs to that category (1 for a match, 0 otherwise).

This example shows a more advanced use of applyCategory. It uses the 'WEIGHTED' scoring algorithm and generates three distinct output tables: 'categorized_docs' for the main category scores, 'category_matches' for detailed term-level matches, and 'category_grouped_matches' which aggregates matches by category for each document, using a semicolon as a delimiter.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	textRuleScore.applyCategory /
3	TABLE={name='reviews', caslib='mycas'},
4	docId='docId',
5	text='text',
6	model={name='category_model_table'},
7	casOut={name='categorized_docs', replace=true},
8	matchOut={name='category_matches', replace=true},
9	groupedMatchOut={name='category_grouped_matches', replace=true},
10	matchDelimiter=';',
11	scoringAlgorithm='WEIGHTED';
12	RUN;
13	QUIT;

Result :
Three tables are created in the 'mycas' caslib: 'categorized_docs' with weighted category scores, 'category_matches' detailing every rule match, and 'category_grouped_matches' providing a summarized view of matches per document and category.

FAQ

What is the primary function of the applyCategory action?

Which parameter is mandatory for specifying the category model to be used?

How can I define the input data table and the specific text variable to be categorized?

What are the available scoring algorithms for this action?

What is the difference between the `casOut`, `matchOut`, and `groupedMatchOut` output parameters?

How can I specify a unique identifier for each document in the input table?

1	PROC CAS;
2	datastep.runCode(code='data mycas.reviews;
3	length text $200;
4	infile datalines delimiter="\|";
5	input docId $ text $;
6	datalines;
7	1\|This is a great product, I love it!
8	2\|The service was terrible and the food was cold.
9	3\|I am not sure how I feel about this.
10	;
11	run;
12
13	datastep.runCode(code="DATA mycas.category_model_table; LENGTH _mco_ long; _mco_ = 12345; RUN;");
14	QUIT;

Description

Data Creation for Categorization

Examples

Basic Text Categorization

Detailed Categorization with Match Outputs and Weighted Scoring

FAQ