applyConcept - WeAreCAS

Description

The applyConcept action performs concept extraction using a predefined or custom concept extraction model (a LITI file). It is part of the Text Analytics Rule Score action set, which provides tools for linguistic rule scoring for categorization, concept extraction, and sentiment analysis. This action processes an input text document or a table of documents and identifies occurrences of concepts defined in the model, outputting detailed match information.

textRuleScore.applyConcept { casOut={...}, docId="string", dropConcepts={"string-1", ...}, factOut={...}, language="string", litiChunkSize="string", matchType="ALL"|"BEST"|"LONGEST", model={...}, parseTableIn={...}, parseTableOut={...}, ruleMatchOut={...}, table={...}, text="string" };

Settings

Parameter	Description
casOut	Specifies the output CAS table to store the concept match results.
docId	Specifies the name of the variable in the input table that contains the document IDs.
dropConcepts	Specifies a list of concept names to exclude from the output tables. This is useful for filtering out predefined concepts without modifying the model.
factOut	Specifies the output CAS table for storing fact match results.
language	Specifies the language of the input text. Default is 'ENGLISH'.
litiChunkSize	Specifies the chunk size for document processing (e.g., '32K', '1M', 'ALL'). Smaller sizes can help manage memory for large documents. Default is '32K'.
matchType	Specifies the matching strategy: 'ALL' for all matches, 'BEST' for the best match, or 'LONGEST' for the longest match. Default is 'ALL'.
model	Specifies the input CAS table containing the user-defined LITI (Language Interpretation for Text Information) model for concept extraction.
parseTableIn	Specifies a CAS table containing pre-parsed documents from a previous run, which can improve performance, especially when using the CLAUS_n operator.
parseTableOut	Specifies a CAS table to save pre-parsed documents, which can be used as input for future runs to improve performance.
ruleMatchOut	Specifies the output CAS table to store detailed rule match information, which can be used as input for the ruleGen action.
table	Specifies the input CAS table that contains the documents to be processed.
text	Specifies the name of the variable in the input table that contains the document text.

Data Preparation

Data Creation

This example creates a sample CAS table named 'my_documents' with two columns: 'doc_id' for the document identifier and 'text' for the document content. This table will be used as input for the concept extraction.

1 DATA mycas.my_documents;
2   INFILE DATALINES delimiter='|';
3   LENGTH doc_id $ 10 text $ 300;
4   INPUT doc_id $ text $;
5   DATALINES;
6   doc1|The new SAS Viya platform is a powerful analytics tool.
7   doc2|SAS Cloud Analytic Services (CAS) is the engine behind Viya.
8   doc3|You can use LITI models for concept extraction.
9   ;
10 RUN;

Examples

This example applies the default concept extraction model to the 'my_documents' table. It identifies concepts in the 'text' column, using 'doc_id' as the document identifier. The results are stored in a CAS table named 'concept_matches'.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	textRuleScore.applyConcept /
3	TABLE={name='my_documents'},
4	docId='doc_id',
5	text='text',
6	casOut={name='concept_matches', replace=true};
7	RUN;

Result :
The action will produce an output table 'concept_matches' in the current caslib. This table will contain the concepts found in each document, such as 'SAS' or 'platform', along with their start and end positions.

This example demonstrates a more advanced use case. It first loads a custom LITI model from a table named 'my_liti_model'. Then, it applies this model to the 'my_documents' table. It specifies 'LONGEST' for the match type to only return the longest matching string for overlapping concepts. It generates three output tables: 'concept_matches' for the main results, 'fact_matches' for extracted facts, and 'rulematch_details' for detailed rule matching information used for debugging or further analysis.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	textRuleScore.applyConcept /
3	TABLE={name='my_documents'},
4	docId='doc_id',
5	text='text',
6	model={name='my_liti_model'},
7	matchType='LONGEST',
8	casOut={name='concept_matches', replace=true},
9	factOut={name='fact_matches', replace=true},
10	ruleMatchOut={name='rulematch_details', replace=true};
11	RUN;

Result :
Three tables will be created in the current caslib: 'concept_matches' with the longest concept matches, 'fact_matches' containing any facts extracted based on the LITI rules, and 'rulematch_details' with granular data about which rules were triggered for each match.

This example shows a two-step process to improve performance. First, `applyConcept` is called with the `parseTableOut` parameter to create a table of pre-parsed documents named 'parsed_docs'. In the second call, this 'parsed_docs' table is used as input via the `parseTableIn` parameter, which can speed up processing, especially with complex models or large documents.

SAS® / CAS Code

Copied!

1	PROC CAS;
2	/* Step 1: Parse documents and save the intermediate table */
3	textRuleScore.applyConcept /
4	TABLE={name='my_documents'},
5	docId='doc_id',
6	text='text',
7	parseTableOut={name='parsed_docs', replace=true};
8
9	/* Step 2: Use the pre-parsed table for faster concept extraction */
10	textRuleScore.applyConcept /
11	TABLE={name='my_documents'},
12	docId='doc_id',
13	text='text',
14	parseTableIn={name='parsed_docs'},
15	casOut={name='concept_matches_fast', replace=true};
16	RUN;

Result :
The first step creates the 'parsed_docs' table. The second step uses this intermediate table to create 'concept_matches_fast', which will contain the same concept matches as a single run but may complete more quickly.

FAQ

What is the purpose of the `applyConcept` action in SAS Viya?

What are the primary input and output parameters for the `applyConcept` action?

How can I use a custom concept model with the `applyConcept` action?

What does the `matchType` parameter control?

How can I optimize the performance of the `applyConcept` action, especially with large documents or complex models?

Is it possible to exclude certain concepts from the output results?

What does the `litiChunkSize` parameter do?

1	DATA mycas.my_documents;
2	INFILE DATALINES delimiter='\|';
3	LENGTH doc_id $ 10 text $ 300;
4	INPUT doc_id $ text $;
5	DATALINES;
6	doc1\|The new SAS Viya platform is a powerful analytics tool.
7	doc2\|SAS Cloud Analytic Services (CAS) is the engine behind Viya.
8	doc3\|You can use LITI models for concept extraction.
9	;
10	RUN;

Description

Data Creation

Examples

Basic Concept Extraction

Concept Extraction with a Custom LITI Model and Multiple Outputs

Using Pre-Parsed Data for Efficiency

FAQ