Semarchy AI zero-shot classification enricher

The Semarchy AI zero-shot classification enricher classifies text inputs into predetermined categories without requiring explicit training on labeled data.

Plugin ID

Semarchy AI Zero-Shot Classification Enricher - com.semarchy.engine.plugins.ai.classification.zeroshot

Description

The AI zero-shot classification enricher leverages advanced machine learning techniques to analyze text inputs and assign data to predetermined categories, without prior training on labeled datasets. This enricher enhances the efficiency of organizing and managing data without the need for extensive manual classification efforts.

Plugin parameters

The following table lists the plugin parameters.

Parameter name Mandatory Type Description

API Key

Yes

String

Client-side API key for establishing connectivity with the Hugging Face API.

Model

Yes

String

Classification model to use. Any zero-shot classification model is applicable (e.g.,facebook/bart-large-mnli).

Base URL

Yes

String

Base URL for the Hugging Face model, available either directly on Hugging Face or on Azure (e.g., https://router.huggingface.co/hf-inference/models).

Deployment

String

Preferred method for accessing the Hugging Face API, which can be done either through direct API calls to Hugging Face or by routing the requests via an alternative provider. Possible values are:

HUGGING_FACE (default)
AZURE_ML

The base URL must be set accordingly.

Datasource

String

Name of the platform datasource from which class information is retrieved. If not specified, the enricher defaults to using the data location’s datasource.

Candidate Class (JSON)^*

String

JSON-formatted query containing one or more "class identifier":"label" pairs, representing the potential categories that the model may classify inputs into. Identifiers may be numeric, strings, or alphanumeric (e.g., {"345":"Hats", "SHOES":"Shoes", "MENSWEAR.01":"Men’s shirts"}).

At least two classes are required for proper data classification.

Candidate Class Table
Candidate Class ID
Candidate Class Label^*

String

Name of a specific table or column from which to retrieve and organize class information. For example:

Candidate Class Table: GD_FAMILY (indicates that the class information is stored in the GD_FAMILY table within the database)
Candidate Class ID: ID (indicates that the unique identifiers for each class are stored in the ID column within the GD_FAMILY table)
Candidate Class Label: NAME (indicates that the labels or names corresponding to each class ID are stored in the NAME column within the GD_FAMILY table)

Candidate Class (Custom SQL)^*

String

Custom SQL query for retrieving potential classes for the input text (e.g., SELECT ID, NAME FROM GD_FAMILY WHERE 1=1).

Ensure the query follows the correct sequence by selecting the class identifier first, followed by the columns for the class labels exactly as they are named in the specified table.

Min Score To Classify

Number

Value between 0 and 100 used to define the minimum confidence score required for the enricher to classify input text into one of the potential classes.
Default value: 0.

Multi-Label

Boolean

Choice of whether multiple labels (i.e., candidate classes) can be assigned to a single input text sample.
Default value: false.

Use Cache

Boolean

Choice of whether to use the cache layer on the inference API to accelerate the processing of requests that have been made previously.
Default value: true.

When using deterministic models, which consistently produce the same results, cached data can be reliably used. However, if you are employing a non-deterministic model, set the parameter to false to bypass the cache layer and ensure fresh results are retrieved.

Wait For Model

Boolean

Choice of whether to wait for the model to be ready before processing requests or immediately returning a 503 error indicating that the service is unavailable.
Default value: false.

Enable this option only after encountering a 503 error to avoid timeouts and efficiently manage inference tasks.

^* Choose one of these options to specify the candidate classes for classification.

Classification models

The AI zero-shot classification enricher is powered by machine learning models that are accessed through the Hugging Face API. The enricher allows flexibility in choosing a model that suits the nature of the data to be classified, considering factors such as domain-specific requirements and performance characteristics.
Here are some relevant models for zero-shot classification:

facebook/bart-large-mnli: versatile and highly accurate, making it suitable for general-purpose classification tasks, though it has longer inference times due to its large size.
cross-encoder/nli-roberta-base: balances performance and efficiency, making it ideal for zero-shot classification tasks that require lower latency without significant accuracy trade-offs.
MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli: robust and highly accurate, particularly effective for complex or nuanced texts and tasks involving fact-checking or adversarial examples.

For more information on classification models, see the official Hugging Face documentation or the official documentation about Hugging Face on Azure.

Plugin inputs

The following table lists the plugin inputs.

Input name	Mandatory	Type	Description
Input Text	Yes	String	Text sample to classify, provided as a string.

Input name

Mandatory

Type

Description

Input Text

Yes

String

Text sample to classify, provided as a string.

Plugin outputs

The following table lists the plugin outputs.

Output name

Type

Description

Most Probable Class ID

String

The category that the enricher identifies as the most likely classification for the input text based on the predefined labels.

Classification Score

Number

The confidence level associated with the most probable class identified by the enricher, represented as a percentage.

An error message is displayed on the user interface and raised in the error log if the classification score falls below the required threshold for any class.

Examples and use cases

Automated data classification: classifying new records in a product catalog

Imagine a scenario where a new record is added to a product catalog with the following description:

"This adorable summer dress features a vibrant floral print on soft, breathable cotton, ensuring comfort and style for any occasion. Its A-line silhouette and practical details, like side pockets and a back zipper, make it a versatile wardrobe essential."

Suppose the zero-shot classification enricher is configured as follows:

In the plugin parameters:
- Candidate Class (JSON): {"GIRLSCLOTHING":"Girls' clothing", "BOYSCLOTHING":"Boys' clothing", "GIRLSSHOES":"Girls' shoes", "BOYSSHOES":"Boys' shoes"}
  or
- Candidate Class Table: GD_FAMILY
  Candidate Class ID: ID
  Candidate Class Label: NAME
  or
- Candidate Class (Custom SQL): SELECT ID, NAME FROM GD_FAMILY WHERE 1=1
- Min Score To Classify: 50
In the plugin input properties:
- Input Text: Description
In the plugin output properties:
- FID_Family: Most Probable Class ID
- EnrichmentConfidenceScore: Classification Score

The enricher automatically classifies the new product record into the Girls' clothing family with a confidence score of 77, using the facebook/bart-large-mnli model.

Additional use cases

Common use cases for zero-shot classification may include:

Supplier data classification: automatically classify supplier descriptions into predefined categories such as "electronics components," "raw materials," "office supplies," or "logistic services" to facilitate procurement, auditing, and strategic sourcing.
Customer segmentation: based on their activity descriptions, segment customers into categories such as "high-value", "medium-value," "low-value," or "loyalty program member" to enable targeted marketing, improved customer service, and personalized offerings.
Product type categorization: analyze product names or descriptions and classify them into categories such as "electronics," "apparel," "home goods," or "personal care," to enhance searchability, inventory management, and reporting efficiency for large product inventories.