Semarchy AI-based few-shot classification enricher

The Semarchy AI-based few-shot classification enricher classifies text inputs into predetermined categories using minimal labeled examples.

Plugin ID

Semarchy AI Few-Shot Classification Enricher - com.semarchy.engine.plugins.ai.classification.fewshot

Description

The few-shot classification enricher is a sophisticated tool that generalizes from a few labeled data—​called sentences—in order to effectively identify and categorize new, unseen inputs. This approach is particularly useful in specialized contexts where there is limited labeled data available for training generic machine learning models.

Plugin parameters

The following table lists the plugin parameters.

Parameter name Mandatory Type Description

API Key

Yes

String

Client-side API key for establishing connectivity with the Hugging Face API.

Model

Yes

String

Classification model to use. Any few-shot classification model is applicable (e.g., sentence-transformers/all-MiniLM-L6-v2).

Datasource

No

String

Name of the platform datasource from which class information is retrieved. If not specified, the enricher defaults to using the data location’s datasource.

Sentence (JSON)*

No

String

JSON-formatted query containing labeled data examples, provided as "class identifier":"descriptive input text" pairs, to help the model learn how to associate similar inputs with the correct labels (e.g., {"BOYSSHOES":["<Boys' shoes description>"], "GIRLSDRESSES":["<Girls' dress description 1>", "<Girls' dress description 2>"]}).

Sentence Table
Sentence ID
Sentence Text*

No

String

Labeled text samples associated with a specified table and identifiers, to help the model learn how to associate similar inputs with the correct labels. For example:

  • Sentence Table: GD_PRODUCT (indicates that the sentence is stored in the GD_PRODUCT table within the database)

  • Sentence ID: F_BRAND (indicates that the unique identifiers for each class are stored in the F_BRAND column within the GD_PRODUCT table)

  • Sentence Text: DESCRIPTION (indicates that the text to associate with a class ID is stored in the DESCRIPTION column within the GD_PRODUCT table)

Sentence (Custom SQL)*

No

String

Custom SQL query for retrieving existing classes and their corresponding labels to help the model learn how to associate similar inputs with the correct labels.

Basic query
SELECT F_BRAND, DESCRIPTION
FROM GD_PRODUCT
WHERE F_DESCRIPTION IS NOT NULL
Specifying a maximum number of sentences to retrieve per brand
SELECT ID, LABEL FROM (
SELECT F_BRAND AS ID, DESCRIPTION AS LABEL, ROW_NUMBER()
OVER (PARTITION BY F_BRAND ORDER BY b_upddate DESC) AS n
FROM GD_PRODUCT WHERE DESCRIPTION IS NOT NULL ) x
WHERE n <= 20

Min Score To Classify

No

Number

Value between 0 and 100, used to define the minimum confidence score required for the enricher to classify input text into one of the potential classes.
Default value: 0

Max Samples Per Class

No

Number

Maximum number of examples to use for each candidate class.
Default value: 20

* Choose one of these options to specify sentences—​that is, labeled examples from which the model will learn.

Classification models

The few-shot classification enricher is powered by machine learning models that are accessed through the Hugging Face API. The enricher allows flexibility in choosing a model that suits the nature of the data to be classified, considering factors such as domain-specific requirements and performance characteristics.
Here are some relevant models for few-shot classification:

  • sentence-transformers/all-MiniLM-L6-v2: a compact transformer model optimized for tasks like clustering or semantic search.

  • sentence-transformers/all-mpnet-base-v2: an efficient model applicable for tasks such as information retrieval, clustering, or assessing sentence similarity.

For more information on classification models, see the official Hugging Face documentation.

Plugin inputs

The following table lists the plugin inputs.

Input name Mandatory Type Description

Input Text

Yes

String

Text sample to classify, provided as a string.

Plugin outputs

The following table lists the plugin outputs.

Output name Type Description

Most Probable Class ID

String

The category that the enricher identifies as the most likely classification for the input text based on the predefined labels.

Classification Score

Number

The confidence level associated with the most probable class identified by the enricher, represented as a percentage.

An error message is displayed on the user interface and raised in the PDE log if the classification score falls below the required threshold for any class.

Examples and use cases

Automated data classification: classifying new records in a product catalog

Imagine a scenario where a new record is added to a product catalog with the following description:

"The Heritage Houndstooth Blazer, crafted from a blend of 70% organic cotton and 30% Tencel™, is GOTS-certified, FLA-compliant, and produced in WRAP-certified facilities. Featuring PrecisionFit tailoring and Repreve® recycled polyester lining, it combines sustainability with modern design."

This product catalog includes various brands spanning regular, luxury, and ethical clothing lines. Below are descriptions of some of the products included in the inventory.

  • In the Everyday Essentials (regular) line:

    • "Wear this Esperanza Contrast Color T-Shirt as a casual t-shirt, pairing it with jeans, or wear it as a smart shirt with bright shorts. It’s a perfect addition to any girl’s wardrobe."

    • "Relaxed-fit jean that fits easily over boots. Casual enough for a date, durable enough for hard work, rugged enough for motorcycle riding."

  • In the Prestige (luxury) line:

    • "Luxury V-neck designed for a layered look, this Moretti sweater is the perfect knit fabric for a warm and comfortable pullover. This lightweight sweater provides a slim fit for the fashionable man who wants to look meticulous and well-dressed."

    • "Valdo Cashmere is a men’s stylish cashmere wool blended double-breasted pea coat with bronze parallel buttons. The finest materials are used to create this pea coat that makes this piece a warm overcoat for spring and winter seasons."

  • In the Eco-Conscious (ethical) line:

    • "Introducing the Vegan Leather Jacket, made from cruelty-free materials and manufactured in facilities adhering to the FLA’s fair labor code. This jacket combines sleek design with ethical principles, ensuring both style and sustainability in every stitch."

    • "Our Fairtrade Wool Blend Coat blends merino wool with recycled fibers, promoting sustainable practices and fair wages for workers. GOTS-certified and dyed with low-impact methods, it offers warmth and style while supporting ethical fashion initiatives."

Suppose the few-shot classification enricher is configured as follows:

  • In the plugin parameters:

    • Sentence (JSON): {"EVERYDAY ESSENTIALS":["Relaxed-fit jean that fits easily over boots. Casual enough for a date, durable enough for hard work, rugged enough for motorcycle riding.", "Wear this Esperanza Contrast Color T-Shirt as a casual t-shirt, pairing it with jeans, or wear it as a smart shirt with bright shorts. It’s a perfect addition to any girl’s wardrobe."], "PRESTIGE":["Luxury V-neck designed for a layered look, this Moretti sweater is the perfect knit fabric for a warm and comfortable pullover. This lightweight sweater provides a slim fit for the fashionable man who wants to look meticulous and well-dressed.", "Valdo Cashmere is a men’s stylish cashmere wool blended double-breasted pea coat with bronze parallel buttons. The finest materials are used to create this pea coat that makes this piece a warm overcoat for spring and winter seasons."], "ECO-CONSCIOUS":["Our Fairtrade Wool Blend Coat blends merino wool with recycled fibers, promoting sustainable practices and fair wages for workers. GOTS-certified and dyed with low-impact methods, it offers warmth and style while supporting ethical fashion initiatives.", "Introducing the Vegan Leather Jacket, made from cruelty-free materials and manufactured in facilities adhering to the FLA’s fair labor code. This jacket combines sleek design with ethical principles, ensuring both style and sustainability in every stitch."]}
      or

    • Sentence Table: GD_PRODUCT
      Sentence ID: F_LINE
      Sentence Text: DESCRIPTION
      or

    • Sentence (Custom SQL): SELECT F_LINE, DESCRIPTION FROM GD_PRODUCT WHERE 1=1

    • Min Score To Classify: 50

  • In the plugin input properties:

    • Input Text: Description

  • In the plugin output properties:

    • FID_Line: Most Probable Class ID

    • EnrichmentConfidenceScore: Classification Score

Based on the product description and provided sentences, the enricher automatically classifies the new product record into the Eco-Conscious clothing line with a confidence score of 72, using the sentence-transformers/all-MiniLM-L6-v2 model.

Additional use cases

Relevant use cases for few-shot classification may include:

  • Product code mapping in e-commerce: classify product listings using specific industry codes or SKU numbers (e.g., BSH-043 in "bookshelf systems", DSK-011 in "modular desks", WRD-408 in "wardrobes" for a furniture store) to streamline inventory management, improve search functionality, and enhance user experience.

  • Clinical trial data classification: categorize clinical trial data using medical research codes (e.g., "RCT" for randomized controlled trials, "PK" for pharmacokinetics, "AE" for adverse events) to enhance data analysis, reporting, and regulatory compliance.

  • Employee skill categorization: organize employee profiles according to their educational background (BSc, MEng, PhD, etc.), credentials (PMP, CPA, etc.), software proficiency (DBMS, CRM, CAD, GIS, etc.), and other relevant criteria into categories (e.g., "consulting," "accounting," "IT," "training," or "customer service" in the human resources industry) to support better workforce management, targeted training and development initiatives, and enhanced talent management strategies.