Semarchy GenAI Gemini structured enricher
The Semarchy GenAI Gemini structured enricher extracts structured data from unstructured text to enhance data completeness and streamline the data entry process.
Plugin ID
Semarchy GenAI Gemini Structured Enricher - com.semarchy.engine.plugins.genai.gemini.structured
Description
The GenAI Gemini structured enricher is designed to extract structured data from unstructured text using Google Gemini language models. It can generate or extract up to 20 outputs in JSON format, including strings, booleans, numbers, and dates.
Prerequisites
To authenticate with the Vertex AI API, you need to have a Google Cloud service account and must configure xDM to integrate Gemini models.
-
Create a Google Cloud service account.
For detailed information, see the official Vertex AI documentation. -
At the end of the account creation process, download the service account key file.
-
In xDM’s startup configuration, set the environment variable
GOOGLE_APPLICATION_CREDENTIALS
to the path of the service account key file that contains your credentials.
For additional information on how to get started with Vertex AI, see the LangChain4j documentation.
Plugin parameters
The following table lists the plugin parameters.
Parameter name | Mandatory | Type | Description | ||
---|---|---|---|---|---|
Project |
Yes |
String |
A globally unique, permanent identifier generated by the Google Cloud console. The project ID can be a combination of lowercase letters, numbers, and hyphens.
|
||
Location |
Yes |
String |
Geographical region where the Gemini model will be deployed and used. The following regions are supported:
|
||
Model Name |
Yes |
String |
Language model to be used.
|
||
Temperature |
No |
Number |
Value ranging from 0.0 to 1.0 for balancing between conservative and coherent outputs (0) and creative variations (1) during text generation. |
||
Max Output Tokens |
No |
Integer |
Maximum number of tokens allowed in the generated output during text generation.
|
||
Top K |
No |
Number |
Value ranging from 1 to 40 for limiting the model’s predictions to the most probable tokens at each step of generation. |
||
Top P |
No |
Number |
Value ranging from 0.0 to 1.0 for defining the cumulative probability threshold for nucleus sampling (i.e., token selection).
|
||
Max Retries |
No |
Integer |
Maximum number of attempts allowed for API requests before considering them unsuccessful. |
||
Boolean output <N> (BOOLEAN_OUT_<N>) |
No |
String |
Descriptor for the Nth boolean output in the structured output generated by the enricher, providing a description for the extracted boolean data (from 1 to 5). |
||
Date output <N> (DATE_OUT_<N>) |
No |
String |
Descriptor for the Nth date output in the structured output generated by the enricher, providing a description for the extracted date data (from 1 to 5). |
||
Number output <N> (NUMBER_OUT_<N>) |
No |
String |
Descriptor for the Nth number output in the structured output generated by the enricher, providing a description for the extracted number data (from 1 to 5). |
||
String output <N> (STRING_OUT_<N>) |
No |
String |
Descriptor for the Nth string output in the structured output generated by the enricher, providing a description for the extracted string data (from 1 to 5). |
The enricher can return up to 20 outputs (five of each type). |
The output descriptors are specifically designed to match the corresponding attribute types, whether they are dates, strings, numbers, or boolean values. For example, the Date output 1 descriptor exclusively matches date attributes. This matching process is automatically handled by the plugin.
|
Language models
Language models are AI systems trained on vast amounts of text data to understand and generate human-like language, enabling tasks like text completion, translation, summarization, and sentiment analysis.
The Vertex AI API offers a range of models with distinct capabilities. For more information on Gemini models and a list of stable model versions, see the official Vertex AI documentation.
Tokens
Tokens are units of text that language models use to process and generate language. They can range from individual characters to entire words, depending on the language and the specific model being used.
For more information about tokens, see the official Vertex AI documentation.
Plugin inputs
The following table lists the plugin inputs.
Input name | Mandatory | Type | Description |
---|---|---|---|
User Prompt |
No |
String |
Instructions specifying the information to be extracted and the method for structuring the outputs accordingly. |
Source Text for Extraction |
No |
String |
Unstructured text from which structured data is extracted. |
System Prompt |
No |
String |
Initial instruction designed to guide the model towards specific topics, styles, tones, or formats of generated text. |
If you choose not to set a user prompt, you must enter a source text for extraction, and vice-versa. Defining either a user prompt or a source text for extraction is mandatory. |
When opting for the Source Text for Extraction method, the enricher injects the provided text along with the configured descriptors (i.e., String output 1, Number output 1, etc.) into a standard user prompt. The pieces of information specified by the descriptors are then extracted from the text content. When opting for the User Prompt method, model designers construct a user prompt containing unstructured values and output keys (i.e., STRING_OUT_1, NUMBER_OUT_1, etc.). These keys are then mapped to the relevant attributes in the plugin output properties.
For a detailed demonstration of these methods, see Examples and use cases. |
Plugin outputs
The following table lists the plugin outputs.
Output name | Type | Description |
---|---|---|
Boolean output <N> (BOOLEAN_OUT_<N>) |
String |
Extracted boolean corresponding to the Nth boolean output descriptor, numbered from 1 to 5, and applied to a designated attribute |
Date output <N> (DATE_OUT_<N>) |
String |
Extracted date corresponding to the Nth date output descriptor, numbered from 1 to 5, and applied to a designated attribute. |
Number output <N> (NUMBER_OUT_<N>) |
String |
Extracted number corresponding to the Nth number output descriptor, numbered from 1 to 5, and applied to a designated attribute. |
String output <N> (STRING_OUT_<N>) |
String |
Extracted string corresponding to the Nth string output descriptor, numbered from 1 to 5, and applied to a designated attribute. |
Examples and use cases
Imagine a scenario where a user wants to expedite product record creation by automatically extracting a product’s name, price, and country of origin from a detailed description. In practice, the user wants the Product Name, Price, and Country of Origin fields to be automatically populated based on the Description field’s content.
For instance, consider a new product record with the following description:
"The Aerodynamic Helmet by Velocity Bikes is expertly crafted in France for speed, style, and safety. Its sleek profile reduces drag while prioritizing rider protection. Priced at $129.99, it’s the ultimate choice for safety-conscious cyclists."
Two methods can be employed to achieve the desired result.
-
Using the Source Text for Extraction method, a model designer can configure the enricher as follows:
-
In the plugin properties:
-
String output 1 (STRING_OUT_1): Name of the product
-
String output 2 (STRING_OUT_2): Country of origin of the product
-
Number output 1 (NUMBER_OUT_1): Price of the product
-
-
In the plugin input properties:
-
Source Text for Extraction: Description
-
-
In the plugin output properties:
-
ProductName: String output 1 (STRING_OUT_1)
-
Origin: String output 2 (STRING_OUT_2)
-
Price: Number output 1 (NUMBER_OUT_1)
-
-
-
Using the User Prompt method, the model designer can configure the enricher as follows:
-
In the plugin input properties:
-
User Prompt:
'From ' || Description || ' extract the following information in a structured JSON format: STRING_OUT_1: the product name, STRING_OUT_2: the country of origin, NUMBER_OUT_1: the product price.'
-
-
In the plugin output properties:
-
ProductName: String output 1 (STRING_OUT_1)
-
Origin: String output 2 (STRING_OUT_2)
-
Price: Number output 1 (NUMBER_OUT_1)
-
-
Regardless of the method selected, the enricher’s response populates the Product Name, Country of Origin, and Price fields in the new record with the following information:
-
Product Name: Aerodynamic Helmet
-
Country of Origin: France
-
Price: 129.99