Skip to main content

LLM Index Document

The LLM Index Document task is used to index a document into a vector database for efficient search, retrieval, and processing at a later stage.

The task uses a large language model (LLM) to create embeddings of the indexed document text, which are then stored in the vector database for later retrieval.

Task configuration

Configure these parameters for the LLM Index Document task.

ParameterDescriptionRequired/Optional
inputParameters.vectorDBThe vector database to store the data.

Note: If you haven’t configured the vector database on your Orkes Conductor cluster, navigate to the Integrations tab and configure your required provider. Refer to the documentation on how to integrate Vector Databases with Orkes console.
Required.
inputParameters.indexThe index in your vector database where the text or data will be stored.

The terminology of the index field varies depending on the integration:
  • For Weaviate, the index field indicates the class name.
  • For other integrations, it denotes the index name.
Required.
inputParameters.namespaceNamespaces are separate isolated environments within the database to manage and organize vector data effectively. Choose from the available namespace configured within the chosen vector database.

The usage and terminology of the namespace field vary depending on the integration:
  • For Pinecone, the namespace field is applicable.
  • For Weaviate, the namespace field is not applicable.
  • For MongoDB, the namespace field is referred to as “Collection” in MongoDB.
  • For Postgres, the namespace field is referred to as “Table” in Postgres.
Required.
inputParameters.embeddingModelProviderThe LLM provider for generating the embeddings.

Note: If you haven’t configured your AI/LLM provider on your Orkes console, navigate to the Integrations tab and configure your required provider. Refer to the documentation on how to integrate the LLM providers with Orkes Conductor.
Required.
inputParameters.embeddingModelThe embedding model provided by the selected LLM provider to generate the embeddings.Required.
inputParameters.urlThe URL of the file to be indexed.Required.
inputParameters.mediaTypeThe media type of the file to be indexed. Supported media types:
  • application/java-archive
  • application/EDI-X12
  • application/EDIFACT
  • application/javascript
  • application/octet-stream
  • application/ogg
  • application/pdf
  • application/xhtml+xml
  • application/x-shockwave-flash
  • application/json
  • application/ld+json
  • application/xml
  • application/zip
  • application/x-www-form-urlencoded
  • audio/mpeg
  • audio/x-ms-wma
  • audio/vnd.rn-realaudio
  • audio/x-wav
  • image/gif
  • image/jpeg
  • image/png
  • image/tiff
  • image/vnd.microsoft.icon
  • image/x-icon
  • image/vnd.djvu
  • image/svg+xml
Optional.
inputParameters.chunkSizeThe length of each input text segment when divided for processing by the LLM. For example, if the document contains 2,000 words and the chunk size is set to 500, the document is divided into four chunks for processing.Optional.
inputParameters.chunkOverlapThe overlap between adjacent chunks. For example, if the chunk overlap is specified as 100, then the first 100 words of each chunk would overlap with the last 100 words of the previous chunk.Optional.

Task definition

This is the JSON schema for an LLM Index Document task definition.

{
"name": "llm_index_document_task",
"taskReferenceName": "llm_index_document_task_ref",
"inputParameters": {
"vectorDB": "pineconedb",
"namespace": "myNewModel",
"index": "test",
"embeddingModelProvider": "azure_openai",
"embeddingModel": "text-davinci-003",
"url": "${workflow.input.url}",
"mediaType": "application/xhtml+xml",
"chunkSize": 500,
"chunkOverlap": 100
},
"type": "LLM_INDEX_DOCUMENT"
}

Task output

There is no output. The LLM Index Document task will store the indexed data in the specified vector database.

Adding an LLM Index Document task in UI

To add an LLM Index Document task:

  1. In your workflow, select the (+) icon and add an LLM Index Document task.
  2. Choose the Vector database, Index, Namespace, Embedding model provider, and Embedding model.
  3. Enter the URL of the document to be indexed.
  4. Choose the Media type, and enter the Chunk Size and Chunk Overlap.

LLM Index Document Task - UI