Skip to main content

LLM Index Document

A system task designed to index a provided document into a vector database for efficient search, retrieval, and processing at a later stage.

Definitions

{
"name": "llm_index_document_task",
"taskReferenceName": "llm_index_document_task_ref",
"inputParameters": {
"vectorDB": "pineconedb",
"namespace": "myNewModel",
"index": "test",
"embeddingModelProvider": "azure_openai",
"embeddingModel": "text-davinci-003",
"url": "${workflow.input.url}",
"mediaType": "application/xhtml+xml",
"chunkSize": 500,
"chunkOverlap": 100
},
"type": "LLM_INDEX_DOCUMENT"
}

Input Parameters

ParameterDescription
vectorDBChoose the required vector database.

Note:If you haven’t configured the vector database on your Orkes console, navigate to the Integrations tab and configure your required provider. Refer to the documentation on how to integrate Vector Databases with Orkes console.
namespaceChoose from the available namespace configured within the chosen vector database.

Namespaces are separate isolated environments within the database to manage and organize vector data effectively.

Note: The namespace field has different names and applicability based on the integration:
  • For Pinecone integration, the namespace field is applicable.
  • For Weaviate integration, the namespace field is not applicable.
  • For MongoDB integration, the namespace field is referred to as “Collection” in MongoDB.
  • For Postgres integration, the namespace field is referred to as “Table” in Postgres.
indexChoose the index in your vector database where indexed text or data was stored.

Note: For Weaviate integration, this field refers to the class name, while for other integrations, it denotes the index name.
embeddingModelProviderChoose the required LLM provider for embedding.

If you haven’t configured your AI / LLM provider on your Orkes console, navigate to the Integrations tab and configure your required provider. Refer to the documentation on how to integrate the LLM providers with Orkes console.
embeddingModelChoose from the available language models provided by the selected LLM provider.
urlProvide the URL of the file to be indexed.
mediaTypeSelect the media type of the file to be indexed. Currently, supported media types include:
  • application/java-archive
  • application/EDI-X12
  • application/EDIFACT
  • application/javascript
  • application/octet-stream
  • application/ogg
  • application/pdf
  • application/xhtml+xml
  • application/x-shockwave-flash
  • application/json
  • application/ld+json
  • application/xml
  • application/zip
  • application/x-www-form-urlencoded
  • audio/mpeg
  • audio/x-ms-wma
  • audio/vnd.rn-realaudio
  • audio/x-wav
  • image/gif
  • image/jpeg
  • image/png
  • image/tiff
  • image/vnd.microsoft.icon
  • image/x-icon
  • image/vnd.djvu
  • image/svg+xml
chunkSizeSpecifies how long each input text segment should be when it’s divided for processing by the LLM.

For example, if the article contains 2000 words and the chunk size is configured as 500, then the document would be divided into four chunks for processing.
chunkOverlapSpecifies the overlap between adjacent chunks.

For example, if the chunk overlap is specified as 100, then the first 100 words of each chunk would overlap with the last 100 words of the previous chunk.

Examples



  1. Add task type LLM Index Document.
  2. Choose the vector database, & LLM provider for embedding the document.
  3. Provide the document URL to be indexed and other input parameters.

LLM Index Document Task