Transforming and Enriching
Transformers are the bread and butter of an indexing pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven’t thought of.
There’s two ways to apply a transformer. Per node or in batch.
The Transformer
trait
The Transformer
trait is very straightforward:
Or in human language: “You get a node, do your thing, then return a result with the node”. That’s it.
The BatchableTransformer
trait
Batchable transformers take a list of nodes, and then yield it back as an IndexingStream
.
The trait is defined as follows:
Closures
Transformer
and BatchableTransformer
are also implemented for closures! That means you can also do the following:
Built in transformers
Name | Description | Feature Flag |
---|---|---|
Embed | Generic embedding transformer, requires an embedding model | |
MetadataKeywords | Uses an LLM to extract keywords and add as metadata | |
MetadataQACode | Uses an LLM to generate questions and answers for Code | |
MetadataQAText | Uses an LLM to generate questions and answers for Text | |
MetadataSummary | Uses an LLM to generate a summary | |
MetadataTitle | Uses an LLM to generate a title | |
HtmlToMarkdownTransformer | Converts html in a node to markdown | scraping |
MetadataRefsDefsCode | Extracts references and definitions with tree-sitter from code | tree-sitter |