Transforming and Enriching
Transformers are the bread and butter of an indexing pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven’t thought of.
There’s two ways to apply a transformer. Per node or in batch.
The Transformer
trait
The Transformer
trait is very straightforward:
pub trait Transformer: Send + Sync { async fn transform_node(&self, node: Node) -> Result<Node>;
fn concurrency(&self) -> Option<usize> { None }}
Or in human language: “You get a node, do your thing, then return a result with the node”. That’s it.
The BatchableTransformer
trait
Batchable transformers take a list of nodes, and then yield it back as an IndexingStream
.
The trait is defined as follows:
pub trait BatchableTransformer: Send + Sync { async fn batch_transform(&self, nodes: Vec<Node>) -> IndexingStream;
fn concurrency(&self) -> Option<usize> { None }}
Closures
Transformer
and BatchableTransformer
are also implemented for closures! That means you can also do the following:
pipeline.then(|mut node| { node.chunk = node.chunk.to_uppercase().to_string();
Ok(node)})
// Or in batch
pipeline.then_in_batch(|mut nodes| { IndexingStream::iter(nodes .into_iter() .map(|mut node| { node.chunk = node.chunk.to_uppercase().to_string();
Ok(Node) }) )})
Built in transformers
Name | Description | Feature Flag |
---|---|---|
Embed | Generic embedding transformer, requires an embedding model | |
MetadataKeywords | Uses an LLM to extract keywords and add as metadata | |
MetadataQACode | Uses an LLM to generate questions and answers for Code | |
MetadataQAText | Uses an LLM to generate questions and answers for Text | |
MetadataSummary | Uses an LLM to generate a summary | |
MetadataTitle | Uses an LLM to generate a title | |
HtmlToMarkdownTransformer | Converts html in a node to markdown | scraping |
MetadataRefsDefsCode | Extracts references and definitions with tree-sitter from code | tree-sitter |