Transforming and Enriching
Transformers are the bread and butter of an indexing pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven’t thought of.
There’s two ways to apply a transformer. Per node or in batch.
The Transformer trait
The Transformer trait is very straightforward:
pub trait Transformer: Send + Sync { async fn transform_node(&self, node: Node) -> Result<Node>;
fn concurrency(&self) -> Option<usize> { None }}Or in human language: “You get a node, do your thing, then return a result with the node”. That’s it.
The BatchableTransformer trait
Batchable transformers take a list of nodes, and then yield it back as an IndexingStream.
The trait is defined as follows:
pub trait BatchableTransformer: Send + Sync { async fn batch_transform(&self, nodes: Vec<Node>) -> IndexingStream;
fn concurrency(&self) -> Option<usize> { None }}Closures
Transformer and BatchableTransformer are also implemented for closures! That means you can also do the following:
pipeline.then(|mut node| { node.chunk = node.chunk.to_uppercase().to_string();
Ok(node)})
// Or in batch
pipeline.then_in_batch(|mut nodes| { IndexingStream::iter(nodes .into_iter() .map(|mut node| { node.chunk = node.chunk.to_uppercase().to_string();
Ok(Node) }) )})Built in transformers
| Name | Description | Feature Flag |
|---|---|---|
| Embed | Generic embedding transformer, requires an embedding model | |
| MetadataKeywords | Uses an LLM to extract keywords and add as metadata | |
| MetadataQACode | Uses an LLM to generate questions and answers for Code | |
| MetadataQAText | Uses an LLM to generate questions and answers for Text | |
| MetadataSummary | Uses an LLM to generate a summary | |
| MetadataTitle | Uses an LLM to generate a title | |
| HtmlToMarkdownTransformer | Converts html in a node to markdown | scraping |
| MetadataRefsDefsCode | Extracts references and definitions with tree-sitter from code | tree-sitter |