Skip to content

Transforming and Enriching

Transformers are the bread and butter of an indexing pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven’t thought of.

There’s two ways to apply a transformer. Per node or in batch.

The Transformer trait

The Transformer trait is very straightforward:

pub trait Transformer: Send + Sync {
async fn transform_node(&self, node: Node) -> Result<Node>;
fn concurrency(&self) -> Option<usize> {
None
}
}

Or in human language: “You get a node, do your thing, then return a result with the node”. That’s it.

The BatchableTransformer trait

Batchable transformers take a list of nodes, and then yield it back as an IndexingStream.

The trait is defined as follows:

pub trait BatchableTransformer: Send + Sync {
async fn batch_transform(&self, nodes: Vec<Node>) -> IndexingStream;
fn concurrency(&self) -> Option<usize> {
None
}
}

Closures

Transformer and BatchableTransformer are also implemented for closures! That means you can also do the following:

pipeline.then(|mut node| {
node.chunk = node.chunk.to_uppercase().to_string();
Ok(node)
})
// Or in batch
pipeline.then_in_batch(10, |mut nodes| {
IndexingStream::iter(nodes
.into_iter()
.map(|mut node| {
node.chunk = node.chunk.to_uppercase().to_string();
Ok(Node)
})
)
})

Built in transformers

NameDescriptionFeature Flag
EmbedGeneric embedding transformer, requires an embedding model
MetadataKeywordsUses an LLM to extract keywords and add as metadata
MetadataQACodeUses an LLM to generate questions and answers for Code
MetadataQATextUses an LLM to generate questions and answers for Text
MetadataSummaryUses an LLM to generate a summary
MetadataTitleUses an LLM to generate a title
HtmlToMarkdownTransformerConverts html in a node to markdownscraping
MetadataRefsDefsCodeExtracts references and definitions with tree-sitter from codetreesitter