Transforming and Enriching

Transformers are the bread and butter of an indexing pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven’t thought of.

There’s two ways to apply a transformer. Per node or in batch.

The `Transformer` trait

The Transformer trait is very straightforward:

pub trait Transformer: Send + Sync {
    async fn transform_node(&self, node: Node) -> Result<Node>;

    fn concurrency(&self) -> Option<usize> {
        None
    }
}

Or in human language: “You get a node, do your thing, then return a result with the node”. That’s it.

The `BatchableTransformer` trait

Batchable transformers take a list of nodes, and then yield it back as an IndexingStream.

The trait is defined as follows:

pub trait BatchableTransformer: Send + Sync {
    async fn batch_transform(&self, nodes: Vec<Node>) -> IndexingStream;

    fn concurrency(&self) -> Option<usize> {
        None
    }
}

Closures

Transformer and BatchableTransformer are also implemented for closures! That means you can also do the following:

pipeline.then(|mut node| {
  node.chunk = node.chunk.to_uppercase().to_string();

  Ok(node)
})

// Or in batch

pipeline.then_in_batch(|mut nodes| {
    IndexingStream::iter(nodes
      .into_iter()
      .map(|mut node| {
        node.chunk = node.chunk.to_uppercase().to_string();

        Ok(Node)
      })
    )
})

Built in transformers

Name	Description	Feature Flag
Embed	Generic embedding transformer, requires an embedding model
MetadataKeywords	Uses an LLM to extract keywords and add as metadata
MetadataQACode	Uses an LLM to generate questions and answers for Code
MetadataQAText	Uses an LLM to generate questions and answers for Text
MetadataSummary	Uses an LLM to generate a summary
MetadataTitle	Uses an LLM to generate a title
HtmlToMarkdownTransformer	Converts html in a node to markdown	scraping
MetadataRefsDefsCode	Extracts references and definitions with tree-sitter from code	tree-sitter