Skip to content

Transforming and Enriching

Transformers are the bread and butter of an ingestion pipeline. They can transform the chunk, extract, modify and add metadata, adding vectors, and probably a whole lot more that we haven’t thought of.

There’s two ways to apply a transformer. Per node or in batch.

The Transformer trait

The Transformer trait is very straightforward:

pub trait Transformer: Send + Sync {
async fn transform_node(&self, node: IngestionNode) -> Result<IngestionNode>;
fn concurrency(&self) -> Option<usize> {
None
}
}

Or in human language: “You get a node, do your thing, then return a result with the node”. That’s it.

In batches, the BatchableTransformer trait is similar, except that it needs to return a stream. See docs.rs for more details.

Built in transformers

NameDescriptionFeature Flag
EmbedGeneric embedding transformer, requires an LLM
MetadataKeywordsUses an LLM to extract keywords and add as metadata
MetadataQACodeUses an LLM to generate questions and answers for Code
MetadataQATextUses an LLM to generate questions and answers for Text
MetadataSummaryUses an LLM to generate a summary
MetadataTitleUses an LLM to generate a title
HtmlToMarkdownTransformerConverts html in a node to markdownscraping