Skip to content

Architecture and Design

Design principles

  • Modular: The pipeline is built from small, composable parts.
  • Extensible: It is easy to add new parts to the pipeline by extending straightforward traits.
  • Performance: Performance and ease-of-use are the main goals of the library. Performance always has priority.
  • Tracable: tracing is used throughout the pipeline.

When designing integrations, transformers, chunkers

  • Simple: The API should be simple and easy to use.
  • Sane defaults, fully configurable: The library should have sane defaults that are easy to override.
  • Builder pattern: The builder pattern is used to create new instances of the pipeline.

The-things-we-talk-about

  • Pipeline: The main struct that holds the pipeline. It is a stream of Nodes.
  • Node: The main struct that holds the data. It has a path, chunk and metadata.
  • IndexingStream: The internal stream of Nodes in the pipeline.
  • Loader: The starting point of the stream, creates and emits Nodes.
  • Transformers: Some behaviour that modifies the Nodes.
  • BatchTransformers: Transformers that transform multiple nodes.
  • Chunkers: Transformers that split a node into multiple nodes.
  • Storages: Persist the Nodes.
  • NodeCache: Filters cached nodes.
  • Integrations: External libraries that can be used with the pipeline.

Pipeline structure and traits

  • from_loader (impl Loader) starting point of the stream, creates and emits Nodes
  • filter_cached (impl NodeCache) filters cached nodes
  • then (impl Transformer) transforms the node and puts it on the stream
  • then_in_batch (impl BatchTransformer) transforms multiple nodes and puts them on the stream
  • then_chunk (impl ChunkerTransformer) transforms a single node and emits multiple nodes
  • then_store_with (impl Storage) stores the nodes in a storage backend, this can be chained

Additionally, several generic transformers are implemented. They take implementers of SimplePrompt and EmbeddingModel to do their things.