Project: datalens
81 entity types
Matrix/Intent/Docling extraction system
CapabilityIntent

Docling extraction system

The Docling extraction system was built as part of the DataLens Phase 2 GPU-first document extraction system. Docling extraction on elin GPU is the core process used by the Docling extraction system for document processing using GPU. The GPU-first document extraction system includes the Docling extraction system as the mandatory method for DOCX and PPTX extraction. Docling extraction quality is validated by the implemented Docling extraction system in the GPU-first document extraction pipeline. Semantic chunking is enforced as a business rule in the Docling extraction system for section/slide-based document processing. The Docling extraction system uses a business rule that tables extracted from documents are embedded as JSON within semantic chunks. Rich metadata including hierarchy, confidence, and provenance is enforced by the Docling extraction system as a business rule for DS-STAR reasoning. The Docling extraction system utilizes Ollama embeddings (nomic-embed-text) to generate vector embeddings for semantic search and reasoning. The DocxExtractor is part of the Docling extraction system for DOCX documents using GPU extraction. The PptxExtractor is part of the Docling extraction system for PPTX documents using GPU extraction. EmbeddingService is used by the Docling extraction system to produce GPU-accelerated embeddings for semantic chunk vectors. RQ worker depends on the Docling extraction system for processing DOCX/PPTX extraction jobs without fallback failure tolerance. DuckDB text_chunks physical table stores semantic chunks produced by the Docling extraction system for querying and analysis. The Deploy_gpu_extractors.sh script installs and configures the Docling extraction system and related GPU-first extraction components.