MKB Explorer

Matrix/User Interface/backend/app/workers/extract.py

PageUser Interface

backend/app/workers/extract.py

The extraction worker uses the DOCX extractor for GPU-first document extraction of DOCX files. The extraction worker uses the PPTX extractor for GPU-first extraction of PPTX files. The extraction worker chains to the batch vectorize job to process GPU embeddings after extraction. The extraction worker uses Docling as mandatory extractor for DOCX/PPTX files and fails extraction if Docling fails. The extract.py worker is modified to write extracted data using pg_data_service.py instead of DuckDBService. The extract.py worker invokes Docling-based extractors for DOCX and PPTX files and enforces a no-fallback failure policy if Docling fails. The extract.py worker runs as part of the RQ workers to process extraction jobs asynchronously. The RQ worker calls the DOCX extractor for extraction using Docling and fails hard if extraction fails, enforcing the no fallback policy. The RQ worker calls the PPTX extractor for extraction using Docling, failing hard on extraction errors without fallback. The extraction worker chains extraction results to the embedding service for batch vectorization on GPU after successful DOCX/PPTX extraction.

Attributes

labels	Page,Entity
page type	Page
layout	Default
functional area	Data Extraction
validation rules	Ensure extraction handles both success and error states, with fallbacks and logging for debugging.

Relationships10 connections

Loading graph...

Related Entities

Pagebackend/app/extractors/docx_extractor.pyUser Interface

RELATES_TO

Pagebackend/app/extractors/pptx_extractor.pyUser Interface

RELATES_TO

Pagebackend/app/services/embedding_service.pyUser Interface

RELATES_TO

BatchJobbatch_vectorize_jobIntegrations

RELATES_TO

ThirdPartyComponentDoclingArchitecture

RELATES_TO

Capabilitypg_data_service.pyIntent

RELATES_TO

PasswordPolicyRQ workersSecurity

RELATES_TO