MKB Explorer

Matrix/User Interface/backend/app/extractors/docx_extractor.py

PageUser Interface

backend/app/extractors/docx_extractor.py

Uses Docling GPU on elin for DOCX extraction with semantic chunking, heading hierarchy, and table embedding as JSON. Falls back to python-docx for simple files; refactored for Phase 2 GPU-first processing. The DocxExtractor is part of the Docling extraction system for DOCX documents using GPU extraction. The docx_extractor.py extractor interfaces with Docling on elin GPU via SSH for DOCX extraction with semantic chunking and rich metadata. GPU-first document extraction includes extracting DOCX files using backend/app/extractors/docx_extractor.py that calls Docling on elin GPU. The DOCX extractor uses Docling for extraction on the elin GPU as a mandatory tool to perform semantic chunking with embedded JSON tables and rich metadata. The RQ worker calls the DOCX extractor for extraction using Docling and fails hard if extraction fails, enforcing the no fallback policy. The DOCX extractor produces semantic chunks with rich metadata including hierarchy and provenance to support DS-STAR queries for document reasoning.

Attributes

labels	Entity,Page
page type	file
layout	standard
functional area	Data Extraction
validation rules	Ensure the script uses GPU-accelerated Docling for DOCX/PPTX extraction without fallback to CPU-based tools; validate extraction success and handle errors explicitly; confirm that GPU resource checks are in place before extraction; test file processing for different formats and sizes.

Relationships14 connections

Loading graph...

Related Entities

Pagebackend/app/workers/extract.pyUser Interface

RELATES_TO

ThirdPartyComponentDoclingArchitecture

RELATES_TO

CapabilityDocling extraction systemIntent

RELATES_TO

PageDS-STAR queriesUser Interface

RELATES_TO

CapabilityGPU-first document extraction systemIntent