MKB Explorer

ThirdPartyComponentArchitecture

DOCX extractor

The batch upload pipeline depends on new extractors including the DOCX extractor. The DOCX extractor uses the python-docx third-party component. The DOCX extractor optionally uses Docling as an extraction method for better text quality and semantic structure. The DOCX extractor falls back to python-docx for faster extraction of simple documents. The DOCX extractor implements semantic chunking with section-based chunk boundaries to preserve document structure. The DOCX extractor handles tables by embedding them as JSON within text chunks instead of separate DuckDB tables. The DOCX extractor relies on background workers for asynchronous processing. The DOCX extractor capability is validated by the test_extractors test case.

Attributes

labels

ThirdPartyComponent,Entity

Relationships8 connections

Loading graph...

Related Entities

LayerBackground workersArchitecture

RELATES_TO

IntegrationEndpointbatch upload pipeline with smart processingIntegrations

RELATES_TO

ThirdPartyComponentDoclingArchitecture

RELATES_TO