Project: datalens
81 entity types
Matrix/Architecture/DOCX extractor
ThirdPartyComponentArchitecture

DOCX extractor

The batch upload pipeline depends on new extractors including the DOCX extractor. The DOCX extractor uses the python-docx third-party component. The DOCX extractor optionally uses Docling as an extraction method for better text quality and semantic structure. The DOCX extractor falls back to python-docx for faster extraction of simple documents. The DOCX extractor implements semantic chunking with section-based chunk boundaries to preserve document structure. The DOCX extractor handles tables by embedding them as JSON within text chunks instead of separate DuckDB tables. The DOCX extractor relies on background workers for asynchronous processing. The DOCX extractor capability is validated by the test_extractors test case.