Project: datalens
81 entity types
Matrix/User Interface/backend/app/extractors/docx_extractor.py
PageUser Interface

backend/app/extractors/docx_extractor.py

Uses Docling GPU on elin for DOCX extraction with semantic chunking, heading hierarchy, and table embedding as JSON. Falls back to python-docx for simple files; refactored for Phase 2 GPU-first processing. The DocxExtractor is part of the Docling extraction system for DOCX documents using GPU extraction. The docx_extractor.py extractor interfaces with Docling on elin GPU via SSH for DOCX extraction with semantic chunking and rich metadata. GPU-first document extraction includes extracting DOCX files using backend/app/extractors/docx_extractor.py that calls Docling on elin GPU. The DOCX extractor uses Docling for extraction on the elin GPU as a mandatory tool to perform semantic chunking with embedded JSON tables and rich metadata. The RQ worker calls the DOCX extractor for extraction using Docling and fails hard if extraction fails, enforcing the no fallback policy. The DOCX extractor produces semantic chunks with rich metadata including hierarchy and provenance to support DS-STAR queries for document reasoning.

Attributes
labelsEntity,Page
page typefile
layoutstandard
functional areaData Extraction
validation rulesEnsure the script uses GPU-accelerated Docling for DOCX/PPTX extraction without fallback to CPU-based tools; validate extraction success and handle errors explicitly; confirm that GPU resource checks are in place before extraction; test file processing for different formats and sizes.
Relationships14 connections
Loading graph...