Project: datalens
81 entity types
Matrix/User Interface/backend/app/extractors/pptx_extractor.py
PageUser Interface

backend/app/extractors/pptx_extractor.py

The PPTX extractor uses Docling on elin GPU via SSH to extract PPTX files preserving slide semantics and metadata, without fallback. It is based on python-pptx for slide and text extraction, with enhancements for semantic chunking and slide layout detection (title, bullets, blank). It extracts image metadata such as counts and types for added context, integrated into GPU-first extraction workflow.