Data Model
152 entities found
extract worker
Background workers include the extract worker as a component. The batch processor orchestrator relies on the extract worker to perform data extraction from files.
extracted_tables
Each project schema such as project_14.* contains multiple tables whose metadata is registered in the extracted_tables registry table in the public schema. PgDataService updates and queries the extracted_tables table to track metadata of tables per project schema. extracted_tables table tracks metadata about extracted tables per project inside PostgreSQL. PgDataService manages the extracted_tables registry in PostgreSQL for tracking table metadata.
extracted_tables registry
extracted_text_chunks
extracted_text_chunks table stores all textual data chunks from all projects centrally in PostgreSQL public schema. PgDataService manages inserting and querying extracted_text_chunks table in PostgreSQL for textual data storage and retrieval. Text chunks for all projects are stored in a unified table in PostgreSQL's public schema. Text chunks for all projects are stored in a unified table in PostgreSQL's public schema.
extracted_text_chunks table
ExtractedTable model
extraction worker
Extraction Worker currently writes extracted data to DuckDB files. Extraction Worker will write extracted data to PostgreSQL schemas instead of DuckDB.
extractors
DS-STAR Intelligence integrates with existing extractors. The plan includes CSV Extractor to validate, clean, and load CSV data into DuckDB. The plan includes Excel Extractor to handle multi-sheet workbooks, normalize headers, detect merged cells, and load data into DuckDB. The plan includes PDF Extractor that uses vLLM to extract tables as JSON and loads validated data into DuckDB. The plan includes CSV Extractor to validate, clean, and load CSV data into DuckDB. The plan includes Excel Extractor to handle multi-sheet workbooks, normalize headers, detect merged cells, and load data into DuckDB. The plan includes PDF Extractor that uses vLLM to extract tables as JSON and loads validated data into DuckDB. The plan includes CSV Extractor to validate, clean, and load CSV data into DuckDB. The plan includes Excel Extractor to handle multi-sheet workbooks, normalize headers, detect merged cells, and load data into DuckDB. The plan includes PDF Extractor that uses vLLM to extract tables as JSON and loads validated data into DuckDB. DataLens Master Implementation Plan depends on Extractor components for data ingestion and processing during extraction phases. DS-STAR Intelligence integrates with existing extractors to improve extraction quality.
file_upload.catalog_data
file_uploads
The file_uploads data entity tracks the processing status of the 132 files in the project_14 schema. The file_uploads table contains a project_id column linking uploaded files to projects. The file_uploads table includes an uploaded_by field that links each file to the user who uploaded it. The text_chunks table includes a file_id referencing the file from which the text chunk originated. GDPR flags reference files through the file_id column in project_gdpr_flags. The data catalog generation process depends on the upload directory where user files are stored. File uploads contain text_chunks tables which store chunks of text extracted from uploaded files. Projects include file_uploads tables which track files uploaded to the project.
FileUpload record ai_summary column
AI Summary Generation stores summaries in the FileUpload record ai_summary column after extraction completes in the cataloging workflow. Dashboard file.ai_summary display renders data stored in the FileUpload record ai_summary column for users to see file summaries. The PostgreSQL database stores FileUpload records with the ai_summary column used for AI summaries of files. File prioritizer uses information in FileUpload columns such as tier and AI summary to assign processing priorities.
FL 2022-2028 budget file
Docling was used to extract 30 tables including from the FL 2022-2028 budget file within the SVGV Budget Analysis Project.
frontend/tests/test-discovery.sh
A comprehensive test script created to validate the DataDiscovery feature, allowing automated or interactive testing of table discovery, consolidation, and analysis workflows with real SVGV data, ensuring full end-to-end functionality and performance validation.
GDPR-blocked data
generate_sql
Converts natural language questions into SQL queries using LLM, schema info, and prompt.
get_all_schemas
Gets complete schema info for all tables in a project, used for prompt building.
get_schema
Retrieves schema details for specific table in a project.
get_tables
Lists all tables within a project's DuckDB database.
HR data
information_schema.tables
A system view in SQL databases used to retrieve metadata about existing tables, crucial for detecting and managing schema changes during the data extraction and consolidation processes.
Insight
Insight physical table holds insights derived from project data. The insights table has a project_id column that links each insight to a project. Query physical table generates Insight physical table containing analytical insights from executed queries. Insight physical table is derived from AgentFinding data entity representing findings extracted by the agent. Projects possess insights tables which store generated insights tied to project data analysis.
insights table
The insights table is the data source that Analysis Recommendations build upon to generate actionable cards using the project's goal context.
intelligent table discovery
Built as part of DataLens' data discovery system, it automates table ranking and join discovery, enhancing query success from 70% to over 95%. It uses entity extraction, relevance scoring, and pattern recognition to pre-select related tables, creating transient views for Arctic to generate accurate SQL.
IronClaw agent tables
IronClaw agent tables are physical tables mapped within the PostgreSQL database for persistent storage of agent sessions and related data. The Agent session data entity maps to the IronClaw agent tables in the database. The IronClaw agent feature depends on the IronClaw agent tables for storing session and message data.
IronClaw database
IronClaw Agent requires IronClaw database to be configured and operational to persist sessions and support agent session creation. IronClaw onboarding process initializes the IronClaw database to enable session storage and persistent agent threads. IronClaw service on elin requires IronClaw database for session persistence and agent thread management. IronClaw onboard process configures and initializes the IronClaw database to enable session and thread management.
Join discovery
The Backend discovery service depends on Join discovery mechanisms to find table relationships.
Join relationships
JSON format (not JSONB/ARRAY)
To ensure cross-database compatibility, the DataLens Platform uses JSON format instead of JSONB or ARRAY for storing structured data.
LLM
WrenAI supports multiple LLM providers including OpenAI, Anthropic, Ollama, and Bedrock.
MDL
A declarative schema layer used in other architectures for data modeling, not primary to DataLens.