QANATIX
Data Ingestion

Data Ingestion Overview

How data flows into QANATIX — 10 sources, one pipeline.

Data Ingestion

QANATIX accepts data from 10 sources. All data flows through the same pipeline regardless of source.

Supported sources

SourceMethodFormat
CSVFile upload.csv
JSONFile upload or batch API.json
NDJSONFile upload or streaming.ndjson
PDFFile upload.pdf — extracted to markdown
XML / BMEcatFile upload.xml — catalog standards
SAP IDocFile upload.xml — MATMAS, DEBMAS, CREMAS
PostgreSQLDatabase connectorServer-side cursors
MySQLDatabase connectorStreaming query
MongoDBDatabase connectorCollection sync
Neo4jDatabase connectorCypher queries
REST APIPush / webhookJSON payload
NDJSON streamStreaming endpointBackpressure-controlled

Pipeline

Every record goes through 5 stages:

Extract → Normalize → Validate → Embed → Index
  1. Extract — parse the source format (CSV rows, JSON objects, PDF pages, XML elements, DB rows)
  2. Normalize — map to QANATIX entity structure (name, entity_type, vertical_data)
  3. Validate — check against the vertical's JSON schema (if registered)
  4. Embed — generate dense + sparse vector embeddings (OpenAI, Cohere, or BGE-M3)
  5. Index — upsert into Qdrant for hybrid search

Deduplication

QANATIX deduplicates by content hash. If you upload the same file twice, duplicates are skipped. Re-ingestion with source_id upserts existing entities instead of creating duplicates.

Batch sizes

SettingDefault
JSON batch max records5,000
File upload max size50 MB
Database connector batch5,000 rows
Streaming buffer100 records or 5s flush

Guides

On this page