Data Ingestion

QANATIX accepts data from 10 sources. All data flows through the same pipeline regardless of source.

Supported sources

Source	Method	Format
CSV	File upload	`.csv`
JSON	File upload or batch API	`.json`
NDJSON	File upload or streaming	`.ndjson`
PDF	File upload	`.pdf` — extracted to markdown
XML / BMEcat	File upload	`.xml` — catalog standards
SAP IDoc	File upload	`.xml` — MATMAS, DEBMAS, CREMAS
PostgreSQL	Database connector	Server-side cursors
MySQL	Database connector	Streaming query
MongoDB	Database connector	Collection sync
Neo4j	Database connector	Cypher queries
REST API	Push / webhook	JSON payload
NDJSON stream	Streaming endpoint	Backpressure-controlled

Pipeline

Every record goes through 5 stages:

Extract → Normalize → Validate → Embed → Index

Extract — parse the source format (CSV rows, JSON objects, PDF pages, XML elements, DB rows)
Normalize — map to QANATIX entity structure (name, entity_type, vertical_data)
Validate — check against the vertical's JSON schema (if registered)
Embed — generate dense + sparse vector embeddings (OpenAI, Cohere, or BGE-M3)
Index — upsert into Qdrant for hybrid search

Deduplication

QANATIX deduplicates by content hash. If you upload the same file twice, duplicates are skipped. Re-ingestion with source_id upserts existing entities instead of creating duplicates.

Batch sizes

Setting	Default
JSON batch max records	5,000
File upload max size	50 MB
Database connector batch	5,000 rows
Streaming buffer	100 records or 5s flush

Guides

File Upload — CSV, JSON, NDJSON, PDF
XML Catalogs — BMEcat, SAP IDoc
API Ingestion — REST push, webhooks
Database Connectors — Postgres, MySQL, MongoDB, Neo4j
Streaming — NDJSON streaming with backpressure

Data Ingestion Overview