Data Ingestion
Data Ingestion Overview
How data flows into QANATIX — 10 sources, one pipeline.
Data Ingestion
QANATIX accepts data from 10 sources. All data flows through the same pipeline regardless of source.
Supported sources
| Source | Method | Format |
|---|---|---|
| CSV | File upload | .csv |
| JSON | File upload or batch API | .json |
| NDJSON | File upload or streaming | .ndjson |
| File upload | .pdf — extracted to markdown | |
| XML / BMEcat | File upload | .xml — catalog standards |
| SAP IDoc | File upload | .xml — MATMAS, DEBMAS, CREMAS |
| PostgreSQL | Database connector | Server-side cursors |
| MySQL | Database connector | Streaming query |
| MongoDB | Database connector | Collection sync |
| Neo4j | Database connector | Cypher queries |
| REST API | Push / webhook | JSON payload |
| NDJSON stream | Streaming endpoint | Backpressure-controlled |
Pipeline
Every record goes through 5 stages:
Extract → Normalize → Validate → Embed → Index- Extract — parse the source format (CSV rows, JSON objects, PDF pages, XML elements, DB rows)
- Normalize — map to QANATIX entity structure (
name,entity_type,vertical_data) - Validate — check against the vertical's JSON schema (if registered)
- Embed — generate dense + sparse vector embeddings (OpenAI, Cohere, or BGE-M3)
- Index — upsert into Qdrant for hybrid search
Deduplication
QANATIX deduplicates by content hash. If you upload the same file twice, duplicates are skipped. Re-ingestion with source_id upserts existing entities instead of creating duplicates.
Batch sizes
| Setting | Default |
|---|---|
| JSON batch max records | 5,000 |
| File upload max size | 50 MB |
| Database connector batch | 5,000 rows |
| Streaming buffer | 100 records or 5s flush |
Guides
- File Upload — CSV, JSON, NDJSON, PDF
- XML Catalogs — BMEcat, SAP IDoc
- API Ingestion — REST push, webhooks
- Database Connectors — Postgres, MySQL, MongoDB, Neo4j
- Streaming — NDJSON streaming with backpressure