Vector Indexing & Metadata¶
This guide covers the schema and best practices for populating the Cloudflare Vectorize index.
1. Index Schema¶
| Property | Value | Description |
|---|---|---|
| Dimensions | 1024 |
Required by the bge-m3 embedding model. |
| Metric | cosine |
Cosine similarity is recommended for text embeddings. |
Metadata Fields¶
To enable efficient filtering, the following metadata fields are indexed:
| Field Name | Type | Description |
|---|---|---|
lang |
string |
The language of the article chunk (e.g., ar, en). |
source |
string |
The key of the source (e.g., SANA, verify-sy). |
published_at_bucket |
number |
The publication month, formatted as YYYYMM (e.g., 202508). |
2. Vector & Metadata Structure¶
Each document stored in Vectorize represents a single chunk of an article.
Vector ID Naming Convention¶
Vector IDs follow a strict naming convention to ensure uniqueness and traceability:
article:<uuid>:<chunk_no>
article: A static prefix.<uuid>: The unique identifier of the parent article.<chunk_no>: The 0-indexed number of the chunk within the article.
Recommended Metadata Payload¶
This is the metadata object that should be stored alongside each vector.
{
"article_id": "<uuid>",
"chunk_no": 0,
"lang": "ar",
"source": "SANA",
"published_at_bucket": 202508,
"url": "https://example.com/article/123",
"title": "Article Title",
"text": "A short snippet of the article text (2-4 KiB max)."
}
Chunk Size
Aim for a chunk size of approximately 600-800 tokens per vector for optimal retrieval performance.
3. Backfill & Data Management¶
Pilot Backfill¶
For initial data loading or small backfills, use the /rag/dev/upsert-batch development endpoint or a one-off script that calls VECTORIZE.upsert directly.
Data Updates & Deletes¶
- Database Mapping: It is crucial to maintain a mapping in the primary PostgreSQL database between
(article_id, chunk_no)and the correspondingvector_id. - Updates: To update a vector, re-embed the content and call
VECTORIZE.upsert()with the same vector ID. This will overwrite the existing vector. - Deletes: To remove vectors, use
VECTORIZE.deleteByIds(['id1', 'id2', ...]). For large-scale deletions, consider maintaining a "tombstone" list and purging periodically.