Architecture
High-Level Overview
- Service name:
sinatools (branded as NLP Lab sidecar).
- Purpose: Provide Arabic NLP microservices (currently powered by the SinaTools SDK) with optional dialect enhancements.
- Interface: REST over HTTP, OpenAPI v3 spec available at
sinatools/openapi.json.
- Deployment: Docker Compose service mounting
sinatools/app into /app with Uvicorn entrypoint.
Module Layout
sinatools/app/
├── main.py # Entrypoint: loads src.app.create_app()
├── src/
│ ├── app.py # FastAPI factory, router registration, warmup logic
│ ├── config.py # Paths to datasets and shared constants
│ ├── routers/ # Feature-specific FastAPI routers
│ ├── services/ # Cached SDK loaders, Nabra helpers
│ └── utils.py # Normalisation, tokenisation, similarity helpers
└── tests/
└── test_api.py # FastAPI TestClient coverage for all endpoints
Components
| Component |
Description |
| Routers |
Each router (ner, wsd, morph, dialect, relation, health) defines request models, response schemas, and error handling. |
| Services |
Thin wrappers exposing cached SDK functions. Nabra service handles CSV ingestion and glossary lexicon building. |
| Datasets |
Nabra CSVs mounted under /app/Nabra. Glosses for WSD pulled from sinatools.wsd.glosses_dic. |
| SDK |
SinaTools Python package bundles Wojood, Salma, Alma, Hadath models. |
| Entry / Warmup |
src/app.py optionally preloads models based on SINA_WARM. |
Data Flow
- Client calls a REST endpoint.
- Router validates payload via Pydantic models.
- Router invokes the corresponding service (cached SDK loader or Nabra lookup).
- Response is normalised into JSON (adds metadata like
sense_url, lemma_forms, match_type).
- FastAPI returns the response; OpenAPI schema and docs auto-update.
Dependencies & Integrations
- No direct datastore; all data is in-memory or local files.
- Downstream services consume these APIs for content tagging, search ranking, and UI tooltips.
- Observability hooks (metrics/logging) can be added later via FastAPI middleware.
Extensibility Points
- New models: Drop in additional routers/services to integrate future NLP tasks under the same sidecar.
- Datasets: Additional dialect corpora can reuse the glossary patterns.
- Auth / Rate Limiting: Currently unauthenticated; add
fastapi middleware or shared gateway if needed.