dataloader/CLAUDE.md

315 lines
11 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Role and Working Principles
You act as a senior Python / ML / AI / DL developer and system architect.
You work in an enterprise-level project (multi-tier backend) with initial base architecture similar to the REST template from `rest_template.md`.
You design code strictly according to patterns (Singleton, Repository, Interface, DTO, CRUD, Service, Context, Adapter, etc.).
You write clean production-level Python code (FastAPI, SQLAlchemy 2.x, asyncio, Pydantic v2, PostgreSQL, aiohttp, structlog/loguru).
**Working Rules:**
- Do not add comments unless explicitly requested.
- Always add docstrings to functions, classes, and modules.
- Always follow PEP 8 style and architectural layer isolation (api / service / repositories / models / schemas / interfaces / logger / config / context).
- Prefer typing via `from __future__ import annotations`.
- All dependencies are passed through `AppContext` (DI Singleton pattern).
- Implement logging through the logger with context (`logger.info("msg")` without structures).
- When creating projects from scratch, rely on the structure from `rest_template.md`.
- Respond strictly to the point, no fluff, like a senior developer during code review.
- All logic in examples is correct, asynchronous, and production-ready.
- Use only modern library versions.
Your style is minimalistic, precise, clean, and architecturally sound.
## Project Overview
**Dataloader** is an asynchronous FastAPI service for managing and executing long-running ETL tasks via a PostgreSQL-based job queue. The service uses PostgreSQL's `LISTEN/NOTIFY` for efficient worker wakeup, advisory locks for concurrency control, and `SELECT ... FOR UPDATE SKIP LOCKED` for job claiming.
This is a Clean Architecture implementation following the project template `rest_template.md`, built with Python 3.11+, FastAPI, SQLAlchemy 2.0 (async), and asyncpg.
## Development Commands
### Running the Application
```bash
# Install dependencies with Poetry
poetry install
# Run the application
poetry run dataloader
# or
uvicorn dataloader.__main__:main
# The app will start on port 8081 by default (configurable via APP_PORT)
```
### Testing
```bash
# Run all tests
poetry run pytest
# Run specific test file
poetry run pytest tests/integration_tests/v1_api/test_service.py
# Run with verbose output
poetry run pytest -v
# Run integration tests only
poetry run pytest tests/integration_tests/
```
### Database
The database schema is already applied (see `DDL.sql`). The queue uses:
- Table `dl_jobs` - main job queue with statuses: queued, running, succeeded, failed, canceled, lost
- Table `dl_job_events` - audit log of job lifecycle events
- PostgreSQL triggers for `LISTEN/NOTIFY` on job insertion/updates
## Architecture
### High-Level Structure
The codebase follows Clean Architecture with clear separation of concerns:
1. **API Layer** (`src/dataloader/api/`)
- `v1/router.py` - HTTP endpoints for job management
- `v1/service.py` - Business logic layer
- `v1/schemas.py` - Pydantic request/response models
- `os_router.py` - Infrastructure endpoints (`/health`, `/status`) **DO NOT MODIFY**
- `metric_router.py` - Metrics endpoints (BETA) **DO NOT MODIFY**
- `middleware.py` - Request/response logging middleware **DO NOT MODIFY**
2. **Storage Layer** (`src/dataloader/storage/`)
- `repositories.py` - PostgreSQL queue operations using SQLAlchemy ORM
- `db.py` - Database engine and session management
- `notify_listener.py` - PostgreSQL LISTEN/NOTIFY implementation
3. **Worker Layer** (`src/dataloader/workers/`)
- `manager.py` - Manages lifecycle of async worker tasks
- `base.py` - Core worker implementation with claim/heartbeat/execute cycle
- `reaper.py` - Background task to requeue lost jobs (expired leases)
- `pipelines/registry.py` - Pipeline registration and resolution system
- `pipelines/` - Individual pipeline implementations
4. **Logger** (`src/dataloader/logger/`)
- Structured logging with automatic sensitive data masking
- **DO NOT MODIFY** these files - they're from the template
5. **Core** (`src/dataloader/`)
- `__main__.py` - Application entry point
- `config.py` - Pydantic Settings for all configuration
- `context.py` - AppContext singleton for dependency injection
- `base.py` - Base classes and types
- `exceptions.py` - Global exception definitions
### Key Architectural Patterns
#### Job Queue Protocol
Jobs flow through the system via a strict state machine:
1. **Enqueue** (`trigger` API) - Creates job in `queued` status
- Idempotent via `idempotency_key`
- PostgreSQL trigger fires `LISTEN/NOTIFY` to wake workers
2. **Claim** (worker) - Worker acquires job atomically
- Uses `FOR UPDATE SKIP LOCKED` to prevent contention
- Sets status to `running`, increments attempt counter
- Attempts PostgreSQL advisory lock on `lock_key`
- If lock fails → job goes back to `queued` with backoff delay
3. **Execute** (worker) - Runs the pipeline with heartbeat
- Heartbeat updates every `DL_HEARTBEAT_SEC` seconds
- Extends `lease_expires_at` to prevent reaper from reclaiming
- Checks `cancel_requested` flag between pipeline chunks
- Pipeline yields between chunks to allow cooperative cancellation
4. **Complete** (worker) - Finalize job status
- **Success**: `status = succeeded`, release advisory lock
- **Failure**:
- If `attempt < max_attempts``status = queued` (retry with exponential backoff: 30 * attempt seconds)
- If `attempt >= max_attempts``status = failed`
- **Cancel**: `status = canceled`
- Always releases advisory lock
5. **Reaper** (background) - Recovers lost jobs
- Runs every `DL_REAPER_PERIOD_SEC`
- Finds jobs where `status = running` AND `lease_expires_at < now()`
- Resets them to `queued` for retry
#### Concurrency Control
The system uses multiple layers of concurrency control:
- **`lock_key`**: PostgreSQL advisory lock ensures only one worker processes jobs with the same lock_key
- **`partition_key`**: Logical grouping for job ordering (currently informational)
- **`FOR UPDATE SKIP LOCKED`**: Prevents multiple workers from claiming the same job
- **Async workers**: Multiple workers can run concurrently within a single process
#### Worker Configuration
Workers are configured via `WORKERS_JSON` environment variable:
```json
[
{"queue": "load.cbr", "concurrency": 2},
{"queue": "load.sgx", "concurrency": 1}
]
```
This spawns M async tasks (sum of all concurrency values) within the FastAPI process.
#### Pipeline System
Pipelines are registered via decorator in `workers/pipelines/`:
```python
from dataloader.workers.pipelines.registry import register
@register("my.task")
async def my_pipeline(args: dict):
# Process chunk 1
yield # Allow heartbeat & cancellation check
# Process chunk 2
yield
# Process chunk 3
```
The `yield` statements enable:
- Heartbeat updates during long operations
- Cooperative cancellation via `cancel_requested` checks
- Progress tracking
All pipelines must be imported in `workers/pipelines/__init__.py` `load_all()` function.
### Application Lifecycle
1. **Startup** (`lifespan` in `api/__init__.py`)
- Initialize logging
- Create database engine and sessionmaker
- Load all pipelines from registry
- Build WorkerManager from `WORKERS_JSON`
- Start all worker tasks and reaper
2. **Runtime**
- FastAPI serves HTTP requests
- Workers poll queue via LISTEN/NOTIFY
- Reaper runs in background
3. **Shutdown** (on SIGTERM)
- Signal all workers to stop via `asyncio.Event`
- Cancel worker tasks and wait for completion
- Cancel reaper task
- Dispose database engine
## Configuration
All configuration is via environment variables (`.env` file or system environment):
### Application Settings
- `APP_HOST` - Server bind address (default: `0.0.0.0`)
- `APP_PORT` - Server port (default: `8081`)
- `DEBUG` - Debug mode (default: `False`)
- `LOCAL` - Local development flag (default: `False`)
### Database Settings
- `PG_HOST`, `PG_PORT`, `PG_USER`, `PG_PASSWORD`, `PG_DATABASE`, `PG_SCHEMA` - PostgreSQL connection
- `PG_POOL_SIZE`, `PG_MAX_OVERFLOW`, `PG_POOL_RECYCLE` - Connection pool configuration
- `DL_DB_DSN` - Optional override for queue database DSN (if different from main DB)
### Worker Settings
- `WORKERS_JSON` - JSON array of worker configurations (required)
- `DL_HEARTBEAT_SEC` - Heartbeat interval (default: `10`)
- `DL_DEFAULT_LEASE_TTL_SEC` - Default lease duration (default: `60`)
- `DL_REAPER_PERIOD_SEC` - Reaper run interval (default: `10`)
- `DL_CLAIM_BACKOFF_SEC` - Backoff when advisory lock fails (default: `15`)
### Logging Settings
- `LOG_PATH`, `LOG_FILE_NAME` - Application log location
- `METRIC_PATH`, `METRIC_FILE_NAME` - Metrics log location
- `AUDIT_LOG_PATH`, `AUDIT_LOG_FILE_NAME` - Audit events log location
## API Endpoints
### Business API (v1)
- `POST /api/v1/jobs/trigger` - Create or get existing job (idempotent)
- Body: `{queue, task, args?, idempotency_key?, lock_key, partition_key?, priority?, available_at?}`
- Response: `{job_id, status}`
- `GET /api/v1/jobs/{job_id}/status` - Get job status
- Response: `{job_id, status, attempt, started_at?, finished_at?, heartbeat_at?, error?, progress}`
- `POST /api/v1/jobs/{job_id}/cancel` - Request job cancellation (cooperative)
- Response: Same as status endpoint
### Infrastructure API
- `GET /health` - Health check (no database access, <20ms)
- `GET /status` - Service status with version/uptime
## Development Guidelines
### Adding a New Pipeline
1. Create pipeline file in `src/dataloader/workers/pipelines/`:
```python
from dataloader.workers.pipelines.registry import register
@register("myqueue.mytask")
async def my_task_pipeline(args: dict):
# Your implementation
# Use yield between chunks for heartbeat
yield
```
2. Import in `src/dataloader/workers/pipelines/__init__.py`:
```python
def load_all() -> None:
from . import noop
from . import my_task # Add this line
```
3. Add queue to `.env`:
```
WORKERS_JSON=[{"queue":"myqueue","concurrency":1}]
```
### Idempotent Operations
All pipelines should be idempotent since jobs may be retried:
- Use `idempotency_key` for external API calls
- Use `UPSERT` or `INSERT ... ON CONFLICT` for database writes
- Design pipelines to be safely re-runnable from any point
### Security & Data Masking
The logger automatically masks sensitive fields (defined in `logger/utils.py`):
- Keywords: `password`, `token`, `secret`, `key`, `authorization`, etc.
- Never log credentials directly
- Use structured logging: `logger.info("message", extra={...})`
### Error Handling
- Pipelines should raise exceptions for transient errors (will trigger retry)
- Use `max_attempts` in job creation to control retry limits
- Permanent failures should be logged but not raise (mark job as succeeded but log error in events)
### Testing
Integration tests should:
- Use test fixtures from `tests/conftest.py`
- Test full job lifecycle: trigger → claim → execute → complete
- Test failure scenarios: cancellation, retries, lock contention
- Mock external dependencies, use real database for queue operations
## Important Files to Reference
- `TZ.md` - Full technical specification (Russian)
- `TODO.md` - Implementation progress and next steps
- `rest_template.md` - Project structure template
- `DDL.sql` - Database schema