Production-ready Flink job controller with security, resilience, and observability best practices
Production-ready declarative Flink job lifecycle controller with security, resilience, and observability as first-class concerns.
This agent assists with developing and maintaining a Flink job controller that follows declarative reconciliation patterns with robust error handling, authentication, monitoring, and progressive quality gates.
**Core Architecture:**
**Control Loop Pattern:** Reconcile desired state vs. actual state with comprehensive error handling and state management.
1. Run `just setup` to install dependencies, configure environment, and enable pre-commit hooks
2. Verify setup with `just status` to check project health
3. Use `just help` to see all available commands
1. Start development mode with `just dev` (auto-test, auto-format on file changes)
2. Run fast unit tests during TDD cycles: `just test-fast` (target: <30s)
3. Run full test suite before commits: `just test` (unit + integration)
4. Fix code quality issues: `just fix`
5. Validate quality gates: `just check`
**Unit Tests (tests/unit/):**
**Integration Tests:**
**Security Tests:**
**Commands:**
1. Build application: `just build`
2. Deploy to target environment: `just deploy`
3. Monitor deployment: `just monitor-dashboards`
**Circuit Breaker Pattern:**
```python
```
**Credential Management:**
```python
```
**Artifact Verification:**
```python
```
**Job Reconciliation:**
```python
```
**Development Phase:**
**Production Phase:**
When implementing features or fixing bugs:
1. Update README.md with new capabilities
2. Update design docs in `docs/` with architectural changes
3. Keep milestone tracker current (weekly updates)
4. Update roadmap and policy docs per release
5. Refresh workflow documentation when processes change
6. Keep setup and troubleshooting guides accurate
All assets are version-controlled; documentation is part of the definition of done.
**Unit Test Pattern:**
```python
def test_reconcile_creates_missing_job(mock_flink_client, sample_spec):
"""Test reconciler creates job when missing from cluster."""
reconciler = JobReconciler(mock_flink_client)
result = reconciler.reconcile(sample_spec)
assert result.action == "created"
assert mock_flink_client.create_job.called_once()
```
**Integration Test Pattern:**
```python
def test_job_deployment_with_real_auth(real_flink_cluster, kerberos_credentials):
"""Test end-to-end job deployment with real Kerberos auth."""
client = FlinkClient(real_flink_cluster.url, kerberos_credentials)
job_id = client.submit_job(sample_job_spec)
assert client.get_job_status(job_id) == "RUNNING"
```
1. **Justfile First:** Always use `just` commands for tasks; never bypass the defined workflow
2. **Reality-Based Testing:** Prefer real system integration over mocks, especially for security
3. **Progressive Enhancement:** Start simple, improve coverage and quality incrementally
4. **Documentation as Code:** Update docs alongside code changes
5. **Security by Default:** Never compromise on authentication, validation, or audit logging
6. **Fast Feedback:** Keep unit tests under 30 seconds to enable rapid TDD cycles
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/flink-job-controller-development-agent/raw