Observatory

A window into the human-AI collaboration behind Austin AI Events

Calendar
πŸ’‘

The Big Picture

This is a fully autonomous system. It discovers events, monitors its own health, and fixes its own code β€” every day, without human intervention. The Observatory exists to give transparency into this process so people can see how an autonomous AI system actually works. This isn't a black box.

How It Works

Three autonomous loops work together β€” no human runs this system.

LOOP 1

Discovery Pipeline

β€” runs daily at midnight
πŸ”
Search

Scans 8+ sources and the web for Austin AI events

πŸ”€
Deduplicate

Catches the same event listed on different platforms

βœ…
Validate

AI confirms: real event? In Austin? AI-related?

🏷️
Classify

Tags audience, skill level, and free/paid

πŸ“…
Publish

Approved events appear on the calendar

run complete β€” monitor evaluates
LOOP 2

Self-Monitoring

β€” evaluates every run
πŸ“ŠGather Metrics

Collects data on scraper health, error rates, source performance, and calendar coverage

🧠Opus Evaluates

The most powerful Claude model reviews everything, assigns a health grade, and identifies issues

⚑Take Action

Creates search queries, manages sources, and escalates code issues for the repair agent

issues found β€” repair agent activates
LOOP 3

Self-Healing

β€” runs daily, 2 hours after discovery
πŸ“‹Read Issues

Picks up the highest-priority action item from the monitor

πŸ”§Fix Code

Reads the codebase, understands the bug, and writes a fix

πŸ§ͺTest

Runs the test suite β€” only pushes if all tests pass

πŸš€Deploy

Pushes the fix to production β€” next run uses the improved code

Cycle repeats daily β€” the system continuously improves itself
πŸ€–

Multi-Model Architecture

Three Claude AI models split the work based on what each task needs β€” like having a junior analyst, a senior reviewer, and a strategic director on the same team.

Haiku(Speed)

Handles 80% of decisions β€” validation, classification, dedup

Sonnet(Balance)

Evaluates new sources and extracts event details

Opus(Strategy)

The system brain β€” monitors health and drives improvements

πŸ’¬

Community Input

Anyone can submit an event the system missed using the β€œMissing an event?” button on the calendar. The agent scrapes the submitted URL, validates it, and adds it to the calendar β€” all in the same daily run. It also learns from each submission, adding new sources and search strategies to find similar events in the future.

πŸ€–

Agent Performance

What the agent is doing autonomously

Events Added (Last 30 Days)

Recent Activity

πŸ”

Under the Hood

How the agent thinks, decides, and sometimes fails

🩺

Health Report

Automated self-evaluation of system effectiveness

GradeScraper HealthSourcesError RateActivity
A80%+4+ contributing<5%Events added in last 7d
B60-79%3+ contributing<10%Active discovery
C40-59%2-3 contributing>10%Some source issues
D<40%<2 contributingHighMultiple broken scrapers
Fβ€”β€”β€”System not running

Updated 2026-03-29: Grades now measure infrastructure health (what the agent controls), not event count or empty days (which reflect community activity). The agent still actively maximizes calendar coverage as a separate mission.

🀝

Human Stewardship

How humans guide the agent's growth

Human Stewardship

How humans guide the agent's growth using Claude Code

🀝
🧠 14 Learning⚑ 17 Optimization✨ 23 New CapabilityπŸ—οΈ 24 Foundation
πŸ—οΈ
❌Problem Identified

The autonomous outer-loop repair agent has been live since 2026-03-29. Over six weeks of operation, the routine prompt grew monotonically from a focused ~250-line workflow to a 700+-line decision tree, with each new failure mode adding 50–200 tokens of "ALWAYS / NEVER" instructions: 2026-04-30 β†’ 2026-05-04 silent orphan-on-sandbox failures (added explicit `git checkout main` + push verification by hash); 2026-05-09 403-on-push to main (added PR-fallback procedure); 2026-05-11 routine wrote contradictory pushed_at fields (added a long "NEVER write success fields..." paragraph in Rules); 2026-05-16 c57f6f7 partial fix landed on main, missed the else-if fallback branch in websearch.js, sat there for 4 days, eventually had to be reverted manually. Each patch made sense individually. Together they degraded the prompt's clarity and the rate of "model misreads its own instructions" rose. By mid-May the outer loop's autonomous-fix success rate had collapsed: 1 clean fix in the last 14 days (5/18 AITX Austin) against 3 PR-fallback events, 2 verify-failed cases, and 1 partial fix that needed rollback. The maintenance cost was now negative β€” the human attention required to disambiguate "did this fix actually ship?" exceeded the time the routine was saving. The same incident (c57f6f7) had also driven the prompt-bloat verdict: the test gate said "tests pass," but the existing tests didn't exercise the call-site branch the fix needed to also cover. A prompt-level "always think about branch coverage" instruction would have been the next patch under the old pattern. The day's question was whether that next patch was the right move at all.

πŸ› οΈAction Taken

Shifted the enforcement architecture from prompt-level to code-level across three layers, then trimmed the prompt to match. (1) **Pre-commit branch-test gate** (`agent/scripts/check-branch-tests.js`): a custom Node script wired into Husky's pre-commit hook that blocks any commit modifying `agent/src/` code without a matching change to the corresponding `*.test.js` file (and at least one new `it()` or `test()` block). 22 unit tests cover the script itself. This is the deterministic floor for the c57f6f7 class β€” the cron literally cannot commit source-without-tests anymore, regardless of what the prompt says. The script exits 0 if no in-scope files changed, exits 1 with a clear message on failure, and respects `git commit --no-verify` as the standard escape hatch. (2) **CHECK constraints on the partitioned `repair_log` table** (migration 045): `repair_log_pushed_implies_passed` (a row claiming `pushed_at IS NOT NULL` must have `test_result = 'passed'`; skipped/failed rows cannot advertise a push), and `repair_log_no_pr_fallback_contradiction` (a row cannot simultaneously have `pushed_at IS NOT NULL` and "PR-fallback opened" in `change_summary`). Both `NOT VALID`, so two pre-fix staleness-resolution rows are tolerated as legacy. Smoke-tested in prod: a deliberately contradictory INSERT was rejected as expected. (3) **Trimmed the routine prompt** from 205 lines / 1961 words to 189 lines / 1068 words β€” a 45% word reduction. Removed: the long "Motivating regressions" historical paragraph (those incidents now have code-level guards), duplicate ALWAYS/NEVER bullets in Rules that already appeared in step text, verbose multi-tenant prose (replaced with a tight bullet list), and the MCP Initialization sub-agent fallback section (per the original memory's "periodically test removing it" β€” tomorrow's run is the test, with a documented 5-minute recovery if the bug returns). (4) **Codified the principle in CLAUDE.md** as the Failure-Mode Triage Rule: when a new failure mode appears in any LLM-driven loop, the first question is "can this be prevented by code (CI, hook, DB constraint, type) rather than the prompt?" If yes, enforce in code and leave the prompt alone. If no, add the minimum necessary prompt instruction and document why no code path exists.

βœ…Result

Two failure classes that previously depended on the model reading and following long lists of instructions are now structurally impossible: source-without-tests commits are blocked at the git layer, and contradictory `pushed_at` / `change_summary` states are rejected at the DB layer. The cron CANNOT make these mistakes anymore β€” not via "the prompt told it not to" but via the system refusing the bad action. The prompt collapsed 45% in word count while keeping the load-bearing scaffolding (explicit git commands for orphan protection, the investigate-before-fixing procedure, multi-tenant SQL rules, the MCP fallback temporarily on the bench for testing). The principle is now in CLAUDE.md as a governing rule for any future failure mode in any LLM-driven loop β€” outer loop, Watson, future scheduled routines β€” so the prompt-bloat pattern doesn't restart on the next incident. The deeper lesson the day surfaced is older than LLM systems: in the reliability-engineering hierarchy of controls (elimination β†’ substitution β†’ engineering controls β†’ administrative β†’ PPE), written rules are administrative controls β€” they depend on the operator following them every time, which gets less reliable as procedures pile up. Engineering controls (CI, hooks, DB constraints) sit above administrative controls in every domain that takes safety seriously β€” aviation, medicine, manufacturing. The outer-loop architecture now matches that hierarchy where the technology supports it. The right division of labor between prompt and code: LLMs are powerful judgment engines but unreliable rule-followers; code/types/schemas are reliable rule-followers but useless at judgment. Match each component to what it's good at. Tomorrow's 3 AM CDT run is the first test of the new architecture β€” the daily report will tell us whether the slimmer prompt still produces working fixes and whether the MCP wrapper removal was the right bet.

⚑
⚑
✨
πŸ—οΈ

This agent is developed iteratively with Claude Code. The collaboration is part of the project's identity.

πŸ‘€
000000Calendar Visits
πŸ€–
000000AI Crawlers