فهرست منبع

LLM Readme Update

Benjamin Harris 2 ماه پیش
والد
کامیت
3642c1be2a
2فایلهای تغییر یافته به همراه74 افزوده شده و 15 حذف شده
  1. 15 11
      CLAUDE.md
  2. 59 4
      README.md

+ 15 - 11
CLAUDE.md

@@ -9,7 +9,7 @@ This is a scraping pipeline that collects Tasmanian planning development applica
 ## Key Files
 
 | File | Role |
-|---|---|
+| --- | --- |
 | `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) |
 | `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback |
 | `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table |
@@ -23,6 +23,7 @@ This is a scraping pipeline that collects Tasmanian planning development applica
 | `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
 | `web/index.php` | Search portal — dynamic UNION across all `da_*` tables |
 | `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP (called by `run_all.sh` when any scraper ERRORs) |
+| `tools/classify_pdfs.rb` | LLM PDF classification backfill — sets `application_type` on rows with a downloaded PDF |
 | `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
 
 ---
@@ -58,7 +59,8 @@ docker compose run --rm \
 
 ## Architecture Conventions
 
-### Each scraper follows this pattern:
+### Each scraper follows this pattern
+
 1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename
 2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included)
 3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers)
@@ -66,7 +68,8 @@ docker compose run --rm \
 5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received`
 6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches
 
-### WAF / Cloudflare handling:
+### WAF / Cloudflare handling
+
 - `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically.
 - For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation.
 - Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
@@ -119,6 +122,7 @@ docker compose run --rm \
 ## Adding or Modifying a Scraper
 
 When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
+
 - `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector
 - HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`)
 - Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning
@@ -316,12 +320,12 @@ end
 RUN apt-get install -y poppler-utils
 ```
 
-### Key Decisions Before Implementation
-
-1. **Option A vs B vs C** — inline vs backfill tool vs PHP
-2. **Which model** — any Ollama model on the local server (check with `curl http://192.168.8.73:11434/api/tags`)
-3. **Prompt language** — zero-shot classification vs few-shot examples; JSON output vs plain text
-4. **Confidence threshold** — store raw LLM response for auditing? Flag low-confidence results?
-5. **Re-classification** — should existing `application_type` values be overwritten on re-run, or treated as write-once?
-6. **Dockerfile change** — confirm `poppler-utils` can be added to the scraper image
+### Implementation
 
+- **Approach**: Option B (backfill tool) — `tools/classify_pdfs.rb` runs independently of `run_all.sh`
+- **Model**: `llama3.2` (3B, fast) by default; override with `LLM_MODEL` env var
+- **Prompt**: Zero-shot, plain-text response (no JSON overhead for a fixed classification list)
+- **Raw response**: Always stored in `application_type_raw` for auditing
+- **Re-classification**: Write-once by default; set `RECLASSIFY=1` to overwrite
+- **PDF extraction**: `pdftotext -l 3` (first 3 pages, from `poppler-utils` in Dockerfile)
+- **Qwen3 note**: The tool strips `<think>...</think>` tags from models that produce reasoning output before normalising the response

+ 59 - 4
README.md

@@ -77,6 +77,15 @@ Copy `.env.example` to `.env` and fill in the values. **Never commit `.env`.**
 | `SCRAPE_EVERY_MINUTES` | No | If set, the scraper loops on this interval (default: run once) |
 | `DOWNLOAD_ATTACHMENTS` | No | Set to `1` to download PDF attachments |
 | `DOWNLOAD_DIR` | No | Host path for downloaded PDFs (default: `/app/downloads`) |
+| `LLAMA_URL` | No | Base URL of local Ollama instance for PDF classification (default: `http://192.168.8.73:11434`) |
+| `LLM_MODEL` | No | Ollama model name for PDF classification (default: `llama3.2`) |
+| `SMTP_HOST` | No | SMTP server for error summary emails |
+| `SMTP_PORT` | No | SMTP port (default: `587`) |
+| `SMTP_USERNAME` | No | SMTP username |
+| `SMTP_PASSWORD` | No | SMTP password |
+| `SMTP_SMTPSecure` | No | `tls` or `ssl` (default: `tls`) |
+| `SMTP_SENTFROM` | No | Sender email address |
+| `SMTP_ADDADDRESS` | No | Recipient email address |
 | `DEBUG` | No | Set to `1` for verbose scraper output |
 | `DRY_RUN` | No | Set to `1` to parse without writing to the DB |
 | `ENRICH_DEBUG` | No | Set to `1` for verbose geocode/lookup output |
@@ -150,14 +159,24 @@ Every `da_*` table shares the same base schema:
 | `on_notice_to` | `DATE` | Public comment close date |
 | `applicant` | `VARCHAR(255)` | |
 | `document_url` | `TEXT` | Remote PDF URL |
-| `local_document_url` | `TEXT` | Downloaded PDF path (relative to `/downloads`) |
+| `local_document_url` | `TEXT` | Downloaded PDF path (served via `/files/`) |
+| `documents_json` | `MEDIUMTEXT` | JSON array of `{name, url, local_url}` — multi-doc DAs (e.g. Launceston) |
 | `address_std` | `VARCHAR(255)` | Google-normalised address |
 | `lat` / `lng` | `DECIMAL(10,7)` | Geocoded coordinates |
 | `property_id` | `TEXT` | Land title PID |
 | `title_reference` | `TEXT` | Certificate of title reference |
+| `application_type` | `VARCHAR(60)` | LLM-classified type (e.g. `Residential`, `Subdivision`) |
+| `application_type_raw` | `TEXT` | Raw LLM response (for auditing) |
+| `application_type_at` | `DATETIME` | When classification was last run |
+| `status` | `VARCHAR(100)` | Application status (Launceston eProperty) |
+| `assigned_officer` | `VARCHAR(255)` | Assigned planning officer (Launceston) |
+| `category` | `VARCHAR(100)` | Application category (Launceston) |
+| `application_valid` | `DATE` | Date application was deemed valid (Launceston) |
+| `advertised_on` | `DATE` | Date first advertised (Launceston) |
+| `property_legal_description` | `TEXT` | Certificate of title / legal description (Launceston) |
 | `created_at` / `updated_at` | `DATETIME` | |
 
-Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`) — the first value is kept on subsequent scrapes.
+Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`, `document_url`) — the first value is kept on subsequent scrapes.
 
 ---
 
@@ -199,7 +218,7 @@ docker compose run --rm \
 
 For sites that additionally require a **warm cookie state**, the scraper does a proactive homepage GET before fetching the target URL. See `burnie.rb` for the reference implementation of this pattern (custom `CookieJar` class + `http_get_with_cookies`). Scrapers using this pattern: `burnie.rb`, `kingisland.rb`, `latrobe.rb`, `derwentvalley.rb`.
 
-**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly.
+**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`, `kentish.tas.gov.au`, `centralhighlands.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly. Where a PlanBuild equivalent exists, data is still collected via `planbuild.rb`.
 
 ---
 
@@ -216,11 +235,46 @@ For sites that additionally require a **warm cookie state**, the scraper does a
 
 ---
 
+## PDF Classification (LLM)
+
+After PDFs are downloaded, `tools/classify_pdfs.rb` extracts text from each PDF using `pdftotext` and sends it to a local Ollama instance to classify the application type.
+
+**Application types:** Residential, Commercial, Industrial, Subdivision, Rural/Agriculture, Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use, Demolition, Signage, Other
+
+```bash
+# Classify all unclassified PDFs (dry run first)
+docker compose run --rm -e DRY_RUN=1 scraper ruby /app/tools/classify_pdfs.rb
+
+# Run for real
+docker compose run --rm scraper ruby /app/tools/classify_pdfs.rb
+
+# Single council
+docker compose run --rm -e ONLY_TABLE=da_northernmidlands scraper ruby /app/tools/classify_pdfs.rb
+
+# Re-classify existing (overwrite)
+docker compose run --rm -e RECLASSIFY=1 scraper ruby /app/tools/classify_pdfs.rb
+
+# Use a different model
+docker compose run --rm -e LLM_MODEL=gemma3 scraper ruby /app/tools/classify_pdfs.rb
+```
+
+Results are written to `application_type`, `application_type_raw` (full LLM response for auditing), and `application_type_at` (timestamp). The web portal displays the type as a badge and supports filtering by type.
+
+---
+
+## Error Summary Emails
+
+When any scraper exits with an error, `run_all.sh` automatically calls `tools/send_summary_email.rb` to send an HTML summary email if `SMTP_HOST` is configured in `.env`. The email contains a colour-coded table of all scrapers with their saved counts and error status.
+
+---
+
 ## Tools
 
 | Script | Purpose |
 |---|---|
 | `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
+| `tools/classify_pdfs.rb` | LLM classification of downloaded PDFs — sets `application_type` on each row |
+| `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP — called by `run_all.sh` on ERROR |
 | `tools/backfill_dorset_docs.rb` | Backfill PDF links for Dorset rows |
 | `tools/import_sqlites.rb` | Import data from legacy SQLite exports |
 | `planbuild_fetch.js` | Playwright-based scraper for the PlanBuild TAS portal |
@@ -238,7 +292,8 @@ tas_councils/
 │   ├── enrich.rb         # Post-upsert enrichment pipeline
 │   ├── util.rb           # Date parsing, council/table name mappings
 │   ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich!
-│   └── migrate.rb        # Sequential schema migration runner
+│   ├── migrate.rb        # Sequential schema migration runner
+│   └── llm.php           # LLM inference helper for PHP (llama-swap + Ollama)
 ├── scrapers/             # One .rb file per council
 ├── web/                  # PHP search portal (Apache)
 ├── tools/                # Standalone backfill and migration scripts