|
|
@@ -16,10 +16,14 @@ This is a scraping pipeline that collects Tasmanian planning development applica
|
|
|
| `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write |
|
|
|
| `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings |
|
|
|
| `lib/scraper_helpers.rb` | Shared helpers: `abs_url`, `text_or`, `upsert_and_enrich!` |
|
|
|
-| `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set |
|
|
|
+| `lib/migrate.rb` | Sequential schema migration runner — add new migrations at end of `MIGRATIONS` array |
|
|
|
+| `lib/llm.php` | LLM inference helper for PHP — calls Ollama-compatible API (llama-swap primary, Ollama fallback) |
|
|
|
+| `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set; prints summary table; emails on error |
|
|
|
| `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) |
|
|
|
| `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
|
|
|
| `web/index.php` | Search portal — dynamic UNION across all `da_*` tables |
|
|
|
+| `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP (called by `run_all.sh` when any scraper ERRORs) |
|
|
|
+| `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
|
|
|
|
|
|
---
|
|
|
|
|
|
@@ -36,7 +40,7 @@ docker compose run --rm scraper /app/run_all.sh
|
|
|
TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb
|
|
|
|
|
|
# Run a subset
|
|
|
-ONLY=meandervalley,kent docker compose run --rm scraper /app/run_all.sh
|
|
|
+ONLY=meandervalley,westtamar docker compose run --rm scraper /app/run_all.sh
|
|
|
|
|
|
# Geocode backfill (batch, all tables)
|
|
|
docker compose run --rm \
|
|
|
@@ -68,24 +72,41 @@ docker compose run --rm \
|
|
|
- Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
|
|
|
- The warmup pattern (custom `CookieJar` + `http_get` with redirect handling) is self-contained in scrapers that need it and does **not** depend on `lib/http.rb`.
|
|
|
|
|
|
-### Write-once fields (in `DB.upsert`):
|
|
|
+### PDF Downloads
|
|
|
+
|
|
|
+- Only happen when `DOWNLOAD_ATTACHMENTS=1` (set in `docker-compose.yml` or at runtime)
|
|
|
+- Files land in `DOWNLOAD_DIR/<councilname>/<ref>/filename.pdf` inside the container
|
|
|
+- The web container mounts the same folder at `/srv/files` and Apache serves it via `Alias /files /srv/files`
|
|
|
+- **`local_document_url` must be stored as `/files/<councilname>/...`** — not `/downloads/...`. The Apache alias is `/files`, not `/downloads`.
|
|
|
+- The web portal prefers `local_document_url` over `document_url` when rendering the document button
|
|
|
+- For multi-document DAs (e.g. Launceston), all docs are stored as JSON in `documents_json` and rendered as a list of buttons in the portal
|
|
|
+
|
|
|
+### Write-once fields (in `DB.upsert`)
|
|
|
+
|
|
|
- `date_received` — never overwritten once set
|
|
|
- `date_received_raw` — never overwritten once non-blank
|
|
|
- `document_url` / `local_document_url` — new value only replaces if existing is NULL
|
|
|
|
|
|
-### Table names:
|
|
|
+### Table names
|
|
|
+
|
|
|
- Always derived from the scraper filename: `scrapers/foo.rb` → `da_foo`
|
|
|
- `run_all.sh` sets `TABLE_NAME=da_<basename>` before invoking each scraper
|
|
|
- The `COUNCIL_MAP` in `lib/util.rb` maps internal council keys to table names (used by PlanBuild integration)
|
|
|
|
|
|
+### run_all.sh summary table
|
|
|
+
|
|
|
+- After all scrapers finish, prints a formatted table: Council | Saved | Warns | Status
|
|
|
+- Status values: `ok`, `warn`, `blocked` (Cloudflare), `ERROR` (non-zero exit)
|
|
|
+- Saved count: parsed from scraper stdout — looks for `"Saved N"` (case-insensitive) first, falls back to counting `"Upserted"` lines
|
|
|
+- All scrapers should end with `puts "Done #{TABLE}. Saved #{n} item(s)."` for correct summary parsing
|
|
|
+- If any scraper has ERROR status and `SMTP_HOST` is set, `tools/send_summary_email.rb` sends an HTML summary email
|
|
|
+
|
|
|
---
|
|
|
|
|
|
## Error Handling Conventions
|
|
|
|
|
|
-After a refactor, the project follows these rules:
|
|
|
-
|
|
|
- **URI building** (`URI.join`, `URI.parse`) → `rescue URI::InvalidURIError`
|
|
|
-- **DB operations** (prepare/execute) → `rescue Mysql2::Error => e; warn "[scraper] ..."`
|
|
|
+- **DB operations** (prepare/execute) → `rescue Mysql2::Error => e; Log.warn ...`
|
|
|
- **Zlib decompression** → `rescue Zlib::Error`
|
|
|
- **Date parsing** (`Date.strptime`, `Date.parse`) → `rescue ArgumentError, Date::Error`
|
|
|
- **JSON parsing** → `rescue JSON::ParserError`
|
|
|
@@ -104,7 +125,10 @@ When a council changes its website markup, only that scraper needs updating. The
|
|
|
- `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime`
|
|
|
|
|
|
**Template choice:**
|
|
|
-- Simple HTML list/table → copy `glamorgan.rb`
|
|
|
+
|
|
|
+- Simple HTML list/table with one entry per row → copy `glamorgan.rb`
|
|
|
+- Single page, entries grouped under `<h2>` headings → copy `northernmidlands.rb`
|
|
|
+- Single page, entries under `<h2>` with labeled `<strong>` fields + PDF in `<ul>` → copy `westtamar.rb`
|
|
|
- Link/PDF listing → copy `centralhighlands.rb`
|
|
|
- WAF-protected site needing homepage warmup → copy `kingisland.rb` (minimal) or `burnie.rb` (full-featured with PDF download)
|
|
|
- Multi-hop redirect to detail pages → copy `derwentvalley.rb`
|
|
|
@@ -120,12 +144,40 @@ The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everyth
|
|
|
- Schema changes go in `lib/migrate.rb` (new migration at end of `MIGRATIONS` array) or `lib/db.rb` (`ensure_table!`) for columns every new table gets
|
|
|
- The `geo_cache` table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API calls
|
|
|
- The `UNIQUE KEY uniq_ref_addr (council_reference, address)` constraint drives the upsert behaviour
|
|
|
+- Current migration versions: v1 (enrichment/geocode columns), v2 (geo_cache table), v3 (documents_json), v4 (Launceston detail columns), v5 (rewrite /downloads/ → /files/ in local_document_url)
|
|
|
+
|
|
|
+### Schema — notable columns added beyond base
|
|
|
+
|
|
|
+| Column | Type | Notes |
|
|
|
+| --- | --- | --- |
|
|
|
+| `documents_json` | MEDIUMTEXT | JSON array of `{name, url, local_url}` — used when a DA has multiple PDFs (e.g. Launceston) |
|
|
|
+| `status` | VARCHAR(100) | Application status text (Launceston eProperty) |
|
|
|
+| `assigned_officer` | VARCHAR(255) | Assigned planning officer (Launceston) |
|
|
|
+| `group` | VARCHAR(100) | Application group (Launceston) — reserved SQL word, always quoted |
|
|
|
+| `category` | VARCHAR(100) | Application category (Launceston) |
|
|
|
+| `application_valid` | DATE | Date application deemed valid (Launceston) |
|
|
|
+| `advertised_on` | DATE | Date first advertised (Launceston) |
|
|
|
+| `property_legal_description` | TEXT | Certificate of Title / legal description (Launceston) |
|
|
|
+
|
|
|
+---
|
|
|
|
|
|
## Web Portal Notes
|
|
|
|
|
|
- `web/index.php` dynamically discovers all `da_*` tables and builds a UNION query
|
|
|
- It handles missing columns gracefully (not all tables have every column)
|
|
|
-- `web/backfill_pid_title.php` is a legacy admin tool — it should not be publicly accessible; consider moving it out of the web root or placing it behind authentication
|
|
|
+- Document display: if `documents_json` is present → renders a button per document using the name from JSON; otherwise falls back to single "Open document" button using `local_document_url` → `document_url`
|
|
|
+- `web/backfill_pid_title.php` is a legacy admin tool — it should not be publicly accessible
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Email Summary
|
|
|
+
|
|
|
+`tools/send_summary_email.rb` is called by `run_all.sh` when any scraper exits with ERROR status. It:
|
|
|
+
|
|
|
+- Reads SMTP config from env vars: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USERNAME`, `SMTP_PASSWORD`, `SMTP_SMTPSecure` (`tls`/`ssl`), `SMTP_SENTFROM`, `SMTP_ADDADDRESS`
|
|
|
+- Uses Ruby stdlib `net/smtp` — no gems required
|
|
|
+- Sends multipart (plain + HTML) email with colour-coded summary table
|
|
|
+- Silently skips if `SMTP_HOST` is not set
|
|
|
|
|
|
---
|
|
|
|
|
|
@@ -134,7 +186,142 @@ The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everyth
|
|
|
- **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
|
|
|
- **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead.
|
|
|
- **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers.
|
|
|
-- **PDF downloads**: Only happen when `DOWNLOAD_ATTACHMENTS=1`. Files land in `DOWNLOAD_DIR/<councilname>/`. The web portal serves them from `/downloads/` via an Apache alias.
|
|
|
+- **PDF download path**: `local_document_url` must begin with `/files/` (not `/downloads/`). The Apache alias in `web/000-files.conf` is `Alias /files /srv/files`. Using `/downloads/` results in 404 in the web portal.
|
|
|
+- **Binary PDF downloads**: Pass `headers: { "Accept" => "application/pdf,*/*", "Referer" => URL }` to `Http.get` when downloading PDFs from CDN subdomains — some CDNs reject requests without a valid referrer.
|
|
|
- **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern.
|
|
|
- **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`.
|
|
|
- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`.
|
|
|
+- **`group` column**: This is a reserved SQL word. In `DB.upsert` it is safe because all column names are backtick-quoted. In raw SQL always write `` `group` ``.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Next Phase — LLM-Based PDF Classification
|
|
|
+
|
|
|
+### Goal
|
|
|
+
|
|
|
+Extract structured information from downloaded DA PDFs using a local LLaMA model — primarily **application type** (Residential, Commercial, Industrial, Subdivision, etc.) but potentially other fields not reliably scraped from HTML (e.g. lot size, number of dwellings, value of works).
|
|
|
+
|
|
|
+### LLM Infrastructure
|
|
|
+
|
|
|
+A local Ollama instance is running at `http://192.168.8.73:11434` (env var: `LLAMA_URL`).
|
|
|
+
|
|
|
+`lib/llm.php` (already in the repo) shows the integration pattern for PHP:
|
|
|
+
|
|
|
+- Primary backend: llama-swap via OpenAI-compatible `/v1/chat/completions`
|
|
|
+- Fallback: Ollama `/api/generate`
|
|
|
+- Config loaded from `config/ai.php` — `LLAMACPP_HOST`, `OLLAMA_HOST`, `LLAMACPP_MODEL`, `OLLAMA_MODEL`, etc.
|
|
|
+
|
|
|
+For the Ruby scraper pipeline the equivalent is a direct Ollama HTTP call (no gems needed — stdlib `net/http`):
|
|
|
+
|
|
|
+```ruby
|
|
|
+# Minimal Ollama call — POST to /api/generate
|
|
|
+require "net/http"
|
|
|
+require "json"
|
|
|
+
|
|
|
+def llm_classify(text, model: "llama3.2")
|
|
|
+ uri = URI("#{ENV.fetch('LLAMA_URL', 'http://192.168.8.73:11434')}/api/generate")
|
|
|
+ body = JSON.generate(model: model, prompt: text, stream: false)
|
|
|
+ res = Net::HTTP.post(uri, body, "Content-Type" => "application/json")
|
|
|
+ JSON.parse(res.body)["response"].to_s.strip
|
|
|
+rescue StandardError => e
|
|
|
+ warn "[llm] #{e.class}: #{e.message}"
|
|
|
+ nil
|
|
|
+end
|
|
|
+```
|
|
|
+
|
|
|
+### Proposed Pipeline
|
|
|
+
|
|
|
+```text
|
|
|
+Downloaded PDF (local_document_url)
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+Extract text (pdftotext CLI or pdf-reader gem)
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+Prompt LLM → application_type string
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+DB.upsert / UPDATE da_* SET application_type = ?
|
|
|
+```
|
|
|
+
|
|
|
+### Suggested Prompt
|
|
|
+
|
|
|
+```text
|
|
|
+You are classifying a Tasmanian planning development application.
|
|
|
+Read the following text and return ONLY the single most appropriate
|
|
|
+application type from this list:
|
|
|
+ Residential, Commercial, Industrial, Subdivision, Rural/Agriculture,
|
|
|
+ Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use,
|
|
|
+ Demolition, Signage, Other
|
|
|
+
|
|
|
+Text:
|
|
|
+<first 1500 characters of PDF text>
|
|
|
+
|
|
|
+Reply with the type only. No explanation.
|
|
|
+```
|
|
|
+
|
|
|
+### Schema Changes Needed
|
|
|
+
|
|
|
+```sql
|
|
|
+-- Add to ensure_table! and as a new migration:
|
|
|
+application_type VARCHAR(60) NULL -- e.g. "Residential", "Subdivision"
|
|
|
+application_type_raw TEXT NULL -- full LLM response for debugging
|
|
|
+application_type_at DATETIME NULL -- when classification was last run
|
|
|
+```
|
|
|
+
|
|
|
+### Implementation Options
|
|
|
+
|
|
|
+**Option A — Inline during scrape** (simplest):
|
|
|
+
|
|
|
+- Each scraper that downloads PDFs calls `llm_classify` immediately after download
|
|
|
+- Adds latency to each scrape run (LLM inference per PDF)
|
|
|
+- Suitable if the LLM is fast (< 5s per classification)
|
|
|
+
|
|
|
+**Option B — Backfill tool** (recommended):
|
|
|
+
|
|
|
+- New script `tools/classify_pdfs.rb` — iterates rows where `local_document_url IS NOT NULL AND application_type IS NULL`
|
|
|
+- Run separately from `run_all.sh`, on demand or on a cron
|
|
|
+- Supports `ONLY_TABLE` env var to process one council at a time
|
|
|
+- Safer — scrape failures don't block classification; can re-run without re-scraping
|
|
|
+
|
|
|
+**Option C — PHP tool in web container**:
|
|
|
+
|
|
|
+- New `tools/classify_pdfs.php` using the existing `lib/llm.php`
|
|
|
+- Reads PDFs from `/srv/files`, calls `llmGenerate`, updates DB
|
|
|
+- Advantage: reuses the already-written PHP LLM helper
|
|
|
+- Disadvantage: PDF text extraction harder in PHP (needs `pdftotext` shell call or a PHP PDF lib)
|
|
|
+
|
|
|
+### PDF Text Extraction
|
|
|
+
|
|
|
+`pdftotext` (part of `poppler-utils`) is the most reliable option:
|
|
|
+
|
|
|
+```ruby
|
|
|
+def extract_pdf_text(local_path, max_chars: 2000)
|
|
|
+ # local_path is relative like "/files/northernmidlands/PLN-26-0030/doc.pdf"
|
|
|
+ # Map to filesystem path inside container
|
|
|
+ fs_path = local_path.sub(%r{\A/files/}, "#{ENV.fetch('DOWNLOAD_DIR', '/app/downloads')}/")
|
|
|
+ return nil unless File.exist?(fs_path)
|
|
|
+
|
|
|
+ text, = Open3.capture2("pdftotext", "-l", "3", fs_path, "-")
|
|
|
+ text.to_s.gsub(/\s+/, " ").strip[0, max_chars]
|
|
|
+rescue StandardError => e
|
|
|
+ warn "[classify] pdftotext failed for #{fs_path}: #{e.message}"
|
|
|
+ nil
|
|
|
+end
|
|
|
+```
|
|
|
+
|
|
|
+`pdftotext` may need to be installed in the scraper Dockerfile:
|
|
|
+
|
|
|
+```dockerfile
|
|
|
+RUN apt-get install -y poppler-utils
|
|
|
+```
|
|
|
+
|
|
|
+### Key Decisions Before Implementation
|
|
|
+
|
|
|
+1. **Option A vs B vs C** — inline vs backfill tool vs PHP
|
|
|
+2. **Which model** — any Ollama model on the local server (check with `curl http://192.168.8.73:11434/api/tags`)
|
|
|
+3. **Prompt language** — zero-shot classification vs few-shot examples; JSON output vs plain text
|
|
|
+4. **Confidence threshold** — store raw LLM response for auditing? Flag low-confidence results?
|
|
|
+5. **Re-classification** — should existing `application_type` values be overwritten on re-run, or treated as write-once?
|
|
|
+6. **Dockerfile change** — confirm `poppler-utils` can be added to the scraper image
|
|
|
+
|