浏览代码

updates and readme

Benjamin Harris 2 月之前
父节点
当前提交
0639f40eb9

+ 3 - 1
.claude/settings.local.json

@@ -3,7 +3,9 @@
     "allow": [
       "Bash(ls /f/GIT_REPO/tas_councils/scrapers/*.rb)",
       "Bash(grep -c \"def \" /f/GIT_REPO/tas_councils/scrapers/*.rb)",
-      "Bash(sort -t: -k2 -rn)"
+      "Bash(sort -t: -k2 -rn)",
+      "Bash(grep -v \"next\\\\s*$\")",
+      "Bash(grep -n -B3 \"^\\\\s*rescue\\\\s*$\\\\|^rescue\\\\s*$\" \"f:/GIT_REPO/tas_councils/scrapers/centralcoast.rb\" \"f:/GIT_REPO/tas_councils/scrapers/circularhead.rb\" \"f:/GIT_REPO/tas_councils/scrapers/clarence.rb\" \"f:/GIT_REPO/tas_councils/scrapers/flinders_council.rb\" \"f:/GIT_REPO/tas_councils/scrapers/georgetown.rb\" \"f:/GIT_REPO/tas_councils/scrapers/glamorgan.rb\" \"f:/GIT_REPO/tas_councils/scrapers/glenorchy.rb\" \"f:/GIT_REPO/tas_councils/scrapers/huonvalley.rb\" \"f:/GIT_REPO/tas_councils/scrapers/tasman.rb\" \"f:/GIT_REPO/tas_councils/scrapers/westtamar.rb\" \"f:/GIT_REPO/tas_councils/scrapers/meandervalley.rb\" \"f:/GIT_REPO/tas_councils/scrapers/devonportcity.rb\" \"f:/GIT_REPO/tas_councils/scrapers/dorset.rb\" \"f:/GIT_REPO/tas_councils/scrapers/launcestoncity.rb\" \"f:/GIT_REPO/tas_councils/scrapers/waratah_wynyard.rb\")"
     ]
   }
 }

+ 1 - 0
.gitignore

@@ -0,0 +1 @@
+/downloads

+ 117 - 0
CLAUDE.md

@@ -0,0 +1,117 @@
+# CLAUDE.md — Project Guide for Claude Code
+
+## What This Project Does
+
+This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose.
+
+---
+
+## Key Files
+
+| File | Role |
+|---|---|
+| `lib/db.rb` | DB client, `ensure_table!`, `upsert` (with write-once semantics for some fields) |
+| `lib/http.rb` | HTTP client — retries, cookie jar, 403/406 warmup, curl fallback |
+| `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table |
+| `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write |
+| `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings |
+| `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set |
+| `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) |
+| `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
+| `web/index.php` | Search portal — dynamic UNION across all `da_*` tables |
+
+---
+
+## Running Things Locally
+
+```bash
+# Full stack
+docker compose up -d
+
+# Run all scrapers once
+docker compose run --rm scraper /app/run_all.sh
+
+# Run a single scraper
+TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb
+
+# Run a subset
+ONLY=meandervalley,kent docker compose run --rm scraper /app/run_all.sh
+
+# Geocode backfill
+docker compose run --rm \
+  -e GOOGLE_MAPS_API_KEY="..." \
+  scraper ruby /app/tools/enrich.rb --table=da_brighton
+```
+
+---
+
+## Architecture Conventions
+
+### Each scraper follows this pattern:
+1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename
+2. `DB.ensure_table!(TABLE)` + `ensure_extra_columns!(TABLE)` — idempotent schema setup
+3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup)
+4. Parse with Nokogiri
+5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received`
+6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches
+
+### Write-once fields (in `DB.upsert`):
+- `date_received` — never overwritten once set
+- `date_received_raw` — never overwritten once non-blank
+- `document_url` / `local_document_url` — new value only replaces if existing is NULL
+
+### Table names:
+- Always derived from the scraper filename: `scrapers/foo.rb` → `da_foo`
+- `run_all.sh` sets `TABLE_NAME=da_<basename>` before invoking each scraper
+- The `COUNCIL_MAP` in `lib/util.rb` maps internal council keys to table names (used by PlanBuild integration)
+
+---
+
+## Error Handling Conventions
+
+After a refactor, the project follows these rules:
+
+- **URI building** (`URI.join`, `URI.parse`) → `rescue URI::InvalidURIError`
+- **DB operations** (prepare/execute) → `rescue Mysql2::Error => e; warn "[scraper] ..."`
+- **Zlib decompression** → `rescue Zlib::Error`
+- **Date parsing** (`Date.strptime`, `Date.parse`) → `rescue ArgumentError, Date::Error`
+- **JSON parsing** → `rescue JSON::ParserError`
+- **Network/HTTP** → `rescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFError`
+- **Enrichment failures** always `warn` to stderr — do not gate them behind `ENRICH_DEBUG`
+- **No bare `rescue`** — always specify the exception class(es)
+
+---
+
+## Adding or Modifying a Scraper
+
+When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
+- `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector
+- HTTP 403/406 — Council site added WAF; check `Http.get` options or add a warmup step
+- `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime`
+
+To add a new scraper, copy a structurally similar one (e.g. `glamorgan.rb` for table-based sites, `centralhighlands.rb` for link/PDF-based sites) and adapt the parsing logic. The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else.
+
+---
+
+## Database Notes
+
+- MariaDB 10.11, `utf8mb4` encoding throughout
+- Schema is created on-the-fly — `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE ... ADD COLUMN IF NOT EXISTS`
+- There is no migration framework; schema changes go in `lib/db.rb` (`ensure_table!`) or `lib/enrich.rb` (`ensure_extra_columns!`)
+- The `geo_cache` table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API calls
+- The `UNIQUE KEY uniq_ref_addr (council_reference, address)` constraint drives the upsert behaviour
+
+## Web Portal Notes
+
+- `web/index.php` dynamically discovers all `da_*` tables and builds a UNION query
+- It handles missing columns gracefully (not all tables have every column)
+- `web/backfill_pid_title.php` is a legacy admin tool — it should not be publicly accessible; consider moving it out of the web root or placing it behind authentication
+
+---
+
+## Common Gotchas
+
+- **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
+- **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead.
+- **PlanBuild scrapers**: `planbuild.rb` and `planbuild_fetch.js` handle councils on the state-run PlanBuild portal. They write to per-council tables using `Util.ref_to_table`. These are separate from the council-specific scrapers.
+- **PDF downloads**: Only happen when `DOWNLOAD_ATTACHMENTS=1`. Files land in `DOWNLOAD_DIR/<councilname>/`. The web portal serves them from `/downloads/` via an Apache alias.

+ 234 - 0
README.md

@@ -0,0 +1,234 @@
+# TAS Councils Planning Applications Scraper
+
+A web scraping and data aggregation system for Tasmanian development applications (DAs). It collects planning application notices from all 29 Tasmanian council websites, normalises and geocodes the data, and exposes it via a PHP search portal.
+
+---
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────┐
+│  29 Ruby scrapers  (scrapers/*.rb)                  │
+│  Each polls one council website on a schedule       │
+└──────────────────────┬──────────────────────────────┘
+                       │ upserts rows
+                       ▼
+               MariaDB (da_* tables)
+                       │
+         ┌─────────────┴─────────────┐
+         │                           │
+   PHP web portal              Adminer UI
+   (web/index.php)             port 9980
+   port 9981
+```
+
+**Services (Docker Compose):**
+
+| Service | Image | Port | Purpose |
+|---|---|---|---|
+| `db` | `mariadb:10.11` | 3306 | Database |
+| `scraper` | Custom (Ruby 3.2) | — | Runs all scrapers on a schedule |
+| `web` | Custom (PHP/Apache) | 9981 | Search portal |
+| `adminer` | `adminer` | 9980 | Database admin UI |
+
+---
+
+## Quick Start
+
+### 1. Copy and configure environment
+
+```bash
+cp .env.example .env
+# Edit .env — set DB passwords and your Google Maps API key
+```
+
+### 2. Start all services
+
+```bash
+docker compose up -d
+```
+
+- Web portal: http://localhost:9981
+- Adminer:    http://localhost:9980
+
+### 3. Run scrapers manually (once)
+
+```bash
+docker compose run --rm scraper /app/run_all.sh
+```
+
+---
+
+## Environment Variables
+
+Copy `.env.example` to `.env` and fill in the values. **Never commit `.env`.**
+
+| Variable | Required | Description |
+|---|---|---|
+| `MYSQL_DATABASE` | Yes | Database name (default: `planning_scrapes`) |
+| `MYSQL_USER` | Yes | Database username |
+| `MYSQL_PASSWORD` | Yes | Database password |
+| `MYSQL_ROOT_PASSWORD` | Yes | MariaDB root password |
+| `GOOGLE_MAPS_API_KEY` | Yes | Used to geocode DA addresses |
+| `LOOKUP_URL` | No | URL of the property lookup service (PID/title enrichment) |
+| `LOOKUP_THROTTLE_MS` | No | Milliseconds between lookup requests (default: 150) |
+| `SCRAPE_EVERY_MINUTES` | No | If set, the scraper loops on this interval (default: run once) |
+| `DOWNLOAD_ATTACHMENTS` | No | Set to `1` to download PDF attachments |
+| `DOWNLOAD_DIR` | No | Host path for downloaded PDFs (default: `/app/downloads`) |
+| `DEBUG` | No | Set to `1` for verbose scraper output |
+| `DRY_RUN` | No | Set to `1` to parse without writing to the DB |
+| `ENRICH_DEBUG` | No | Set to `1` for verbose geocode/lookup output |
+| `ALLOW_INSECURE` | No | Set to `1` to skip SSL verification (use only for legacy council sites) |
+
+---
+
+## Running Scrapers Selectively
+
+Use `ONLY` or `SKIP` environment variables with `run_all.sh`. Values are comma-separated scraper names (filename without `.rb`).
+
+```bash
+# Run only two councils
+ONLY=meandervalley,kentish docker compose run --rm scraper /app/run_all.sh
+
+# Run all except one
+SKIP=hobartcity docker compose run --rm scraper /app/run_all.sh
+```
+
+---
+
+## Council → Table Mapping
+
+Each scraper writes to its own `da_*` table. The table name is derived from the scraper filename.
+
+| Council | Scraper file | DB table |
+|---|---|---|
+| Break O'Day | `break_oday.rb` | `da_break_oday` |
+| Brighton | `brighton.rb` | `da_brighton` |
+| Burnie | `burnie.rb` | `da_burnie` |
+| Central Coast | `centralcoast.rb` | `da_centralcoast` |
+| Central Highlands | `centralhighlands.rb` | `da_centralhighlands` |
+| Circular Head | `circularhead.rb` | `da_circularhead` |
+| Clarence | `clarence.rb` | `da_clarence` |
+| Derwent Valley | `derwentvalley.rb` | `da_derwentvalley` |
+| Devonport | `devonportcity.rb` | `da_devonportcity` |
+| Dorset | `dorset.rb` | `da_dorset` |
+| Flinders | `flinders_council.rb` | `da_flinders_council` |
+| George Town | `georgetown.rb` | `da_georgetown` |
+| Glamorgan Spring Bay | `glamorgan.rb` | `da_glamorgan` |
+| Glenorchy | `glenorchy.rb` | `da_glenorchy` |
+| Hobart | `hobartcity.rb` | `da_hobartcity` |
+| Huon Valley | `huonvalley.rb` | `da_huonvalley` |
+| Kentish | `kentish.rb` | `da_kentish` |
+| Kingborough | `kingborough.rb` | `da_kingborough` |
+| Latrobe | `latrobe.rb` | `da_latrobe` |
+| Launceston | `launcestoncity.rb` | `da_launcestoncity` |
+| Meander Valley | `meandervalley.rb` | `da_meandervalley` |
+| Northern Midlands | `northernmidlands.rb` | `da_northernmidlands` |
+| Southern Midlands | `southernmidlands.rb` | `da_southernmidlands` |
+| Sorell | *(PlanBuild)* | `da_sorell` |
+| Tasman | `tasman.rb` | `da_tasman` |
+| Waratah–Wynyard | `waratah_wynyard.rb` | `da_waratah_wynyard` |
+| West Coast | `westcoast.rb` | `da_westcoast` |
+| West Tamar | `westtamar.rb` | `da_westtamar` |
+| Various (PlanBuild portal) | `planbuild.rb` | Per-council `da_*` tables |
+
+---
+
+## Database Schema
+
+Every `da_*` table shares the same base schema:
+
+| Column | Type | Notes |
+|---|---|---|
+| `id` | `BIGINT` | Auto-increment PK |
+| `council_reference` | `VARCHAR(100)` | DA reference number |
+| `address` | `VARCHAR(255)` | Street address |
+| `description` | `TEXT` | Proposal description |
+| `date_received` | `DATE` | Application date |
+| `on_notice_to` | `DATE` | Public comment close date |
+| `applicant` | `VARCHAR(255)` | |
+| `document_url` | `TEXT` | Remote PDF URL |
+| `local_document_url` | `TEXT` | Downloaded PDF path (relative to `/downloads`) |
+| `address_std` | `VARCHAR(255)` | Google-normalised address |
+| `lat` / `lng` | `DECIMAL(10,7)` | Geocoded coordinates |
+| `property_id` | `TEXT` | Land title PID |
+| `title_reference` | `TEXT` | Certificate of title reference |
+| `created_at` / `updated_at` | `DATETIME` | |
+
+Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`) — the first value is kept on subsequent scrapes.
+
+---
+
+## Enrichment Pipeline
+
+After each upsert, `enrich_after_upsert!` runs two optional enrichment steps:
+
+1. **Geocoding** (requires `GOOGLE_MAPS_API_KEY`) — calls the Google Maps Geocoding API, caches results in the `geo_cache` table, and populates `address_std`, `street`, `locality`, `state`, `postcode`, `lat`, `lng`.
+
+2. **Property lookup** (requires `LOOKUP_URL`) — POSTs `{lat, lng}` to a property data service and populates `property_id` and `title_reference`.
+
+To run enrichment as a standalone backfill over existing rows:
+
+```bash
+docker compose run --rm \
+  -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
+  -e LOOKUP_URL="$LOOKUP_URL" \
+  scraper ruby /app/tools/enrich.rb
+```
+
+Run against a single table with a dry run:
+
+```bash
+docker compose run --rm \
+  -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
+  -e LOOKUP_URL="$LOOKUP_URL" \
+  -e DRY_RUN=1 \
+  scraper ruby /app/tools/enrich.rb --table=da_dorset
+```
+
+---
+
+## Adding a New Scraper
+
+1. Create `scrapers/<councilname>.rb` — use an existing simple scraper (e.g. `glamorgan.rb`) as a template.
+2. At minimum the scraper must:
+   - Read `TABLE = ENV.fetch("TABLE_NAME")`
+   - Call `DB.ensure_table!(TABLE)` and `ensure_extra_columns!(TABLE)`
+   - Call `DB.upsert(TABLE, row)` with at least `council_reference` and `address`
+   - Call `enrich_after_upsert!` after each upsert
+3. Add the council to `COUNCIL_MAP` in `lib/util.rb` if PlanBuild integration is needed.
+4. Test locally: `TABLE_NAME=da_<name> ruby scrapers/<name>.rb`
+
+---
+
+## Tools
+
+| Script | Purpose |
+|---|---|
+| `tools/enrich.rb` | Batch geocode + property lookup for existing rows |
+| `tools/backfill_geocode.rb` | Geocode-only backfill |
+| `tools/backfill_dorset_docs.rb` | Backfill PDF links for Dorset rows |
+| `tools/import_sqlites.rb` | Import data from legacy SQLite exports |
+| `planbuild_fetch.js` | Playwright-based scraper for the PlanBuild TAS portal |
+
+---
+
+## Project Structure
+
+```
+tas_councils/
+├── lib/
+│   ├── db.rb          # DB connection, table creation, upsert logic
+│   ├── http.rb        # HTTP client with retries, cookie jar, WAF warmup
+│   ├── geocode.rb     # Google Maps geocoding with SHA1 cache
+│   ├── enrich.rb      # Post-upsert enrichment pipeline
+│   └── util.rb        # Date parsing, council/table name mappings
+├── scrapers/          # One .rb file per council
+├── web/               # PHP search portal (Apache)
+├── tools/             # Standalone backfill and migration scripts
+├── run_all.sh         # Discovers and runs scrapers (supports ONLY/SKIP)
+├── entrypoint.sh      # Docker entrypoint; optionally loops on a schedule
+├── Dockerfile         # Ruby 3.2 scraper image
+├── docker-compose.yml # Full stack: db, scraper, web, adminer
+└── .env               # Secrets — never commit this file
+```

+ 21 - 13
lib/enrich.rb

@@ -35,14 +35,20 @@ end
 
 def ensure_extra_columns!(table)
   esc = DB.client.escape(table)
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS address_std VARCHAR(255) NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS lat DOUBLE NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS lng DOUBLE NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS property_id VARCHAR(50) NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS title_reference VARCHAR(80) NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS local_document_url TEXT NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS on_notice_to DATE NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS on_notice_to_raw VARCHAR(80) NULL"); rescue; end
+  {
+    "address_std"        => "VARCHAR(255) NULL",
+    "lat"                => "DOUBLE NULL",
+    "lng"                => "DOUBLE NULL",
+    "property_id"        => "VARCHAR(50) NULL",
+    "title_reference"    => "VARCHAR(80) NULL",
+    "local_document_url" => "TEXT NULL",
+    "on_notice_to"       => "DATE NULL",
+    "on_notice_to_raw"   => "VARCHAR(80) NULL"
+  }.each do |col, defn|
+    DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS `#{col}` #{defn}")
+  rescue Mysql2::Error => e
+    warn "[enrich] schema migration skipped for #{table}.#{col}: #{e.message}"
+  end
 end
 
 def http_post_json(url, payload, timeout: 15)
@@ -56,7 +62,9 @@ def http_post_json(url, payload, timeout: 15)
   req.body = JSON.generate(payload)
   res = http.request(req)
   raise "HTTP #{res.code}" unless res.is_a?(Net::HTTPSuccess)
-  JSON.parse(res.body) rescue {}
+  JSON.parse(res.body)
+rescue JSON::ParserError
+  {}
 end
 
 # Call this right after DB.upsert in each scraper
@@ -83,7 +91,7 @@ def enrich_after_upsert!(table:, council_reference:, address:)
       # refresh row to fetch lat/lng for next step
       row = sel.execute(council_reference, address).first
     rescue => e
-      log_enrich("enrich: geocode failed #{table} #{council_reference}: #{e.class} #{e.message}")
+      warn "[enrich] geocode failed #{table} #{council_reference}: #{e.class} #{e.message}"
     end
   end
 
@@ -96,14 +104,14 @@ def enrich_after_upsert!(table:, council_reference:, address:)
       if resp["ok"]
         pid   = (resp["pid"] || "").to_s
         title = (resp["title_id"] || "").to_s
-        upd = DB.client.prepare("UPDATE `#{esc}` SET property_id = COALESCE(NULLIF(?,''), property_id), title_reference = COALESCE(NULLIF(?,''), title_reference) WHERE council_reference = ? AND address = ?")
+        upd = DB.client.prepare("UPDATE `#{esc}` SET property_id = COALESCE(NULLIF(?,’’), property_id), title_reference = COALESCE(NULLIF(?,’’), title_reference) WHERE council_reference = ? AND address = ?")
         upd.execute(pid, title, council_reference, address)
         log_enrich("enrich: lookup ok #{table} #{council_reference} pid=#{pid} title=#{title}")
       else
-        log_enrich("enrich: lookup error #{table} #{council_reference}: #{resp["error"]}")
+        warn "[enrich] lookup error #{table} #{council_reference}: #{resp["error"]}"
       end
     rescue => e
-      log_enrich("enrich: lookup failed #{table} #{council_reference}: #{e.class} #{e.message}")
+      warn "[enrich] lookup failed #{table} #{council_reference}: #{e.class} #{e.message}"
     end
   end
 end

+ 11 - 8
lib/geocode.rb

@@ -39,8 +39,8 @@ module Geocode
     DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS postcode VARCHAR(10) NULL")
     DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS lat DECIMAL(10,7) NULL")
     DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS lng DECIMAL(10,7) NULL")
-  rescue => e
-    warn "ensure columns skipped for #{table}: #{e.class} #{e.message}"
+  rescue Mysql2::Error => e
+    warn "[geocode] ensure columns skipped for #{table}: #{e.message}"
   end
 
   # Public helper to geocode and return a hash of normalized components
@@ -73,7 +73,9 @@ module Geocode
     url.query = URI.encode_www_form(params)
 
     body = Http.get(url.to_s, headers: { "Accept" => "application/json" })
-    data = JSON.parse(body) rescue {}
+    data = JSON.parse(body)
+  rescue JSON::ParserError
+    data = {}
 
     status = data["status"].to_s
     unless status == "OK" && data["results"].is_a?(Array) && !data["results"].empty?
@@ -90,10 +92,11 @@ module Geocode
 
     res
   rescue Error => e
-    warn "Geocode error: #{e.message}"
+    warn "[geocode] #{e.message}"
     nil
-  rescue => e
-    warn "Geocode unexpected error: #{e.class} #{e.message}"
+  rescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError,
+         Errno::ECONNRESET, EOFError, Mysql2::Error => e
+    warn "[geocode] network/db error for #{raw_address.inspect}: #{e.class} #{e.message}"
     nil
   end
 
@@ -126,8 +129,8 @@ module Geocode
       council_reference,
       orig_address
     )
-  rescue => e
-    warn "Failed to update normalized address for #{table}/#{council_reference}: #{e.class} #{e.message}"
+  rescue Mysql2::Error => e
+    warn "[geocode] failed to update normalized address for #{table}/#{council_reference}: #{e.message}"
   end
 
   # Helpers

+ 12 - 11
lib/util.rb

@@ -3,25 +3,26 @@ require "date"
 module Util
     def self.parse_epoch_ms(v)
         return nil if v.nil? || v.to_s.strip.empty?
-        Time.at(v.to_i / 1000).to_date rescue nil
+        begin
+            Time.at(v.to_i / 1000).to_date
+        rescue ArgumentError, RangeError, TypeError
+            nil
+        end
     end
 
     def self.parse_aus_date(text)
         s = text.to_s.strip
         return nil if s.empty?
-        begin
-            Date.strptime(s, "%d/%m/%Y")
-        rescue
+        ["%d/%m/%Y", "%d/%m/%y"].each do |fmt|
             begin
-                Date.strptime(s, "%d/%m/%y")
-            rescue
-                begin
-                    Date.parse(s)
-                rescue
-                    nil
-                end
+                return Date.strptime(s, fmt)
+            rescue ArgumentError, Date::Error
+                next
             end
         end
+        Date.parse(s)
+    rescue ArgumentError, Date::Error
+        nil
     end
     COUNCIL_MAP = {
         "BREAK_ODAY"         => "da_break_oday",

+ 1 - 1
scrapers/brighton.rb

@@ -41,7 +41,7 @@ end
 def abs_url(base, href)
     return "" if href.to_s.strip.empty?
     URI.join(base, href).to_s
-rescue
+rescue URI::InvalidURIError
     href.to_s
 end
 

+ 11 - 3
scrapers/burnie.rb

@@ -85,11 +85,15 @@ def decompress(body, enc)
   if enc.to_s =~ /gzip/i
     Zlib::GzipReader.new(StringIO.new(body)).read
   elsif enc.to_s =~ /deflate/i
-    Zlib::Inflate.inflate(body) rescue body
+    begin
+      Zlib::Inflate.inflate(body)
+    rescue Zlib::Error
+      body
+    end
   else
     body
   end
-rescue
+rescue Zlib::Error
   body
 end
 
@@ -184,7 +188,11 @@ def decode_seamless_viewstate(doc)
   b64 = doc.at_css("#__SEAMLESSVIEWSTATE")&.[]("value").to_s
   return nil if b64.empty?
   raw  = Base64.decode64(b64)
-  html = (Zlib::GzipReader.new(StringIO.new(raw)).read rescue raw)
+  html = begin
+    Zlib::GzipReader.new(StringIO.new(raw)).read
+  rescue Zlib::Error
+    raw
+  end
   Nokogiri::HTML(html)
 rescue => e
   warn "Failed to decode __SEAMLESSVIEWSTATE: #{e.class} #{e.message}"

+ 1 - 1
scrapers/centralcoast.rb

@@ -23,7 +23,7 @@ ensure_extra_columns!(TABLE)
 def abs_url(base, href)
     return "" if href.to_s.strip.empty?
     URI.join(base, href).to_s
-rescue
+rescue URI::InvalidURIError
     href.to_s
 end
 

+ 2 - 2
scrapers/centralhighlands.rb

@@ -118,8 +118,8 @@ links.each_with_index do |a, idx|
   begin
     upd = DB.client.prepare("UPDATE `#{DB.client.escape(TABLE)}` SET document_url = ? WHERE council_reference = ? AND address = ?")
     upd.execute(pdf, ref, address)
-  rescue
-    # ignore if column missing
+  rescue Mysql2::Error => e
+    warn "[centralhighlands] db update skipped for #{ref}: #{e.message}"
   end
 
   puts "Upserted #{ref} -> #{address}"

+ 2 - 2
scrapers/circularhead.rb

@@ -79,8 +79,8 @@ items.each_with_index do |li, idx|
   begin
     upd = DB.client.prepare("UPDATE `#{DB.client.escape(TABLE)}` SET document_url = ?, title_reference = ? WHERE council_reference = ? AND address = ?")
     upd.execute(document_url, title_reference, council_reference, address)
-  rescue
-    # ignore if ALTER failed
+  rescue Mysql2::Error => e
+    warn "[circularhead] db update skipped for #{council_reference}: #{e.message}"
   end
 
   puts "Upserted #{council_reference} -> #{address}"

+ 1 - 1
scrapers/clarence.rb

@@ -24,7 +24,7 @@ ensure_extra_columns!(TABLE)
 def abs_url(base, href)
     return "" if href.to_s.strip.empty?
     URI.join(base, href).to_s
-rescue
+rescue URI::InvalidURIError
     href.to_s
 end
 

+ 3 - 3
scrapers/devonportcity.rb

@@ -141,7 +141,7 @@ def extract_on_notice_to_from_title(title)
                 # handle 19-08-2025
                 begin
                     date_received = Date.strptime(date_received_raw, "%d-%m-%Y")
-                rescue
+                rescue ArgumentError, Date::Error
                     date_received = parse_date_any(date_received_raw)
                 end
             end
@@ -188,8 +188,8 @@ def extract_on_notice_to_from_title(title)
             begin
                 upd = DB.client.prepare("UPDATE `#{DB.client.escape(TABLE)}` SET document_url = ?, title_reference = ? WHERE council_reference = ? AND address = ?")
                 upd.execute(document_url, title_reference, council_reference, address)
-            rescue
-                # ignore if ALTER failed
+            rescue Mysql2::Error => e
+                warn "[devonportcity] db update skipped for #{council_reference}: #{e.message}"
             end
 
             puts "Upserted #{council_reference} -> #{address}"

+ 2 - 2
scrapers/dorset.rb

@@ -26,7 +26,7 @@ ensure_extra_columns!(TABLE)
 def abs_url(href)
   return "" if href.to_s.strip.empty?
   URI.join(BASE_HTTPS, href).to_s
-rescue
+rescue URI::InvalidURIError
   href.to_s
 end
 
@@ -198,7 +198,7 @@ def id_from_url(u)
   uri = URI.parse(u)
   q   = uri.query.to_s
   q[/\bid=([^&]+)/, 1] || File.basename(uri.path)
-rescue
+rescue URI::InvalidURIError
   nil
 end
 

+ 21 - 13
scrapers/enrich.rb

@@ -35,14 +35,20 @@ end
 
 def ensure_extra_columns!(table)
   esc = DB.client.escape(table)
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS address_std VARCHAR(255) NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS lat DOUBLE NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS lng DOUBLE NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS property_id VARCHAR(50) NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS title_reference VARCHAR(80) NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS local_document_url TEXT NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS on_notice_to DATE NULL"); rescue; end
-  begin DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS on_notice_to_raw VARCHAR(80) NULL"); rescue; end
+  {
+    "address_std"        => "VARCHAR(255) NULL",
+    "lat"                => "DOUBLE NULL",
+    "lng"                => "DOUBLE NULL",
+    "property_id"        => "VARCHAR(50) NULL",
+    "title_reference"    => "VARCHAR(80) NULL",
+    "local_document_url" => "TEXT NULL",
+    "on_notice_to"       => "DATE NULL",
+    "on_notice_to_raw"   => "VARCHAR(80) NULL"
+  }.each do |col, defn|
+    DB.client.query("ALTER TABLE `#{esc}` ADD COLUMN IF NOT EXISTS `#{col}` #{defn}")
+  rescue Mysql2::Error => e
+    warn "[enrich] schema migration skipped for #{table}.#{col}: #{e.message}"
+  end
 end
 
 def http_post_json(url, payload, timeout: 15)
@@ -56,7 +62,9 @@ def http_post_json(url, payload, timeout: 15)
   req.body = JSON.generate(payload)
   res = http.request(req)
   raise "HTTP #{res.code}" unless res.is_a?(Net::HTTPSuccess)
-  JSON.parse(res.body) rescue {}
+  JSON.parse(res.body)
+rescue JSON::ParserError
+  {}
 end
 
 # Call this right after DB.upsert in each scraper
@@ -83,7 +91,7 @@ def enrich_after_upsert!(table:, council_reference:, address:)
       # refresh row to fetch lat/lng for next step
       row = sel.execute(council_reference, address).first
     rescue => e
-      log_enrich("enrich: geocode failed #{table} #{council_reference}: #{e.class} #{e.message}")
+      warn "[enrich] geocode failed #{table} #{council_reference}: #{e.class} #{e.message}"
     end
   end
 
@@ -96,14 +104,14 @@ def enrich_after_upsert!(table:, council_reference:, address:)
       if resp["ok"]
         pid   = (resp["pid"] || "").to_s
         title = (resp["title_id"] || "").to_s
-        upd = DB.client.prepare("UPDATE `#{esc}` SET property_id = COALESCE(NULLIF(?,''), property_id), title_reference = COALESCE(NULLIF(?,''), title_reference) WHERE council_reference = ? AND address = ?")
+        upd = DB.client.prepare("UPDATE `#{esc}` SET property_id = COALESCE(NULLIF(?,’’), property_id), title_reference = COALESCE(NULLIF(?,’’), title_reference) WHERE council_reference = ? AND address = ?")
         upd.execute(pid, title, council_reference, address)
         log_enrich("enrich: lookup ok #{table} #{council_reference} pid=#{pid} title=#{title}")
       else
-        log_enrich("enrich: lookup error #{table} #{council_reference}: #{resp["error"]}")
+        warn "[enrich] lookup error #{table} #{council_reference}: #{resp["error"]}"
       end
     rescue => e
-      log_enrich("enrich: lookup failed #{table} #{council_reference}: #{e.class} #{e.message}")
+      warn "[enrich] lookup failed #{table} #{council_reference}: #{e.class} #{e.message}"
     end
   end
 end

+ 2 - 2
scrapers/flinders_council.rb

@@ -89,8 +89,8 @@ links.each do |a|
   begin
     upd = DB.client.prepare("UPDATE `#{DB.client.escape(TABLE)}` SET document_url = ? WHERE council_reference = ? AND address = ?")
     upd.execute(pdf, ref, address)
-  rescue
-    # ignore if column missing
+  rescue Mysql2::Error => e
+    warn "[flinders] db update skipped for #{ref}: #{e.message}"
   end
 
   puts "Upserted #{ref} -> #{address}"

+ 2 - 2
scrapers/georgetown.rb

@@ -135,8 +135,8 @@ items.each do |row|
   begin
     upd = DB.client.prepare("UPDATE `#{DB.client.escape(TABLE)}` SET document_url = ? WHERE council_reference = ? AND address = ?")
     upd.execute(row[:document_url], row[:council_reference], row[:address])
-  rescue
-    # ignore if column wasn’t created
+  rescue Mysql2::Error => e
+    warn "[georgetown] db update skipped for #{row[:council_reference]}: #{e.message}"
   end
 
   puts "Upserted #{row[:council_reference]} -> #{row[:address]}"

+ 3 - 3
scrapers/glamorgan.rb

@@ -35,7 +35,7 @@ end
 def safe_abs(base, href)
   return "" if href.to_s.strip.empty?
   URI.join(base, href).to_s
-rescue
+rescue URI::InvalidURIError
   href.to_s
 end
 
@@ -106,8 +106,8 @@ rows.each_with_index do |row, idx|
   begin
     upd = DB.client.prepare("UPDATE `#{DB.client.escape(TABLE)}` SET document_url = ? WHERE council_reference = ? AND address = ?")
     upd.execute(document_url, council_reference, address)
-  rescue
-    # column might not exist on first run if ALTER failed
+  rescue Mysql2::Error => e
+    warn "[glamorgan] db update skipped for #{council_reference}: #{e.message}"
   end
 
   puts "Upserted #{council_reference} -> #{address}"

+ 1 - 1
scrapers/glenorchy.rb

@@ -31,7 +31,7 @@ end
 def abs_url(href)
     return "" if href.to_s.strip.empty?
     URI.join(URL, href).to_s
-rescue
+rescue URI::InvalidURIError
     href.to_s
 end
 

+ 2 - 2
scrapers/huonvalley.rb

@@ -133,8 +133,8 @@ loop do
     begin
       upd = DB.client.prepare("UPDATE `#{DB.client.escape(TABLE)}` SET document_url = ? WHERE council_reference = ? AND address = ?")
       upd.execute(r[:document_url], r[:council_reference], r[:address])
-    rescue
-      # ignore if column missing
+    rescue Mysql2::Error => e
+      warn "[huonvalley] db update skipped for #{r[:council_reference]}: #{e.message}"
     end
 
     puts "Upserted #{r[:council_reference]} -> #{r[:address]}"

+ 4 - 4
scrapers/launcestoncity.rb

@@ -99,7 +99,7 @@ end
 def absolute(base, href)
   return nil if href.to_s.empty?
   URI.join(base, href).to_s
-rescue
+rescue URI::InvalidURIError
   nil
 end
 
@@ -164,7 +164,7 @@ def variants_for_doc_list(url)
           "f" => "$P1.ESB.PUBNOT.VIW"
         ))
         uri2.to_s
-      rescue
+      rescue URI::InvalidURIError
         s
       end
     ]
@@ -228,8 +228,8 @@ def probe_common_docs(base_url:, key:, danum:, referer:)
         end
         found << { name: File.basename(pdf_url), url: pdf_url, local_url: local_rel }
       end
-    rescue
-      # ignore and try next candidate
+    rescue StandardError => e
+      warn "[launcestoncity] probe failed for #{pdf_url}: #{e.class} #{e.message}"
       next
     end
   end

+ 1 - 1
scrapers/meandervalley.rb

@@ -132,7 +132,7 @@ def safe_name(s) = s.to_s.gsub(/[^\w\-.]+/, "_")
         href = a["href"].to_s
         pdf  = begin
             URI.join(URL, href).to_s
-        rescue
+        rescue URI::InvalidURIError
             href
         end
 

+ 2 - 2
scrapers/tasman.rb

@@ -102,8 +102,8 @@ items.each_with_index do |row, idx|
   begin
     upd = DB.client.prepare("UPDATE `#{DB.client.escape(TABLE)}` SET document_url = ? WHERE council_reference = ? AND address = ?")
     upd.execute(document_url, council_reference, address)
-  rescue
-    # ignore if column missing
+  rescue Mysql2::Error => e
+    warn "[tasman] db update skipped for #{council_reference}: #{e.message}"
   end
 
   puts "Upserted #{council_reference} -> #{address}"

+ 2 - 2
scrapers/waratah_wynyard.rb

@@ -26,7 +26,7 @@ end
 def abs_url(base, href)
   return "" if href.to_s.strip.empty?
   URI.join(base, href).to_s
-rescue
+rescue URI::InvalidURIError
   href.to_s
 end
 
@@ -188,7 +188,7 @@ doc = Nokogiri::HTML(html)
 
 host = begin
   URI.parse(URL).host
-rescue
+rescue URI::InvalidURIError
   nil
 end
 

+ 2 - 1
scrapers/westtamar.rb

@@ -154,7 +154,8 @@ detail_links.each do |u|
   begin
     upd = DB.client.prepare("UPDATE `#{DB.client.escape(TABLE)}` SET document_url = ?, on_notice_to = ?, on_notice_to_raw = ?, title_reference = ? WHERE council_reference = ? AND address = ?")
     upd.execute(item[:document_url], item[:date_received], item[:date_received_raw], item[:title_reference], item[:council_reference], item[:address])
-  rescue
+  rescue Mysql2::Error => e
+    warn "[westtamar] db update skipped for #{item[:council_reference]}: #{e.message}"
   end
 
   puts "Upserted #{item[:council_reference]} -> #{item[:address]}"