VERSIONS.md 9.5 KB

Changelog

All notable changes to the TAS Councils scraping pipeline are recorded here. Entries are grouped by push/session in reverse-chronological order.


2026-04-13 — Scraper Fixes & Audit

scrapers/planbuild.rb — rewrote to fix crash on first item:

  • Added missing require "zlib", require "stringio", require_relative "../lib/log"
  • fetch_detail now always returns a Hash (parsed.is_a?(Hash) ? parsed : {}); bare rescue {} replaced with rescue JSON::ParserError, Zlib::Error
  • Removed debug puts — replaced with Log.debug/Log.info
  • local_document_url now passes nil (not "") when no downloads — prevents COALESCE overwriting an existing URL with empty string
  • Per-item rescue so one bad reference skips and logs rather than killing the run

scrapers/southernmidlands.rb — rewrote detail page parser:

  • Detail pages use Location: / Proposal: paragraph format, not table rows — old table tr th/td selector found nothing, causing 0 saves
  • New parser splits <br>-separated lines per paragraph, extracts Location/Proposal fields, handles multiple DAs per item page
  • Removed redundant ALTER TABLE block (columns already in DB.ensure_table!)
  • Added explicit require_relative "../lib/http", ../lib/db", "../lib/util"

Missing require_relative "../lib/log" — 20 scrapers fixed:

  • break_oday, brighton, burnie, centralcoast, circularhead, clarence, derwentvalley, devonportcity, dorset, flinders_council, glenorchy, huonvalley, kentish, launcestoncity, meandervalley, northernmidlands, southernmidlands, waratah_wynyard, westcoast, westtamar
  • Log.warn called in rescue blocks in all of these — without the require, the first error would raise NameError: uninitialized constant Log instead of logging

enrich_after_upsert! variable scope bugs — 4 scrapers fixed:

  • flinders_council.rb: council_reference (undefined) → ref; folded separate UPDATE document_url into DB.upsert; removed redundant ALTER TABLE
  • huonvalley.rb: council_reference/address (undefined) → r[:council_reference]/r[:address]; folded UPDATE document_url into upsert; removed redundant ALTER TABLE
  • kentish.rb: council_reference/address (undefined) → r[:council_reference]/r[:address]; folded extras UPDATE into upsert
  • westcoast.rb: address (undefined) → item[:address]; fixed upsert field names (on_noticeon_notice_to, on_notice_rawon_notice_to_raw); fixed values referencing non-existent item keys; folded extras UPDATE into upsert

Redundant ALTER TABLE blocks removed from circularhead.rb and waratah_wynyard.rb — all columns already created by DB.ensure_table!


2026-04-13 — Code Quality Pass 3

Logging

  • All 63 bare warn "..." calls across scrapers/*.rb replaced with Log.warn "scraper", "..." — structured logging now consistent throughout; stderr output is now filtered by LOG_LEVEL.

DB.upsert dynamic rewrite (lib/db.rb)

  • Removed hardcoded 22-column array — upsert now derives columns from row.keys, so scrapers that pass scraper-specific columns (e.g. advertised_date, legal_description) are no longer silently ignored.
  • Added SAFE_COLUMN_RE = /\A[a-z][a-z0-9_]*\z/ — each key is validated before interpolation into SQL; unsafe names raise ArgumentError rather than silently passing.
  • Extracted write-once/merge semantics into UPSERT_ON_DUP constant (date_received, date_received_raw, document_url, local_document_url) — easier to audit and extend.
  • Non-existent columns now raise Mysql2::Error (caught by scraper rescue) instead of silently being dropped, surfacing schema mismatches early.

2026-04-13 — Code Quality Pass 2

Security

  • lib/http.rb curl fallback: replaced shell-interpolated backtick call with Open3.capture2 array form — eliminates shell injection risk from URL-derived ref/uri values.
  • web/index.php: added validate_table_name() helper (enforces /\Ada_[a-z0-9_]+\z/) applied before every backtick-quoted table name interpolation (tableHasColumn, $stageT stages fetch, UNION SELECT builder).

Schema consolidation

  • Removed Geocode.ensure_da_columns! from lib/geocode.rb — redundant, covered by DB.ensure_table! (new tables) and migration v1 (existing tables). Removed its call from tools/backfill_geocode.rb.
  • Removed ensure_extra_columns! from lib/enrich.rb and all 10 scraper call-sites — same reasoning; was also using wrong column types (DOUBLE/VARCHAR(50)) vs canonical schema (DECIMAL(10,7)/TEXT).

Error handling

  • 66 bare rescue => e replaced with rescue StandardError => e across all scrapers, lib, and tools — prevents accidental swallowing of SystemExit/SignalException.
  • lib/enrich.rb: two warn calls replaced with Log.warn for structured logging; stale file header comment removed.

Removed

  • Deleted scrapers/enrich.rb — stale duplicate with wrong require_relative paths, old broken COALESCE(NULLIF(?, '')) query, no main batch loop. Was picked up by run_all.sh's glob and failing every full run with LoadError.

Docs

  • CLAUDE.md: corrected scraper pattern (removed ensure_extra_columns!(TABLE) step), updated geocode-backfill command, corrected schema-change guidance.
  • README.md: removed stale tools/enrich.rb references; corrected enrichment/backfill examples and tools table; added link to VERSIONS.md.
  • VERSIONS.md: created — changelog covering all changes from initial upload.

2026-04-13 — Code Quality & Bug Fixes

Bug fixes

  • Fixed Mysql2::Error Unknown column '''' in 'SET' — MariaDB 10.11's prepared-statement parser mishandles string literals ('') inside NULLIF/IF expressions in SET clauses. Replaced COALESCE(NULLIF(?, ''), col) with COALESCE(?, col) passing nil when the value is empty (lib/enrich.rb).
  • Fixed private method 'da_tables' called error in lib/migrate.rb — migration lambdas call Migrate.da_tables with an explicit receiver, which counts as a public call. Removed da_tables from private_class_method declaration.
  • Fixed unmatched end / dangling rescue syntax error in scrapers/launcestoncity.rb introduced during a prior cleanup pass.
  • Eliminated duplicate "Docs page had no usable links" warning (fired twice per DA) in scrapers/launcestoncity.rb.

Removed

  • Deleted scrapers/enrich.rb — stale copy of lib/enrich.rb with wrong require_relative paths, old broken COALESCE(NULLIF(?, '')) query, and no main batch loop. Was being picked up by run_all.sh's scrapers/*.rb glob and failing every full run with a LoadError.

Docs

  • Updated CLAUDE.md: corrected geocode-backfill command to use tools/backfill_geocode.rb, updated schema-change guidance to point to lib/migrate.rb.
  • Updated README.md: removed stale tools/enrich.rb references, corrected enrichment/backfill examples, updated tools table.

2026-04-13 — Structure Updates (5f60868)

  • General structural cleanup across scrapers.

2026-04-13 — Launceston City Scraper (3fc874cbc3490f)

  • Implemented scrapers/launcestoncity.rb for the Launceston eProperty portal (ASP.NET session-based site).
  • Session cookie management (merge_set_cookie!) to maintain ASP.NET_SessionId across requests.
  • Document listing via docget.asp with multi-variant URL probing (path-case and route-param variants).
  • probe_common_docs fallback: constructs known PDF filenames from DA number when the document list page returns no links.
  • PDF download to DOWNLOAD_DIR/launceston/<da_ref>/ when DOWNLOAD_ATTACHMENTS=1.
  • Enriches each DA from the details page (applicant, received date, advertised date, legal description).

2026-04-13 — Structured Logging (c03bfae)

  • Added lib/log.rbLog.debug, Log.info, Log.warn, Log.error with LOG_LEVEL env filtering.
  • Replaced puts/warn calls across lib/ with Log.* calls.
  • Added LOG_LEVEL env var to docker-compose.yml (default: info).

2026-04-13 — Schema Migrations (0e4e035)

  • Added lib/migrate.rb — lightweight sequential migration runner backed by a schema_migrations table.
  • Migration v1: adds enrichment and geocode columns to all existing da_* tables.
  • Migration v2: creates geo_cache table.
  • run_all.sh now runs ruby /app/lib/migrate.rb before scrapers.

2026-04-13 — SQL Injection Hardening (f3c06ab)

  • Added DB.validate_table_name! — enforces da_[a-z0-9_]+ pattern on every table name before interpolation into SQL.
  • Applied DB.client.escape() on all remaining identifier interpolations.
  • Applied validate_table_name! in lib/geocode.rb and lib/enrich.rb.

2026-04-12 — Initial Upload (ab11792)

  • 28 council scrapers covering all Tasmanian councils.
  • lib/db.rb — DB client, ensure_table!, upsert with write-once semantics.
  • lib/http.rb — HTTP client with retries, cookie jar, 403/406 warmup, curl fallback.
  • lib/geocode.rb — Google Maps geocoding with SHA1 cache in geo_cache.
  • lib/enrich.rbenrich_after_upsert! for per-row geocoding and property lookup.
  • lib/util.rbparse_aus_date, council/table name mappings.
  • web/index.php — PHP search portal with dynamic UNION across all da_* tables.
  • tools/backfill_geocode.rb — batch geocode backfill.
  • tools/import_sqlites.rb — import from legacy SQLite exports.
  • Docker Compose stack: MariaDB 10.11, Ruby 3.2 scraper, PHP/Apache web, Adminer.
  • run_all.sh — discovers and runs scrapers with ONLY/SKIP filtering.
  • entrypoint.sh — Docker entry with optional loop via SCRAPE_EVERY_MINUTES.