VERSIONS.md 4.2 KB

Changelog

All notable changes to the TAS Councils scraping pipeline are recorded here. Entries are grouped by push/session in reverse-chronological order.


2026-04-13 — Code Quality & Bug Fixes (current)

Bug fixes

  • Fixed Mysql2::Error Unknown column '''' in 'SET' — MariaDB 10.11's prepared-statement parser mishandles string literals ('') inside NULLIF/IF expressions in SET clauses. Replaced COALESCE(NULLIF(?, ''), col) with COALESCE(?, col) passing nil when the value is empty (lib/enrich.rb).
  • Fixed private method 'da_tables' called error in lib/migrate.rb — migration lambdas call Migrate.da_tables with an explicit receiver, which counts as a public call. Removed da_tables from private_class_method declaration.
  • Fixed unmatched end / dangling rescue syntax error in scrapers/launcestoncity.rb introduced during a prior cleanup pass.
  • Eliminated duplicate "Docs page had no usable links" warning (fired twice per DA) in scrapers/launcestoncity.rb.

Removed

  • Deleted scrapers/enrich.rb — stale copy of lib/enrich.rb with wrong require_relative paths, old broken COALESCE(NULLIF(?, '')) query, and no main batch loop. Was being picked up by run_all.sh's scrapers/*.rb glob and failing every full run with a LoadError.

Docs

  • Updated CLAUDE.md: corrected geocode-backfill command to use tools/backfill_geocode.rb, updated schema-change guidance to point to lib/migrate.rb.
  • Updated README.md: removed stale tools/enrich.rb references, corrected enrichment/backfill examples, updated tools table.

2026-04-13 — Structure Updates (5f60868)

  • General structural cleanup across scrapers.

2026-04-13 — Launceston City Scraper (3fc874cbc3490f)

  • Implemented scrapers/launcestoncity.rb for the Launceston eProperty portal (ASP.NET session-based site).
  • Session cookie management (merge_set_cookie!) to maintain ASP.NET_SessionId across requests.
  • Document listing via docget.asp with multi-variant URL probing (path-case and route-param variants).
  • probe_common_docs fallback: constructs known PDF filenames from DA number when the document list page returns no links.
  • PDF download to DOWNLOAD_DIR/launceston/<da_ref>/ when DOWNLOAD_ATTACHMENTS=1.
  • Enriches each DA from the details page (applicant, received date, advertised date, legal description).

2026-04-13 — Structured Logging (c03bfae)

  • Added lib/log.rbLog.debug, Log.info, Log.warn, Log.error with LOG_LEVEL env filtering.
  • Replaced puts/warn calls across lib/ with Log.* calls.
  • Added LOG_LEVEL env var to docker-compose.yml (default: info).

2026-04-13 — Schema Migrations (0e4e035)

  • Added lib/migrate.rb — lightweight sequential migration runner backed by a schema_migrations table.
  • Migration v1: adds enrichment and geocode columns to all existing da_* tables.
  • Migration v2: creates geo_cache table.
  • run_all.sh now runs ruby /app/lib/migrate.rb before scrapers.

2026-04-13 — SQL Injection Hardening (f3c06ab)

  • Added DB.validate_table_name! — enforces da_[a-z0-9_]+ pattern on every table name before interpolation into SQL.
  • Applied DB.client.escape() on all remaining identifier interpolations.
  • Applied validate_table_name! in lib/geocode.rb and lib/enrich.rb.

2026-04-12 — Initial Upload (ab11792)

  • 28 council scrapers covering all Tasmanian councils.
  • lib/db.rb — DB client, ensure_table!, upsert with write-once semantics.
  • lib/http.rb — HTTP client with retries, cookie jar, 403/406 warmup, curl fallback.
  • lib/geocode.rb — Google Maps geocoding with SHA1 cache in geo_cache.
  • lib/enrich.rbenrich_after_upsert! for per-row geocoding and property lookup.
  • lib/util.rbparse_aus_date, council/table name mappings.
  • web/index.php — PHP search portal with dynamic UNION across all da_* tables.
  • tools/backfill_geocode.rb — batch geocode backfill.
  • tools/import_sqlites.rb — import from legacy SQLite exports.
  • Docker Compose stack: MariaDB 10.11, Ruby 3.2 scraper, PHP/Apache web, Adminer.
  • run_all.sh — discovers and runs scrapers with ONLY/SKIP filtering.
  • entrypoint.sh — Docker entry with optional loop via SCRAPE_EVERY_MINUTES.