# Changelog All notable changes to the TAS Councils scraping pipeline are recorded here. Entries are grouped by push/session in reverse-chronological order. --- ## 2026-04-14 — WAF Warmup, Scraper Rewrites & Bug Fixes **`lib/http.rb` — Full browser fingerprint headers** - Added `Upgrade-Insecure-Requests: 1`, `Sec-Fetch-Dest`, `Sec-Fetch-Mode`, `Sec-Fetch-Site`, `Sec-Fetch-User`, `sec-ch-ua`, `sec-ch-ua-mobile`, `sec-ch-ua-platform` to `BASE_HEADERS` — these are sent by all scrapers using `Http.get`/`Http.request` automatically - Updated curl fallback to pass the same headers for consistency **`scrapers/burnie.rb` — Two bug fixes** - Fixed redirect loop: `next` inside `Net::HTTP.start` block only exits the block, not the `while` loop; fixed by setting a `redirect_to` variable inside the block and calling `next` on the outer loop - Fixed `URI::InvalidURIError` on PDF URLs containing non-ASCII characters (e.g. en-dash `–` in filename): percent-encode non-ASCII chars in href before `URI.join` **`scrapers/kingisland.rb` — Complete rewrite** - Previously a stub that immediately exited; now implements homepage warmup + planning page fetch with browser fingerprint headers - Parses WordPress accordion section (`div#accordion-1-c4`) for DA notices - Extracts ref (`DA YYYY/NN`), address, description, on-notice date, and PDF link from structured paragraph text - Falls back gracefully with a warning if the fetch fails or returns a Cloudflare challenge **`scrapers/latrobe.rb` — Complete rewrite** - Previous version targeted PlanBuild portal (incorrect — Latrobe is not on PlanBuild) - Now scrapes `https://www.latrobe.tas.gov.au/services/building-and-planning-services/planningapp` directly - Uses homepage warmup to bypass Cloudflare WAF - Parses `li.generic-list__item h3.generic-list__title a` — link text format: `L-DA007/2026 ADDRESS - DESCRIPTION (submissions by DATE)` - Note: site blocks Docker/datacenter IPs via Cloudflare JS challenge; scraper exits cleanly when blocked **`scrapers/derwentvalley.rb` — Complete rewrite** - Previous version found 0 links (CSS selectors didn't match; news listing used lgasa/squiz.cloud redirect chain) - Now uses homepage warmup + browser headers to pass Cloudflare - Fetches `/home/latest-news?...=Public+Notice`; for each `news-listing__item` link extracts the `index_url` parameter from the lgasa href, GETs `lgasa-web.squiz.cloud/?a=ID` (non-following), reads `Location` header to get the real DV detail page URL - Fetches each detail page (with DV cookies) and parses the `APP No / SITE / PROPOSAL` table - Extracts closing date from "no later than ... DATE" pattern (fixed regex to allow dots in "5.00pm") - Note: site blocks Docker/datacenter IPs via Cloudflare JS challenge; scraper exits cleanly when blocked **`scrapers/georgetown.rb` — Fixed field name matching** - `"Location"` was not matched by `/(Address|Property)/i` — address was always empty, causing all rows to be skipped - `"Opening Date"` was not matched by date received regex - Added `Location` and `Opening Date` to the respective patterns - Now also extracts `applicant` ("Applicant Name"), `title_reference` ("Title reference"), and `on_notice_to` ("Closing Date") into the upsert **`scrapers/kingisland.rb` (original stub) → replaced with full implementation** (see above) **Docs** - `CLAUDE.md`: added WAF/Cloudflare handling section, warmup pattern guidance, template scraper recommendations, new common gotchas (non-ASCII PDF URLs, redirect-in-block bug) - `README.md`: added WAF and Cloudflare Handling section; updated project structure tree to include `scraper_helpers.rb` and `migrate.rb` - `VERSIONS.md`: this entry --- ## 2026-04-13 — Scraper Fixes & Audit **`scrapers/planbuild.rb`** — rewrote to fix crash on first item: - Added missing `require "zlib"`, `require "stringio"`, `require_relative "../lib/log"` - `fetch_detail` now always returns a Hash (`parsed.is_a?(Hash) ? parsed : {}`); bare `rescue {}` replaced with `rescue JSON::ParserError, Zlib::Error` - Removed debug `puts` — replaced with `Log.debug`/`Log.info` - `local_document_url` now passes `nil` (not `""`) when no downloads — prevents COALESCE overwriting an existing URL with empty string - Per-item rescue so one bad reference skips and logs rather than killing the run **`scrapers/southernmidlands.rb`** — rewrote detail page parser: - Detail pages use `Location: / Proposal:` paragraph format, not table rows — old `table tr th/td` selector found nothing, causing 0 saves - New parser splits `
`-separated lines per paragraph, extracts Location/Proposal fields, handles multiple DAs per item page - Removed redundant `ALTER TABLE` block (columns already in `DB.ensure_table!`) - Added explicit `require_relative "../lib/http"`, `../lib/db"`, `"../lib/util"` **Missing `require_relative "../lib/log"` — 20 scrapers fixed:** - `break_oday`, `brighton`, `burnie`, `centralcoast`, `circularhead`, `clarence`, `derwentvalley`, `devonportcity`, `dorset`, `flinders_council`, `glenorchy`, `huonvalley`, `kentish`, `launcestoncity`, `meandervalley`, `northernmidlands`, `southernmidlands`, `waratah_wynyard`, `westcoast`, `westtamar` - `Log.warn` called in rescue blocks in all of these — without the require, the first error would raise `NameError: uninitialized constant Log` instead of logging **`enrich_after_upsert!` variable scope bugs — 4 scrapers fixed:** - `flinders_council.rb`: `council_reference` (undefined) → `ref`; folded separate `UPDATE document_url` into `DB.upsert`; removed redundant `ALTER TABLE` - `huonvalley.rb`: `council_reference`/`address` (undefined) → `r[:council_reference]`/`r[:address]`; folded `UPDATE document_url` into upsert; removed redundant `ALTER TABLE` - `kentish.rb`: `council_reference`/`address` (undefined) → `r[:council_reference]`/`r[:address]`; folded extras UPDATE into upsert - `westcoast.rb`: `address` (undefined) → `item[:address]`; fixed upsert field names (`on_notice` → `on_notice_to`, `on_notice_raw` → `on_notice_to_raw`); fixed values referencing non-existent item keys; folded extras UPDATE into upsert **Redundant `ALTER TABLE` blocks removed** from `circularhead.rb` and `waratah_wynyard.rb` — all columns already created by `DB.ensure_table!` --- ## 2026-04-13 — Code Quality Pass 3 **Logging** - All 63 bare `warn "..."` calls across `scrapers/*.rb` replaced with `Log.warn "scraper", "..."` — structured logging now consistent throughout; stderr output is now filtered by `LOG_LEVEL`. **DB.upsert dynamic rewrite** (`lib/db.rb`) - Removed hardcoded 22-column array — `upsert` now derives columns from `row.keys`, so scrapers that pass scraper-specific columns (e.g. `advertised_date`, `legal_description`) are no longer silently ignored. - Added `SAFE_COLUMN_RE = /\A[a-z][a-z0-9_]*\z/` — each key is validated before interpolation into SQL; unsafe names raise `ArgumentError` rather than silently passing. - Extracted write-once/merge semantics into `UPSERT_ON_DUP` constant (`date_received`, `date_received_raw`, `document_url`, `local_document_url`) — easier to audit and extend. - Non-existent columns now raise `Mysql2::Error` (caught by scraper rescue) instead of silently being dropped, surfacing schema mismatches early. --- ## 2026-04-13 — Code Quality Pass 2 **Security** - `lib/http.rb` curl fallback: replaced shell-interpolated backtick call with `Open3.capture2` array form — eliminates shell injection risk from URL-derived `ref`/`uri` values. - `web/index.php`: added `validate_table_name()` helper (enforces `/\Ada_[a-z0-9_]+\z/`) applied before every backtick-quoted table name interpolation (`tableHasColumn`, `$stageT` stages fetch, UNION SELECT builder). **Schema consolidation** - Removed `Geocode.ensure_da_columns!` from `lib/geocode.rb` — redundant, covered by `DB.ensure_table!` (new tables) and migration v1 (existing tables). Removed its call from `tools/backfill_geocode.rb`. - Removed `ensure_extra_columns!` from `lib/enrich.rb` and all 10 scraper call-sites — same reasoning; was also using wrong column types (`DOUBLE`/`VARCHAR(50)`) vs canonical schema (`DECIMAL(10,7)`/`TEXT`). **Error handling** - 66 bare `rescue => e` replaced with `rescue StandardError => e` across all scrapers, lib, and tools — prevents accidental swallowing of `SystemExit`/`SignalException`. - `lib/enrich.rb`: two `warn` calls replaced with `Log.warn` for structured logging; stale file header comment removed. **Removed** - Deleted `scrapers/enrich.rb` — stale duplicate with wrong `require_relative` paths, old broken `COALESCE(NULLIF(?, ''))` query, no main batch loop. Was picked up by `run_all.sh`'s glob and failing every full run with `LoadError`. **Docs** - `CLAUDE.md`: corrected scraper pattern (removed `ensure_extra_columns!(TABLE)` step), updated geocode-backfill command, corrected schema-change guidance. - `README.md`: removed stale `tools/enrich.rb` references; corrected enrichment/backfill examples and tools table; added link to VERSIONS.md. - `VERSIONS.md`: created — changelog covering all changes from initial upload. --- ## 2026-04-13 — Code Quality & Bug Fixes **Bug fixes** - Fixed `Mysql2::Error Unknown column '''' in 'SET'` — MariaDB 10.11's prepared-statement parser mishandles string literals (`''`) inside `NULLIF`/`IF` expressions in `SET` clauses. Replaced `COALESCE(NULLIF(?, ''), col)` with `COALESCE(?, col)` passing `nil` when the value is empty (`lib/enrich.rb`). - Fixed `private method 'da_tables' called` error in `lib/migrate.rb` — migration lambdas call `Migrate.da_tables` with an explicit receiver, which counts as a public call. Removed `da_tables` from `private_class_method` declaration. - Fixed unmatched `end` / dangling `rescue` syntax error in `scrapers/launcestoncity.rb` introduced during a prior cleanup pass. - Eliminated duplicate "Docs page had no usable links" warning (fired twice per DA) in `scrapers/launcestoncity.rb`. **Removed** - Deleted `scrapers/enrich.rb` — stale copy of `lib/enrich.rb` with wrong `require_relative` paths, old broken `COALESCE(NULLIF(?, ''))` query, and no main batch loop. Was being picked up by `run_all.sh`'s `scrapers/*.rb` glob and failing every full run with a `LoadError`. **Docs** - Updated `CLAUDE.md`: corrected geocode-backfill command to use `tools/backfill_geocode.rb`, updated schema-change guidance to point to `lib/migrate.rb`. - Updated `README.md`: removed stale `tools/enrich.rb` references, corrected enrichment/backfill examples, updated tools table. --- ## 2026-04-13 — Structure Updates (5f60868) - General structural cleanup across scrapers. --- ## 2026-04-13 — Launceston City Scraper (3fc874c → bc3490f) - Implemented `scrapers/launcestoncity.rb` for the Launceston eProperty portal (ASP.NET session-based site). - Session cookie management (`merge_set_cookie!`) to maintain ASP.NET_SessionId across requests. - Document listing via `docget.asp` with multi-variant URL probing (path-case and route-param variants). - `probe_common_docs` fallback: constructs known PDF filenames from DA number when the document list page returns no links. - PDF download to `DOWNLOAD_DIR/launceston//` when `DOWNLOAD_ATTACHMENTS=1`. - Enriches each DA from the details page (applicant, received date, advertised date, legal description). --- ## 2026-04-13 — Structured Logging (c03bfae) - Added `lib/log.rb` — `Log.debug`, `Log.info`, `Log.warn`, `Log.error` with `LOG_LEVEL` env filtering. - Replaced `puts`/`warn` calls across `lib/` with `Log.*` calls. - Added `LOG_LEVEL` env var to `docker-compose.yml` (default: `info`). --- ## 2026-04-13 — Schema Migrations (0e4e035) - Added `lib/migrate.rb` — lightweight sequential migration runner backed by a `schema_migrations` table. - Migration v1: adds enrichment and geocode columns to all existing `da_*` tables. - Migration v2: creates `geo_cache` table. - `run_all.sh` now runs `ruby /app/lib/migrate.rb` before scrapers. --- ## 2026-04-13 — SQL Injection Hardening (f3c06ab) - Added `DB.validate_table_name!` — enforces `da_[a-z0-9_]+` pattern on every table name before interpolation into SQL. - Applied `DB.client.escape()` on all remaining identifier interpolations. - Applied `validate_table_name!` in `lib/geocode.rb` and `lib/enrich.rb`. --- ## 2026-04-12 — Initial Upload (ab11792) - 28 council scrapers covering all Tasmanian councils. - `lib/db.rb` — DB client, `ensure_table!`, upsert with write-once semantics. - `lib/http.rb` — HTTP client with retries, cookie jar, 403/406 warmup, curl fallback. - `lib/geocode.rb` — Google Maps geocoding with SHA1 cache in `geo_cache`. - `lib/enrich.rb` — `enrich_after_upsert!` for per-row geocoding and property lookup. - `lib/util.rb` — `parse_aus_date`, council/table name mappings. - `web/index.php` — PHP search portal with dynamic UNION across all `da_*` tables. - `tools/backfill_geocode.rb` — batch geocode backfill. - `tools/import_sqlites.rb` — import from legacy SQLite exports. - Docker Compose stack: MariaDB 10.11, Ruby 3.2 scraper, PHP/Apache web, Adminer. - `run_all.sh` — discovers and runs scrapers with `ONLY`/`SKIP` filtering. - `entrypoint.sh` — Docker entry with optional loop via `SCRAPE_EVERY_MINUTES`.