# Changelog
All notable changes to the TAS Councils scraping pipeline are recorded here.
Entries are grouped by push/session in reverse-chronological order.
---
## 2026-04-14 — WAF Warmup, Scraper Rewrites & Bug Fixes
**`lib/http.rb` — Full browser fingerprint headers**
- Added `Upgrade-Insecure-Requests: 1`, `Sec-Fetch-Dest`, `Sec-Fetch-Mode`, `Sec-Fetch-Site`, `Sec-Fetch-User`, `sec-ch-ua`, `sec-ch-ua-mobile`, `sec-ch-ua-platform` to `BASE_HEADERS` — these are sent by all scrapers using `Http.get`/`Http.request` automatically
- Updated curl fallback to pass the same headers for consistency
**`scrapers/burnie.rb` — Two bug fixes**
- Fixed redirect loop: `next` inside `Net::HTTP.start` block only exits the block, not the `while` loop; fixed by setting a `redirect_to` variable inside the block and calling `next` on the outer loop
- Fixed `URI::InvalidURIError` on PDF URLs containing non-ASCII characters (e.g. en-dash `–` in filename): percent-encode non-ASCII chars in href before `URI.join`
**`scrapers/kingisland.rb` — Complete rewrite**
- Previously a stub that immediately exited; now implements homepage warmup + planning page fetch with browser fingerprint headers
- Parses WordPress accordion section (`div#accordion-1-c4`) for DA notices
- Extracts ref (`DA YYYY/NN`), address, description, on-notice date, and PDF link from structured paragraph text
- Falls back gracefully with a warning if the fetch fails or returns a Cloudflare challenge
**`scrapers/latrobe.rb` — Complete rewrite**
- Previous version targeted PlanBuild portal (incorrect — Latrobe is not on PlanBuild)
- Now scrapes `https://www.latrobe.tas.gov.au/services/building-and-planning-services/planningapp` directly
- Uses homepage warmup to bypass Cloudflare WAF
- Parses `li.generic-list__item h3.generic-list__title a` — link text format: `L-DA007/2026 ADDRESS - DESCRIPTION (submissions by DATE)`
- Note: site blocks Docker/datacenter IPs via Cloudflare JS challenge; scraper exits cleanly when blocked
**`scrapers/derwentvalley.rb` — Complete rewrite**
- Previous version found 0 links (CSS selectors didn't match; news listing used lgasa/squiz.cloud redirect chain)
- Now uses homepage warmup + browser headers to pass Cloudflare
- Fetches `/home/latest-news?...=Public+Notice`; for each `news-listing__item` link extracts the `index_url` parameter from the lgasa href, GETs `lgasa-web.squiz.cloud/?a=ID` (non-following), reads `Location` header to get the real DV detail page URL
- Fetches each detail page (with DV cookies) and parses the `APP No / SITE / PROPOSAL` table
- Extracts closing date from "no later than ... DATE" pattern (fixed regex to allow dots in "5.00pm")
- Note: site blocks Docker/datacenter IPs via Cloudflare JS challenge; scraper exits cleanly when blocked
**`scrapers/georgetown.rb` — Fixed field name matching**
- `"Location"` was not matched by `/(Address|Property)/i` — address was always empty, causing all rows to be skipped
- `"Opening Date"` was not matched by date received regex
- Added `Location` and `Opening Date` to the respective patterns
- Now also extracts `applicant` ("Applicant Name"), `title_reference` ("Title reference"), and `on_notice_to` ("Closing Date") into the upsert
**`scrapers/kingisland.rb` (original stub) → replaced with full implementation** (see above)
**Docs**
- `CLAUDE.md`: added WAF/Cloudflare handling section, warmup pattern guidance, template scraper recommendations, new common gotchas (non-ASCII PDF URLs, redirect-in-block bug)
- `README.md`: added WAF and Cloudflare Handling section; updated project structure tree to include `scraper_helpers.rb` and `migrate.rb`
- `VERSIONS.md`: this entry
---
## 2026-04-13 — Scraper Fixes & Audit
**`scrapers/planbuild.rb`** — rewrote to fix crash on first item:
- Added missing `require "zlib"`, `require "stringio"`, `require_relative "../lib/log"`
- `fetch_detail` now always returns a Hash (`parsed.is_a?(Hash) ? parsed : {}`); bare `rescue {}` replaced with `rescue JSON::ParserError, Zlib::Error`
- Removed debug `puts` — replaced with `Log.debug`/`Log.info`
- `local_document_url` now passes `nil` (not `""`) when no downloads — prevents COALESCE overwriting an existing URL with empty string
- Per-item rescue so one bad reference skips and logs rather than killing the run
**`scrapers/southernmidlands.rb`** — rewrote detail page parser:
- Detail pages use `Location: / Proposal:` paragraph format, not table rows — old `table tr th/td` selector found nothing, causing 0 saves
- New parser splits `
`-separated lines per paragraph, extracts Location/Proposal fields, handles multiple DAs per item page
- Removed redundant `ALTER TABLE` block (columns already in `DB.ensure_table!`)
- Added explicit `require_relative "../lib/http"`, `../lib/db"`, `"../lib/util"`
**Missing `require_relative "../lib/log"` — 20 scrapers fixed:**
- `break_oday`, `brighton`, `burnie`, `centralcoast`, `circularhead`, `clarence`, `derwentvalley`, `devonportcity`, `dorset`, `flinders_council`, `glenorchy`, `huonvalley`, `kentish`, `launcestoncity`, `meandervalley`, `northernmidlands`, `southernmidlands`, `waratah_wynyard`, `westcoast`, `westtamar`
- `Log.warn` called in rescue blocks in all of these — without the require, the first error would raise `NameError: uninitialized constant Log` instead of logging
**`enrich_after_upsert!` variable scope bugs — 4 scrapers fixed:**
- `flinders_council.rb`: `council_reference` (undefined) → `ref`; folded separate `UPDATE document_url` into `DB.upsert`; removed redundant `ALTER TABLE`
- `huonvalley.rb`: `council_reference`/`address` (undefined) → `r[:council_reference]`/`r[:address]`; folded `UPDATE document_url` into upsert; removed redundant `ALTER TABLE`
- `kentish.rb`: `council_reference`/`address` (undefined) → `r[:council_reference]`/`r[:address]`; folded extras UPDATE into upsert
- `westcoast.rb`: `address` (undefined) → `item[:address]`; fixed upsert field names (`on_notice` → `on_notice_to`, `on_notice_raw` → `on_notice_to_raw`); fixed values referencing non-existent item keys; folded extras UPDATE into upsert
**Redundant `ALTER TABLE` blocks removed** from `circularhead.rb` and `waratah_wynyard.rb` — all columns already created by `DB.ensure_table!`
---
## 2026-04-13 — Code Quality Pass 3
**Logging**
- All 63 bare `warn "..."` calls across `scrapers/*.rb` replaced with `Log.warn "scraper", "..."` — structured logging now consistent throughout; stderr output is now filtered by `LOG_LEVEL`.
**DB.upsert dynamic rewrite** (`lib/db.rb`)
- Removed hardcoded 22-column array — `upsert` now derives columns from `row.keys`, so scrapers that pass scraper-specific columns (e.g. `advertised_date`, `legal_description`) are no longer silently ignored.
- Added `SAFE_COLUMN_RE = /\A[a-z][a-z0-9_]*\z/` — each key is validated before interpolation into SQL; unsafe names raise `ArgumentError` rather than silently passing.
- Extracted write-once/merge semantics into `UPSERT_ON_DUP` constant (`date_received`, `date_received_raw`, `document_url`, `local_document_url`) — easier to audit and extend.
- Non-existent columns now raise `Mysql2::Error` (caught by scraper rescue) instead of silently being dropped, surfacing schema mismatches early.
---
## 2026-04-13 — Code Quality Pass 2
**Security**
- `lib/http.rb` curl fallback: replaced shell-interpolated backtick call with `Open3.capture2` array form — eliminates shell injection risk from URL-derived `ref`/`uri` values.
- `web/index.php`: added `validate_table_name()` helper (enforces `/\Ada_[a-z0-9_]+\z/`) applied before every backtick-quoted table name interpolation (`tableHasColumn`, `$stageT` stages fetch, UNION SELECT builder).
**Schema consolidation**
- Removed `Geocode.ensure_da_columns!` from `lib/geocode.rb` — redundant, covered by `DB.ensure_table!` (new tables) and migration v1 (existing tables). Removed its call from `tools/backfill_geocode.rb`.
- Removed `ensure_extra_columns!` from `lib/enrich.rb` and all 10 scraper call-sites — same reasoning; was also using wrong column types (`DOUBLE`/`VARCHAR(50)`) vs canonical schema (`DECIMAL(10,7)`/`TEXT`).
**Error handling**
- 66 bare `rescue => e` replaced with `rescue StandardError => e` across all scrapers, lib, and tools — prevents accidental swallowing of `SystemExit`/`SignalException`.
- `lib/enrich.rb`: two `warn` calls replaced with `Log.warn` for structured logging; stale file header comment removed.
**Removed**
- Deleted `scrapers/enrich.rb` — stale duplicate with wrong `require_relative` paths, old broken `COALESCE(NULLIF(?, ''))` query, no main batch loop. Was picked up by `run_all.sh`'s glob and failing every full run with `LoadError`.
**Docs**
- `CLAUDE.md`: corrected scraper pattern (removed `ensure_extra_columns!(TABLE)` step), updated geocode-backfill command, corrected schema-change guidance.
- `README.md`: removed stale `tools/enrich.rb` references; corrected enrichment/backfill examples and tools table; added link to VERSIONS.md.
- `VERSIONS.md`: created — changelog covering all changes from initial upload.
---
## 2026-04-13 — Code Quality & Bug Fixes
**Bug fixes**
- Fixed `Mysql2::Error Unknown column '''' in 'SET'` — MariaDB 10.11's prepared-statement parser mishandles string literals (`''`) inside `NULLIF`/`IF` expressions in `SET` clauses. Replaced `COALESCE(NULLIF(?, ''), col)` with `COALESCE(?, col)` passing `nil` when the value is empty (`lib/enrich.rb`).
- Fixed `private method 'da_tables' called` error in `lib/migrate.rb` — migration lambdas call `Migrate.da_tables` with an explicit receiver, which counts as a public call. Removed `da_tables` from `private_class_method` declaration.
- Fixed unmatched `end` / dangling `rescue` syntax error in `scrapers/launcestoncity.rb` introduced during a prior cleanup pass.
- Eliminated duplicate "Docs page had no usable links" warning (fired twice per DA) in `scrapers/launcestoncity.rb`.
**Removed**
- Deleted `scrapers/enrich.rb` — stale copy of `lib/enrich.rb` with wrong `require_relative` paths, old broken `COALESCE(NULLIF(?, ''))` query, and no main batch loop. Was being picked up by `run_all.sh`'s `scrapers/*.rb` glob and failing every full run with a `LoadError`.
**Docs**
- Updated `CLAUDE.md`: corrected geocode-backfill command to use `tools/backfill_geocode.rb`, updated schema-change guidance to point to `lib/migrate.rb`.
- Updated `README.md`: removed stale `tools/enrich.rb` references, corrected enrichment/backfill examples, updated tools table.
---
## 2026-04-13 — Structure Updates (5f60868)
- General structural cleanup across scrapers.
---
## 2026-04-13 — Launceston City Scraper (3fc874c → bc3490f)
- Implemented `scrapers/launcestoncity.rb` for the Launceston eProperty portal (ASP.NET session-based site).
- Session cookie management (`merge_set_cookie!`) to maintain ASP.NET_SessionId across requests.
- Document listing via `docget.asp` with multi-variant URL probing (path-case and route-param variants).
- `probe_common_docs` fallback: constructs known PDF filenames from DA number when the document list page returns no links.
- PDF download to `DOWNLOAD_DIR/launceston//` when `DOWNLOAD_ATTACHMENTS=1`.
- Enriches each DA from the details page (applicant, received date, advertised date, legal description).
---
## 2026-04-13 — Structured Logging (c03bfae)
- Added `lib/log.rb` — `Log.debug`, `Log.info`, `Log.warn`, `Log.error` with `LOG_LEVEL` env filtering.
- Replaced `puts`/`warn` calls across `lib/` with `Log.*` calls.
- Added `LOG_LEVEL` env var to `docker-compose.yml` (default: `info`).
---
## 2026-04-13 — Schema Migrations (0e4e035)
- Added `lib/migrate.rb` — lightweight sequential migration runner backed by a `schema_migrations` table.
- Migration v1: adds enrichment and geocode columns to all existing `da_*` tables.
- Migration v2: creates `geo_cache` table.
- `run_all.sh` now runs `ruby /app/lib/migrate.rb` before scrapers.
---
## 2026-04-13 — SQL Injection Hardening (f3c06ab)
- Added `DB.validate_table_name!` — enforces `da_[a-z0-9_]+` pattern on every table name before interpolation into SQL.
- Applied `DB.client.escape()` on all remaining identifier interpolations.
- Applied `validate_table_name!` in `lib/geocode.rb` and `lib/enrich.rb`.
---
## 2026-04-12 — Initial Upload (ab11792)
- 28 council scrapers covering all Tasmanian councils.
- `lib/db.rb` — DB client, `ensure_table!`, upsert with write-once semantics.
- `lib/http.rb` — HTTP client with retries, cookie jar, 403/406 warmup, curl fallback.
- `lib/geocode.rb` — Google Maps geocoding with SHA1 cache in `geo_cache`.
- `lib/enrich.rb` — `enrich_after_upsert!` for per-row geocoding and property lookup.
- `lib/util.rb` — `parse_aus_date`, council/table name mappings.
- `web/index.php` — PHP search portal with dynamic UNION across all `da_*` tables.
- `tools/backfill_geocode.rb` — batch geocode backfill.
- `tools/import_sqlites.rb` — import from legacy SQLite exports.
- Docker Compose stack: MariaDB 10.11, Ruby 3.2 scraper, PHP/Apache web, Adminer.
- `run_all.sh` — discovers and runs scrapers with `ONLY`/`SKIP` filtering.
- `entrypoint.sh` — Docker entry with optional loop via `SCRAPE_EVERY_MINUTES`.