Jelajahi Sumber

Update Versions & Readme

Benjamin Harris 2 bulan lalu
induk
melakukan
3fc284b11c
2 mengubah file dengan 94 tambahan dan 10 penghapusan
  1. 14 10
      README.md
  2. 80 0
      VERSIONS.md

+ 14 - 10
README.md

@@ -2,6 +2,8 @@
 
 A web scraping and data aggregation system for Tasmanian development applications (DAs). It collects planning application notices from all 29 Tasmanian council websites, normalises and geocodes the data, and exposes it via a PHP search portal.
 
+See [VERSIONS.md](VERSIONS.md) for the changelog.
+
 ---
 
 ## Architecture
@@ -167,23 +169,26 @@ After each upsert, `enrich_after_upsert!` runs two optional enrichment steps:
 
 2. **Property lookup** (requires `LOOKUP_URL`) — POSTs `{lat, lng}` to a property data service and populates `property_id` and `title_reference`.
 
-To run enrichment as a standalone backfill over existing rows:
+To run geocode backfill as a standalone pass over existing rows:
 
 ```bash
+# All tables
 docker compose run --rm \
   -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
-  -e LOOKUP_URL="$LOOKUP_URL" \
-  scraper ruby /app/tools/enrich.rb
-```
+  scraper ruby /app/tools/backfill_geocode.rb
 
-Run against a single table with a dry run:
+# Single table
+docker compose run --rm \
+  -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
+  -e ONLY_TABLE=da_dorset \
+  scraper ruby /app/tools/backfill_geocode.rb
 
-```bash
+# Dry run
 docker compose run --rm \
   -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
-  -e LOOKUP_URL="$LOOKUP_URL" \
+  -e ONLY_TABLE=da_dorset \
   -e DRY_RUN=1 \
-  scraper ruby /app/tools/enrich.rb --table=da_dorset
+  scraper ruby /app/tools/backfill_geocode.rb
 ```
 
 ---
@@ -205,8 +210,7 @@ docker compose run --rm \
 
 | Script | Purpose |
 |---|---|
-| `tools/enrich.rb` | Batch geocode + property lookup for existing rows |
-| `tools/backfill_geocode.rb` | Geocode-only backfill |
+| `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
 | `tools/backfill_dorset_docs.rb` | Backfill PDF links for Dorset rows |
 | `tools/import_sqlites.rb` | Import data from legacy SQLite exports |
 | `planbuild_fetch.js` | Playwright-based scraper for the PlanBuild TAS portal |

+ 80 - 0
VERSIONS.md

@@ -0,0 +1,80 @@
+# Changelog
+
+All notable changes to the TAS Councils scraping pipeline are recorded here.
+Entries are grouped by push/session in reverse-chronological order.
+
+---
+
+## 2026-04-13 — Code Quality & Bug Fixes (current)
+
+**Bug fixes**
+- Fixed `Mysql2::Error Unknown column '''' in 'SET'` — MariaDB 10.11's prepared-statement parser mishandles string literals (`''`) inside `NULLIF`/`IF` expressions in `SET` clauses. Replaced `COALESCE(NULLIF(?, ''), col)` with `COALESCE(?, col)` passing `nil` when the value is empty (`lib/enrich.rb`).
+- Fixed `private method 'da_tables' called` error in `lib/migrate.rb` — migration lambdas call `Migrate.da_tables` with an explicit receiver, which counts as a public call. Removed `da_tables` from `private_class_method` declaration.
+- Fixed unmatched `end` / dangling `rescue` syntax error in `scrapers/launcestoncity.rb` introduced during a prior cleanup pass.
+- Eliminated duplicate "Docs page had no usable links" warning (fired twice per DA) in `scrapers/launcestoncity.rb`.
+
+**Removed**
+- Deleted `scrapers/enrich.rb` — stale copy of `lib/enrich.rb` with wrong `require_relative` paths, old broken `COALESCE(NULLIF(?, ''))` query, and no main batch loop. Was being picked up by `run_all.sh`'s `scrapers/*.rb` glob and failing every full run with a `LoadError`.
+
+**Docs**
+- Updated `CLAUDE.md`: corrected geocode-backfill command to use `tools/backfill_geocode.rb`, updated schema-change guidance to point to `lib/migrate.rb`.
+- Updated `README.md`: removed stale `tools/enrich.rb` references, corrected enrichment/backfill examples, updated tools table.
+
+---
+
+## 2026-04-13 — Structure Updates (5f60868)
+
+- General structural cleanup across scrapers.
+
+---
+
+## 2026-04-13 — Launceston City Scraper (3fc874c → bc3490f)
+
+- Implemented `scrapers/launcestoncity.rb` for the Launceston eProperty portal (ASP.NET session-based site).
+- Session cookie management (`merge_set_cookie!`) to maintain ASP.NET_SessionId across requests.
+- Document listing via `docget.asp` with multi-variant URL probing (path-case and route-param variants).
+- `probe_common_docs` fallback: constructs known PDF filenames from DA number when the document list page returns no links.
+- PDF download to `DOWNLOAD_DIR/launceston/<da_ref>/` when `DOWNLOAD_ATTACHMENTS=1`.
+- Enriches each DA from the details page (applicant, received date, advertised date, legal description).
+
+---
+
+## 2026-04-13 — Structured Logging (c03bfae)
+
+- Added `lib/log.rb` — `Log.debug`, `Log.info`, `Log.warn`, `Log.error` with `LOG_LEVEL` env filtering.
+- Replaced `puts`/`warn` calls across `lib/` with `Log.*` calls.
+- Added `LOG_LEVEL` env var to `docker-compose.yml` (default: `info`).
+
+---
+
+## 2026-04-13 — Schema Migrations (0e4e035)
+
+- Added `lib/migrate.rb` — lightweight sequential migration runner backed by a `schema_migrations` table.
+- Migration v1: adds enrichment and geocode columns to all existing `da_*` tables.
+- Migration v2: creates `geo_cache` table.
+- `run_all.sh` now runs `ruby /app/lib/migrate.rb` before scrapers.
+
+---
+
+## 2026-04-13 — SQL Injection Hardening (f3c06ab)
+
+- Added `DB.validate_table_name!` — enforces `da_[a-z0-9_]+` pattern on every table name before interpolation into SQL.
+- Applied `DB.client.escape()` on all remaining identifier interpolations.
+- Applied `validate_table_name!` in `lib/geocode.rb` and `lib/enrich.rb`.
+
+---
+
+## 2026-04-12 — Initial Upload (ab11792)
+
+- 28 council scrapers covering all Tasmanian councils.
+- `lib/db.rb` — DB client, `ensure_table!`, upsert with write-once semantics.
+- `lib/http.rb` — HTTP client with retries, cookie jar, 403/406 warmup, curl fallback.
+- `lib/geocode.rb` — Google Maps geocoding with SHA1 cache in `geo_cache`.
+- `lib/enrich.rb` — `enrich_after_upsert!` for per-row geocoding and property lookup.
+- `lib/util.rb` — `parse_aus_date`, council/table name mappings.
+- `web/index.php` — PHP search portal with dynamic UNION across all `da_*` tables.
+- `tools/backfill_geocode.rb` — batch geocode backfill.
+- `tools/import_sqlites.rb` — import from legacy SQLite exports.
+- Docker Compose stack: MariaDB 10.11, Ruby 3.2 scraper, PHP/Apache web, Adminer.
+- `run_all.sh` — discovers and runs scrapers with `ONLY`/`SKIP` filtering.
+- `entrypoint.sh` — Docker entry with optional loop via `SCRAPE_EVERY_MINUTES`.