|
|
@@ -10,11 +10,12 @@ This is a scraping pipeline that collects Tasmanian planning development applica
|
|
|
|
|
|
| File | Role |
|
|
|
|---|---|
|
|
|
-| `lib/db.rb` | DB client, `ensure_table!`, `upsert` (with write-once semantics for some fields) |
|
|
|
-| `lib/http.rb` | HTTP client — retries, cookie jar, 403/406 warmup, curl fallback |
|
|
|
+| `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) |
|
|
|
+| `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback |
|
|
|
| `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table |
|
|
|
| `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write |
|
|
|
| `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings |
|
|
|
+| `lib/scraper_helpers.rb` | Shared helpers: `abs_url`, `text_or`, `upsert_and_enrich!` |
|
|
|
| `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set |
|
|
|
| `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) |
|
|
|
| `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
|
|
|
@@ -56,11 +57,17 @@ docker compose run --rm \
|
|
|
### Each scraper follows this pattern:
|
|
|
1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename
|
|
|
2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included)
|
|
|
-3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup)
|
|
|
+3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers)
|
|
|
4. Parse with Nokogiri
|
|
|
5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received`
|
|
|
6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches
|
|
|
|
|
|
+### WAF / Cloudflare handling:
|
|
|
+- `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically.
|
|
|
+- For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation.
|
|
|
+- Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
|
|
|
+- The warmup pattern (custom `CookieJar` + `http_get` with redirect handling) is self-contained in scrapers that need it and does **not** depend on `lib/http.rb`.
|
|
|
+
|
|
|
### Write-once fields (in `DB.upsert`):
|
|
|
- `date_received` — never overwritten once set
|
|
|
- `date_received_raw` — never overwritten once non-blank
|
|
|
@@ -92,10 +99,17 @@ After a refactor, the project follows these rules:
|
|
|
|
|
|
When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
|
|
|
- `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector
|
|
|
-- HTTP 403/406 — Council site added WAF; check `Http.get` options or add a warmup step
|
|
|
+- HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`)
|
|
|
+- Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning
|
|
|
- `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime`
|
|
|
|
|
|
-To add a new scraper, copy a structurally similar one (e.g. `glamorgan.rb` for table-based sites, `centralhighlands.rb` for link/PDF-based sites) and adapt the parsing logic. The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else.
|
|
|
+**Template choice:**
|
|
|
+- Simple HTML list/table → copy `glamorgan.rb`
|
|
|
+- Link/PDF listing → copy `centralhighlands.rb`
|
|
|
+- WAF-protected site needing homepage warmup → copy `kingisland.rb` (minimal) or `burnie.rb` (full-featured with PDF download)
|
|
|
+- Multi-hop redirect to detail pages → copy `derwentvalley.rb`
|
|
|
+
|
|
|
+The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else.
|
|
|
|
|
|
---
|
|
|
|
|
|
@@ -119,5 +133,8 @@ To add a new scraper, copy a structurally similar one (e.g. `glamorgan.rb` for t
|
|
|
|
|
|
- **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
|
|
|
- **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead.
|
|
|
-- **PlanBuild scrapers**: `planbuild.rb` and `planbuild_fetch.js` handle councils on the state-run PlanBuild portal. They write to per-council tables using `Util.ref_to_table`. These are separate from the council-specific scrapers.
|
|
|
+- **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers.
|
|
|
- **PDF downloads**: Only happen when `DOWNLOAD_ATTACHMENTS=1`. Files land in `DOWNLOAD_DIR/<councilname>/`. The web portal serves them from `/downloads/` via an Apache alias.
|
|
|
+- **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern.
|
|
|
+- **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`.
|
|
|
+- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`.
|