il y a 2 mois · d19d144796
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -10,11 +10,12 @@ This is a scraping pipeline that collects Tasmanian planning development applica
 
				 
			
 
				 | File | Role |
			
 
				 |---|---|
			
 
				-| `lib/db.rb` | DB client, `ensure_table!`, `upsert` (with write-once semantics for some fields) |
			
 
				-| `lib/http.rb` | HTTP client — retries, cookie jar, 403/406 warmup, curl fallback |
			
 
				+| `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) |
			
 
				+| `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback |
			
 
				 | `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table |
			
 
				 | `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write |
			
 
				 | `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings |
			
 
				+| `lib/scraper_helpers.rb` | Shared helpers: `abs_url`, `text_or`, `upsert_and_enrich!` |
			
 
				 | `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set |
			
 
				 | `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) |
			
 
				 | `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
			
@@ -56,11 +57,17 @@ docker compose run --rm \
 
				 ### Each scraper follows this pattern:
			
 
				 1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename
			
 
				 2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included)
			
 
				-3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup)
			
 
				+3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers)
			
 
				 4. Parse with Nokogiri
			
 
				 5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received`
			
 
				 6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches
			
 
				 
			
 
				+### WAF / Cloudflare handling:
			
 
				+- `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically.
			
 
				+- For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation.
			
 
				+- Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
			
 
				+- The warmup pattern (custom `CookieJar` + `http_get` with redirect handling) is self-contained in scrapers that need it and does **not** depend on `lib/http.rb`.
			
 
				+
			
 
				 ### Write-once fields (in `DB.upsert`):
			
 
				 - `date_received` — never overwritten once set
			
 
				 - `date_received_raw` — never overwritten once non-blank
			
@@ -92,10 +99,17 @@ After a refactor, the project follows these rules:
 
				 
			
 
				 When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
			
 
				 - `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector
			
 
				-- HTTP 403/406 — Council site added WAF; check `Http.get` options or add a warmup step
			
 
				+- HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`)
			
 
				+- Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning
			
 
				 - `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime`
			
 
				 
			
 
				-To add a new scraper, copy a structurally similar one (e.g. `glamorgan.rb` for table-based sites, `centralhighlands.rb` for link/PDF-based sites) and adapt the parsing logic. The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else.
			
 
				+**Template choice:**
			
 
				+- Simple HTML list/table → copy `glamorgan.rb`
			
 
				+- Link/PDF listing → copy `centralhighlands.rb`
			
 
				+- WAF-protected site needing homepage warmup → copy `kingisland.rb` (minimal) or `burnie.rb` (full-featured with PDF download)
			
 
				+- Multi-hop redirect to detail pages → copy `derwentvalley.rb`
			
 
				+
			
 
				+The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else.
			
 
				 
			
 
				 ---
			
 
				 
			
@@ -119,5 +133,8 @@ To add a new scraper, copy a structurally similar one (e.g. `glamorgan.rb` for t
 
				 
			
 
				 - **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
			
 
				 - **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead.
			
 
				-- **PlanBuild scrapers**: `planbuild.rb` and `planbuild_fetch.js` handle councils on the state-run PlanBuild portal. They write to per-council tables using `Util.ref_to_table`. These are separate from the council-specific scrapers.
			
 
				+- **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers.
			
 
				 - **PDF downloads**: Only happen when `DOWNLOAD_ATTACHMENTS=1`. Files land in `DOWNLOAD_DIR/<councilname>/`. The web portal serves them from `/downloads/` via an Apache alias.
			
 
				+- **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern.
			
 
				+- **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`.
			
 
				+- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`.
			
--- a/README.md
+++ b/README.md
@@ -193,6 +193,16 @@ docker compose run --rm \
 
				 
			
 
				 ---
			
 
				 
			
 
				+## WAF and Cloudflare Handling
			
 
				+
			
 
				+`lib/http.rb` sends a full Chrome browser fingerprint on every request, including `sec-ch-ua`, `Sec-Fetch-*`, and `Upgrade-Insecure-Requests` headers. This satisfies most WAF checks without any extra scraper code.
			
 
				+
			
 
				+For sites that additionally require a **warm cookie state**, the scraper does a proactive homepage GET before fetching the target URL. See `burnie.rb` for the reference implementation of this pattern (custom `CookieJar` class + `http_get_with_cookies`). Scrapers using this pattern: `burnie.rb`, `kingisland.rb`, `latrobe.rb`, `derwentvalley.rb`.
			
 
				+
			
 
				+**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly.
			
 
				+
			
 
				+---
			
 
				+
			
 
				 ## Adding a New Scraper
			
 
				 
			
 
				 1. Create `scrapers/<councilname>.rb` — use an existing simple scraper (e.g. `glamorgan.rb`) as a template.
			
@@ -222,17 +232,19 @@ docker compose run --rm \
 
				 ```
			
 
				 tas_councils/
			
 
				 ├── lib/
			
 
				-│   ├── db.rb          # DB connection, table creation, upsert logic
			
 
				-│   ├── http.rb        # HTTP client with retries, cookie jar, WAF warmup
			
 
				-│   ├── geocode.rb     # Google Maps geocoding with SHA1 cache
			
 
				-│   ├── enrich.rb      # Post-upsert enrichment pipeline
			
 
				-│   └── util.rb        # Date parsing, council/table name mappings
			
 
				-├── scrapers/          # One .rb file per council
			
 
				-├── web/               # PHP search portal (Apache)
			
 
				-├── tools/             # Standalone backfill and migration scripts
			
 
				-├── run_all.sh         # Discovers and runs scrapers (supports ONLY/SKIP)
			
 
				-├── entrypoint.sh      # Docker entrypoint; optionally loops on a schedule
			
 
				-├── Dockerfile         # Ruby 3.2 scraper image
			
 
				-├── docker-compose.yml # Full stack: db, scraper, web, adminer
			
 
				-└── .env               # Secrets — never commit this file
			
 
				+│   ├── db.rb             # DB connection, table creation, dynamic upsert logic
			
 
				+│   ├── http.rb           # HTTP client — browser-fingerprint headers, retries, WAF warmup, curl fallback
			
 
				+│   ├── geocode.rb        # Google Maps geocoding with SHA1 cache
			
 
				+│   ├── enrich.rb         # Post-upsert enrichment pipeline
			
 
				+│   ├── util.rb           # Date parsing, council/table name mappings
			
 
				+│   ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich!
			
 
				+│   └── migrate.rb        # Sequential schema migration runner
			
 
				+├── scrapers/             # One .rb file per council
			
 
				+├── web/                  # PHP search portal (Apache)
			
 
				+├── tools/                # Standalone backfill and migration scripts
			
 
				+├── run_all.sh            # Discovers and runs scrapers (supports ONLY/SKIP)
			
 
				+├── entrypoint.sh         # Docker entrypoint; optionally loops on a schedule
			
 
				+├── Dockerfile            # Ruby 3.2 scraper image
			
 
				+├── docker-compose.yml    # Full stack: db, scraper, web, adminer
			
 
				+└── .env                  # Secrets — never commit this file
			
 
				 ```
			
--- a/VERSIONS.md
+++ b/VERSIONS.md
@@ -5,6 +5,52 @@ Entries are grouped by push/session in reverse-chronological order.
 
				 
			
 
				 ---
			
 
				 
			
 
				+## 2026-04-14 — WAF Warmup, Scraper Rewrites & Bug Fixes
			
 
				+
			
 
				+**`lib/http.rb` — Full browser fingerprint headers**
			
 
				+- Added `Upgrade-Insecure-Requests: 1`, `Sec-Fetch-Dest`, `Sec-Fetch-Mode`, `Sec-Fetch-Site`, `Sec-Fetch-User`, `sec-ch-ua`, `sec-ch-ua-mobile`, `sec-ch-ua-platform` to `BASE_HEADERS` — these are sent by all scrapers using `Http.get`/`Http.request` automatically
			
 
				+- Updated curl fallback to pass the same headers for consistency
			
 
				+
			
 
				+**`scrapers/burnie.rb` — Two bug fixes**
			
 
				+- Fixed redirect loop: `next` inside `Net::HTTP.start` block only exits the block, not the `while` loop; fixed by setting a `redirect_to` variable inside the block and calling `next` on the outer loop
			
 
				+- Fixed `URI::InvalidURIError` on PDF URLs containing non-ASCII characters (e.g. en-dash `–` in filename): percent-encode non-ASCII chars in href before `URI.join`
			
 
				+
			
 
				+**`scrapers/kingisland.rb` — Complete rewrite**
			
 
				+- Previously a stub that immediately exited; now implements homepage warmup + planning page fetch with browser fingerprint headers
			
 
				+- Parses WordPress accordion section (`div#accordion-1-c4`) for DA notices
			
 
				+- Extracts ref (`DA YYYY/NN`), address, description, on-notice date, and PDF link from structured paragraph text
			
 
				+- Falls back gracefully with a warning if the fetch fails or returns a Cloudflare challenge
			
 
				+
			
 
				+**`scrapers/latrobe.rb` — Complete rewrite**
			
 
				+- Previous version targeted PlanBuild portal (incorrect — Latrobe is not on PlanBuild)
			
 
				+- Now scrapes `https://www.latrobe.tas.gov.au/services/building-and-planning-services/planningapp` directly
			
 
				+- Uses homepage warmup to bypass Cloudflare WAF
			
 
				+- Parses `li.generic-list__item h3.generic-list__title a` — link text format: `L-DA007/2026 ADDRESS - DESCRIPTION (submissions by DATE)`
			
 
				+- Note: site blocks Docker/datacenter IPs via Cloudflare JS challenge; scraper exits cleanly when blocked
			
 
				+
			
 
				+**`scrapers/derwentvalley.rb` — Complete rewrite**
			
 
				+- Previous version found 0 links (CSS selectors didn't match; news listing used lgasa/squiz.cloud redirect chain)
			
 
				+- Now uses homepage warmup + browser headers to pass Cloudflare
			
 
				+- Fetches `/home/latest-news?...=Public+Notice`; for each `news-listing__item` link extracts the `index_url` parameter from the lgasa href, GETs `lgasa-web.squiz.cloud/?a=ID` (non-following), reads `Location` header to get the real DV detail page URL
			
 
				+- Fetches each detail page (with DV cookies) and parses the `APP No / SITE / PROPOSAL` table
			
 
				+- Extracts closing date from "no later than ... DATE" pattern (fixed regex to allow dots in "5.00pm")
			
 
				+- Note: site blocks Docker/datacenter IPs via Cloudflare JS challenge; scraper exits cleanly when blocked
			
 
				+
			
 
				+**`scrapers/georgetown.rb` — Fixed field name matching**
			
 
				+- `"Location"` was not matched by `/(Address|Property)/i` — address was always empty, causing all rows to be skipped
			
 
				+- `"Opening Date"` was not matched by date received regex
			
 
				+- Added `Location` and `Opening Date` to the respective patterns
			
 
				+- Now also extracts `applicant` ("Applicant Name"), `title_reference` ("Title reference"), and `on_notice_to` ("Closing Date") into the upsert
			
 
				+
			
 
				+**`scrapers/kingisland.rb` (original stub) → replaced with full implementation** (see above)
			
 
				+
			
 
				+**Docs**
			
 
				+- `CLAUDE.md`: added WAF/Cloudflare handling section, warmup pattern guidance, template scraper recommendations, new common gotchas (non-ASCII PDF URLs, redirect-in-block bug)
			
 
				+- `README.md`: added WAF and Cloudflare Handling section; updated project structure tree to include `scraper_helpers.rb` and `migrate.rb`
			
 
				+- `VERSIONS.md`: this entry
			
 
				+
			
 
				+---
			
 
				+
			
 
				 ## 2026-04-13 — Scraper Fixes & Audit
			
 
				 
			
 
				 **`scrapers/planbuild.rb`** — rewrote to fix crash on first item: