瀏覽代碼

Readme Updates

Benjamin Harris 2 月之前
父節點
當前提交
d19d144796
共有 3 個文件被更改,包括 94 次插入19 次删除
  1. 23 6
      CLAUDE.md
  2. 25 13
      README.md
  3. 46 0
      VERSIONS.md

+ 23 - 6
CLAUDE.md

@@ -10,11 +10,12 @@ This is a scraping pipeline that collects Tasmanian planning development applica
 
 | File | Role |
 |---|---|
-| `lib/db.rb` | DB client, `ensure_table!`, `upsert` (with write-once semantics for some fields) |
-| `lib/http.rb` | HTTP client — retries, cookie jar, 403/406 warmup, curl fallback |
+| `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) |
+| `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback |
 | `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table |
 | `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write |
 | `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings |
+| `lib/scraper_helpers.rb` | Shared helpers: `abs_url`, `text_or`, `upsert_and_enrich!` |
 | `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set |
 | `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) |
 | `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
@@ -56,11 +57,17 @@ docker compose run --rm \
 ### Each scraper follows this pattern:
 1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename
 2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included)
-3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup)
+3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers)
 4. Parse with Nokogiri
 5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received`
 6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches
 
+### WAF / Cloudflare handling:
+- `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically.
+- For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation.
+- Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
+- The warmup pattern (custom `CookieJar` + `http_get` with redirect handling) is self-contained in scrapers that need it and does **not** depend on `lib/http.rb`.
+
 ### Write-once fields (in `DB.upsert`):
 - `date_received` — never overwritten once set
 - `date_received_raw` — never overwritten once non-blank
@@ -92,10 +99,17 @@ After a refactor, the project follows these rules:
 
 When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
 - `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector
-- HTTP 403/406 — Council site added WAF; check `Http.get` options or add a warmup step
+- HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`)
+- Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning
 - `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime`
 
-To add a new scraper, copy a structurally similar one (e.g. `glamorgan.rb` for table-based sites, `centralhighlands.rb` for link/PDF-based sites) and adapt the parsing logic. The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else.
+**Template choice:**
+- Simple HTML list/table → copy `glamorgan.rb`
+- Link/PDF listing → copy `centralhighlands.rb`
+- WAF-protected site needing homepage warmup → copy `kingisland.rb` (minimal) or `burnie.rb` (full-featured with PDF download)
+- Multi-hop redirect to detail pages → copy `derwentvalley.rb`
+
+The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else.
 
 ---
 
@@ -119,5 +133,8 @@ To add a new scraper, copy a structurally similar one (e.g. `glamorgan.rb` for t
 
 - **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
 - **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead.
-- **PlanBuild scrapers**: `planbuild.rb` and `planbuild_fetch.js` handle councils on the state-run PlanBuild portal. They write to per-council tables using `Util.ref_to_table`. These are separate from the council-specific scrapers.
+- **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers.
 - **PDF downloads**: Only happen when `DOWNLOAD_ATTACHMENTS=1`. Files land in `DOWNLOAD_DIR/<councilname>/`. The web portal serves them from `/downloads/` via an Apache alias.
+- **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern.
+- **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`.
+- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`.

+ 25 - 13
README.md

@@ -193,6 +193,16 @@ docker compose run --rm \
 
 ---
 
+## WAF and Cloudflare Handling
+
+`lib/http.rb` sends a full Chrome browser fingerprint on every request, including `sec-ch-ua`, `Sec-Fetch-*`, and `Upgrade-Insecure-Requests` headers. This satisfies most WAF checks without any extra scraper code.
+
+For sites that additionally require a **warm cookie state**, the scraper does a proactive homepage GET before fetching the target URL. See `burnie.rb` for the reference implementation of this pattern (custom `CookieJar` class + `http_get_with_cookies`). Scrapers using this pattern: `burnie.rb`, `kingisland.rb`, `latrobe.rb`, `derwentvalley.rb`.
+
+**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly.
+
+---
+
 ## Adding a New Scraper
 
 1. Create `scrapers/<councilname>.rb` — use an existing simple scraper (e.g. `glamorgan.rb`) as a template.
@@ -222,17 +232,19 @@ docker compose run --rm \
 ```
 tas_councils/
 ├── lib/
-│   ├── db.rb          # DB connection, table creation, upsert logic
-│   ├── http.rb        # HTTP client with retries, cookie jar, WAF warmup
-│   ├── geocode.rb     # Google Maps geocoding with SHA1 cache
-│   ├── enrich.rb      # Post-upsert enrichment pipeline
-│   └── util.rb        # Date parsing, council/table name mappings
-├── scrapers/          # One .rb file per council
-├── web/               # PHP search portal (Apache)
-├── tools/             # Standalone backfill and migration scripts
-├── run_all.sh         # Discovers and runs scrapers (supports ONLY/SKIP)
-├── entrypoint.sh      # Docker entrypoint; optionally loops on a schedule
-├── Dockerfile         # Ruby 3.2 scraper image
-├── docker-compose.yml # Full stack: db, scraper, web, adminer
-└── .env               # Secrets — never commit this file
+│   ├── db.rb             # DB connection, table creation, dynamic upsert logic
+│   ├── http.rb           # HTTP client — browser-fingerprint headers, retries, WAF warmup, curl fallback
+│   ├── geocode.rb        # Google Maps geocoding with SHA1 cache
+│   ├── enrich.rb         # Post-upsert enrichment pipeline
+│   ├── util.rb           # Date parsing, council/table name mappings
+│   ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich!
+│   └── migrate.rb        # Sequential schema migration runner
+├── scrapers/             # One .rb file per council
+├── web/                  # PHP search portal (Apache)
+├── tools/                # Standalone backfill and migration scripts
+├── run_all.sh            # Discovers and runs scrapers (supports ONLY/SKIP)
+├── entrypoint.sh         # Docker entrypoint; optionally loops on a schedule
+├── Dockerfile            # Ruby 3.2 scraper image
+├── docker-compose.yml    # Full stack: db, scraper, web, adminer
+└── .env                  # Secrets — never commit this file
 ```

+ 46 - 0
VERSIONS.md

@@ -5,6 +5,52 @@ Entries are grouped by push/session in reverse-chronological order.
 
 ---
 
+## 2026-04-14 — WAF Warmup, Scraper Rewrites & Bug Fixes
+
+**`lib/http.rb` — Full browser fingerprint headers**
+- Added `Upgrade-Insecure-Requests: 1`, `Sec-Fetch-Dest`, `Sec-Fetch-Mode`, `Sec-Fetch-Site`, `Sec-Fetch-User`, `sec-ch-ua`, `sec-ch-ua-mobile`, `sec-ch-ua-platform` to `BASE_HEADERS` — these are sent by all scrapers using `Http.get`/`Http.request` automatically
+- Updated curl fallback to pass the same headers for consistency
+
+**`scrapers/burnie.rb` — Two bug fixes**
+- Fixed redirect loop: `next` inside `Net::HTTP.start` block only exits the block, not the `while` loop; fixed by setting a `redirect_to` variable inside the block and calling `next` on the outer loop
+- Fixed `URI::InvalidURIError` on PDF URLs containing non-ASCII characters (e.g. en-dash `–` in filename): percent-encode non-ASCII chars in href before `URI.join`
+
+**`scrapers/kingisland.rb` — Complete rewrite**
+- Previously a stub that immediately exited; now implements homepage warmup + planning page fetch with browser fingerprint headers
+- Parses WordPress accordion section (`div#accordion-1-c4`) for DA notices
+- Extracts ref (`DA YYYY/NN`), address, description, on-notice date, and PDF link from structured paragraph text
+- Falls back gracefully with a warning if the fetch fails or returns a Cloudflare challenge
+
+**`scrapers/latrobe.rb` — Complete rewrite**
+- Previous version targeted PlanBuild portal (incorrect — Latrobe is not on PlanBuild)
+- Now scrapes `https://www.latrobe.tas.gov.au/services/building-and-planning-services/planningapp` directly
+- Uses homepage warmup to bypass Cloudflare WAF
+- Parses `li.generic-list__item h3.generic-list__title a` — link text format: `L-DA007/2026 ADDRESS - DESCRIPTION (submissions by DATE)`
+- Note: site blocks Docker/datacenter IPs via Cloudflare JS challenge; scraper exits cleanly when blocked
+
+**`scrapers/derwentvalley.rb` — Complete rewrite**
+- Previous version found 0 links (CSS selectors didn't match; news listing used lgasa/squiz.cloud redirect chain)
+- Now uses homepage warmup + browser headers to pass Cloudflare
+- Fetches `/home/latest-news?...=Public+Notice`; for each `news-listing__item` link extracts the `index_url` parameter from the lgasa href, GETs `lgasa-web.squiz.cloud/?a=ID` (non-following), reads `Location` header to get the real DV detail page URL
+- Fetches each detail page (with DV cookies) and parses the `APP No / SITE / PROPOSAL` table
+- Extracts closing date from "no later than ... DATE" pattern (fixed regex to allow dots in "5.00pm")
+- Note: site blocks Docker/datacenter IPs via Cloudflare JS challenge; scraper exits cleanly when blocked
+
+**`scrapers/georgetown.rb` — Fixed field name matching**
+- `"Location"` was not matched by `/(Address|Property)/i` — address was always empty, causing all rows to be skipped
+- `"Opening Date"` was not matched by date received regex
+- Added `Location` and `Opening Date` to the respective patterns
+- Now also extracts `applicant` ("Applicant Name"), `title_reference` ("Title reference"), and `on_notice_to` ("Closing Date") into the upsert
+
+**`scrapers/kingisland.rb` (original stub) → replaced with full implementation** (see above)
+
+**Docs**
+- `CLAUDE.md`: added WAF/Cloudflare handling section, warmup pattern guidance, template scraper recommendations, new common gotchas (non-ASCII PDF URLs, redirect-in-block bug)
+- `README.md`: added WAF and Cloudflare Handling section; updated project structure tree to include `scraper_helpers.rb` and `migrate.rb`
+- `VERSIONS.md`: this entry
+
+---
+
 ## 2026-04-13 — Scraper Fixes & Audit
 
 **`scrapers/planbuild.rb`** — rewrote to fix crash on first item: