reconsider prior backlog each run + recognize CVEs from context

2026-07-21 12:31:27 +02:00 · 2026-04-19 10:41:52 +00:00
parent 12f545dc45
commit b305cc48c3
3 changed files with 234 additions and 41 deletions
@@ -28,10 +28,7 @@ subsystems.
  {
    "scan_date": "2026-04-18T14:24:43+00:00",
    "window_cutoff": "2026-04-17T13:24:43+00:00",
-    "per_source": {
-      "phoronix": {"status": 200, "new": 2, "total_in_feed": 75},
-      "oss-sec":  {"status": 304, "new": 0}
-    },
+    "per_source": { "phoronix": {"status": 200, "new": 2, "total_in_feed": 75} },
    "items": [
      {
        "source": "phoronix",
@@ -44,13 +41,27 @@ subsystems.
        "vendor_ids": [],
        "snippet": "first 400 chars of description, tags stripped"
      }
+    ],
+    "reconsider": [
+      {
+        "canonical_id":   "INTEL-SA-00145",
+        "current_bucket": "toimplement",
+        "title":          "Lazy FP State Restore",
+        "sources":        ["intel-psirt"],
+        "urls":           ["https://www.intel.com/.../intel-sa-00145.html"],
+        "extracted_cves": [],
+        "first_seen":     "2026-04-19T09:41:44+00:00"
+      }
    ]
  }
  ```

-  `items` is already: (a) within the time window, (b) not known to prior
-  state under any of its alt-IDs. If `items` is empty, your only job is to
-  write the three stub output files with `(no new items in this window)`.
+  - `items` are fresh observations from today's fetch: already inside the
+    time window and not yet present in state under any alt-ID.
+  - `reconsider` holds existing `toimplement`/`tocheck` entries from state,
+    submitted for re-review each run (see the "Reconsideration" section
+    below). On days where both arrays are empty, write stub output files
+    with `(no new items in this window)`.

 - `./checker/` is a checkout of the **`test`** branch of this repo (the
  development branch where coded-but-unreleased CVE checks live). This is
@@ -82,6 +93,30 @@ in `tocheck`.
 follow-ups per run total**. Do not use it for items you already plan to file
 as `unrelated` or `toimplement`.

+## Reconsideration rules (for `reconsider` entries)
+
+Each `reconsider` entry is an item *already* in state under `current_bucket`
+= `toimplement` or `tocheck`, from a prior run. Re-examine it against the
+**current** `./checker/` tree and current knowledge. You may:
+
+- **Demote** `toimplement` → `tocheck` or `unrelated` if the checker now
+  covers the CVE/codename (grep confirms), or if reinterpreting the
+  advisory shows it's out of scope.
+- **Demote** `tocheck` → `unrelated` if new context settles the ambiguity
+  as out-of-scope.
+- **Promote** `tocheck` → `toimplement` if you now have firm evidence it's
+  a real, in-scope, not-yet-covered CVE.
+- **Leave it unchanged** (same bucket) — emit a record anyway; it's cheap
+  and documents that the reconsideration happened today.
+- **Reassign the canonical ID** — if a CVE has since been assigned to a
+  vendor advisory (e.g., an INTEL-SA that previously had no CVE), put the
+  CVE in `extracted_cves` and use it as the new `canonical_id`. The merge
+  step will rekey the record under the CVE and keep the old ID as an alias.
+
+For every reconsider record you emit, set `"reconsider": true` in its
+classification entry — this tells the merge step to **overwrite** the
+stored bucket (including demotions), not just promote.
+
 ## Outputs

 Compute `TODAY` = the `YYYY-MM-DD` prefix of `scan_date`. Write three files at
@@ -91,6 +126,11 @@ the repo root, overwriting if present:
 - `watch_${TODAY}_tocheck.md`
 - `watch_${TODAY}_unrelated.md`

+These delta files cover the **`items`** array only — they answer "what
+did today's fetch surface". Reconsider decisions update state (and surface
+in the `current_*.md` snapshots the merge step rewrites); don't duplicate
+them here.
+
 Each file uses level-2 headers per source short-name, then one bullet per
 item: the stable ID, the permalink, and 1–2 sentences of context.

@@ -112,6 +152,9 @@ otherwise empty):
 - per-source counts (from per_source): ...
 - fetch failures (status != 200/304): ...
 - total classified this run: toimplement=<n>, tocheck=<n>, unrelated=<n>
+- reconsidered: <n> entries re-reviewed; <list any bucket transitions, e.g.
+  "CVE-2018-3665: toimplement -> tocheck (now covered at src/vulns/...)">,
+  or "no transitions" if every reconsider kept its existing bucket.
 ```

 ## `classifications.json` — required side-channel for the merge step
@@ -134,14 +177,27 @@ record per item in `new_items.json.items`:

 Rules:

- One record per input item. Same `stable_id` as in `new_items.json`.
+- One record per input item (`items` + `reconsider`). For items, use the
+  same `stable_id` as in `new_items.json`. For reconsider entries, use the
+  entry's `canonical_id` from state as the record's `stable_id`.
 - `canonical_id`: prefer the first `extracted_cves` entry if any; otherwise
  the item's `stable_id`. **Use the same `canonical_id` for multiple items
  that are really the same CVE from different sources** — the merge step
  will collapse them into one entry and add alias rows automatically.
+- **Populate `extracted_cves` / `canonical_id` from context when the feed
+  didn't.** If the title, body, or a well-known transient-execution codename
+  mapping lets you identify a CVE the feed didn't emit (e.g., "Lazy FP
+  State Restore" → `CVE-2018-3665`, "LazyFP" → same, "FP-DSS" → whatever
+  CVE AMD/Intel assigned), put the CVE in `extracted_cves` and use it as
+  `canonical_id`. This prevents Intel's CVE-less listing entries from
+  creating orphan `INTEL-SA-NNNNN` records in the backlog.
 - `sources` / `urls`: arrays; default to the item's own single source and
  permalink if you didn't enrich further.
- If `new_items.json.items` is empty, write `[]`.
+- **`reconsider: true`** — set on every record that corresponds to an
+  input from the `reconsider` array. The merge step uses this flag to
+  overwrite the stored bucket instead of merging by "strongest wins" —
+  this is what enables demotions.
+- If both `items` and `reconsider` are empty, write `[]`.

 ## Guardrails

@@ -362,6 +362,46 @@ def _resolve_window_hours() -> float:
        return float(DEFAULT_WINDOW_HOURS)


+def backlog_to_reconsider(data: dict[str, Any]) -> list[dict[str, Any]]:
+    """Walk state.seen and emit toimplement/tocheck entries for re-review.
+
+    Each entry carries enough context that Claude can re-grep ./checker/
+    and decide whether the prior classification still holds. Items in
+    `unrelated` are skipped — those are settled.
+
+    A CVE alias pointing at this canonical is included in `extracted_cves`
+    so Claude sees every known CVE for the item without having to consult
+    the full alias map.
+    """
+    seen = data.get("seen", {})
+    aliases = data.get("aliases", {})
+    # Reverse-index aliases: canonical -> [alt, ...]
+    by_canonical: dict[str, list[str]] = {}
+    for alt, canon in aliases.items():
+        by_canonical.setdefault(canon, []).append(alt)
+
+    out: list[dict[str, Any]] = []
+    for canonical, rec in seen.items():
+        if rec.get("bucket") not in ("toimplement", "tocheck"):
+            continue
+        cves: list[str] = []
+        if canonical.startswith("CVE-"):
+            cves.append(canonical)
+        for alt in by_canonical.get(canonical, []):
+            if alt.startswith("CVE-") and alt not in cves:
+                cves.append(alt)
+        out.append({
+            "canonical_id":   canonical,
+            "current_bucket": rec.get("bucket"),
+            "title":          rec.get("title") or "",
+            "sources":        list(rec.get("sources") or []),
+            "urls":           list(rec.get("urls") or []),
+            "extracted_cves": cves,
+            "first_seen":     rec.get("first_seen"),
+        })
+    return out
+
+
 def candidate_ids(item: dict[str, Any]) -> list[str]:
    """All identifiers under which this item might already be known."""
    seen: set[str] = set()
@@ -451,19 +491,25 @@ def main() -> int:
    # Persist updated HTTP cache metadata regardless of whether Claude runs.
    state.save(data)

+    reconsider = backlog_to_reconsider(data)
+
    out = {
        "scan_date": scan_date_iso,
        "window_cutoff": cutoff.isoformat(),
        "per_source": per_source,
        "items": all_new,
+        "reconsider": reconsider,
    }
    args.output.write_text(json.dumps(out, indent=2, sort_keys=True) + "\n")

-    # GitHub Actions step outputs
+    # GitHub Actions step outputs. Downstream `if:` conditions gate the
+    # classify step on `new_count || reconsider_count`; both must be 0
+    # for Claude to be skipped.
    gh_out = os.environ.get("GITHUB_OUTPUT")
    if gh_out:
        with open(gh_out, "a") as f:
            f.write(f"new_count={len(all_new)}\n")
+            f.write(f"reconsider_count={len(reconsider)}\n")
            failures = [
                s for s, v in per_source.items()
                if not (isinstance(v["status"], int) and v["status"] in (200, 304))
@@ -474,6 +520,7 @@ def main() -> int:
    print(f"Window:       {window_hours:g} h")
    print(f"Cutoff:       {cutoff.isoformat()}")
    print(f"New items:    {len(all_new)}")
+    print(f"Reconsider:   {len(reconsider)} existing toimplement/tocheck entries")
    for s, v in per_source.items():
        print(f"  {s:14s} status={str(v['status']):>16} new={v['new']}")

@@ -14,11 +14,22 @@ Each classification record has shape:
      "bucket":         "toimplement|tocheck|unrelated",
      "extracted_cves": ["...", ...],    # optional
      "sources":        ["...", ...],    # optional
-      "urls":           ["...", ...]     # optional
+      "urls":           ["...", ...],    # optional
+      "reconsider":     true             # optional; set by Claude for reconsidered
+                                         #   backlog entries — merge overwrites
+                                         #   the stored bucket (incl. demotions)
+                                         #   instead of promoting
    }

 Behavior:
-    - Upsert seen[canonical_id], union sources/urls, promote bucket strength.
+    - For records WITHOUT `reconsider: true` (fresh items):
+      upsert seen[canonical_id], union sources/urls, promote bucket strength.
+    - For records WITH `reconsider: true` (previously-classified entries):
+      overwrite the stored bucket unconditionally (permits demotions), union
+      sources/urls. If Claude's canonical_id differs from the stable_id (the
+      previous canonical), rekey the seen entry under the new ID and leave
+      the old as an alias — used when a CVE has since been assigned to what
+      was previously a bare vendor-ID entry.
    - For every alt_id in (stable_id, vendor_ids, extracted_cves) that differs
      from canonical_id, set aliases[alt_id] = canonical_id.
    - Update last_run to SCAN_DATE.
@@ -92,38 +103,117 @@ def merge(
    scan_date: str,
 ) -> None:
    for rec in classifications:
-        stable_id = rec.get("stable_id")
-        if not stable_id:
+        if not rec.get("stable_id"):
            continue
-        meta = new_items_by_stable_id.get(stable_id, {})
-        canonical = _canonical(rec, meta)
-        bucket = rec.get("bucket", "unrelated")
-
-        title = (meta.get("title") or "").strip()
-
-        existing = data["seen"].get(canonical)
-        if existing is None:
-            data["seen"][canonical] = {
-                "bucket": bucket,
-                "first_seen": scan_date,
-                "seen_at": scan_date,
-                "title": title,
-                "sources": _unique(list(rec.get("sources") or []) + ([meta.get("source")] if meta.get("source") else [])),
-                "urls":    _unique(list(rec.get("urls") or []) + ([meta.get("permalink")] if meta.get("permalink") else [])),
-            }
+        if rec.get("reconsider"):
+            _apply_reconsider(data, rec, scan_date)
        else:
-            existing["bucket"] = state.promote_bucket(existing["bucket"], bucket)
-            existing["seen_at"] = scan_date
-            existing.setdefault("first_seen", existing.get("seen_at") or scan_date)
-            if not existing.get("title") and title:
-                existing["title"] = title
-            existing["sources"] = _unique(list(existing.get("sources") or []) + list(rec.get("sources") or []) + ([meta.get("source")] if meta.get("source") else []))
-            existing["urls"] = _unique(list(existing.get("urls") or []) + list(rec.get("urls") or []) + ([meta.get("permalink")] if meta.get("permalink") else []))
+            _apply_new_item(data, rec, new_items_by_stable_id, scan_date)

-        # Aliases: every alt id that is not the canonical key points at it.
-        for alt in _alt_ids(rec, meta):
-            if alt != canonical:
-                data["aliases"][alt] = canonical
+
+def _apply_new_item(
+    data: dict[str, Any],
+    rec: dict[str, Any],
+    new_items_by_stable_id: dict[str, dict[str, Any]],
+    scan_date: str,
+) -> None:
+    stable_id = rec["stable_id"]
+    meta = new_items_by_stable_id.get(stable_id, {})
+    canonical = _canonical(rec, meta)
+    bucket = rec.get("bucket", "unrelated")
+    title = (meta.get("title") or "").strip()
+
+    existing = data["seen"].get(canonical)
+    if existing is None:
+        data["seen"][canonical] = {
+            "bucket": bucket,
+            "first_seen": scan_date,
+            "seen_at": scan_date,
+            "title": title,
+            "sources": _unique(list(rec.get("sources") or []) + ([meta.get("source")] if meta.get("source") else [])),
+            "urls":    _unique(list(rec.get("urls") or []) + ([meta.get("permalink")] if meta.get("permalink") else [])),
+        }
+    else:
+        existing["bucket"] = state.promote_bucket(existing["bucket"], bucket)
+        existing["seen_at"] = scan_date
+        existing.setdefault("first_seen", existing.get("seen_at") or scan_date)
+        if not existing.get("title") and title:
+            existing["title"] = title
+        existing["sources"] = _unique(list(existing.get("sources") or []) + list(rec.get("sources") or []) + ([meta.get("source")] if meta.get("source") else []))
+        existing["urls"] = _unique(list(existing.get("urls") or []) + list(rec.get("urls") or []) + ([meta.get("permalink")] if meta.get("permalink") else []))
+
+    for alt in _alt_ids(rec, meta):
+        if alt != canonical:
+            data["aliases"][alt] = canonical
+
+
+def _apply_reconsider(
+    data: dict[str, Any],
+    rec: dict[str, Any],
+    scan_date: str,
+) -> None:
+    """Re-review of a previously-classified entry. The record's stable_id
+    is the entry's current canonical key in state; `canonical_id` may name
+    a new key (e.g. a freshly-assigned CVE) — in which case we rekey."""
+    old_key = rec["stable_id"]
+    new_canonical = _canonical(rec, None)
+    bucket = rec.get("bucket", "unrelated")
+
+    # Resolve the current record — may need to follow an alias if the
+    # backlog snapshot the classifier reviewed is slightly out of sync.
+    current_key = old_key if old_key in data["seen"] else data["aliases"].get(old_key)
+    if not current_key or current_key not in data["seen"]:
+        print(f"warning: reconsider record for {old_key!r} points at no "
+              f"state entry; skipping.", file=sys.stderr)
+        return
+
+    existing = data["seen"][current_key]
+
+    # Overwrite bucket unconditionally (allows demotions) and stamp the
+    # reconsideration date so we can later throttle if this grows.
+    existing["bucket"] = bucket
+    existing["seen_at"] = scan_date
+    existing["reconsidered_at"] = scan_date
+
+    # Union any fresh sources/urls the classifier surfaced.
+    if rec.get("sources"):
+        existing["sources"] = _unique(list(existing.get("sources") or []) + list(rec["sources"]))
+    if rec.get("urls"):
+        existing["urls"] = _unique(list(existing.get("urls") or []) + list(rec["urls"]))
+
+    # Alias every alt ID the classifier provided to the current key
+    # (before a possible rekey below redirects them).
+    for alt in _alt_ids(rec, None):
+        if alt != current_key:
+            data["aliases"][alt] = current_key
+
+    # Rekey if Claude newly identified a canonical ID (e.g., a CVE for a
+    # vendor-ID entry). If the destination already exists, merge; else
+    # move. In both cases, retarget all aliases and leave the old key
+    # itself as an alias.
+    if new_canonical and new_canonical != current_key:
+        if new_canonical in data["seen"]:
+            dest = data["seen"][new_canonical]
+            dest["bucket"] = state.promote_bucket(dest.get("bucket", "unrelated"), existing.get("bucket", "unrelated"))
+            dest["sources"] = _unique(list(dest.get("sources") or []) + list(existing.get("sources") or []))
+            dest["urls"] = _unique(list(dest.get("urls") or []) + list(existing.get("urls") or []))
+            if not dest.get("title") and existing.get("title"):
+                dest["title"] = existing["title"]
+            dest["seen_at"] = scan_date
+            dest["reconsidered_at"] = scan_date
+            dest.setdefault("first_seen", existing.get("first_seen") or scan_date)
+            del data["seen"][current_key]
+        else:
+            data["seen"][new_canonical] = existing
+            del data["seen"][current_key]
+
+        for alias_key, target in list(data["aliases"].items()):
+            if target == current_key:
+                data["aliases"][alias_key] = new_canonical
+        data["aliases"][current_key] = new_canonical
+        # Clean up any self-aliases the retarget may have produced.
+        for k in [k for k, v in data["aliases"].items() if k == v]:
+            del data["aliases"][k]


 def ensure_stub_reports(scan_date: str) -> None: