spectre-meltdown-checker/dist/doc/batch_prometheus.md

# Prometheus Batch Mode

`--batch prometheus` emits Prometheus text-format metrics that can be fed into any
Prometheus-compatible monitoring stack. It is designed for **fleet-scale security
monitoring**: run the script periodically on every host, push the output to a
Prometheus Pushgateway (or drop it into a node_exporter textfile directory), then
alert and dashboard from Prometheus/Grafana like any other infrastructure metric.

---

## Quick start

### Pushgateway (recommended for cron/batch fleet scans)

```sh
#!/bin/sh
PUSHGATEWAY="http://pushgateway.internal:9091"
INSTANCE=$(hostname -f)

spectre-meltdown-checker.sh --batch prometheus \
  | curl --silent --show-error --data-binary @- \
    "${PUSHGATEWAY}/metrics/job/smc/instance/${INSTANCE}"
```

Run this as root via cron or a systemd timer on every host.  The Pushgateway
retains the last pushed value, so Prometheus scrapes it on its own schedule.
A stale-data alert (`smc_last_scan_timestamp_seconds`) catches hosts that stopped
reporting.

### node_exporter textfile collector

```sh
#!/bin/sh
TEXTFILE_DIR="/var/lib/node_exporter/textfile_collector"
TMP="${TEXTFILE_DIR}/smc.prom.$$"

spectre-meltdown-checker.sh --batch prometheus > "$TMP"
mv "$TMP" "${TEXTFILE_DIR}/smc.prom"
```

The atomic `mv` prevents node_exporter from reading a partially written file.
node_exporter must be started with `--collector.textfile.directory` pointing at
`TEXTFILE_DIR`.

---

## Metric reference

All metric names are prefixed `smc_` (spectre-meltdown-checker).  All metrics
are **gauges**: they represent the state at the time of the scan, not a running
counter.

---

### `smc_build_info`

Script metadata.  Always value `1`; all data is in labels.

| Label | Values | Meaning |
|---|---|---|
| `version` | string | Script version (e.g. `25.30.0250400123`) |
| `mode` | `live` / `offline` | `live` = running on the active kernel; `offline` = inspecting a kernel image |
| `run_as_root` | `true` / `false` | Whether the script ran as root.  Non-root scans skip MSR reads and may miss mitigations |
| `paranoid` | `true` / `false` | `--paranoid` mode: stricter criteria (e.g. requires SMT disabled) |
| `sysfs_only` | `true` / `false` | `--sysfs-only` mode: only the kernel's own sysfs report was used, not independent detection |
| `reduced_accuracy` | `true` / `false` | Kernel information was incomplete (no kernel image, config, or map); some checks may be less precise |
| `mocked` | `true` / `false` | Debug/test mode: CPU values were overridden.  Results do **not** reflect the real system |

**Example:**
```
smc_build_info{version="25.30.0250400123",mode="live",run_as_root="true",paranoid="false",sysfs_only="false",reduced_accuracy="false",mocked="false"} 1
```

**Important labels for fleet operators:**

- `run_as_root="false"` means the scan was incomplete.  Treat those results as
  lower confidence and alert separately.
- `sysfs_only="true"` means the script trusted the kernel's self-report without
  independent verification.  The kernel may be wrong about its own mitigation
  status (known to happen on older kernels).
- `paranoid="true"` raises the bar: a host with `paranoid="true"` and
  `vulnerable_count=0` is held to a higher standard than one with `paranoid="false"`.
  Do not compare counts across hosts with different `paranoid` values.
- `mocked="true"` must never appear on a production host; if it does, the results
  are fabricated and every downstream alert is unreliable.

---

### `smc_system_info`

Operating system and kernel metadata.  Always value `1`.

Absent in offline mode when neither `uname -r` nor `uname -m` is available.

| Label | Values | Meaning |
|---|---|---|
| `kernel_release` | string | Output of `uname -r` (live mode only) |
| `kernel_arch` | string | Output of `uname -m` (live mode only) |
| `hypervisor_host` | `true` / `false` | Whether this machine is detected as a hypervisor host (running KVM, Xen, VMware, etc.) |

**Example:**
```
smc_system_info{kernel_release="5.15.0-100-generic",kernel_arch="x86_64",hypervisor_host="false"} 1
```

**`hypervisor_host`** materially changes the risk profile of several CVEs.
L1TF (CVE-2018-3646) and MDS (CVE-2018-12126/12130/12127) are significantly more
severe on hypervisor hosts because they can be exploited across VM boundaries by
a malicious guest.  Always prioritise remediation on hosts where
`hypervisor_host="true"`.

---

### `smc_cpu_info`

CPU hardware and microcode metadata.  Always value `1`.  Absent when `--no-hw`
is used.

| Label | Values | Meaning |
|---|---|---|
| `vendor` | string | CPU vendor (e.g. `Intel`, `AuthenticAMD`) |
| `model` | string | CPU friendly name from `/proc/cpuinfo` |
| `family` | integer string | CPU family number |
| `model_id` | integer string | CPU model number |
| `stepping` | integer string | CPU stepping number |
| `cpuid` | hex string | Full CPUID value (e.g. `0x000906ed`); absent on some ARM CPUs |
| `codename` | string | Intel CPU codename (e.g. `Coffee Lake`); absent on AMD and ARM |
| `smt` | `true` / `false` | Whether SMT (HyperThreading) is currently enabled |
| `microcode` | hex string | Installed microcode version (e.g. `0xf4`) |
| `microcode_latest` | hex string | Latest known-good microcode version from the firmware database |
| `microcode_up_to_date` | `true` / `false` | Whether `microcode == microcode_latest` |
| `microcode_blacklisted` | `true` / `false` | Whether the installed microcode is known to cause problems and should be rolled back |

**Example:**
```
smc_cpu_info{vendor="Intel",model="Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz",family="6",model_id="158",stepping="13",cpuid="0x000906ed",codename="Coffee Lake",smt="true",microcode="0xf4",microcode_latest="0xf4",microcode_up_to_date="true",microcode_blacklisted="false"} 1
```

**Microcode labels:**

- `microcode_up_to_date="false"` means a newer microcode is available in the
  firmware database.  This does not necessarily mean the system is vulnerable
  (the current microcode may still provide all required mitigations), but it
  warrants investigation.
- `microcode_blacklisted="true"` means the installed microcode is known to
  cause system instability or incorrect behaviour and must be rolled back
  immediately.  Treat this as a P1 incident.
- `microcode_latest` may be absent if the CPU is not in the firmware database
  (very new, very old, or exotic CPUs).

**`smt`** affects the risk level of several CVEs (MDS, L1TF).  For those CVEs,
full mitigation requires disabling SMT in addition to kernel and microcode updates.
The script accounts for this in its status assessment; use this label to audit
which hosts still have SMT enabled.

---

### `smc_vulnerability_status`

One time series per CVE.  The **numeric value** encodes the check result:

| Value | Meaning |
|---|---|
| `0` | Not vulnerable (CPU is unaffected by design, or all required mitigations are in place) |
| `1` | Vulnerable (mitigations are missing or insufficient) |
| `2` | Unknown (the script could not determine the status, e.g. due to missing kernel info or insufficient privileges) |

| Label | Values | Meaning |
|---|---|---|
| `cve` | CVE ID string | The CVE identifier (e.g. `CVE-2017-5753`) |
| `name` | string | Human-readable CVE name and aliases (e.g. `Spectre Variant 1, bounds check bypass`) |
| `cpu_affected` | `true` / `false` | Whether this CPU's hardware design is concerned by this CVE |

**Example:**
```
smc_vulnerability_status{cve="CVE-2017-5753",name="Spectre Variant 1, bounds check bypass",cpu_affected="true"} 0
smc_vulnerability_status{cve="CVE-2017-5715",name="Spectre Variant 2, branch target injection",cpu_affected="true"} 1
smc_vulnerability_status{cve="CVE-2022-29900",name="Retbleed, arbitrary speculative code execution with return instructions (AMD)",cpu_affected="false"} 0
```

**`cpu_affected` explained:**

A value of `0` with `cpu_affected="false"` means the CPU hardware is architecturally
immune to this CVE, no patch was needed or applied.

A value of `0` with `cpu_affected="true"` means the CPU has the hardware weakness
but all required mitigations (kernel, microcode, or both) are in place.

This distinction is important when auditing a fleet: if you need to verify that
all at-risk systems are patched, filter on `cpu_affected="true"` to exclude
hardware-immune systems from the analysis.

---

### `smc_vulnerable_count`

Number of CVEs with status `1` (vulnerable) in this scan.  Value is `0` when
no CVEs are vulnerable.

---

### `smc_unknown_count`

Number of CVEs with status `2` (unknown) in this scan.  A non-zero value
typically means the scan lacked sufficient privileges or kernel information.
Treat unknown the same as vulnerable for alerting purposes.

---

### `smc_last_scan_timestamp_seconds`

Unix timestamp (seconds since epoch) when the scan completed.  Use this to
detect hosts that have stopped reporting.

---

## Alerting rules

```yaml
groups:
  - name: spectre_meltdown_checker
    rules:

      # Fire when any CVE is confirmed vulnerable
      - alert: SMCVulnerable
        expr: smc_vulnerable_count > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} has {{ $value }} vulnerable CVE(s)"
          description: >
            Run spectre-meltdown-checker.sh interactively on {{ $labels.instance }}
            for remediation guidance.

      # Fire when status is unknown (usually means scan ran without root)
      - alert: SMCUnknown
        expr: smc_unknown_count > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} has {{ $value }} CVE(s) with unknown status"
          description: >
            Ensure the checker runs as root on {{ $labels.instance }}.

      # Fire when a host stops reporting (scan not run in 8 days)
      - alert: SMCScanStale
        expr: time() - smc_last_scan_timestamp_seconds > 8 * 86400
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} has not reported scan results in 8 days"

      # Fire when installed microcode is known-bad
      - alert: SMCMicrocodeBlacklisted
        expr: smc_cpu_info{microcode_blacklisted="true"} == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is running blacklisted microcode"
          description: >
            The installed microcode ({{ $labels.microcode }}) is known to cause
            instability.  Roll back to the previous version immediately.

      # Fire when scan ran without root (results may be incomplete)
      - alert: SMCScanNotRoot
        expr: smc_build_info{run_as_root="false"} == 1
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} scan ran without root privileges"

      # Fire when mocked data is detected on a production host
      - alert: SMCScanMocked
        expr: smc_build_info{mocked="true"} == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} scan results are mocked and unreliable"
```

---

## Useful PromQL queries

```promql
# All vulnerable CVEs across the fleet
smc_vulnerability_status == 1

# Vulnerable CVEs on hosts that are also hypervisor hosts (highest priority)
smc_vulnerability_status == 1
  * on(instance) group_left(hypervisor_host)
  smc_system_info{hypervisor_host="true"}

# Vulnerable CVEs on affected CPUs only (excludes hardware-immune systems)
smc_vulnerability_status{cpu_affected="true"} == 1

# Fleet-wide: how many hosts are vulnerable to each CVE
count by (cve, name) (smc_vulnerability_status == 1)

# Hosts with outdated microcode, with CPU model context
smc_cpu_info{microcode_up_to_date="false"}

# Hosts with SMT still enabled (relevant for MDS/L1TF remediation)
smc_cpu_info{smt="true"}

# For a specific CVE: hosts affected by hardware but fully mitigated
smc_vulnerability_status{cve="CVE-2018-3646", cpu_affected="true"} == 0

# Proportion of fleet that is fully clean (no vulnerable, no unknown)
(
  count(smc_vulnerable_count == 0 and smc_unknown_count == 0)
  /
  count(smc_vulnerable_count >= 0)
)

# Hosts where scan ran without root, results less reliable
smc_build_info{run_as_root="false"}

# Hosts with sysfs_only mode, independent detection was skipped
smc_build_info{sysfs_only="true"}

# Vulnerable CVEs joined with kernel release for patch tracking
smc_vulnerability_status == 1
  * on(instance) group_left(kernel_release)
  smc_system_info

# Vulnerable CVEs joined with CPU model and microcode version
smc_vulnerability_status == 1
  * on(instance) group_left(vendor, model, microcode, microcode_up_to_date)
  smc_cpu_info
```

---

## Caveats and edge cases

**Offline mode (`--kernel`)**
`smc_system_info` will have no `kernel_release` or `kernel_arch` labels (those
come from `uname`, which reports the running kernel, not the inspected one).
`mode="offline"` in `smc_build_info` signals this.  Offline mode is primarily
useful for pre-deployment auditing, not fleet runtime monitoring.

**`--no-hw`**
`smc_cpu_info` is not emitted.  CPU and microcode labels are absent from all
queries.  CVE checks that rely on hardware capability detection (`cap_*` flags,
MSR reads) will report `unknown` status.

**`--sysfs-only`**
The script trusts the kernel's sysfs report (`/sys/devices/system/cpu/vulnerabilities/`)
without running its own independent detection.  Some older kernels are known to
misreport their mitigation status.  `sysfs_only="true"` in `smc_build_info`
flags this condition.  Do not use `--sysfs-only` for production fleet monitoring.

**`--paranoid`**
Enables defense-in-depth checks beyond the security community consensus (e.g.
requires SMT to be disabled, IBPB always-on).  A host is only `vulnerable_count=0`
under `paranoid` if it meets this higher bar.  Do not compare `vulnerable_count`
across hosts with different `paranoid` values.

**`reduced_accuracy`**
Set when the kernel image, config file, or System.map could not be read.  Some
checks fall back to weaker heuristics and may report `unknown` for CVEs that are
actually mitigated.  This typically happens when the script runs without root or
on a kernel with an inaccessible image.

**Label stability**
Prometheus identifies time series by their full label set.  If a script upgrade
adds or renames a label (e.g. a new `smc_cpu_info` label is added for a new CVE),
Prometheus will create a new time series and the old one will become stale.  Plan
for this in long-retention dashboards by using `group_left` joins rather than
hardcoding label matchers.