Files
spectre-meltdown-checker/doc/batch_prometheus.md
github-actions[bot] a2823830a6 chore: create doc/ in -build branch
built from commit 2b1389e5c667a3c10c8e47fca7cb14d81695165c
 dated 2026-04-08 21:57:03 +0200
 by Stéphane Lesimple (speed47_github@speed47.net)
2026-04-08 20:10:38 +00:00

14 KiB

Prometheus Batch Mode — Fleet Operator Guide

--batch prometheus emits Prometheus text-format metrics that can be fed into any Prometheus-compatible monitoring stack. It is designed for fleet-scale security monitoring: run the script periodically on every host, push the output to a Prometheus Pushgateway (or drop it into a node_exporter textfile directory), then alert and dashboard from Prometheus/Grafana like any other infrastructure metric.


Quick start

#!/bin/sh
PUSHGATEWAY="http://pushgateway.internal:9091"
INSTANCE=$(hostname -f)

spectre-meltdown-checker.sh --batch prometheus \
  | curl --silent --show-error --data-binary @- \
    "${PUSHGATEWAY}/metrics/job/smc/instance/${INSTANCE}"

Run this as root via cron or a systemd timer on every host. The Pushgateway retains the last pushed value, so Prometheus scrapes it on its own schedule. A stale-data alert (smc_last_scan_timestamp_seconds) catches hosts that stopped reporting.

node_exporter textfile collector

#!/bin/sh
TEXTFILE_DIR="/var/lib/node_exporter/textfile_collector"
TMP="${TEXTFILE_DIR}/smc.prom.$$"

spectre-meltdown-checker.sh --batch prometheus > "$TMP"
mv "$TMP" "${TEXTFILE_DIR}/smc.prom"

The atomic mv prevents node_exporter from reading a partially written file. node_exporter must be started with --collector.textfile.directory pointing at TEXTFILE_DIR.


Metric reference

All metric names are prefixed smc_ (spectre-meltdown-checker). All metrics are gauges: they represent the state at the time of the scan, not a running counter.


smc_build_info

Script metadata. Always value 1; all data is in labels.

Label Values Meaning
version string Script version (e.g. 25.30.0250400123)
mode live / offline live = running on the active kernel; offline = inspecting a kernel image
run_as_root true / false Whether the script ran as root. Non-root scans skip MSR reads and may miss mitigations
paranoid true / false --paranoid mode: stricter criteria (e.g. requires SMT disabled)
sysfs_only true / false --sysfs-only mode: only the kernel's own sysfs report was used, not independent detection
reduced_accuracy true / false Kernel information was incomplete (no kernel image, config, or map); some checks may be less precise
mocked true / false Debug/test mode: CPU values were overridden. Results do not reflect the real system

Example:

smc_build_info{version="25.30.0250400123",mode="live",run_as_root="true",paranoid="false",sysfs_only="false",reduced_accuracy="false",mocked="false"} 1

Important labels for fleet operators:

  • run_as_root="false" means the scan was incomplete. Treat those results as lower confidence and alert separately.
  • sysfs_only="true" means the script trusted the kernel's self-report without independent verification. The kernel may be wrong about its own mitigation status (known to happen on older kernels).
  • paranoid="true" raises the bar: a host with paranoid="true" and vulnerable_count=0 is held to a higher standard than one with paranoid="false". Do not compare counts across hosts with different paranoid values.
  • mocked="true" must never appear on a production host; if it does, the results are fabricated and every downstream alert is unreliable.

smc_system_info

Operating system and kernel metadata. Always value 1.

Absent in offline mode when neither uname -r nor uname -m is available.

Label Values Meaning
kernel_release string Output of uname -r (live mode only)
kernel_arch string Output of uname -m (live mode only)
hypervisor_host true / false Whether this machine is detected as a hypervisor host (running KVM, Xen, VMware, etc.)

Example:

smc_system_info{kernel_release="5.15.0-100-generic",kernel_arch="x86_64",hypervisor_host="false"} 1

hypervisor_host materially changes the risk profile of several CVEs. L1TF (CVE-2018-3646) and MDS (CVE-2018-12126/12130/12127) are significantly more severe on hypervisor hosts because they can be exploited across VM boundaries by a malicious guest. Always prioritise remediation on hosts where hypervisor_host="true".


smc_cpu_info

CPU hardware and microcode metadata. Always value 1. Absent when --no-hw is used.

Label Values Meaning
vendor string CPU vendor (e.g. Intel, AuthenticAMD)
model string CPU friendly name from /proc/cpuinfo
family integer string CPU family number
model_id integer string CPU model number
stepping integer string CPU stepping number
cpuid hex string Full CPUID value (e.g. 0x000906ed); absent on some ARM CPUs
codename string Intel CPU codename (e.g. Coffee Lake); absent on AMD and ARM
smt true / false Whether SMT (HyperThreading) is currently enabled
microcode hex string Installed microcode version (e.g. 0xf4)
microcode_latest hex string Latest known-good microcode version from the firmware database
microcode_up_to_date true / false Whether microcode == microcode_latest
microcode_blacklisted true / false Whether the installed microcode is known to cause problems and should be rolled back

Example:

smc_cpu_info{vendor="Intel",model="Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz",family="6",model_id="158",stepping="13",cpuid="0x000906ed",codename="Coffee Lake",smt="true",microcode="0xf4",microcode_latest="0xf4",microcode_up_to_date="true",microcode_blacklisted="false"} 1

Microcode labels:

  • microcode_up_to_date="false" means a newer microcode is available in the firmware database. This does not necessarily mean the system is vulnerable (the current microcode may still provide all required mitigations), but it warrants investigation.
  • microcode_blacklisted="true" means the installed microcode is known to cause system instability or incorrect behaviour and must be rolled back immediately. Treat this as a P1 incident.
  • microcode_latest may be absent if the CPU is not in the firmware database (very new, very old, or exotic CPUs).

smt affects the risk level of several CVEs (MDS, L1TF). For those CVEs, full mitigation requires disabling SMT in addition to kernel and microcode updates. The script accounts for this in its status assessment; use this label to audit which hosts still have SMT enabled.


smc_vulnerability_status

One time series per CVE. The numeric value encodes the check result:

Value Meaning
0 Not vulnerable (CPU is unaffected by design, or all required mitigations are in place)
1 Vulnerable (mitigations are missing or insufficient)
2 Unknown (the script could not determine the status, e.g. due to missing kernel info or insufficient privileges)
Label Values Meaning
cve CVE ID string The CVE identifier (e.g. CVE-2017-5753)
name string Human-readable CVE name and aliases (e.g. Spectre Variant 1, bounds check bypass)
cpu_affected true / false Whether this CPU's hardware design is concerned by this CVE

Example:

smc_vulnerability_status{cve="CVE-2017-5753",name="Spectre Variant 1, bounds check bypass",cpu_affected="true"} 0
smc_vulnerability_status{cve="CVE-2017-5715",name="Spectre Variant 2, branch target injection",cpu_affected="true"} 1
smc_vulnerability_status{cve="CVE-2022-29900",name="Retbleed, arbitrary speculative code execution with return instructions (AMD)",cpu_affected="false"} 0

cpu_affected explained:

A value of 0 with cpu_affected="false" means the CPU hardware is architecturally immune to this CVE — no patch was needed or applied.

A value of 0 with cpu_affected="true" means the CPU has the hardware weakness but all required mitigations (kernel, microcode, or both) are in place.

This distinction is important when auditing a fleet: if you need to verify that all at-risk systems are patched, filter on cpu_affected="true" to exclude hardware-immune systems from the analysis.


smc_vulnerable_count

Number of CVEs with status 1 (vulnerable) in this scan. Value is 0 when no CVEs are vulnerable.


smc_unknown_count

Number of CVEs with status 2 (unknown) in this scan. A non-zero value typically means the scan lacked sufficient privileges or kernel information. Treat unknown the same as vulnerable for alerting purposes.


smc_last_scan_timestamp_seconds

Unix timestamp (seconds since epoch) when the scan completed. Use this to detect hosts that have stopped reporting.


Alerting rules

groups:
  - name: spectre_meltdown_checker
    rules:

      # Fire when any CVE is confirmed vulnerable
      - alert: SMCVulnerable
        expr: smc_vulnerable_count > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} has {{ $value }} vulnerable CVE(s)"
          description: >
            Run spectre-meltdown-checker.sh interactively on {{ $labels.instance }}
            for remediation guidance.

      # Fire when status is unknown (usually means scan ran without root)
      - alert: SMCUnknown
        expr: smc_unknown_count > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} has {{ $value }} CVE(s) with unknown status"
          description: >
            Ensure the checker runs as root on {{ $labels.instance }}.

      # Fire when a host stops reporting (scan not run in 8 days)
      - alert: SMCScanStale
        expr: time() - smc_last_scan_timestamp_seconds > 8 * 86400
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} has not reported scan results in 8 days"

      # Fire when installed microcode is known-bad
      - alert: SMCMicrocodeBlacklisted
        expr: smc_cpu_info{microcode_blacklisted="true"} == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is running blacklisted microcode"
          description: >
            The installed microcode ({{ $labels.microcode }}) is known to cause
            instability.  Roll back to the previous version immediately.

      # Fire when scan ran without root (results may be incomplete)
      - alert: SMCScanNotRoot
        expr: smc_build_info{run_as_root="false"} == 1
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} scan ran without root privileges"

      # Fire when mocked data is detected on a production host
      - alert: SMCScanMocked
        expr: smc_build_info{mocked="true"} == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} scan results are mocked and unreliable"

Useful PromQL queries

# All vulnerable CVEs across the fleet
smc_vulnerability_status == 1

# Vulnerable CVEs on hosts that are also hypervisor hosts (highest priority)
smc_vulnerability_status == 1
  * on(instance) group_left(hypervisor_host)
  smc_system_info{hypervisor_host="true"}

# Vulnerable CVEs on affected CPUs only (excludes hardware-immune systems)
smc_vulnerability_status{cpu_affected="true"} == 1

# Fleet-wide: how many hosts are vulnerable to each CVE
count by (cve, name) (smc_vulnerability_status == 1)

# Hosts with outdated microcode, with CPU model context
smc_cpu_info{microcode_up_to_date="false"}

# Hosts with SMT still enabled (relevant for MDS/L1TF remediation)
smc_cpu_info{smt="true"}

# For a specific CVE: hosts affected by hardware but fully mitigated
smc_vulnerability_status{cve="CVE-2018-3646", cpu_affected="true"} == 0

# Proportion of fleet that is fully clean (no vulnerable, no unknown)
(
  count(smc_vulnerable_count == 0 and smc_unknown_count == 0)
  /
  count(smc_vulnerable_count >= 0)
)

# Hosts where scan ran without root — results less reliable
smc_build_info{run_as_root="false"}

# Hosts with sysfs_only mode — independent detection was skipped
smc_build_info{sysfs_only="true"}

# Vulnerable CVEs joined with kernel release for patch tracking
smc_vulnerability_status == 1
  * on(instance) group_left(kernel_release)
  smc_system_info

# Vulnerable CVEs joined with CPU model and microcode version
smc_vulnerability_status == 1
  * on(instance) group_left(vendor, model, microcode, microcode_up_to_date)
  smc_cpu_info

Caveats and edge cases

Offline mode (--kernel) smc_system_info will have no kernel_release or kernel_arch labels (those come from uname, which reports the running kernel, not the inspected one). mode="offline" in smc_build_info signals this. Offline mode is primarily useful for pre-deployment auditing, not fleet runtime monitoring.

--no-hw smc_cpu_info is not emitted. CPU and microcode labels are absent from all queries. CVE checks that rely on hardware capability detection (cap_* flags, MSR reads) will report unknown status.

--sysfs-only The script trusts the kernel's sysfs report (/sys/devices/system/cpu/vulnerabilities/) without running its own independent detection. Some older kernels are known to misreport their mitigation status. sysfs_only="true" in smc_build_info flags this condition. Do not use --sysfs-only for production fleet monitoring.

--paranoid Enables defense-in-depth checks beyond the security community consensus (e.g. requires SMT to be disabled, IBPB always-on). A host is only vulnerable_count=0 under paranoid if it meets this higher bar. Do not compare vulnerable_count across hosts with different paranoid values.

reduced_accuracy Set when the kernel image, config file, or System.map could not be read. Some checks fall back to weaker heuristics and may report unknown for CVEs that are actually mitigated. This typically happens when the script runs without root or on a kernel with an inaccessible image.

Label stability Prometheus identifies time series by their full label set. If a script upgrade adds or renames a label (e.g. a new smc_cpu_info label is added for a new CVE), Prometheus will create a new time series and the old one will become stale. Plan for this in long-retention dashboards by using group_left joins rather than hardcoding label matchers.