From 5a0c391b067b40f2c9b2cdab59bc0de3b1ba5aa1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?St=C3=A9phane=20Lesimple?= Date: Mon, 30 Mar 2026 21:12:15 +0200 Subject: [PATCH] doc: update development guidelines --- DEVELOPMENT.md | 338 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 329 insertions(+), 9 deletions(-) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index c9d5e8a..760cf62 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -4,18 +4,73 @@ spectre-meltdown-checker is a single self-contained shell script (`spectre-meltd The script must stay POSIX-compatible, and not use features only available in specific shells such as `bash` or `zsh`. The `local` keyword is accepted however. +## Project Mission + +This tool exists to give system administrators simple, actionable answers to two questions: + +1. **Am I vulnerable?** +2. **What do I have to do to mitigate these vulnerabilities on my system?** + +The script does not run exploits and cannot guarantee security. It reports whether a system is **affected**, **vulnerable**, or **mitigated** against known transient execution vulnerabilities, and provides detailed insight into the prerequisites for full mitigation (microcode, kernel, hypervisor, etc.). + +### Why this tool still matters + +Even though the Linux `sysfs` hierarchy (`/sys/devices/system/cpu/vulnerabilities/`) now reports mitigation status for most vulnerabilities, this script provides value beyond what `sysfs` offers: + +- **Independent of kernel knowledge**: A given kernel only understands vulnerabilities known at compile time. This script's detection logic is maintained independently, so it can identify gaps a kernel doesn't yet know about. +- **Detailed prerequisite breakdown**: Mitigating a vulnerability can involve multiple layers (microcode, host kernel, hypervisor, guest kernel, software). The script shows exactly which pieces are in place and which are missing. +- **Offline kernel analysis**: The script can inspect a kernel image before it is booted (`--kernel`, `--config`, `--map`), verifying it carries the expected mitigations. +- **Backport-aware**: It detects actual capabilities rather than checking version strings, so it works correctly with vendor kernels that silently backport or forward-port patches. +- **Covers gaps in sysfs**: Some vulnerabilities (e.g. Zenbleed) are not reported through `sysfs` at all. + +### Terminology + +These terms have precise meanings throughout the codebase and output: + +- **Affected**: The CPU hardware, as shipped from the factory, is known to be concerned by a vulnerability. Says nothing about whether the vulnerability is currently exploitable. +- **Vulnerable**: The system uses an affected CPU *and* has no (or insufficient) mitigations in place, meaning the vulnerability can be exploited. +- **Mitigated**: A previously vulnerable system has all required layers updated so the vulnerability cannot be exploited. + +## Branch Model + +The project uses 4 branches organized in two pipelines (production and dev/test). Developers work on the source branches; CI builds the monolithic script and pushes it to the corresponding output branch. + +| Branch | Contents | Pushed by | +|--------|----------|-----------| +| **`source`** | Production source (split files + Makefile) | Developers | +| **`master`** | Monolithic production script (built artifact) | CI from `source` | +| **`dev`** | Dev/test source (split files + Makefile) | Developers | +| **`dev-build`** | Monolithic test script (built artifact) | CI from `dev` | + +- **`source`** and **`dev`** contain the split source files and the Makefile. These are the branches developers commit to. +- **`master`** and **`dev-build`** contain only the monolithic `spectre-meltdown-checker.sh` built by CI. Nobody commits to these directly. +- **`master`** is the preexisting production branch that users pull from. It cannot be renamed. +- **`dev-build`** is a testing branch that users can pull from to test pre-release versions. + +Typical workflow: +1. Feature/fix branches are created from `dev` and merged back into `dev`. +2. CI builds the script and pushes it to `dev-build` for testing. +3. When ready for release, `dev` is merged into `source`. +4. CI builds the script and pushes it to `master` for production. + ## Linting and Testing ```bash -# Lint (used in CI) -shellcheck spectre-meltdown-checker.sh +# Assemble the final script +make build -# Indentation must use tabs only (CI enforces this) -grep -Pn '^ ' spectre-meltdown-checker.sh # should find nothing +# Lint the generated script +make fmt-check shellcheck # Run the script (requires root for full results) sudo ./spectre-meltdown-checker.sh +# Run specific tests that we might have just added (variant name) +sudo ./spectre-meltdown-checker.sh --variant l1tf --variant taa + +# Run specific tests that we might have just added (CVE name) +sudo ./spectre-meltdown-checker.sh --cve CVE-2018-3640 --cve CVE-2022-40982 + # Batch JSON mode (CI validates exactly 19 CVEs in output) sudo ./spectre-meltdown-checker.sh --batch json | jq '.[] | .CVE' | wc -l # must be 19 @@ -41,10 +96,44 @@ The entire tool is a single bash script with no external script dependencies. Ke ## Key Design Principles -- **Non-destructive**: Never modifies the system; any loaded kernel modules (cpuid, msr) are unloaded on exit -- **Version-agnostic**: Detects actual CPU/kernel capabilities rather than hardcoding version numbers -- **Whitelist approach**: CPUs are assumed affected unless proven unaffected -- **Offline mode**: Can analyze a non-running kernel via `--kernel`, `--config`, `--map` flags +These rules are non-negotiable and govern how every part of the script is written: + +### 1. Production-safe + +It must always be okay to run this script in a production environment. + +- **1a. Non-destructive**: Never modify the system. If the script loads a kernel module it needs (e.g. `cpuid`, `msr`), it must unload it on exit. +- **1b. Report only**: Never attempt to "fix" or "mitigate" any vulnerability, or modify any configuration. The script reports status and leaves all decisions to the sysadmin. +- **1c. No exploit execution**: Never run any kind of exploit or proof-of-concept. This would violate rule 1a, could cause unpredictable system behavior, and may produce wrong conclusions (especially for Spectre-class PoCs that require very specific build options and prerequisites). + +### 2. Never hardcode kernel versions + +Never look at the kernel version string to determine whether it supports a mitigation. This would defeat the script's purpose: it must detect mitigations in unknown, vendor-patched, or backported kernels. Similarly, do not blindly trust what `sysfs` reports when it is possible to verify directly. + +### 3. Never hardcode microcode versions + +Never look at the microcode version to determine whether it has the proper mitigation mechanisms. Instead, probe for the mechanisms themselves (CPUID bits, MSR values), as the kernel would. + +### 4. Assume affected unless proven otherwise (whitelist approach) + +When a CPU is not explicitly known to be unaffected by a vulnerability, assume that it is affected. This conservative default has been the right call since the early Spectre/Meltdown days and remains sound. + +### 5. Offline mode + +The script can analyze a non-running kernel via `--kernel`, `--config`, `--map` flags, allowing verification before deployment. + +## CVE Inclusion Criteria + +A vulnerability should be supported by this tool when mitigating it requires **kernel modifications**, **microcode modifications**, or **both**. + +A vulnerability is **out of scope** when: + +- Mitigation is handled entirely by a driver or userspace software update (e.g. CVE-2019-14615, which requires an Intel driver update). +- The vulnerability is a regression from a bad backport and cannot be detected without hardcoding kernel versions (violates rule 2). +- The vendor has determined it is not a new attack and issued no kernel or microcode changes, leaving nothing for the script to check. +- The industry has collectively decided not to address the vulnerability (no mitigations exist), leaving nothing to verify. + +When evaluating whether to add a new CVE, check the [information-tagged issues](https://github.com/speed47/spectre-meltdown-checker/issues?q=is%3Aissue+label%3Ainformation) for prior discussion and precedent. ## POSIX Compliance @@ -80,7 +169,14 @@ This script uses the following naming rules for variables: Other general constants go at the top of the file, below the `VERSION` affectation. `opt_*` : Command-line options set during argument parsing (e.g. opt_verbose, opt_batch). `cpu_*` : CPU identification/state filled by parse_cpu_details() (e.g. cpu_family, cpu_model). -`cap_*` : CPU capability flags read from hardware/firmware (e.g. cap_rdcl_no). +`cap_*` : CPU capability flags read from hardware/firmware (e.g. cap_verw_clear, cap_rdcl_no). + All `cap_*` variables are set in `check_cpu()`. They come in two flavors: + - **Immunity bits** (`cap_*_no`): The CPU vendor declares this hardware is not affected by a vulnerability. + The `_no` suffix mirrors the vendor's own bit naming (e.g. RDCL_NO, GDS_NO, TSA_SQ_NO). + These are consumed in `is_cpu_affected()` to mark a CPU as immune. + - **Mitigation bits** (all other `cap_*`): Microcode or hardware provides a mechanism to work around + a vulnerability the CPU *does* have (e.g. cap_verw_clear, cap_ibrs, cap_ssbd). + These are consumed in `check_CVE_*_linux()` functions to assess mitigation status. `affected_*` : Per-CVE vulnerability status from is_cpu_affected() (e.g. affected_l1tf). `ret__*` : "Out-parameters" set by a function for its caller (e.g. ret_read_cpuid_value, ret_read_msr_msg). The matches the function name so ownership is obvious, these variables can't be written @@ -103,6 +199,230 @@ Functions follow two naming tiers: These must never be called directly from the top-level main flow. Examples: `_echo`, `_emit_json`, `_cve_registry_field`. +## How to Implement a New CVE Check + +Adding a new CVE follows a fixed pattern. Every check uses the same three-function structure and the same decision algorithm. This section walks through both. + +### Prerequisites + +Before writing code, verify the CVE meets the inclusion criteria (see "CVE Inclusion Criteria" above). The vulnerability must require kernel and/or microcode changes to mitigate. + +### Step 1: Create the Vulnerability File + +Create `src/vulns/CVE-YYYY-NNNNN.sh`. The file must contain exactly three functions: + +```sh +# vim: set ts=4 sw=4 sts=4 et: +#################### +# SHORT_NAME section + +# CVE-YYYY-NNNNN SHORT_NAME (one-line description) - entry point +check_CVE_YYYY_NNNNN() { + check_cve 'CVE-YYYY-NNNNN' +} + +# CVE-YYYY-NNNNN SHORT_NAME (one-line description) - Linux mitigation check +check_CVE_YYYY_NNNNN_linux() { + # ... (see Step 3) +} + +# CVE-YYYY-NNNNN SHORT_NAME (one-line description) - BSD mitigation check +check_CVE_YYYY_NNNNN_bsd() { + if ! is_cpu_affected "$cve"; then + pvulnstatus "$cve" OK "your CPU vendor reported your CPU model as not affected" + else + pvulnstatus "$cve" UNK "your CPU is affected, but mitigation detection has not yet been implemented for BSD in this script" + fi +} +``` + +The entry point calls `check_cve`, which prints the CVE header and dispatches to `_linux()` or `_bsd()` based on `$g_os`. If BSD mitigations are not yet understood, use the stub above — it correctly reports UNK rather than a false OK. + +### Step 2: Register the CVE in the CPU Affection Logic + +In `src/libs/200_cpu_affected.sh`, add an `affected_yourname` variable and populate it inside `is_cpu_affected()`. The variable follows the whitelist principle: **assume affected (`1`) unless you can prove the CPU is immune (`0`)**. Two kinds of evidence can prove immunity: + +- **Static identifiers**: CPU vendor, family, model, stepping — these identify the hardware design. +- **Hardware immunity `cap_*` bits**: CPUID or MSR bits that the CPU vendor defines to explicitly declare "this hardware is not affected" (e.g. `cap_rdcl_no` for Meltdown, `cap_ssb_no` for Variant 4, `cap_gds_no` for Downfall, `cap_tsa_sq_no`/`cap_tsa_l1_no` for TSA). These are read in `check_cpu()` and stored as `cap_*` globals. + +Never use microcode version strings. + +**Important**: Do not confuse hardware immunity bits with *mitigation* capability bits. A hardware immunity bit (e.g. `GDS_NO`, `TSA_SQ_NO`) declares that the CPU design is architecturally free of the vulnerability — it belongs here in `is_cpu_affected()`. A mitigation capability bit (e.g. `VERW_CLEAR`, `MD_CLEAR`) indicates that updated microcode provides a mechanism to work around a vulnerability the CPU *does* have — it belongs in the `check_CVE_YYYY_NNNNN_linux()` function (Phase 2), where it is used to determine whether mitigations are in place. + +### Step 3: Implement the Linux Check + +The `_linux()` function follows a standard algorithm with four phases: + +**Phase 1 — Initialize and check sysfs:** + +```sh +check_CVE_YYYY_NNNNN_linux() { + local status sys_interface_available msg + status=UNK + sys_interface_available=0 + msg='' + if sys_interface_check "$VULN_SYSFS_BASE/vuln_name"; then + sys_interface_available=1 + status=$ret_sys_interface_check_status + fi +``` + +`sys_interface_check` reads `/sys/devices/system/cpu/vulnerabilities/` and parses the kernel's own assessment into `ret_sys_interface_check_status` (OK/VULN/UNK) and `ret_sys_interface_check_fullmsg`. If the sysfs file doesn't exist (older kernel, or the CVE predates kernel awareness), it returns false and `sys_interface_available` stays 0. + +**Phase 2 — Custom detection (kernel + runtime):** + +Guarded by `if [ "$opt_sysfs_only" != 1 ]; then` so users who trust sysfs can skip it. + +This is where the real detection lives. Check for mitigations at each layer: + +- **Kernel support**: Determine whether the kernel carries the mitigation code. Three sources of evidence are available, and any one of them is sufficient: + + - **Kernel image** (`$g_kernel`): Search for strings or symbols that prove the mitigation code is compiled in. + ```sh + if grep -q 'mitigation_string' "$g_kernel"; then + kernel_mitigated="found mitigation evidence in kernel image" + fi + ``` + Guard with `if [ -n "$g_kernel_err" ]; then` first — the kernel image may be unavailable. + + - **Kernel config** (`$g_kernel_config`): Look for the `CONFIG_*` option that enables the mitigation. + ```sh + if [ -n "$g_kernel_config" ] && grep -q '^CONFIG_MITIGATION_NAME=y' "$g_kernel_config"; then + kernel_mitigated="found mitigation config option enabled" + fi + ``` + + - **System.map** (`$g_kernel_map`): Look for function names directly linked to the mitigation. + ```sh + if [ -n "$g_kernel_map" ] && grep -q 'mitigation_function_name' "$g_kernel_map"; then + kernel_mitigated="found mitigation function in System.map" + fi + ``` + + Each source may independently be unavailable (offline mode without the file, or stripped kernel), so check all that are present. A match in any one confirms kernel support. + +- **Runtime state** (live mode only): Read MSRs, check cpuinfo flags, parse dmesg, inspect debugfs. + ```sh + if [ "$opt_live" = 1 ]; then + read_msr 0xADDRESS + ret=$? + if [ "$ret" = "$READ_MSR_RET_OK" ]; then + # check specific bits in ret_read_msr_value_lo / ret_read_msr_value_hi + fi + else + pstatus blue N/A "not testable in offline mode" + fi + ``` + +- **Microcode capabilities**: Check CPUID bits or MSR flags that indicate the CPU firmware supports the mitigation. Never compare microcode version numbers directly. + +Close the `opt_sysfs_only` block with the forced-sysfs fallback: +```sh + elif [ "$sys_interface_available" = 0 ]; then + msg="/sys vulnerability interface use forced, but it's not available!" + status=UNK + fi +``` + +**Phase 3 — CPU affection gate:** + +```sh + if ! is_cpu_affected "$cve"; then + pvulnstatus "$cve" OK "your CPU vendor reported your CPU model as not affected" +``` + +If the CPU is not affected, nothing else matters — report OK and return. This overrides any sysfs or custom detection result. + +**Phase 4 — Final status determination:** + +For affected CPUs, combine the evidence from Phase 2 into a final verdict: + +```sh + elif [ "$opt_sysfs_only" != 1 ]; then + if [ "$microcode_ok" = 1 ] && [ -n "$kernel_mitigated" ]; then + pvulnstatus "$cve" OK "Both kernel and microcode mitigate the vulnerability" + elif [ "$microcode_ok" = 1 ]; then + pvulnstatus "$cve" OK "Microcode mitigates the vulnerability" + elif [ -n "$kernel_mitigated" ]; then + pvulnstatus "$cve" OK "Kernel mitigates the vulnerability" + else + pvulnstatus "$cve" VULN "Neither kernel nor microcode mitigate the vulnerability" + explain "Remediation advice here..." + fi + else + pvulnstatus "$cve" "$status" "$ret_sys_interface_check_fullmsg" + fi +} +``` + +The exact combination logic depends on the CVE. Some require **both** microcode and kernel fixes (report VULN if either is missing). Others are mitigated by **either** layer alone (report OK if one is present). Some also require SMT to be disabled — check with `is_cpu_smt_enabled()`. + +### Cross-Cutting Features + +Several command-line options affect the logic inside `_linux()` checks. New CVE implementations must account for them where relevant. + +#### `--explain` (`opt_explain`) + +When the user passes `--explain`, the `explain()` function prints actionable "How to fix" remediation advice. Call `explain` whenever reporting a VULN status, so the user knows what concrete steps to take: + +```sh +pvulnstatus "$cve" VULN "Neither kernel nor microcode mitigate the vulnerability" +explain "Update your kernel to a version that includes the mitigation, and update your CPU microcode. If you are using a distro, make sure you are up to date." +``` + +The text should be specific: mention kernel parameters to set (`nosmt`), sysctl knobs to toggle, or which component needs updating. If SMT must be disabled, say so explicitly. Multiple `explain` calls can be made for different failure paths, each tailored to the specific gap found. `explain` is a no-op when `--explain` was not passed, so it is always safe to call. + +#### `--paranoid` (`opt_paranoid`) + +Paranoid mode raises the bar for what counts as "mitigated". In normal mode, conditional mitigations or partial defenses may be accepted as sufficient. In paranoid mode, only the **maximum security configuration** qualifies as OK. + +The most common effect is requiring SMT (Hyper-Threading) to be disabled. For example, MDS and TAA mitigations are considered incomplete in paranoid mode if SMT is still enabled, because a sibling thread could still exploit the vulnerability: + +```sh +if [ "$opt_paranoid" != 1 ] || [ "$kernel_smt_allowed" = 0 ]; then + pvulnstatus "$cve" OK "Microcode and kernel mitigate the vulnerability" +else + pvulnstatus "$cve" VULN "Mitigation is active but SMT must be disabled for full protection" +fi +``` + +Other paranoid-mode effects include requiring unconditional (rather than conditional) L1D flushing, or requiring TSX to be fully disabled. When implementing a new CVE, consider whether there is a stricter configuration that paranoid mode should enforce and add the appropriate `opt_paranoid` branches. + +#### `--vmm` (`opt_vmm`) + +The `--vmm` option tells the script whether the system is a hypervisor host running untrusted virtual machines. It accepts three values: `auto` (default, auto-detect by looking for `qemu`/`kvm`/`xen` processes), `yes` (force hypervisor mode), or `no` (force non-hypervisor mode). The result is stored in `g_has_vmm` by the `check_has_vmm()` function. + +Some vulnerabilities (e.g. L1TF/CVE-2018-3646, ITLBMH/CVE-2018-12207) only matter — or require additional mitigations — when the host is running a hypervisor with untrusted guests. If `g_has_vmm` is 0, the system can be reported as not vulnerable to these VMM-specific aspects: + +```sh +if [ "$g_has_vmm" = 0 ]; then + pvulnstatus "$cve" OK "this system is not running a hypervisor" +else + # check hypervisor-specific mitigations (L1D flushing, EPT, etc.) +fi +``` + +CVEs that need VMM context should call `check_has_vmm` early in their `_linux()` function. Note the interaction with paranoid mode: when `--paranoid` is active and `--vmm` was not explicitly set, the script assumes a hypervisor is present (`g_has_vmm=2`), erring on the side of caution. + +### Step 4: Wire Up and Test + +1. **Add the CVE name mapping** in the `cve2name()` function so the header prints a human-readable name. +2. **Build** the monolithic script with `make`. +3. **Test live**: Run the built script and confirm your CVE appears in the output and reports a sensible status. +4. **Test batch JSON**: Run with `--batch json` and verify the CVE count incremented by one (currently 19 → 20). +5. **Test offline**: Run with `--kernel`/`--config`/`--map` pointing to a kernel image and verify the offline code path reports correctly. +6. **Lint**: Run `shellcheck` on the monolithic script and fix any warnings. + +### Key Rules to Remember + +- **Never hardcode kernel or microcode versions** — detect capabilities directly (design principles 2 and 3). +- **Assume affected by default** — only mark a CPU as unaffected when there is positive evidence (design principle 4). +- **Always handle both live and offline modes** — use `$opt_live` to branch, and print `N/A "not testable in offline mode"` for runtime-only checks when offline. +- **Use `explain()`** when reporting VULN to give actionable remediation advice (see "Cross-Cutting Features" above). +- **Handle `--paranoid` and `--vmm`** when the CVE has stricter mitigation tiers or VMM-specific aspects (see "Cross-Cutting Features" above). +- **All indentation must use tabs** (CI enforces this). +- **Stay POSIX-compatible** — no bashisms, no GNU-only flags in portable code paths. + ## Function documentation headers Every function must have a documentation header immediately above its definition. The format is: