| # VM interface |
| |
| This page provides an overview of the interface Hafnium provides to VMs. Hafnium |
| makes a distinction between the 'primary VM', which controls scheduling and has |
| more direct access to some hardware, and 'secondary VMs' which exist mostly to |
| provide services to the primary VM, and have a more paravirtualised interface. |
| The intention is that the primary VM can run a mostly unmodified operating |
| system (such as Linux) with the addition of a Hafnium driver which |
| [fulfils certain expectations](SchedulerExpectations.md), while secondary VMs |
| will run more specialised trusted OSes or bare-metal code which is designed with |
| Hafnium in mind. |
| |
| The interface documented here is what is planned for the first release of |
| Hafnium, not necessarily what is currently implemented. |
| |
| [TOC] |
| |
| ## CPU scheduling |
| |
| The primary VM will have one vCPU for each physical CPU, and control the |
| scheduling. |
| |
| Secondary VMs will have a configurable number of vCPUs, scheduled on arbitrary |
| physical CPUs at the whims of the primary VM scheduler. |
| |
| All VMs will start with a single active vCPU. Subsequent vCPUs can be started |
| through PSCI. |
| |
| ## PSCI |
| |
| The primary VM will be able to control the physical CPUs through the following |
| PSCI 1.1 calls, which will be forwarded to the underlying implementation in EL3: |
| |
| * PSCI_VERSION |
| * PSCI_FEATURES |
| * PSCI_SYSTEM_OFF |
| * PSCI_SYSTEM_RESET |
| * PSCI_AFFINITY_INFO |
| * PSCI_CPU_SUSPEND |
| * PSCI_CPU_OFF |
| * PSCI_CPU_ON |
| |
| All other PSCI calls are unsupported. |
| |
| Secondary VMs will be able to control their vCPUs through the following PSCI 1.1 |
| calls, which will be implemented by Hafnium: |
| |
| * PSCI_VERSION |
| * PSCI_FEATURES |
| * PSCI_AFFINITY_INFO |
| * PSCI_CPU_SUSPEND |
| * PSCI_CPU_OFF |
| * PSCI_CPU_ON |
| |
| All other PSCI calls are unsupported. |
| |
| ## Hardware timers |
| |
| The primary VM will have access to both the physical and virtual EL1 timers |
| through the usual control registers (`CNT[PV]_TVAL_EL0` and `CNT[PV]_CTL_EL0`). |
| |
| Secondary VMs will have access to the virtual timer only, which will be emulated |
| with help from the kernel driver in the primary VM. |
| |
| ## Interrupts |
| |
| The primary VM will have direct access to control the physical GIC, and receive |
| all interrupts (other than anything already trapped by TrustZone). It will be |
| responsible for forwarding any necessary interrupts to secondary VMs. The |
| Interrupt Translation Service (ITS) will be disabled by Hafnium so that it |
| cannot be used to circumvent access controls. |
| |
| Secondary VMs will have access to a simple paravirtualized interrupt controller |
| through two hypercalls: one to enable or disable a given virtual interrupt ID, |
| and one to get and acknowledge the next pending interrupt. There is no concept |
| of interrupt priorities or a distinction between edge and level triggered |
| interrupts. Secondary VMs may also inject interrupts into their own vCPUs. |
| |
| ## Performance counters |
| |
| VMs will be blocked from accessing performance counter registers (for the |
| performance monitor extensions described in chapter D5 of the Armv8-A reference |
| manual) in production, to prevent them from being used as a side channel to leak |
| data between VMs. |
| |
| Hafnium may allow VMs to use them in debug builds. |
| |
| ## Debug registers |
| |
| VMs will be blocked from accessing debug registers in production builds, to |
| prevent them from being used to circumvent access controls. |
| |
| Hafnium may allow VMs to use these registers in debug builds. |
| |
| ## RAS Extension registers |
| |
| Secondary VMs will be blocked from using registers associated with the RAS |
| Extension. |
| |
| ## Asynchronous message passing |
| |
| VMs will be able to send messages of up to 4 KiB to each other asynchronously, |
| with no queueing, as specified by FF-A. |
| |
| ## Memory |
| |
| VMs will statically be given access to mutually-exclusive regions of the |
| physical address space at boot. This includes MMIO space for controlling |
| devices, plus a fixed amount of RAM for secondaries, and all remaining address |
| space to the primary. Note that this means that only one VM can control any |
| given page of MMIO registers for a device. |
| |
| VMs may choose to donate or share their memory with other VMs at runtime. Any |
| given page may be shared with at most 2 VMs at once (including the original |
| owning VM). Memory which has been donated or shared may not be forcefully |
| reclaimed, but the VM with which it was shared may choose to return it. |
| |
| ## Cache |
| |
| VMs will be blocked from using cache maintenance instructions that operate by |
| set/way. These operations are difficult to virtualize, and could expose the |
| system to side-channel attacks. |
| |
| ## Logging |
| |
| VMs may send a character to a shared log by means of a hypercall or SMC call. |
| These log messages will be buffered per VM to make complete lines, then output |
| to a Hafnium-owned UART and saved in a shared ring buffer which may be extracted |
| from RAM dumps. VM IDs will be prepended to these logs. |
| |
| This log API is intended for use in early bringup and low-level debugging. No |
| sensitive data should be logged through it. Higher level logs can be sent to the |
| primary VM through the asynchronous message passing mechanism described above, |
| or through shared memory. |
| |
| ## Configuration |
| |
| Hafnium will read configuration from a flattened device tree blob (FDT). This |
| may either be the same device tree used for the other details of the system or a |
| separate minimal one just for Hafnium. This will include at least: |
| |
| * The available RAM. |
| * The number of secondary VMs, how many vCPUs each should have, how much |
| memory to assign to each of them, and where to load their initial images. |
| (Most likely the initial image will be a minimal loader supplied with |
| Hafnium which will validate and load the rest of the image from the primary |
| later on.) |
| * Which devices exist on the system, their details (MMIO regions, interrupts |
| and SYSMMU details), and which VM each is assigned to. |
| * A single physical device may be split into multiple logical ‘devices’ |
| from Hafnium’s point of view if necessary to have different VMs own |
| different parts of it. |
| * A whitelist of which SMC calls each VM is allowed to make. |
| |
| ## Failure handling |
| |
| If a secondary VM tries to do something it shouldn't, Hafnium will either inject |
| a fault or kill it and inform the primary VM. The primary VM may choose to |
| restart the system or to continue without the secondary VM. |
| |
| If the primary VM tries to do something it shouldn't, Hafnium will either inject |
| a fault or restart the system. |
| |
| ## TrustZone communication |
| |
| The primary VM will be able to communicate with a TEE running in TrustZone |
| either through FF-A messages or through whitelisted SMC calls, and through |
| shared memory. |
| |
| ## Other SMC calls |
| |
| Other than the PSCI calls described above and those used to communicate with |
| Hafnium, all other SMC calls will be blocked by default. Hafnium will allow SMC |
| calls to be whitelisted on a per-VM, per-function ID basis, as part of the |
| static configuration described above. These whitelisted SMC calls will be |
| forwarded to the EL3 handler with the client ID (as described by the SMCCC) set |
| to the calling VM's ID. |