The purpose of Hafnium is to provide memory isolation between a set of security domains, to better separate untrusted code from security-critical code. It is implemented as a type-1 hypervisor, where each security domain is a VM.
On AArch64 (currently the only supported architecture) it runs at EL2, while the VMs it manages run at EL1 (and user space applications within those VMs at EL0). A Secure Monitor such as Trusted Firmware-A runs underneath it at EL3.
Hafnium provides memory isolation between these VMs by managing their stage 2 page tables, and using IOMMUs to restrict how DMA devices can be used to access memory. It must also prevent them from accessing system resources in a way which would allow them to escape this containment. It also provides:
See the VM interface documentation for more details.
Hafnium makes a distinction between a primary VM, which would typically run the main user-facing operating system such as Android, and a number of secondary VMs which are smaller and exist to provide various services to the primary VM. The primary VM typically owns the majority of the system resources, and is likely to be more latency-sensitive as it is running user-facing tasks. Some of the differences between primary and secondary VMs are explained below.
Hafnium runs a set of VMs without trusting any of them. Neither do the VMs trust each other. Hafnium aims to prevent malicious software running in one VM from compromising any of the other VMs. Specifically, we guarantee confidentiality and memory integrity of each VM: no other VM should be able to read or modify the memory that belongs to a VM without that VM's consent.
We do not make any guarantees of availability of VMs, except for the primary VM. In other words, a compromised primary VM may prevent secondary VMs from running, but not gain unauthorised access to their memory. A compromised secondary VM should not be able to prevent the primary VM or other secondary VMs from running.
Hafnium is designed with the following principles in mind:
A VM in Hafnium consists of:
Each vCPU also has:
VMs and their vCPUs are configured statically from a manifest read at boot time. There is no way to create or destroy VMs at run time.
Unlike many other type-1 hypervisors, Hafnium does not include a scheduler. Instead, we rely on the primary VM to handle scheduling, calling Hafnium when it wants to run a secondary VM's vCPU. This is because:
Hafnium therefore maintains a 1:1 mapping of physical CPUs to vCPUs for the primary VM, and allows the primary VM to control the power state of physical CPUs directly through the standard Arm Power State Coordination Interface (PSCI). The primary VM should then create kernel threads for each secondary VM vCPU and schedule them to run the vCPUs according to the interface expectations defined by Hafnium. PSCI calls made by secondary VMs are handled by Hafnium, to change the state of the VM's vCPUs. In the case of (Android) Linux running in the primary VM this is handled by the Hafnium kernel module.
For example, considering a simple system with a single physical CPU, and a single secondary VM with one vCPU, where the primary VM kernel has created thread 1 to run the secondary VM's vCPU while thread 2 is some other normal thread:
At boot time each VM owns a mutually exclusive subset of memory pages, as configured by the manifest. These pages are all identity mapped in the stage 2 page table which Hafnium manages for the VM, so that it has full access to use them however it wishes.
Hafnium maintains state of which VM owns each page, and which VMs have access to it. It does this using the stage 2 page tables of the VMs, with some extra application-defined bits in the page table entries. A VM may share, lend or donate memory pages to another VM using the appropriate SPCI requests. A given page of memory may never be shared with more than two VMs, either in terms of ownership or access. Thus, the following states are possible for each page, for some values of X and Y:
For now, in the interests of simplicity, Hafnium always uses identity mapping in all page tables it manages (stage 2 page tables for VMs, and stage 1 for itself) – i.e. the IPA (intermediate physical address) is always equal to the PA (physical address) in the stage 2 page table, if it is mapped at all.
From Hafnium's point of view a device consists of:
For now, each device is associated with exactly one VM, which is statically assigned at boot time (through the manifest) and cannot be changed at runtime.
Hafnium is responsible for mapping the device‘s MMIO pages into the owning VM’s stage 2 page table with the appropriate attributes, and for configuring the IOMMU so that the device can only access the memory that is accessible by its owning VM. This needs to be kept in sync as the VM's memory access changes with memory sharing operations. Hafnium may also need to re-initialise the IOMMU if the device is powered off and powered on again.
The primary VM is responsible for forwarding interrupts to the owning VM, in case the device is owned by a secondary VM. This does mean that a compromised primary VM may choose not to forward interrupts, or to inject spurious interrupts, but this is consistent with our security model that secondary VMs are not guaranteed any level of availability.