Commit f5f1327f authored by Omkar Kulkarni's avatar Omkar Kulkarni Committed by Thomas Abraham
Browse files

infra/common: document dmc620 ras error handling support



Add a document that explains the procedure to inject a 1-bit ECC DMC-620
RAS error and ensure that it is handled.

Change-Id: I0ec7d04cfc8da90296451c14ef4624ec253e51ac
Signed-off-by: default avatarOmkar Anand Kulkarni <omkar.kulkarni@arm.com>
parent 9bbc0d68
DMC-620 RAS Error Injection and Handling
========================================
.. contents::
Overview of RAS
---------------
Reliability, Availability and Serviceability (RAS) is a measure that defines
the robustness of the system. A RAS enabled platform ensures that the system
produces correct outputs, is always operational and is easily maintainable.
RAS reduces the systems downtime by detecting the hardware errors and correcting
them when possible.
RAS test with DMC-620 memory controller
---------------------------------------
DMC-620 supports single bit ECC RAS errros that the Neoverse reference design
platform software allows for error injection and error handling. The firmware-
first error handling framework standard error injection mechanism defined as
part of EINJ ACPI table are used for this purpose. The error injection is
supported for 1-bit DRAM errors that occur on DMC620 which are corrected errors
and are notified to OS.
Error injection and error handling
----------------------------------
The Neoverse reference design platform software stack has to be first built and
executed to perform DMC620 RAS error injection and handlng test. To build and
execute the platform software stack, follow the instructions listed on the
buildroot build and boot page `Buildroot boot`_. Make sure that the boot is
successful and buildroot command prompt is accessible.
Procedure to perform 1-bit ECC error injection on DMC620
--------------------------------------------------------
To perform 1-bit error injection test on DMC620, following commands have to be
sequentially executed from buildroot prompt.
::
# mount -t debugfs none /sys/kernel/debug
# echo <address> > /sys/kernel/debug/apei/einj/param1
# echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2
# echo 0x8 > /sys/kernel/debug/apei/einj/error_type
# echo 0x1 > /sys/kernel/debug/apei/einj/error_inject
* <address>
- Valid physical address where the 1-bit error should be injected.
As an example, to inject 1-bit error at physical address 0x8f000000, the command
to be used is -
::
# echo 0x8f000000 > /sys/kernel/debug/apei/einj/param1
As a response of the hardware corrected 1-bit ECC error, the firmware-first
error handling framework presents the details of the error to the kernel and
the kernel prints the following message on the console.
::
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
{1}[Hardware Error]: event severity: recoverable
{1}[Hardware Error]: Error 0, type: corrected
{1}[Hardware Error]: fru_id: 00000000-0000-0000-0000-000000000000
{1}[Hardware Error]: fru_text:
{1}[Hardware Error]: section_type: memory error
{1}[Hardware Error]: physical_address: 0x00000001f03fedcd
{1}[Hardware Error]: physical_address_mask: 0xfffffffffffff000
{1}[Hardware Error]: error_type: 8, parity error
--------------
*Copyright (c) 2021, Arm Limited. All rights reserved.*
.. _Buildroot boot: docs/infra/common/buildroot-boot.rst
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment