Embedded ARM Tutorial

Blog Contents



  • ARM, previously Advanced RISC Machine, originally Acorn RISC Machine, is a family of reduced instruction set computing (RISC) architectures for computer processors, configured for various environments.


  • British company ARM Holdings develops the architecture and licenses it to other companies, who design their own products that implement one of those architectures‍—‌including systems-on-chips (SoC) and systems-on-modules (SoM) that incorporate memory, interfaces, radios, etc.

It also designs cores that implement this instruction set and licenses these designs to a number of companies that incorporate those core designs into their own products.


  • Processors having this architecture fewer number of transistors.
  • These characteristics are desirable for light portable devices and mobile phones.
  • Almost 100 billion ARM processors have been manufactured making it the biggest seller of 2017
  • Cortex core and securecor are the currently available Cortex cores.


  • The british company Acorn started manufacturing these in the year 1980’s.
  • This architecture employed he relatively simple MOS Technology 6502 processor to address business markets like the one that was soon dominated by the IBM PC, launched in 1981.


  • ARM cores are used in a number of products, particularly PDAs and smartphones.


  • Examples:-


  1. Microsoft‘s first generation Surface
  2. Surface 2
  3. Apple‘s iPads
  4. Asus‘s Eee Pad Transformer tablet computers
  5. Chromebook laptops.

Others include:

  1. Apple’s iPhone smartphone
  2. iPod portable media player
  3. Canon PowerShot digital cameras
  4. Nintendo Switch hybrid
  5. 3DS handheld game consoles
  6. TomTom turn-by-turn navigation systems.

All modern ARM processors include hardware debugging facilities, allowing software debuggers to perform operations such as:-

  1. Halting
  2. Stepping
  3. Breakpoints of code starting from reset.

These facilities are built using JTAG support(Joint test action group)

JTAG is an IEEE standard (1149.1) developed in the 1980s to solve electronic boards manufacturing issues.

Nowadays it finds more use as programming, debug and probing port, though some newer cores optionally support ARM’s own two-wire “SWD” protocol.

In ARM7TDMI cores;-

The “D” represented JTAG debug support

The “I” represented the presence of an “EmbeddedICE” debug module.

For ARM7 and ARM9 core generations, EmbeddedICE over JTAG was a de facto debug standard, though not architecturally guaranteed.

The ARMv7 architecture defines basic debug facilities at an architectural level.

The actual transport mechanism used to access the debug facilities is not architecturally specified, but implementations generally include JTAG support.

There is a separate ARM “CoreSight” debug architecture, which is not architecturally required by ARMv7 processors.

32-bit architecture(ARMv7):

An ARMv7 is used to power the popular Raspberry pi-2 micro-computer.

The 32-bit ARM architecture, such as ARMv7-A, was the most widely used architecture in mobile devices as of 2011.Since 1995, the ARM Architecture Reference Manual has been the primary source of documentation on the ARM processor architecture and instruction set, distinguishing interfaces that all ARM processors are required to support (such as instruction semantics) from implementation details that may vary. The architecture has evolved over time, and version seven of the architecture, ARMv7, defines three architecture “profiles”:

  • A-profile, the “Application” profile, implemented by 32-bit cores in the Cortex-A series and by some non-ARM cores.
  • R-profile, the “Real-time” profile, implemented by cores in the Cortex-R series.
  • M-profile, the “Microcontroller” profile, implemented by most cores in the Cortex-M series.

Although the architecture profiles were first defined for ARMv7, ARM subsequently defined the ARMv6-M architecture (used by the Cortex M0/M0+/M1) as a subset of the ARMv7-M profile with fewer instructions.

CPU modes:
  • Except in the M-profile, the 32-bit ARM architecture specifies several CPU modes, depending on the implemented architecture features.
  • At any moment in time, the CPU can be in only one mode, but it can switch modes due to external events (interrupts) or programmatically:-
  • User mode: The only non-privileged mode.
  • FIQ mode: A privileged mode that is entered whenever the processor accepts a fast interrupt request.
  • IRQ mode: A privileged mode that is entered whenever the processor accepts an interrupt.
  • Supervisor (svc) mode: A privileged mode entered whenever the CPU is reset or when an SVC instruction is executed.
  • Abort mode: A privileged mode that is entered whenever a prefetch abort or data abort exception occurs.
  • Undefined mode: A privileged mode that is entered whenever an undefined instruction exception occurs.
  • System mode (ARMv4 and above): The only privileged mode that is not entered by an exception. It can only be entered by executing an instruction that explicitly writes to the mode bits of the Current Program Status Register (CPSR).
  • Monitor mode (ARMv6 and ARMv7 Security Extensions, ARMv8 EL3): A monitor mode is introduced to support TrustZone extension in ARM cores.
  • Hyp mode (ARMv7 Virtualization Extensions, ARMv8 EL2): A hypervisor mode that supports Popek and Goldberg virtualization requirements for the non-secure operation of the CPU.
  • Thread mode (ARMv6-M, ARMv7-M, ARMv8-M): A mode which can be specified as either privileged or unprivileged, while whether Main Stack Pointer (MSP) or Process Stack Pointer (PSP) is used can also be specified in CONTROL register with privileged access.This mode is designed for user tasks in RTOS environment but it’s typically used in bare-metal for super-loop.
  • Handler mode (ARMv6-M, ARMv7-M, ARMv8-M): A mode dedicated for exception handling (except the RESET which are handled in Thread mode). Handler mode always uses MSP and works in privileged level.
Instruction set:
  • The original (and subsequent) ARM implementation was hardwired without microcode, like the much simpler 8-bit 6502 processor used in prior Acorn microcomputers.
  • The 32-bit ARM architecture (and the 64-bit architecture for the most part) includes the following RISC features:
  • Load/store architecture.
  • No support for unaligned memory accesses in the original version of the architecture. ARMv6 and later, except some microcontroller versions, support unaligned accesses for half-word and single-word load/store instructions with some limitations, such as no guaranteed atomicity.
  • Uniform 16× 32-bit register file (including the program counter, stack pointer and the link register).
  • Fixed instruction width of 32 bits to ease decoding and pipelining, at the cost of decreased code density.
  • Later, the Thumb instruction set added 16-bit instructions and increased code density.
  • Mostly single clock-cycle execution.

To compensate for the simpler design, compared with processors like the Intel 80286 and Motorola 68020, some additional design features were used:

  • Conditional execution of most instructions reduces branch overhead and compensates for the lack of a branch predictor
  • Arithmetic instructions alter condition codes only when desired.
  • 32-bit barrel shifter can be used without performance penalty with most arithmetic instructions and address calculations.
  • Has powerful indexed addressing modes.
  • A ink register supports fast leaf function calls.
  • A simple, but fast, 2-priority-level interrupt subsystem has switched register banks.
Pipelines and other implementation issues:
  • The ARM7 and earlier implementations have a three-stage pipeline; the stages being fetched, decode and execute.
  • Higher-performance designs, such as the ARM9, have deeper pipelines.
  • Cortex-A8 has thirteen stages.
  • Additional implementation changes for higher performance include a faster adder and more extensive branch prediction logic.

The difference between the ARM7DI and ARM7DMI cores, for example, was an improved multiplier; hence the added “M”.

Arithmetic instructions:
  • ARM includes integer arithmetic operations for add, subtract, and multiply; some versions of the architecture also support divide operations.
  • ARM supports 32-bit × 32-bit multiplies with either a 32-bit result or 64-bit result, though Cortex-M0 / M0+ / M1 cores don’t support 64-bit results.
  • Some ARM cores also support 16-bit × 16-bit and 32-bit × 16-bit multiplies.

The divide instructions are only included in the following ARM architectures:

  • ARMv7-M and ARMv7E-M architectures always include divide instructions.
  • ARMv7-R architecture always includes divide instructions in the Thumb instruction set, but optionally in its 32-bit instruction set.
  • ARMv7-A architecture optionally includes the divide instructions.
  • The instructions might not be implemented, or implemented only in the Thumb instruction set, or implemented in both the Thumb and ARM instruction sets, or implemented if the Virtualization Extensions are included.
Thumb instruction:
  • To improve compiled code-density, processors since the ARM7TDMI (released in 1994) have featured the Thumb instruction set, which has their own state.
  • (The “T” in “TDMI” indicates the Thumb feature.) When in this state, the processor executes the Thumb instruction set, a compact 16-bit encoding for a subset of the ARM instruction set.
  • Most of the Thumb instructions are directly mapped to normal ARM instructions.
  • The space-saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the ARM instructions executed in the ARM instruction set state.
  • In Thumb, the 16-bit opcodes have less functionality. For example, only branches can be conditional, and many opcodes are restricted to accessing only half of all of the CPU’s general-purpose registers.
  • The shorter opcodes give improved code density overall, even though some operations require extra instructions.
  • In situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allow increased performance compared with 32-bit ARM code, as less program code may need to be loaded into the processor over the constrained memory bandwidth.
  • Embedded hardware, such as the Game Boy Advance, typically have a small amount of RAM accessible with a full 32-bit data path; the majority is accessed via a 16-bit or narrower secondary data path.
  • In this situation, it usually makes sense to compile Thumb code and hand-optimize a few of the most CPU-intensive sections using full 32-bit ARM instructions, placing these wider instructions into the 32-bit bus accessible memory.
  • The first processor with a Thumb instruction decoder was the ARM7TDMI.
  • All ARM9 and later families, including XScale, have included a Thumb instruction decoder.
  • The Thumb instruction set was originally inspired by SuperH‘s ISA; ARM licensed several patents from Hitachi


  • Registers R0 through R7 are the same across all CPU modes; they are never banked.
  • Registers R8 through R12 are the same across all CPU modes except FIQ mode.
  • FIQ mode has its own distinct R8 through R12 registers.
  • R13 and R14 are banked across all privileged CPU modes except system mode.
  • That is, each mode that can be entered because of an exception has its own R13 and R14.
  • These registers generally contain the stack pointer and the return address from function calls, respectively.


  • R13 is also referred to as SP, the Stack Pointer.
  • R14 is also referred to as LR, the Link Register.
  • R15 is also referred to as PC, the Program Counter.

The Current Program Status Register (CPSR) has the following 32 bits

  • M (bits 0–4) is the processor mode bits.
  • T (bit 5) is the Thumb state bit.
  • F (bit 6) is the FIQ disable bit.
  • I (bit 7) is the IRQ disable bit.
  • A (bit 8) is the imprecise data abort disable bit.
  • E (bit 9) is the data endianness bit.
  • IT (bits 10–15 and 25–26) is the if-then state bits.
  • GE (bits 16–19) is the greater-than-or-equal-to bits.
  • DNM (bits 20–23) is the do not modify bits.
  • J (bit 24) is the Java state bit.
  • Q (bit 27) is the sticky overflow bit.
  • V (bit 28) is the overflow bit.
  • C (bit 29) is the carry/borrow/extend bit.
  • Z (bit 30) is the zero bit.
  • N (bit 31) is the negative/less than bit.