Arm Architecture, SoC Development, and AI Acceleration
1. Introduction to Arm
Arm is a processor design company that creates blueprints (BP) and licenses both its ISA (Instruction Set Architecture) and microarchitecture designs.
๐ Arm Developer
Armโs CPU architectures are implemented by a wide range of microarchitectures.
- Architecture (ISA) defines:
- Basic instruction set
- Exception handling
- Memory model
Arm Profiles
- A (Application) Profile: Runs full systems like Linux
- R (Real-time) Profile: Runs RTOS or bare-metal code (BMC), for safety-critical systems
- M (Microcontroller) Profile: Runs RTOS or BMC, optimized for efficiency
2. Arm Architecture Profiles
A-Profile
Supports both 32-bit and 64-bit execution:
- AArch32:
T32+A32IS (backward compatible with Armv7-A) - AArch64:
A64IS (modern 64-bit execution)
Latest versions (2025):
- v9.4-A (in sync with v8.9-A)
- Armv9 introduces:
- SVE2 (Scalable Vector Extension 2)
- aSIMD (NEON SIMD)
- Security isolation
๐ SIMD overview
๐ SVE2 documentation
Key Milestones
- v8.1: AArch64 introduction
- v8.2: FP16 support (Half-precision floating point)
- v8.3: Improved JS data type conversion
- v8.4: Memory partitioning, SHA acceleration
- v8.5 / v9.0: Memory tagging, branch target identification
- v8.6 / v9.1: GEMM (General Matrix Multiply)
- v8.7 / v9.2: PCIe hotplug
- v8.8 / v9.3: PAC enhancements, optimized
memcpy(),memset()
R-Profile
Designed for time-sensitive and safety-critical systems.
Features:
- PMSA & MPU
- Functional safety (ISO 26262, IEC 61508 certified from R52+)
- SSD/HDD controllers (R82 supports DRAM up to 1TB)
- Safety Island for isolation and rapid response
M-Profile
Optimized for microcontrollers and efficiency.
- v8-M Baseline: Minimalist, bare-metal
- v8-M Mainline: Feature-rich extension
- v8-M23: Successor to v6-M
- v8-M33: Successor to v7-M
Supports Helium technology (vector extension for ML & DSP).
- Cortex-M55, M85 support Helium
- Difference vs Neon:
- Helium = designed for small, low-power CPUs
- Optimized for loop prediction, lane prediction, and complex math ops
- Lacks a secure monitor
3. Arm System IP
AMBA (Advanced Microcontroller Bus Architecture)
Provides flexible interconnect protocols.
- APB: Low-bandwidth control (registers, system peripherals)
- AHB: High-speed single-clock edge
- AXI-CE: System-wide coherency
- CHI: High-performance coherent interconnect
- C2C (Chip-to-Chip): Multichip SMP & coherent links
Benefits:
- Efficient IP reuse
- Flexibility
- Compatibility
- Ecosystem support
Exception Levels (Armv8 & v9)
- EL0: Applications
- EL1: OS Kernel
- EL2: Hypervisor
- EL3: Trusted firmware
Armv9 enhancements:
- Memory tagging
- Secure EL2
- Pointer authentication
- Branch target identifier
- Crypto extensions
4. SoC Development Flow
Armโs SoC design flow is similar to other chips:
- Frontend
- Define system requirements
- Choose cores, IP, and system-level optimizations
- IP Selection & Integration
- Choose pre-verified RTL IP
- Ensure EDA compatibility
- Frontend Design & Verification
- RTL design (Verilog)
- Testbenches & constraints validation
- Software Development (Parallel)
- Virtual platforms (Arm FVPs)
- Firmware, bootloaders, drivers
- HW/SW co-design
- Backend Implementation
- Logic synthesis โ gate-level netlist
- Floorplanning, place & route
- Clock tree, parasitic extraction
- STA, DFM, DRC/LVS checks
- Tape-out
- GDSII layout to foundry
- Validation
- FPGA / validation boards
- Mass Production
- Yield optimization
Key Components of Arm SoCs
- CPU cores: Cortex-A, R, M
- Memory subsystems: caches, SRAM, ROM
- AMBA interconnect
- System IP & Corstone (prebuilt compute subsystems)
- Peripherals: UART, I2C, USB, SPI, GPIOs, timers, ADC/DAC
- Accelerators: AI/ML via Ethos NPUs
- Security: TrustZone, secure boot, TPM
- Power/Clock management: DVFS, gating, power islands
๐ SoC Labs
5. Arm Ethos-U (NPU)
Armโs Ethos-U NPUs provide dedicated ML acceleration.
- U55: 256 MAC (CNN focus)
- U65: 512 MAC
- U85: 2048 MAC (Transformer focus)
Workflow
- NN ops assigned to NPU
- Conditions & post-processing on CPU
- Weights in flash, activations in SRAM
- CPU triggers execution, handles interrupts
Standards & Software:
- TOSA (Tensor Operator Set Architecture) support
- CMSIS-NN kernels for Cortex-M fallback
- uTFL (TensorFlow Lite Micro) support (INT8 only)
- Vela compiler for operator partitioning & quantization
- Arm NN: ML framework bridge
- KleidiAI: Optimized NN runtime
6. CPU & GPU Optimization
CPU Optimizations
- Instruction-level parallelism (pipelines, OoO execution)
- Data-level parallelism (SIMD, SME)
- Branch prediction & speculative execution
- Cache prefetching & tuning
- Power efficiency (DVFS, C-states)
- Multi-core parallelism
GPU Optimizations
- SIMD & SIMT execution
- Memory hierarchy optimization
- Multi-precision support
- MMA units (like tensor cores)
7. Neural Network Optimizations
- Quantization (linear + QAT)
- Pruning (remove redundant layers)
- Knowledge Distillation (teacher/student models)
- Matrix optimizations:
- Loop reordering
- Tiling
- Loop unrolling
- Multi-core parallelism (OpenMP, SMT)
Advanced Features
- SVE2 + SME (Scalable Matrix Extension)
- Dedicated tile storage (ZA)
- Streaming execution mode
All of these are integrated into KleidiAI libraries.
๐ Arm University GitHub
๐ Arm Developer Labs
8. Conclusion
Armโs ecosystem spans CPU, GPU, and NPU architectures, connected by robust system IP and optimized through software libraries.
From application processors (A-profile) to real-time controllers (R-profile) and ultra-low-power microcontrollers (M-profile), Arm enables scalable solutions across domains such as IoT, AI, cloud, and mobile computing.
The integration of Ethos NPUs, Helium, Neon, SVE2, and SME highlights Armโs commitment to AI/ML acceleration and energy-efficient computing for the next generation of devices.
Note: (Latest information: 2025 Aug), any further updates from Arm are not get updated here.