Improving the performance of Pulpissimo SoC targeting Machine Learning Applications.

GitHub: janinduleelananda/Pulpissimo-Processor-Design: This repository contains proposed design changes to pulpissimo processor design and it's implementation on Verilog. (github.com)

PULPissimo is the microcontroller architecture of the more recent PULP chips. Here, we discuss the proposed and implemented changes to the architecture aimed at increasing its performance for MobileNetV1 application. Here, we propose a three-stage pipeline approach, consisting of the following stages:

Instruction Fetch
Decode and execute.
Write Back

In the current design, the same ALU is utilized for both checking branch conditions and calculating branch targets. To enhance efficiency, we recommend the introduction of a dedicated branch target ALU responsible for branch target calculations. Furthermore, the multiplication and division unit previously shared a common ALU for addition operations. To optimize performance, we suggest the integration of a separate adder/subtractor into the multiplication/division unit. This addition would significantly enhance processing speed by enabling dedicated arithmetic operations within the unit. The HWPE is designed utilizing memristors to enhance the data processing speed and memory storage. This design integrates the benefits of memristors to achieve faster performance and increased parallel processing capabilities.

Machine learning applications often handle diverse data streams originating from various sensors, which necessitates significant computational power. In response to this challenge, a novel category of energy-efficient computing platforms has emerged recently, as indicated by references. An exemplary member of this family is PULP an open-source platform encompassing both hardware and software components, specifically designed for IoT applications. PULP's primary objective is twofold: minimizing energy consumption and fulfilling the rigorous computational requirements associated with IoT applications. PULPissimo stands as the microcontroller architecture found in the latest iterations of PULP chips. In our current setup, we utilize the Ibex core as the primary core and have introduced various modifications and enhancements to that Ibex core and HWPE unit. Ibex, previously known as Zero-riscy, is a single-issue, in-order core with a two-stage pipeline design. It supports for the base integer instruction set (RV32I version 2.1) and compressed instructions (RV32C version 2.0).

❖ Design changes and objectives

Implementing a three-stage pipeline approach, comprising the stages of Instruction Fetch, Decode and Execute, and Write Back, offers significant advantages to the pulpissimo. Firstly, it boosts throughput by allowing the processor to simultaneously handle multiple instructions in different stages of execution. Secondly, the three-stage pipeline reduces latency, as it divides the datapath into distinct stages with specific functions. As a result, instructions move through the pipeline more rapidly, leading to quicker program execution. Overall, the adoption of a three-stage pipeline architecture brings about improved performance, reduced latency, efficient and resource utilization. In the current design, the same ALU is utilized for both checking branch conditions and calculating branch targets. When the same ALU handles both tasks, it can lead to contention and potential bottlenecks in the pipeline as branch decision and branch target calculation must go through separate cycles when sharing resources. This contention results in pipeline stalls and reduces overall performance. Introduction of a dedicated branch target ALU responsible exclusively for branch operation. It allows for simultaneous execution of branch condition checking and branch target calculation, enabling greater concurrency in the pipeline and the dedicated ALU minimizes contention for the primary ALU, ensuring that other non-branch instructions can proceed through the pipeline without interruption. Furthermore, in the previous design, the multiplication and division unit shared a common ALU for addition operations, which could lead to contention and suboptimal performance, particularly when complex arithmetic operations were in play. To maximize processing efficiency and address these limitations, we recommend the integration of a dedicated adder into the multiplication/division unit. That unit provides the adder results and is_equal results required by multiplication/division units and allows it to work as a separate unit without affecting the concurrent executions happening in main ALU. In our proposal, we introduced a new Hardware Processing Engine (HWPE) featuring a Multiply Accumulate Unit (MAU) based on existing multipulpy hardware accelerator. constructed with memristive crossbars. This MAU comprises several crossbar arrays, each sized at 512 × 512. The size of these crossbar arrays has a direct impact on power consumption, necessitating us to minimize it. Consequently, matrix operations had to be divided and conquered. Our HWPE includes two Finite State Machines (FSM) with 12 states and 24 transitions between them, akin to the original HWPE. Additionally, we've incorporated two micro-code processors for each engine, resulting in a total of two engines for calculations. Each engine is responsible for receiving a sub-vector and sub-matrix, and subsequently multiplying them using the ISAAC MAU. These engines monitor the ready and valid signals of the input matrix and vector streamers. Once the data becomes available, it is transferred to the MAU for processing. Finally, the results from the IMA are streamed by each engine to the result streamer, mirroring the functionality of the original HWPE.

We mainly performed changes on three rtl files in ibex core and performed minor changes on other files.

• Ibex_core.sv

• Ibex_alu.sv

• Ibex_ex_block.sv

And we introduced a new module under the file.

• ibex_branch_mul_alu.sv

We have modified the optional writeback stage included in latest ibex core according to our implementation.

The Eight-Stage Evaluation Process for the Modified Ibex Core

We have employed an eight-stage process to analyze the performance of both the Ibex core and the modified Ibex core. The following outlines these eight stages.

1. Executed the Simple System with the original Ibex core.

2. Executed the Simple System with the addition of separate adder/subtractor for the multiplier.

3. Executed the Simple System with the addition of a branch target ALU + adder/subtractor unit.

4. Executed the Ibex core with both the writeback stage + Adder/subtractor Unit + branch target ALU.

5. Evaluated the original Ibex core with CoreMark.

6. Evaluated the Ibex core with the addition of separated adder/subtractor for the multiplier with CoreMark.

7. Evaluated the Ibex core modified to have a branch target ALU+ adder/subtractor unit with CoreMark.

8. Evaluated the Ibex core with both the writeback stage +Adder/subtractor Unit +branch target ALU with CoreMark.

We conduct a comparison between the output file generated during the first step and the output files produced in the subsequent three stages. This comparison is carried out to find the correctness of all three of these design configurations, thereby ensuring their proper functionality. The output of the Simple System is saved in the 'ibex_simple_system.log' file. We have provided four screenshots, each of which demonstrates the output of a specific step. Since all the outputs are identical, it indicates that the RTL files for each step are functioning correctly. As a result, we can confidently assume that the modified RTLs are working as intended.

In the subsequent stages of evaluating the performance of the modulated RTL designs, we subject them to testing using the CoreMark benchmark. Following this evaluation, we proceed to calculate the CoreMark/MHz values for each of these designs. The CoreMark/MHz metric serves as a measure of single-thread performance per clock frequency. This numerical value is derived from the CoreMark benchmark score and is obtained by dividing the single-core CoreMark score by the clock speed utilized during the benchmark execution. Essentially, it provides insight into the maximum performance achievable within a specific number of clock cycles.

According to this we can calculate the CoreMark/MHz value by using the following equation.

CoreMark/MHz = 106 𝑇𝑜𝑡𝑎𝑙 𝑇𝑖𝑐𝑘𝑠 × 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 = 106 4244465 × 10 = 2.356

In the sixth step, where we introduced the adder/subtractor unit for multiplier, we observed no alteration in the total number of ticks, indicating that there was no change in the CoreMark/MHz metric. Consequently, it is evident that we were unable to achieve a performance improvement solely through the incorporation of this unit. Following that, we proceeded to Step 7, during which we executed the design that incorporated the branch target ALU+add/sub unit. Subsequently, we calculated the CoreMark/MHz value for this specific configuration.

According to this we can calculate the CoreMark/MHz value by using the following equation. CoreMark/MHz = 106 𝑇𝑜𝑡𝑎𝑙 𝑇𝑖𝑐𝑘𝑠 × 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 = 106 4115819 × 10 = 2.429

In the subsequent step, we conducted the benchmark once more, this time evaluating the design that contains both the branch target ALU and the writebackstage. As a result of these efforts, we were able to obtain noteworthy results.

According to this we can calculate the CoreMark/MHz value by using the following equation.

CoreMark/MHz = 106 𝑇𝑜𝑡𝑎𝑙 𝑇𝑖𝑐𝑘𝑠 × 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 = 106 4115819 × 10 = 2.46

Upon a thorough examination of these results, it becomes evident that there has been an increase in performance when integrating the branch target ALU and when incorporating both the branch target ALU and the write backstage together with add/sub unit. This increase performance shows that the design with these modifications has achieved an enhanced performance.

Search This Blog

Janindu's BLOG