Bigger is not always better: exploiting parallelism in NN hardware
When considering hardware platforms for executing high performance NNs (Neural Networks), automotive system designers frequently determine the total compute power by simply adding up each NN’s requirements. However, the approach usually leads to demands for a single large NN accelerator. By considering the task parallel nature of the NN inferences within automotive software, a far more flexible approach, using multiple cores, can deliver superior results with far greater scalability and power efficiency.
Written by Tony King-Smith
Part 1: Looking “under the hood” of AI workloads
When designing a hardware platform capable of executing the AI workloads of automated driving, many factors need to be considered. However, the biggest one is uncertainty: what workload does the hardware actually need to execute, and how much performance is needed to safely and reliably execute it? How much of the time do I need to run that workload?
The challenge becomes even harder when designing an SoC (System on Chip) for automotive. Semiconductor vendors must sell their products to multiple customers to get sufficient return on the massive investment required to bring a new SoC to market. But how do they decide what workloads will best represent what customers will want to execute, and how assumptions made at the start of the design process will translate to real-world use when automotive AI is such a fast-moving and constantly evolving field?
Also, as the complexity and sophistication of AI workloads continues to rise, so does their power consumption. Given the enormous amount of compute power we can now integrate on one chip, one way SoC designers tackle this problem is to make use of what is known as “dark silicon”: switching off all power to areas of the chip not being used.
This approach has been extensively used in mobile phones for the past few years and can make a substantial difference to power consumption — but only if the chip is designed alongside the software from the start to utilize it. This relies on the fact that the AI subsystems will use different workloads, that are switched in and out from time to time according to the current operating conditions and environment (e.g. highway, urban or parking modes, low vs high volumes of traffic, different weather conditions).
Without a deep understanding of the range of workloads and how they each operate, SoC designers are forced to target the worst case, which often means large parts of the chip’s capabilities are rarely used. For an automated vehicle, that means unnecessary cost, and much higher power consumption. That is why AImotive develops both software and hardware technologies: allowing us to can take a holistic approach to system design.
The total is the sum of many parts
When designing automated driving solutions based on CNNs, designers are faced with many challenges, such as:
- What are the worst-case workloads I need to execute, and do I know exactly what those workloads will be?
- How do I enable continuous improvement over the life of the system, as NN research continues to advance even after the design of the hardware has been frozen?
- How can I make my hardware platform scalable, so I can allow for performance upgrades, and smaller or larger configurations depending on the vehicle model or sensors?
What is the workload?
One challenge of designing hardware systems is that the final software is generally far from ready when the hardware design is being finalized. Indeed, if a new chip is part of the solution, it needs to be finalized 3–4 years ahead of SOP (Start of Production). At a time when often even the higher-level architecture is not finalized, and implementation has barely begun.
Traditionally, engineers solve these problems by over-specifying the hardware platform. By building in a level of contingency for speed, memory etc., the hardware designers can be confident that the hardware will be able to support the final solution.
Executing NNs requires extremely high-performance engines, measured in many tens or hundreds of TOPS (Trillions of Operations Per Second). However, contingency comes with size and power consumption. Considering automotive power constraints, the goal is often to reduce the size of NN accelerators. A solution that does not fit well with the traditional approach.
This is why we designed aiWare hardware IP first and foremost as a highly scalable architecture. Our aiWare hardware IP complements our modular aiDrive software technology portfolio, by offering our customers and partners alternatives to mainstream solutions that are often over-specified.
Do we need one big NN accelerator engine?
The simple answer: no! When we were designing the architecture for aiWare, we looked at all the different NN workloads within an AI-based system, which together required more than 100 TOPS. Based on experience with our own aiDrive software, as well as extensive discussions with our partners about how they are building their solutions, we saw that it is never one single large “monolithic” NN workload: it is a series of NN workloads, some applying the same intelligence on different sensor data in parallel, some executing similar NN tasks pipelined.
AI systems use multiple NNs in various ways, breaking down the task into a series of modules. Furthermore, pre-processing, that is the work done for each sensor’s raw data before it is combined with data from others, often dominated the total TOPs budget. It’s just one of the ways that parallelism inherent in a NN-based system can be leveraged to design more flexible hardware platforms.
Do you need to scale all performance equally?
Another challenging trend of NN research is that as knowledge grows, and demands increase for greater safety and robustness, the performance demands also increase, often multiplying. Advances in AI technology can result in significant improvements in safety and quality of sensing and decision making. So how do we take advantage of these advances in a vehicle designed for a 10–20-year operating life?
A single NN accelerator core is, by definition, limited in capacity, however powerful. What happens when you exceed the capabilities of that engine? If the NN accelerator is integrated into an SoC, will you be forced to move to a new SoC, if the primary reason for the upgrade is to substantially increase NN inference performance to keep up with the latest NN algorithms? That will require re-validation of all the software of every function executing on that SoC — not just the NN parts. A costly and time-consuming process, if you only want more performance from the NN engine.
There will always be reasons to increase other parts of an SoC like the CPU, memory, comms etc. However, that needs to be traded off against the expensive and time-consuming re-validation of any new SoC and the complex embedded software closely tied to it. By adopting an external NN accelerator approach, you can delay having to upgrade the SoC containing the host processor itself, just as PC gamers upgrade their GPU while keeping the same CPU and chassis.
Should hardware scale over lifetime?
As experience for integrating AI into automated vehicles grows rapidly, so too does the need for modularity and scalability. These days, cars often use common platforms for the underlying chassis, which can then be adapted to the various models. OEMs and Tier1s are now starting to apply similar concepts to vehicle software, by bringing together the different software components — sometimes distributed over 50 or 100 ECUs (Electronics Control Units) — into a common software platform.
As cars become increasingly upgradeable during their lifetime, the hardware platform also needs to move towards a standard, modular, scalable and upgradeable solution. But the different types of processors need to be upgradable separately: in particular the NN acceleration hardware, where workloads and algorithms are likely to change dramatically every year for the foreseeable future.
Make sure to follow us, so you don’t miss our upcoming whitepaper on how parallelism and deeper understanding of the NN workloads enables a variety of different strategies for implementing NN accelerators.