Download Real-time Image Processing on Low Cost Embedded.pdf PDF

TitleReal-time Image Processing on Low Cost Embedded.pdf
TagsArm Architecture Thread (Computing) Multi Core Processor Compiler Central Processing Unit
File Size1.0 MB
Total Pages32
Document Text Contents
Page 1

Real-time Image Processing on Low Cost EmbeddedReal-time Image Processing on Low Cost Embedded


Sunil Shah

Electrical Engineering and Computer Sciences
University of California at Berkeley

Technical Report No. UCB/EECS-2014-117

May 20, 2014

Page 16

Figure 9
Figure 10

Compiler Optimisations gcc and other compilers o ff er a myriad of optional performance optimisations.

For example, they o ff er user specified flags that cause the generated executable file to be optimised for

a certain instruc tion set. When compiling libraries such as OpenCV for an embedded computer, it is
typical to cross-compile; compilation on embedded computers typically takes many times longer. Cross-

compilation is when compilation happens on a more powerful compilation computer that has available

to it a compiler for the target architecture. In our case, compilation was on a quad-core x86 computer

for an ARM target architecture. At this stage, it is possible to pass certain parameters to the compiler

that permit it to use ARM NEON instructions in the generated binary code.

NEON is an “advanced SIMD instruction set” introduced in the ARMv7 architecture - the basis for

all modern ARM processors. Single instruction, multiple data (SIMD) instructio ns are data parallel

instructions that allow a single operation to be performed in parallel on two more operands. While the

compiler has to be conservative in how it utilises these in the generated binary code so that correctness
is guaranteed, these allow some performance gain.

Additionally, library providers may implement optional code paths that are enabled when explicitly

compiling for an ARM target architecture. For instance, the OpenCV maintainers have several functions

that have optional NEON instructions implemented. This also results in a performance boost.

Library Optimisations OpenCV utilises other libraries to perform fundamental functions such as image

format parsing and multi-threading. For core functions, such as parsing of image formats, a standard

library is included and used. For advanced functions, support is optional and so these are disabled by



Page 17

We experimented with enabling multithreading functionality by compiling OpenCV with Intel’s Thread

Building Blocks library. This is a library that provides a multi-threading abstraction and for which

support is available in select OpenCV functions [3].

Secondly, we re-compiled OpenCV replacing the default JPEG parsing library, libjpeg, with a per-

formance optimised variant called libjpeg-turbo. This claims to be 2.1 to 5.3x faster than the standard

library [8] and has ARM optimised code paths. Using this, it is possible to capture images at 30 frames

per second on the BeagleBone Black [5].

Note also that we were unable to use the Intel Integrated Performance Primitives (IPP) library. IPP is

the primary method of compiling a high performanc e version of OpenCV. However, it accomplishes this

by utilising significant customisation for x86 processors that implement instruction sets only available

on desktop computers (e.g. Streaming SIMD Extensions (SSE), Advanced Vector Extensions (AVX)).

Modern embedded computers typically use some sort of aggressive CPU scaling

in order to minimise power consumption through an ‘ondemand’ governor [11]. This is b eneficial for

consumer applications where battery life is a significant concern and commercial feature.

However, for a real-time application such as this, CPU scaling is undesirable since it introduces a

very slight latency as load increases and the CPU frequency is increased by the governor. This is more

desirable still in architectures such as the big.LITTLE architecture used on the Odroid XU which actually

switches automatically between a low-power optimised processor and a performance optimised processor

as load increases.

This can be mitigated by manually setting the governor to the ‘performance’ setting. This e �� ��ectively

disables frequency scaling and forces the board to run at maximum clock speed.

As mentioned previously, modern ARM processo rs implement the NEON

SIMD instruction set. Compilers such as gcc make instruction set specific SIMD instructions available to

the programmer through additional instructions called intrinsics. Intrinsics essentially wrap calls to the

underlying SIMD instruc tion. Using intrinsics, it is possible to exploit data parallelism in our own code,

essentially rewriting higher level, non-parallel, function calls with low level function calls that operate

on multiple data simultaneously. This approach is laborious and is therefore used sparin gly. It can,

however, yield significant performance improvements [9].

Multi-threading is a promising approach. Certain library functions may inher-

ently exploit multi-threading and there is a slight benefit to a single-threaded process to having multiple

cores (the operating system will balance processes across cores). However, our single-threaded imple-


Page 31


[1] 3DRobotics. 3DR Pixhawk . [Online]. 2014. url:


[2] 3DRobotics. APM 2.6 Set . [Online]. 2013. url:


[3] OpenCV Adventure. Parallelizing Loops with Intel Thread Building Blocks . [Online]. 2011. url:


[4] AUVSI. The Economic Impact of Unmanned Aircraft Systems Integration in the United States .

[Online]. 2013. url: .

[5] Michael Darling. How to Achieve 30 fps with BeagleBone Black, OpenCV, and Logitech C920

Webcam . [Online]. 2013. url:


[6] Pedro J Garcia-Pardo, Gaurav S Sukhatme, and James F Montgomery. “Towards vision-based

safe landing for an autonomous helicopter”. In: Robotics and Autonomous Systems 38.1 (2002),

pp. 19–29.

[7] Stanley R Herwitz et al. “Precision agriculture as a commercial application for solar-powered un-

manned aerial vehicles”. In: AIAA 1st Technical Conference and Workshop on Unmanned Aerospace

Vehicles . 2002.

[8] libjpeg-turbo. Performance . [Online]. 2013. url:


[9] Gaurav Mitra et al. “Use of SIMD vector opera tions to accele rate application code p erformance
on low-powered ARM and Intel platforms”. In: Parallel and Distributed Processing Symposium

Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International . IEEE. 2013, pp. 1107–1116.

[10] Stack Overflow. What is the best library for computer vision in C/C++? [Online]. 2009. url:


[11] Venkatesh Pallipadi and Alexey Starikovskiy. “The ondemand governor”. In: Proceedings of the

Linux Symposium . Vol. 2. sn. 2006, pp. 215–230.

[12] E. Pereira, R. Sengupta, and K. Hedrick. “The C3UV Testbed for Colla borative Control and

Information Acquisition Using UAVs”. In: American Control Conference . AACC. 2013.


Page 32

[13] Kari Pulli et al. “Real-time computer vision with OpenCV”. In: 55.6

(2012), pp. 61–69.

[14] Morgan Quig ley et al. “ROS: an open-source Robot Operating Syste m”. In:

. Vol. 3. 3.2. 2009.

[15] John L. Rep. Mic a et al. “FAA Modernization and Reform Act of 2012”. In: (2012 ).

[16] Katie Roberts-Ho�� ��man and Pawankumar Hegde. “ARM cortex-a8 vs. intel atom: Architectural

and benchmark comparisons”. In: (2009).

[17] Srikanth Saripalli, James F Montgomery, and Gaurav S Sukhatme. “Vision-based autonomous

landing of an unmanned aerial vehicle”. In:

. Vol. 3. IEEE. 2002, pp. 2799–2804.

[18] Courtney S. Sharp, Omid Shak ernia, and Shankar Sastry. “A Vision System for Landing an Un-

manned Aerial Vehicle.” In: . IEEE,

2001, pp. 1720–1727.

[19] Eric Stotzer et al. “OpenMP on the Low-P ower TI Keystone II ARM/DSP Sys tem-on-Chip”. In:

. Springer, 2013, pp. 114–127.

[20] Sebastian Thrun et al. “Stanley: Th e robot that won the DAR PA Grand Challenge”. In:

23.9 (2006), pp. 661–692.

[21] Chunhua Zhang and John M Kovacs. “The application of small unmanned aerial syste ms for

precision agriculture: a review”. In: 13.6 (2012), pp. 693–712.


Similer Documents