Download Lifting Scheme Cores for Wavelet Transform PDF

TitleLifting Scheme Cores for Wavelet Transform
Author
LanguageEnglish
File Size1.1 MB
Total Pages119
Table of Contents
                            Introduction
Discrete Wavelet Transform
	Lifting Scheme
	2-D Decomposition
	Non-Separable Lifting Scheme
	Capabilities of Lifting Scheme
Computation Schedules
	Processors
	Field-Programmable Gate Arrays
	Graphics Processing Units
Lifting Vectorization
	Horizontal Vectorization
	Vertical Vectorization
	Diagonal Vectorization
Lifting Core
	Core Reorganization
	Treatment of Signal Boundaries
	Lifting Scheme Choice
	Parallel Cores
Multi-Dimensional Cores
	2-D Core Reorganization
	Parallel 2-D Cores
	Extension to Multiple Dimensions
Evaluation
	Image Processing
	JPEG 2000
	3-D Decomposition
	Parallel Processing on GPU
	FPGA Devices
	Vectorization
	Discussion
Conclusions
                        
Document Text Contents
Page 1

Lifting Scheme Cores for Wavelet Transform
Jádra schématu lifting pro vlnkovou transformaci

Ph.D. thesis / dizertační práce

Ing. David Bařina

supervised by / školitel
prof. Dr. Ing. Pavel Zemčík

Brno, 2015

Page 2

Abstract

The thesis focuses on efficient computation of the two-dimensional discrete wavelet trans-
form. The state-of-the-art methods are extended in several ways to perform the transform
in a single loop, possibly in multi-scale fashion, using a compact streaming core. This
core can further be appropriately reorganized to target the minimization of certain plat-
form resources. The approach presented here nicely fits into common SIMD extensions,
exploits the cache hierarchy of modern general-purpose processors, and is suitable for
parallel evaluation. Finally, the approach presented is incorporated into the JPEG 2000
compression chain, in which it has proved to be fundamentally faster than widely used
implementations.

Abstrakt

Práce se zaměřuje na efektivní výpočet dvourozměrné diskrétní vlnkové transformace.
Současné metody jsou v práci rozšířeny v několika směrech a to tak, aby spočetly tuto
transformaci v jediném průchodu, a to případně víceúrovňově, použitím kompaktního
jádra. Tohle jádro dále může být vhodně přeorganizováno za účelem minimalizace užití
některých prostředků. Představený přístup krásně zapadá do běžně používaných rozšíření
SIMD, využívá hierarchii cache pamětí moderních procesorů a je vhodný k paralelnímu
výpočtu. Prezentovaný přístup je nakonec začleňen do kompresního řetězce formátu
JPEG 2000, ve kterém se ukázal být zásadně rychlejší než široce používané implementace.

Keywords

discrete wavelet transform, lifting scheme, Cohen-Daubechies-Feauveau wavelet, SIMD,
CPU cache, parallelization, JPEG 2000

Klíčová slova

diskrétní vlnková transformace, schéma lifting, vlnka Cohen-Daubechies-Feauveau, SIMD,
cache CPU, paralelizace, JPEG 2000

Citation

Please cite this work as: D. Barina. Lifting Scheme Cores for Wavelet Transform. PhD
thesis, Brno University of Technology, Brno, 2015.

Page 59

CHAPTER 6. MULTI-DIMENSIONAL CORES 58

1 2 3
a dh v

Figure 6.5: Proposed non-separable lifting core of CDF 5/3 with three stages. The input
coefficients of active core are in bright box. The output ones are in dark one.

d v

h a

d v

h a

d v

h a

d v

h a

d v

h a

d v

h a

d v

h a

d v

h a

d v

h a

d v

h a

d v

h a

d v

h a

Figure 6.6: Shapes of the individual lifting steps of the newly proposed CDF 5/3 scheme
with three steps.

calculation of these coefficients is subdivided into separate parts. The sum of these parts
gives the desired result.

I will now explain this approach in detail. The starting point of the solution is the non-
separable lifting scheme of CDF 5/3 transform as described in [15]. The operations inside
the core were further appropriately reorganized. After some rewriting of expressions,
the following scheme has appeared. First, the scheme requires only 8 coefficients to
be passed between core iterations. This means that this new scheme has the same
memory demands as in the original separable case. Second, the number of the non-trivial
arithmetic operations was reduced from 24 to 22 compared to the original non-separable
scheme. The trick is that the h, v and a coefficients are not calculated at once. Instead,
these are calculated separately, split in two parts each.

For better understanding, the new scheme is graphically illustrated in Figure 6.5. The
(v

(0)
m,n

, a
(0)
m,n

, v
(1)
m,n

) → (v(0)
m−1,m, a

(0)
m−1,n, v

(1)
m−1,n) coefficients have to be passed through the

buffer horizontally. Additional (h(0)
m,n

, a
(0)
m,n

, h
(1)
m,n

) → (h(0)
m,n−1, a

(0)
m,n−1, h

(1)
m,n−1) coefficients

have to be passed vertically. Lastly, (a(0)
m,n

, a
(1)
m,n

) → (a(0)
m−1,n−1, a

(1)
m−1,n−1) coefficients

have to be passed in diagonal direction. This gives 8 coefficients per core iteration in
total. It should be emphasized that this newly formed scheme cannot be derived using
instruments in [15]. This is caused by the fact that the authors of [15] do not specify

Page 60

CHAPTER 6. MULTI-DIMENSIONAL CORES 59

a sequence of the operations. For comparison, shapes of the individual lifting steps are
shown in Figure 6.6.

Using the matrix notation, the proposed transform core can be described as

y = Aβ Tα,β Dα x, (6.14)

where

x =
[
am−1,n−1 am,n−1 am−1,n am,n hm,n−1 hm,n vm−1,n vm,n dm,n

]T
y =

[
am−1,n−1 am,n−1 am−1,n am,n hm,n−1 hm,n vm−1,n vm,n dm,n

]T (6.15)

Page 118

REFERENCES 117

[57] J. Franco, G. Bernabe, J. Fernandez, and M. Acacio, “A parallel implementation of
the 2D wavelet transform using CUDA,” in 17th Euromicro International Conference
on Parallel, Distributed and Network-based Processing, 2 2009, pp. 111–118. doi:10.
1109/PDP.2009.40

[58] M. Blazewicz, M. Ciznicki, P. Kopta, K. Kurowski, and P. Lichocki, “Two-
dimensional discrete wavelet transform on large images for hybrid computing archi-
tectures: GPU and CELL,” in Euro-Par 2011: Parallel Processing Workshops, ser.
LNCS. Springer, 2012, vol. 7155, pp. 481–490. doi:10.1007/978-3-642-29737-3_
53

[59] V. Galiano, O. Lopez, M. Malumbres, and H. Migallon, “Improving the discrete
wavelet transform computation from multicore to GPU-based algorithms,” in Pro-
ceedings of the 11th International Conference on Computational and Mathematical
Methods in Science and Engineering (CMMSE), 2011, pp. 544–555.

[60] ——, “Parallel strategies for 2D discrete wavelet transform in shared memory sys-
tems and GPUs,” The Journal of Supercomputing, vol. 64, no. 1, pp. 4–16, 2013.
doi:10.1007/s11227-012-0750-5

[61] W. van der Laan, J. B. T. M. Roerdink, and A. Jalba, “Accelerating wavelet-based
video coding on graphics hardware using CUDA,” in Proceedings of 6th International
Symposium on Image and Signal Processing and Analysis (ISPA), Sep. 2009, pp.
608–613.

[62] W. J. van der Laan, A. C. Jalba, and J. B. T. M. Roerdink, “Accelerating wavelet
lifting on graphics hardware using CUDA,” IEEE Transactions on Parallel and Dis-
tributed Systems, vol. 22, no. 1, pp. 132–146, 2011. doi:10.1109/TPDS.2010.143

[63] J. Franco, G. Bernabe, J. Fernandez, and M. Ujaldon, “Parallel 3D fast wavelet
transform on manycore GPUs and multicore CPUs,” Procedia Computer Science,
vol. 1, no. 1, pp. 1101–1110, 2010, ICCS 2010. doi:10.1016/j.procs.2010.04.122

[64] G. Bernabe, G. Guerrero, and J. Fernandez, “CUDA and OpenCL implementations
of 3D fast wavelet transform,” in IEEE Third Latin American Symposium on Circuits
and Systems (LASCAS), Feb. 2012, pp. 1–4. doi:10.1109/LASCAS.2012.6180318

[65] J. Matela, “GPU-based DWT acceleration for JPEG2000,” in Annual Doctoral Work-
shop on Mathematical and Engineering Methods in Computer Science, 2009, pp.
136–143.

Page 119

REFERENCES 118

[66] R. Kutil, “Short-vector SIMD parallelization in signal processing,” in Parallel Com-
puting, R. Trobec, M. Vajtersic, and P. Zinterhof, Eds. Springer London, 2009, pp.
397–433. doi:10.1007/978-1-84882-409-6_13

[67] J. Sykora, L. Kohout, R. Bartosinski, L. Kafka, M. Danek, and P. Honzik, “The
architecture and the technology characterization of an FPGA-based customizable
Application-Specific Vector Processor,” in IEEE 15th International Symposium on
Design and Diagnostics of Electronic Circuits Systems (DDECS), 2012, pp. 62–67.
doi:10.1109/DDECS.2012.6219026

[68] J. Sykora, R. Bartosinski, L. Kohout, M. Danek, and P. Honzik, “Reducing instruc-
tion issue overheads in Application-Specific Vector Processors,” in Proceedings of the
15th Euromicro Conference on Digital System Design (DSD), ser. DSD ’12, 2012,
pp. 600–607.

[69] R. Bartosinski, M. Danek, J. Sykora, L. Kohout, and P. Honzik, “Foreground detec-
tion and image segmentation in a flexible ASVP platform for FPGAs,” in Conference
on Design and Architectures for Signal and Image Processing (DASIP), 2012, pp.
1–2.

[70] ——, “Video surveillance application based on application specific vector proces-
sors,” in Conference on Design and Architectures for Signal and Image Processing
(DASIP), 2012, pp. 1–8.

[71] D. S. Taubman and M. W. Marcellin, JPEG2000 Image Compression Fundamentals,
Standards and Practice, ser. The Springer International Series in Engineering and
Computer Science. Springer US, 2002.

[72] ——, “JPEG2000: standard for interactive imaging,” Proceedings of the IEEE,
vol. 90, no. 8, pp. 1336–1357, Aug. 2002. doi:10.1109/JPROC.2002.800725

[73] D. S. Taubman, E. Ordentlich, M. Weinberger, and G. Seroussi, “Embedded block
coding in JPEG 2000,” Signal Processing: Image Communication, vol. 17, no. 1, pp.
49–72, 2002. doi:10.1016/S0923-5965(01)00028-5

[74] D. S. Taubman, “Software architectures for JPEG2000,” in Proceedings of the IEEE
International Conference for Digital Signal Processing, 2002, pp. 197–200.

Similer Documents