The following Tutorials will be part of VCIP 2021. They will presented in the morning and the afternoon of Sunday, December 5, 2021:

Immersive Imaging Technologies: from Capture to Display

Martin Alain, Trinity College Dublin, Ireland
Dr. Cagri Ozcinar, Samsung R&D Institute, UK
Dr. Emin Zerman, Trinity College Dublin, Ireland


The advances in imaging technologies in the last decade brought several alternatives to the way we acquire and display visual information. These new imaging technologies are immersive as they provide the viewer with more information which either surrounds the viewer or helps the viewer to be immersed in this augmented representation. These immersive imaging technologies include omnidirectional video, light fields, and volumetric (also known as free viewpoint) video. These three different modalities cover the full spectrum of immersive imaging, from 3 degrees of freedom (DoF) to 6DoF and can be used for virtual reality (VR) as well as augmented reality (AR). Applications of immersive imaging notably include education, cultural heritage, tele-immersion, remote collaboration, and communication.

In this tutorial, we cover all stages of the immersive imaging technologies from content capture to display and quality assessment. In particular, we discuss the learning-based approaches on the relevant stages of immersive imaging pipeline. We first introduce the main concepts of immersive imaging and their creative experiments. Next, we present content acquisition systems, followed by content creation and the corresponding data formats. Content delivery is then discussed, notably ongoing coding standardisation efforts, followed by adaptive streaming strategies. Immersive imaging displays are then presented, as they play a crucial role in the user’s sense of immersion. Image rendering algorithms related to such displays are also explained. Finally, perception and quality evaluation of immersive imaging is presented.

Tensor Computation for Image Processing

Yipeng liu, UESTC, China


Many classical signal processing methods rely on representation and computation in the form of vectors and matrices, where multi-dimensional signal is unfolded into matrix for processing. The multi-linear structure would be lost in such vectorization or matricization, which leads to sub-optimal performance in processing. In fact, a natural representation for multi-dimensional data is tensor. Avoiding multi-linear data structure loss, tensor computation can bring enhancement of a number of classical data processing techniques. As a typical kind of multi-dimensional data, image could be more efficiently and effectively processed by tensor computation based data processing techniques.
This tutorial will provide a basic coverage of tensor notations, preliminary operations, main tensor decompositions and their properties. Based on them, a series of tensor based data processing methods are presented, as the multi-linear extensions of classical sparse learning, low rank recovery, principal component analysis, linear regression, logistical regression, support vector machine, subspace clustering, deep neural networks, etc. The tensor decomposition based image processing techniques can not only keep the multi-linear data structures, but also enjoy a low computational complexity by performing processing in a few tensor subspaces. The experimental results for a number of image processing applications are given, such as image reconstruction, image enhancement, image fusion, background extraction, human pose estimation, infrared small target detection, EEG and fMRI analysis, multi-view image clustering, and deep neural network compression, etc.

The MPEG Immersive Video coding standard

Bart Kroon, Philips Research Eindhoven, Netherlands
Dawid Mieloch, Poznań University of Technology, Poland
Gauthier Lafruit, Université Libre de Bruxelles / Brussels University, Belgium


The tutorial gives a high-level overview of the MPEG Immersive Video (MIV) coding standard for compressing data from multiple cameras in view of supporting VR free navigation and light field applications.
The MIV coding standard (https://mpeg-miv.org) is video codec agnostic, i.e. it consists of a pre- and post-processing shell around existing codecs, like AVC, EVC, HEVC and VVC. Consequently, no coding details like DCT block coding and/or motion vectors will be presented, but rather high-level concepts about how to prepare multiview+depth video sequences to be handled by MIV. Relations with other parts of the MPEG-I standard (“I” refers to “Immersive”), e.g. point cloud coding with V-PCC, and streaming with DASH, will also be covered.

Learning-based point cloud processing and coding

Giuseppe Valenzise, Université Paris-Saclay, CNRS, CentraleSupelec, Laboratoire des Signaux et Systèmes, France
Dong Tian, InterDigital, US


Point clouds (PC) are an essential data structure for numerous applications such as virtual and mixed reality, 3D content production, sensing for autonomous vehicle navigation, architecture, and cultural heritage, etc. Point clouds are sets of 3D points identified by their coordinates, which constitute the geometry of the point cloud. In addition, each point can be associated with attributes like colors, normals, and reflectance. Point clouds can have a massive number of points, especially in high precision or large-scale captures. As a result, efficient point cloud processing and coding techniques are fundamental to enable the practical use of PC, making it a hot topic both in academic research and in industry. Point cloud coding is also currently the subject of ongoing standardization activities in MPEG, and the first version of the standards G-PCC and V-PCC have been recently released. In this context, learning-based methods for point cloud processing and coding have attracted a great deal of attention in the past couple of years, thanks to their incredible effectiveness in learning good 3D features and the competitive performance with more traditional PC coding techniques.
This tutorial intends to provide an overview of recent advances in learning-based processing and coding of 3D point clouds (PCs). More specifically, the course will introduce and review the most popular point cloud representations, including “native” representations (unordered point sets), and “non-native” PC representations such as octrees, voxels and 2D projections. We then review some processing techniques for point clouds, including low-level processing (sampling, denoising), and high-level processing (classification, analysis). Afterward, we present some relevant techniques for point cloud coding, with a particular focus on recent learning-based approaches and standardization activities. We also present the recent developments in point cloud quality assessment methodologies, metrics and datasets. We conclude with a discussion on the current trends and perspectives in the field.

This course is appropriate for students, researchers and practitioners with an interest in 3D point cloud processing and coding, and in particular for those willing to catch up with the recent advances in learning-based approaches to this field. The tutorial will introduce the basic aspects and tools necessary to understand the most recent developments in point cloud processing and coding, and is directed to an audience with a general background in multimedia signal processing, coding, and machine learning.

Versatile Video Coding – an Open Implementation and its Applications

Benjamin Bross, Fraunhofer Heinrich Hertz Institute, Germany
Christian Helmrich, Fraunhofer Heinrich Hertz Institute, Germany
Adam Wieckowski, Fraunhofer Heinrich Hertz Institute, Germany


The latest video coding standard VVC (Versatile Video Coding) jointly developed by ITU-T and ISO/IEC has been finalized in July 2020 and in September 2020 Fraunhofer HHI has made the optimized VVC software encoder (VVenC) and VVC software decoder (VVdeC) implementations publicly available on GitHub.
This tutorial details the open encoder implementation VVenC with a specific focus on the challenges and opportunities in implementing the myriad of new coding tools. This includes algorithmic optimizations for specific coding tools such as block partitioning, motion estimation as well as implementation specific optimizations such as SIMD vectorization and parallelization approaches.
Methods to approximate and increase subjectively perceived quality based on a novel block-based XPSNR model and the quantization parameter adaptation algorithm in VVenC will be discussed. Additionally, topics like rate-control, error-propagation analysis and video coding in general in a context of modern video codecs will be discussed based on real-world examples of problems encountered during codec development. Since its initial release the software has been integrated into various workflows and application scenarios, e.g. regarding SDR and HDR UHD content coding, adaptive streaming and immersive video transport. The takeaways from this experiments will be discussed in context of both VVC in general as well as VVenC specifically.

A Journey towards Fully Immersive Media Access

Christian Timmerer, Alpen-Adria-Universität Klagenfurt & Bitmovin, Inc., Austria
Tobias Hoßfeld, University of Würzburg, Germany
Raimund Schatz, AIT Austrian Institute of Technology, Austria


Universal access to and provisioning of multimedia content is now a reality. It is easy to generate, distribute, share, and consume any multimedia content, anywhere, anytime, on any device thanks to a plethora of applications and services that are now commodities in our daily life. Interestingly, most of these services adopt a streaming paradigm, are typically deployed over the open, unmanaged Internet, and account for most of today's Internet traffic. Currently, the global video traffic is greater than 60 percent of all Internet traffic and it is expected that this share will grow to more than 80 percent in the near future (according to Sandvine [https://www.sandvine.com/phenomena] and Cisco VNI [https://www.cisco.com/]). Additionally, Nielsen's Law of Internet bandwidth [https://www.nngroup.com/articles/law-of-bandwidth/] states that the users' bandwidth grows by 50 percent per year, which roughly fits data from 1983 to 2019. Thus, the users' bandwidth can be expected to reach approximately 1 Gbps by 2022. At the same time, network applications will grow and utilize the bandwidth provided, just like programs and their data expand to fill the memory available in a computer system. Most of the available bandwidth today is consumed by video applications and the amount of data is further increasing due to already established and emerging applications, e.g., ultra-high definition, high dynamic range, or virtual, augmented, mixed realities, or immersive media applications in general with the aim to increase the Immersive Media Experience (IMEx) [https://arxiv.org/abs/2007.07032].

A major technical breakthrough was the adaptive streaming over HTTP resulting in the standardization of MPEG Dynamic Adaptive Streaming over HTTP (DASH), which enables a content-/format-agnostic delivery over-the-top (OTT) of the existing infrastructure. Thus, this tutorial takes DASH as a basis and explains how it is adopted for immersive media delivery such as omnidirectional/360-degree video and any other volumetric video representations (i.e., point clouds, light fields, holography). The focus of this tutorial is related to the principles of Quality of Experience (QoE) for such immersive media applications and services including its assessment and management. Finally, this tutorial concludes with open research issues and industry efforts in this domain.