Online access : zoom


VCIP participant marked as *, contact author marked as .
Welcome and Opening
Oral Session
Session Chair: Dr. Deepayan Bhowmik (University of Stirling, UK)
(Special Session) Fake Media in the Age of Artificial Intelligence
Author(s) Title
Davide Cozzolino*, Diego Gragnaniello, GIovanni Poggi, Luisa Verdoliva*

The ever higher quality and wide diffusion of fake images have spawn a quest for reliable forensic tools. Many GAN image detectors have been proposed, recently. In real world scenarios, however, most of them show limited robustness and generalization ability. Moreover, they often rely on side information not available at test time, that is, they are not universal. We investigate these problems and propose a new GAN image detector based on a limited sub-sampling architecture and a suitable contrastive learning paradigm. Experiments carried out in challenging conditions prove the proposed method to be a first step towards universal GAN image detection, ensuring also good robustness to common image impairments, and good generalization to unseen architectures.

Towards Universal GAN Image Detection
Deepayan Bhowmik*, Mohamed Elawady, Keiller Nogueira

Advances in media compression indicate significant potential to drive future media coding standards, \eg Joint Photographic Experts Group's learning-based image coding technologies (JPEG AI) and Joint Video Experts Team's (JVET) deep neural networks (DNN) based video coding. These codecs in fact represent a new type of media format. As a dire consequence, traditional media security and forensic techniques will no longer be of use. This paper proposes an initial study on the effectiveness of traditional watermarking on two state-of-the-art learning based image coding. Results indicate that traditional watermarking methods are no longer effective. We also examine the forensic trails of various DNN architectures in the learning based codecs by proposing a residual noise based source identification algorithm that achieved 79% accuracy.

Security and Forensics Exploration of Learning-based Image Coding
Bachir Kaddar, Sid Ahmed Fezza, Wassim Hamidouche*, Zahid Akhtar, Prof. Abdenour Hadid U Oulu Finland CV

The number of new falsified video content is dramatically increasing, making the need to develop effective deepfake detection meth-ods more urgent than ever. Even though many existed deepfake detection approaches show promising results, the majority of them still suffer from a number of critical limitations. In general, poor generalization results have been obtained under unseen or new deepfake generation methods. Consequently, in this paper, we propose a deepfake detection method called HCiT, which combines Convolutional Neural Network(CNN) with Vision Transformer (ViT). The HCiT hybrid architecture exploits the advantages of CNN to extract local information with the ViT’s self-attention mechanism to improve the detection accuracy. In this hybrid architecture, the feature maps extracted from the CNN are feed into ViT model that determines whether a specific video is fake or real. Experiments were performed on Faceforensics++ and DeepFake Detection Challenge preview datasets, and the results show that the proposed method significantly outperforms the state-of-the-art methods. In addition, the HCiT method shows a great capacity for generalization on datasets covering various techniques of deepfake generation.

HCiT: Deepfake Video Detection Using a Hybrid Model of CNN features and Vision Transformer
Changtao Miao*, Qi Chu, Weihai Li, Tao Gong, Wanyi Zhuang, Nenghai Yu

Over the past several years, to solve the problem of malicious abuse of facial manipulation technology, face manipulation detection technology has obtained considerable attention and achieved remarkable progress. However, most existing methods have very impoverished generalization ability and robustness. In this paper, we propose a novel method for face manipulation detection, which can improve the generalization ability and robustness by bag-of-feature. Specifically, we extend Transformers using bag-of-feature approach to encode inter-patch relationships, allowing it to learn forgery features without any additional mask supervision. Extensive experiments demonstrate that our method can outperform competing for state-of-the-art methods on FaceForensics++, Celeb-DF and DeeperForensics-1.0 datasets.

Towards Generalizable and Robust Face Manipulation Detection via Bag-of-feature
Yunqian Wen, Bo Liu, Rong Xie, Jingyi Cao, Li Song*

Advances in cameras and web technology have made it easy to capture and share large amounts of face videos over to an unknown audience with uncontrollable purposes. These raise increasing concerns about unwanted identity-relevant computer vision devices invading the characters's privacy. Previous de-identification methods rely on designing novel neural networks and processing face videos frame by frame, which ignore the data feature in redundancy and continuity. Besides, these techniques are incapable of well-balancing privacy and utility, and per-frame evaluation is easy to cause flicker. In this paper, we present deep motion flow, which can create remarkable de-identified face videos with a good privacy-utility tradeoff. It calculates the relative dense motion flow between every two adjacent original frames and runs the high quality image anonymization only on the first frame. The de-identified video will be obtained based on the anonymous first frame via the relative dense motion flow. Extensive experiments demonstrate the effectiveness of our proposed de-identification method.

Deep Motion Flow Aided Face Video De-identification
Coffee break
Poster Session
Session Chair: Prof. Ching-Chun Huang (NYCU, Taiwan)
Advanced Techniques for Immersive Image/Video
No. Author(s) Title
1 Sarah Fachada*, Daniele Bonatto*, Mehrdad Teratani, Gauthier Lafruit*

Non-Lambertian objects present an aspect which depends on the viewer's position towards the surrounding scene. Contrary to diffuse objects, their features move non-linearly with the camera, preventing rendering them with existing Depth Image-Based Rendering (DIBR) approaches, or to triangulate their surface with Structure-from-Motion (SfM). In this paper, we propose an extension of the DIBR paradigm to describe these non-linearities, by replacing the depth maps by more complete multi-channel "non-Lambertian maps", without attempting a 3D reconstruction of the scene. We provide a study of the importance of each coefficient of the proposed map, measuring the trade-off between visual quality and data volume to optimally render non-Lambertian objects. We compare our method to other state-of-the-art image based rendering methods and outperform them with promising subjective and objective results on a challenging dataset.

Polynomial Image-Based Rendering for non-Lambertian Objects
2 Jong-Beom Jeong*, Soonbin Lee, Eun-Seok Ryu

With the new immersive video coding standard MPEG immersive video (MIV) and versatile video coding (VVC), six degrees of freedom (6DoF) virtual reality (VR) streaming technology is emerging for both computer-generated and natural content videos. This paper addresses the decoder-wise subpicture bitstream extracting and merging (DWS-BEAM) method for MIV and proposes two main ideas: (i) a selective streaming-aware subpicture allocation method using a motion-constrained tile set (MCTS), (ii) a decoder-wise subpicture extracting and merging method for single-pass decoding. In the experiments using the VVC test model (VTM), the proposed method shows 1.23% BD-rate saving for immersive video PSNR (IV-PSNR) and 15.78% decoding runtime saving compared to the VTM anchor. Moreover, while the MIV test model requires four decoders, the proposed method only requires one decoder.

DWS-BEAM: Decoder-Wise Subpicture Bitstream Extracting and Merging for MPEG Immersive Video
3 si he*, yu liu, yumei wang

As an emerging media format, virtual reality (VR) has attracted the attention of researchers. 6-DoF VR can reconstruct the surrounding environment with the help of the depth information of the scene, so as to provide users with immersive experience. However, due to the lack of depth information in panoramic image, it is still a challenge to convert panorama to 6-DOF VR. In this paper, we propose a new depth estimation method SPCNet based on spherical convolution to solve the problem of depth information restoration of panoramic image. Particularly, spherical convolution is introduced to improve depth estimation accuracy by reducing distortion, which is attributed to Equi-Rectangular Projection(ERP). The experimental results show that many indicators of SPCNet are better than other advanced networks. For example, RMSE is 0.419 lower than UResNet. Moreover, the threshold accuracy of depth estimation has also been improved.

SPCNet: A Panoramic image depth estimation method based on spherical convolution
4 Antonin Gilles*

Thanks to its ability to provide accurate focus cues, Holography is considered as a promising display technology for augmented reality glasses. However, since it contains a large amount of data, the calculation of a hologram is a time-consuming process which results in prohibiting head-motion-to-photon latency, especially when using embedded calculation hardware. In this paper, we present a real-time hologram calculation method implemented on a NVIDIA Jetson AGX Xavier embedded platform. Our method is based on two modules: an offline pre-computation module and an on-the-fly hologram synthesis module. In the offline calculation module, the omnidirectional light field scattered by each scene object is individually pre-computed and stored in a Look-Up Table (LUT). Then, in the hologram synthesis module, the light waves corresponding to the viewer’s position and orientation are extracted from the LUT in real-time to compute the hologram. Experimental results show that the proposed method is able to compute 2K1K color holograms at more than 50 frames per second, enabling its use in augmented reality applications.

Real-time embedded hologram calculation for augmented reality glasses
5 SHUHO UMEBAYASHI*, Kazuya KODAMA, Takayuki Hamamoto

We propose a novel method of light field compression using multi-focus images and reference views.Light fields enable us to observe scenes from various viewpoints.However, it generally consists of 4D enormous data, that are not suitable for storing or transmitting without effective compression at relatively low bit-rates.On the other hand, 4D light fields are essentially redundant because it includes just 3D scene information.While robust 3D scene estimation such as depth recovery from light fields is not so easy, a method of reconstructing light fields directly from 3D information composed of multi-focus images without any scene estimation is successfully derived.Based on the method, we previously proposed light field compression via multi-focus images as effective representation of 3D scenes.Actually, its high performance can be seen only at very low bit-rates, because there exists some degradation of low frequency components and occluded regions on light fields predicted from multi-focus images.In this paper, we study higher quality light field compression by using reference views to improve quality of the prediction from multi-focus images.Our contribution is twofold:first, our improved method can keep good performance of 4D light field compression at a wider range of low bit-rates than the previous one working effectively only for very low bit-rates;second, we clarify how the proposed method can improve its performance continuously by introducing recent video codec such as HEVC and VVC into our compression framework, that does not depend on 3D-SPIHT previously adopted for the corresponding component.We show experimental results by using synthetic and real images, where quality of reconstructed light fields is evaluated by PSNR and SSIM for analyzing characteristics of our novel method well.We notice that it is much superior to light field compression using HEVC directly at low bit-rates regardless of its light field scan order.

Image/Video Super-Resolution and Rescaling
No. Author(s) Title
6 Xueheng Zhang, Li Chen*, Li Song*, Zhiyong Gao

Increasing the spatial resolution and frame rate of a video simultaneously has attracted attention in recent years. The current one-stage space-time video super-resolution (STVSR) methods are difficult to deal with large motion and complex scenes, and are time-consuming and memory intensive. We propose an efficient STVSR framework, which can correctly handle complicated scenes such as occlusion and large motion and generate results with clearer texture. In REDS dataset, our method outperforms all existing one-stage methods. Our method is lightweight and can generate 720p frames at 16fps on a NVIDIA GTX 1080 Ti GPU.

Fast and Context-Aware Framework for Space-Time Video Super-Resolution
7 Li Ma, Sumei Li*

In stereo image super-resolution (SR), it is equally important to utilize intra-view and cross-view information. However, most existing methods only focus on the exploration of cross-view information and neglect the full mining of intra-view information, which limits the reconstruction performance of these methods. Since single image SR (SISR) methods are powerful in intra-view information exploitation, we propose to introduce the knowledge distillation strategy to transfer the knowledge of a SISR network (teacher network) to a stereo image SR network (student network). With the help of the teacher network, the student network can easily learn more intra-view information. Specifically, we propose pixel-wise distillation as the implementation method, which not only improves the intra-view information extraction ability of student network, but also ensures the effective learning of cross-view information. Moreover, we propose a lightweight student network named Adaptive Residual Feature Aggregation network (ARFAnet). Its main unit, the ARFA module, can aggregate informative residual features and produce more representative features for image reconstruction. Experimental results demonstrate that our teacher-student network achieves state-of-the-art performance on all benchmark datasets.

Stereo Image Super-Resolution Based on Pixel-Wise Knowledge Distillation Strategy
8 Yijian Zheng, Sumei Li*

Stereo image super-resolution (SR) has achieved great progress in recent years. However, the two major problems of the existing methods are that the parallax correction is insufficient and the cross-view information fusion only occurs in the beginning of the network. To address these problems,we propose a two-stage parallax correction and a multi-stage cross-view fusion network for better stereo image SR results. Specially, the two-stage parallax correction module consists of horizontal parallax correction and refined parallax correction. The first stage corrects horizontal parallax by parallax attention. The second stage is based on deformable convolution to refine horizontal parallax and correct vertical parallax simultaneously. Then, multiple cascaded enhanced residual spatial feature transform blocks are developed to fuse cross-view information at multiple stages. Extensive experiments show that our method achieves state-of-the-art performance on the KITTI2012, KITTI2015, Middlebury and Flickr1024 datasets.

Two-stage Parallax Correction and Multi-stage Cross-view Fusion Network Based Stereo Image Super-Resolution
9 Asfand Yaar, Hasan F. Ates, Bahadir Gunturk

Deep learning-based single image super-resolution (SR) consistently shows superior performance compared to the traditional SR methods. However, most of these methods assume that the blur kernel used to generate the low-resolution (LR) image is known and fixed (e.g. bicubic). Since blur kernels involved in real-life scenarios are complex and unknown, performance of these SR methods is greatly reduced for real blurry images. Reconstruction of high-resolution (HR) images from randomly blurred and noisy LR images remains a challenging task. Typical blind SR approaches involve two sequential stages: i) kernel estimation; ii) SR image reconstruction based on estimated kernel. However, due to the ill-posed nature of this problem, an iterative refinement could be beneficial for both kernel and SR image estimate. With this observation, in this paper, we propose an image SR method based on deep learning with iterative kernel estimation and image reconstruction. Simulation results show that the proposed method outperforms state-of-the-art in blind image SR and produces visually superior results as well.

Deep Learning-Based Blind Image Super-Resolution using Iterative Networks
10 Yan-An Chen, Ching-Chun Hsiao, Wen-Hsiao Peng*, Ching-Chun Huang*

This paper addresses image rescaling, the task of which is to downscale an input image followed by upscaling for the purposes of transmission, storage, or playback on heterogeneous devices. The state-of-the-art image rescaling network (known as IRN) tackles image downscaling and upscaling as mutually invertible tasks using invertible affine coupling layers. In particular, for upscaling, IRN models the missing high-frequency component by an input-independent (case-agnostic) Gaussian noise. In this work, we take one step further to predict a case-specific high-frequency component from textures embedded in the downscaled image. Moreover, we adopt integer coupling layers to avoid quantizing the downscaled image. When tested on commonly used datasets, the proposed method, termed DIRECT, improves high-resolution reconstruction quality both subjectively and objectively, while maintaining visually pleasing downscaled images.

DIRECT: Discrete Image Rescaling with Enhancement from Case-specific Textures
Recognition and Detection I
No. Author(s) Title
11 Yueyang Li, Sumei Li*, Yizhan Zhao

Scene text detection has achieved impressive progress over the past years. However, there are still two challenges for text detection. The first challenge is the limitation of receptive field on account of the large aspect ratio of texts. The second one is the loss of spatial information due to the long path between lower layers and topmost feature. To address the two problems, we propose an effective text detection network for multi-oriented text. In this paper, we introduce Hybrid feature enhancement module (HFEM) and Low-level feature refinement module (LFRM). HFEM is a multiple parallel branches module to enlarge receptive field and capture multi-scale information for better detection. LFRM is proposed to suppress background noise and strengthen feature propagation. What's more, low-level feature compensation mechanism preserves rich spatial information for the model. Experiments on datasets including ICDAR 2015, MSRA-TD500 and MLT-2017 validate that the proposed method is effective for multi-oriented text. We also provide ablation experiments on ICDAR 2015 to indicate the effectiveness of proposed modules in our network.

Multi-Oriented Text Detection Network Based on Hybrid Feature Enhancement and Shallow Feature Refinement
12 Yizhan Zhao, Sumei Li*, Yongli Chang

Recently, scene text detection based on deep learning has progressed substantially. Nevertheless, most previous models with FPN are limited by the drawback of sample interpolation algorithms, which fail to generate high-quality up-sampled features. Accordingly, we propose an end-to-end trainable text detector to alleviate the above dilemma. Specifically, a Back Projection Enhanced Up-sampling (BPEU) block is proposed to alleviate the drawback of sample interpolation algorithms. It significantly enhances the quality of up-sampled features by employing back projection and detail compensation. Furthermore, a Multi-Dimensional Attention (MDA) block is devised to learn different knowledge from spatial and channel dimensions, which intelligently selects features to generate more discriminative representations. Experimental results on three benchmarks, ICDAR2015, ICDAR2017- MLT and MSRA-TD500, demonstrate the effectiveness of our method.

Multi-Dimension Aware Back Projection Network For Scene Text Detection
13 Jinshi Kang, Xin Jin*, Zhidong Chen

Vision-based automatic defect detection is of great significance for industrial quality controlling. However, for small-scale defects on specular surfaces, existing methods lack the ability to capture defects under single exposure. In this paper, light field spatial correlation fusion (LFSCF) is proposed for defect detection. LFSCF algorithm fuses the spatial correlation maps (SCMs) from multiple perspectives in the structured light field (LF) image to generate a global confidence map. Benefiting from the 4-D imaging capability of LF cameras and the fused receptive field of LFSCF, small-scale defects are visible through single exposure and detectable in several adjacent SCMs, hence the accuracy and robustness are improved. Experimental results show that the proposed method achieves a precision of 0.990 and a recall of 1.000.

Light Field Spatial Correlation Fusion for High Precision Defect Detection
14 Zijun Yu, zhao wenchao, Qiang Zhang, Wenming Yang*, Qingmin Liao

In this paper, we first build up a new well-annotated Unmanned Aerial Vehicle (UAV) dataset termed the UAV19 dataset, which consists of 18123 RGB images containing 19 categories of UAVs. Then, based on the simple one-stage detector FCOS, we propose a novel end-to-end UAV detector RBUAVDet, which extracts fine-grained representative features by explicitly exploiting region and border features containing regional overall information and boundary information of UAV instances. By fusing region and border features effectively with Region and Border Feature Enhancement (RBFE) module, RBUAVDet can improve the accuracy of both classification and localization. Experimental results on the UAV19 dataset show that, compared with baseline FCOS, our proposed RBUAVDet achieves a gain of 2.35\% in mean average precision (mAP) (74.63 v.s. 72.28). Other popular object detectors are also compared and RBUAVDet obtains state-of-the-art accuracy.

RBUAVDet: Exploiting Region and Border Features for Unmanned Aerial Vehicle Detection
15 Gerald Xie, Zhu Li*, Asif Mehmood, Shuvra Bhattacharyya

Object detection is a classic computer vision task, whichlearns the mapping between an image and object boundingboxes + class labels. Many applications of object detectioninvolve images which are prone to degradation at capturetime, notably motion blur from a moving camera like UAVsor object itself. One approach to handling this blur involvesusing common deblurring methods to recover the clean pixelimages and then the apply vision task. This task is typicallyill-posed. On top of this, application of these methods alsoadd onto the inference time of the vision network, which canhinder performance of video inputs. To address the issues,we propose a novel plug-and-play (PnP) solution that insertdeblurring features into the target vision task network withoutthe need to retrain the task network. The deblur features arelearned from a classification loss network on blur strength anddirections, and the PnP scheme works well with the objectdetection network with minimum inference time complexity,compared with the state of the art deblur and then detectionsolution.

Plug-and-Play Deblurring for Robust Object Detection
Steganography I
No. Author(s) Title
16 Yiyan Yang, Zhongpai Gao, Guangtao Zhai*

QR code is a powerful tool to bridge the offline and online worlds. It has been widely used because it can store a large amount of information in a small space. However, the black-and-white style of QR codes is not attractive to the human eyes when embedded in videos, which greatly affects the viewing experience. Invisible QR code has proposed based on temporal psycho-visual modulation (TPVM) to embed invisible hyperlinks in shopping websites, copyright watermarks in movies, etc. However, existing embedding and detection methods are not robust enough. In this paper, we adopt a novel embedding method to greatly improve the visual quality of the embedded video. Furthermore, we build a new dataset of invisible QR codes named `IQRCodes' to train deep neural networks. At last, we propose localization, refinement, and segmentation neural netowrks (LRS-Net) to efficiently detect and restore invisible QR codes that are captured by mobile phones.

LRS-Net: invisible QR Code embedding, detection, and restoration
17 Muhammad Farhan, Syed Muhammmad Ammar Alam, Moid Huda

In the age of digital content creation and distribution, steganography, that is, hiding of secret data within another data is needed in many applications, such as in secret communication between two parties, piracy protection, etc. In image steganography, secret data is generally embedded within the image through an additional step after a mandatory image enhancement process. In this paper, we propose the idea of embedding data \textit{during} the image enhancement process. This saves the additional work required to separately encode the data inside the cover image. We used the Alpha-Trimmed mean filter for image enhancement and XOR of the 6 MSBs for embedding the two bits of the bitstream in the 2 LSBs whereas the extraction is a reverse process. Our obtained quantitative and qualitative results are better than a methodology presented in a very recent paper.

Alpha-trimmed Mean Filter and XOR based Image Enhancement for Embedding Data in Image
Lunch break
Keynote I
Does Deep Learning have to be so deep?
Presenter: Moncef Gabbouj (Tampere University, Finland)
Moderator: Andre Kaup (FAU, Germany)
Panel I
Learning-based Image and Video Coding
Moderator: Prof. João Ascenso (Instituto Superior Técnico, Portugal) and Dr. Elena Alshina (Huawei)
Coffee break
Oral Session
Session Chair: Dr. Elena Alshina (Huawei)
(Special Session) Learning-based Image and Video Coding I
Author(s) Title
Hiroaki Akutsu*, Takahiro Naruko, Akifumi Suzuki

Neural compression has benefited from technological advances such as convolutional neural networks (CNNs) to achieve advanced bitrates, especially in image compression. In neural image compression, an encoder and a decoder can run in parallel on a GPU, so the speed is relatively fast. However, the conventional entropy coding for neural image compression requires serialized iterations in which the probability distribution is estimated by multi-layer CNNs and entropy coding is processed on a CPU. Therefore, the total compression and decompression speed is slow. We propose a fast, practical, GPU-intensive entropy coding framework that consistently executes entropy coding on a GPU through highly parallelized tensor operations, as well as an encoder, decoder, and entropy estimator with an improved network architecture. We experimentally evaluated the speed and rate-distortion performance of the proposed framework and found that we could significantly increase the speed while maintaining the bitrate advantage of neural image compression.

GPU-Intensive Fast Entropy Coding Framework for Neural Image Compression
Franck Galpin*, Philippe Bordes, Thierry Dumas, Pavel Nikitin, Fabrice Le Leannec

This paper presents a learning-based method to improve bi-prediction in video coding. In conventional video coding solutions, the motion compensation of blocks from already decoded reference pictures stands out as the principal tool usedto predict the current frame. Especially, the bi-prediction, in which a block is obtained by averaging two different motion-compensated prediction blocks, significantly improves the final temporal prediction accuracy.In this context, we introduce a simple neural network that further improves the blending operation. A complexity balance, both in terms of network size and encoder mode selection, is carried out. Extensive tests on top of the recently standardized VVC codec are performed and show a BD-rate improvementof -1.4% in random access configuration for a network size of fewer than 10k parameters. We also propose a simple CPU-based implementation and direct network quantization to assess the complexity/gains tradeoff in a conventional codec framework.

Neural Network based Inter bi-prediction Blending
Ahmet Burakhan Koyuncu*, Kai Cui, Atanas Boev, Eckehard Steinbach*

Learning-based image compression has reached the performance of classical methods such as BPG. One common approach is to use an autoencoder network to map the pixel information to a latent space and then approximate the symbol probabilities in that space with a context model. During inference, the learned context model provides symbol probabilities, which are used by the entropy encoder to obtain the bitstream. Currently, the most effective context models use autoregression, but autoregression results in a very high decoding complexity due to the serialized data processing. In this work, we propose a method to parallelize the autoregressive process used for image compression. In our experiments, we achieve a decoding speed that is over 8 times faster than the standard autoregressive context model almost without compression performance reduction.

Parallelized Context Modeling for Faster Image Coding
Honglei Zhang*, Francesco Cricri, Hamed Rezazadegan Tavakoli, Maria Santamaria, Yat-Hong Lam, Miska Hannuksela

For most machine learning systems, overfitting is anundesired behavior. However, overfitting a model to a test imageor a video at inference time is a favorable and effective techniqueto improve the coding efficiency of learning-based image andvideo codecs. At the encoding stage, one or more neural networksthat are part of the codec are finetuned using the input image orvideo to achieve a better coding performance. The encoder encodesthe input content into a content bitstream. If the finetunedneural network is part (also) of the decoder, the encoder signalsthe weight update of the finetuned model to the decoder alongwith the content bitstream. At the decoding stage, the decoder firstupdates its neural network model according to the received weightupdate, and then proceeds with decoding the content bitstream.Since a neural network contains a large number of parameters,compressing the weight update is critical to reducing bitrateoverhead. In this paper, we propose learning-based methods tofind the important parameters to be overfitted, in terms of ratedistortionperformance. Based on simple distribution models forvariables in the weight update, we derive two objective functions.By optimizing the proposed objective functions, the importancescores of the parameters can be calculated and the importantparameters can be determined. Our experiments on lossless imagecompression codec show that the proposed method significantlyoutperforms a prior-art method where overfitted parameterswere selected based on heuristics. Furthermore, our techniqueimproved the compression performance of the state-of-the-artlossless image compression codec by 0.1 bit per pixel.

Learn to overfit better: finding the important parameters for learned image compression
Charles Bonnineau*, Wassim Hamidouche*, Jean-Yves Aubié, Jean-François Travers, Naty Sidaty, Prof. Olivier Deforges

In this paper, we present CAESR, an hybrid learning-based coding approach for spatial scalability based on the versatile video coding (VVC) standard. Our framework considers a low-resolution signal encoded with VVC intra-mode as a base-layer (BL), and a deep conditional autoencoder with hyperprior (AE-HP) as an enhancement-layer (EL) model. The EL encoder takes as inputs both the upscaled BL reconstruction and the original image. Our approach relies on conditional coding that learns the optimal mixture of the source and the upscaled BL image, enabling better performance than residual coding. On the decoder side, a super-resolution (SR) module is used to recover high-resolution details and invert the conditional coding process. Experimental results have shown that our solution is competitive with the VVC full-resolution intra coding while being scalable.

CAESR: Conditional Autoencoder and Super-Resolution for Learned Spatial Scalability
Poster Session
Session Chair: Dr. Christian Herglotz (FAU, Germany)
Emerging Techniques for Image and Video Coding Standards I
No. Author(s) Title
1 Mário Saldanha, Gustavo Sanchez, César Marcon, Luciano Volcan Agostini

This paper presents a learning-based complexity reduction scheme for Versatile Video Coding (VVC) intra-frame prediction. VVC introduces several novel coding tools to improve the coding efficiency of the intra-frame prediction at the cost of a high computational effort. Thus, we developed an efficient complexity reduction scheme composed of three solutions based on machine learning and statistical analysis to reduce the number of intra prediction modes evaluated in the costly Rate-Distortion Optimization (RDO) process. Experimental results demonstrated that the proposed solution provides 18.32% encoding timesaving with a negligible impact on the coding efficiency.

Learning-Based Complexity Reduction Scheme for VVC Intra-Frame Prediction
2 Wei Peng, Li Yu*, Hongkui Wang

High Efficiency Video Coding - Screen Content Coding (HEVC-SCC) follows the traditional angular intra prediction technique in HEVC. However, the Planar mode and the DC mode are somewhat repetitive for screen content video with features such as no senor noise. Hence, this paper proposes a new intra prediction mode called linear regression (LR) mode, which combines the Planar mode and the DC mode into one mode. The LR mode improves the prediction accuracy of intra prediction for fading regions in screen content video. Besides, by optimizing the most probable mode (MPM) construction, the hit rate of the best mode in the MPM list is improved. The experimental results show that the proposed method can achieve 0.57% BD-BR reduction compared with HM 16.20+SCM8.8, while the coding time remains largely the same.

Linear Regression Mode of Intra Prediction for Screen Content Coding
3 Zhao Liping, Zhou kailun, Zhou Qingyang, Wang Huihui, Tao Lin

An efficient SCC tool named Intra String Copy (ISC) has been proposed and adopted in AVS3 recently. ISC has two CU level sub-modes: FPSP (fully-matching-string and partially-matching-string based string prediction) sub-mode and EUSP (equal-value-string, unit-basis-vector-string and unmatched-pixel-string based string prediction) sub-mode. Compared with the latest AVS3 reference software HPM with ISC disabled, using AVS3 SCC Common Test Condition and YUV test sequences in text and graphics with motion (TGM) and mixed content (MC) categories, the proposed tool achieves an average Y BD-rate reduction 9.7%/6.0% and 14.5%/7.7% for TGM and MC in All Intra (AI)/Low Delay B(LDB) configurations, respectively, with low additional encoding complexity and almost the same decoding complexity.

An Intra String Copy Appoach for SCC in AVS3
4 Yang Wang, Li Zhang*, Kai Zhang*, Yuwen He, Hongbin Liu

Intra prediction is typically used to exploit the spatial redundancy in video coding. In the latest video coding standard Versatile Video Coding (VVC), 67 intra prediction modes are adopted in intra prediction. The encoder selects the best one from 67 modes and signals it to the decoder. Bits consuming of signaling the selected mode may limit the coding efficiency. To reduce the overhead of signaling the intra prediction mode, a probability-based decoder-side intra mode derivation (P-DIMD) is proposed in this paper. Specifically, an intra prediction mode candidate set is constructed based on the probabilities of intra prediction modes. The probability of an intra prediction mode is mainly estimated in two ways. First, the textures are typically continuous within a local region and intra prediction modes of neighboring blocks are similar to each other. Second, some intra prediction modes are preferable to be used than others. For each intra prediction mode in the constructed candidate set, intra prediction is processed on a template to calculate a cost. The intra prediction mode with the minimum cost is determined as the optimal mode and used in the intra prediction of the current block. Experimental results demonstrate that P-DIMD can achieve 0.56% BD-rate saving on average compared to VTM-11.0 under all intra configuration.

Probability-based decoder-side intra mode derivation for VVC
5 Limin Wang*, Seungwook Hong, Krit Panusopone

Versatile Video Coding (VVC) is a new international video coding standard. One of the functionalities that VVC supports is so called Gradual Decoding Refresh (GDR). GDR is mainly for (ultra) low-delay applications. As the latest video coding standard, VVC employs many new and advanced coding tools. Among them is HMVP (History-based Motion Vector Prediction), which however can cause leaks for GDR applications. This paper analyzes the leak problem associated with HMVP for GDR and proposes suggestions on how to use HMVP for GDR applications.

History-Based MVP (HMVP) for Gradual Decoding Refresh of VVC
Evaluation of Image and Video Coding Standards
No. Author(s) Title
6 Mário Saldanha, Gustavo Sanchez, César Marcon, Luciano Volcan Agostini

This paper presents an encoding time and encoding efficiency analysis of the Quadtree with nested Multi-type Tree (QTMT) structure in the Versatile Video Coding (VVC) intra-frame prediction. The QTMT structure enables VVC to improve the compression performance compared to its predecessor standard at the cost of a higher encoding complexity. The intra-frame prediction time raised about 26 times compared to the HEVC reference software, and most of this time is related to the new block partitioning structure. Thus, this paper provides a detailed description of the VVC block partitioning structure and an in-depth analysis of the QTMT structure regarding coding time and coding efficiency. Based on the presented analyses, this paper can guide outcoming works focusing on the block partitioning of the VVC intra-frame prediction.

Analysis of VVC Intra Prediction Block Partitioning Structure
7 Reda Kaafarani*, Médéric Blestel, Michael Ropert, Aline Roumy, Thomas Maugey

Many video service providers take advantage of bitrate ladders in adaptive HTTP video streaming to account for different network states and user display specifications by providing bitrate/resolution pairs that best fit client’s network conditions and display capabilities. These bitrate ladders, however, differ when using different codecs and thus the couples bitrate/resolution differ as well. In addition, bitrate ladders are based on previously available codecs (H.264/MPEG4-AVC, HEVC, etc.), i.e. codecs that are already in service, hence the introduction of new codecs e.g. Versatile Video Coding (VVC) requires re-analyzing these ladders. For that matter, we will analyze the evolution of the bitrate ladder when using VVC. We show how VVC impacts this ladder when compared to HEVC and H.264/AVC and in particular, that there is no need to switch to lower resolutions at the lower bitrates defined in the Call for Evidence on Transcoding for Network Distributed Video Coding (CfE).Index Terms—bitrate ladder, VVC, HEVC, H.264/AVC

Evaluation Of Bitrate Ladders For Versatile Video Coder
8 Xiaohan Pan, Zongyu Guo, Zhibo Chen

We have witnessed the rapid development of learned image compression (LIC). The latest LIC models have outperformed almost all traditional image compression standards in terms of rate-distortion (RD) performance. However, the time complexity of LIC model is still underdiscovered, limiting the practical applications in industry. Even with the acceleration of GPU, LIC models still struggle with long coding time, especially on the decoder side. In this paper, we analyze and test a few prevailing and representative LIC models, and compare their complexity with traditional codecs including H.265/HEVC intra and H.266/VVC intra. We provide a comprehensive analysis on every module in the LIC models, and investigate how bitrate changes affect coding time. We observe that the time complexity bottleneck mainly exists in entropy coding and context modelling. Although this paper pay more attention to experimental statistics, our analysis reveals some insights for further acceleration of LIC model, such as model modification for parallel computing, model pruning and a more parallel context model.

Analyzing Time Complexity of Practical Learned Image Compression Models
9 Rongli Jia, Zhiming Zhou, Jiapeng Tang, Yu Dong, Lin Li, Bing Zhou, Li Song*

Nowadays, people are increasingly inclined to use video calls for remote communication. However, due to the complicated and unpredictable network environments, video communication often confronts undesirable conditions during the transmission, such as bandwidth adjustment, packet loss, delay and jitter. Assessing the quality of complex video communication has become an important research topic. Existing quality of experience (QoE) models are mostly aimed at streaming services, but fail to suit real-time communication (RTC) scenario. In this work, we propose an improved QoE model for RTC video communication. We extract communication features, audio features and video features and improve model's performance by deep neural network. Besides, we establish a subjective QoE database based on Android RTC platform to verify the effectiveness of the proposed model. Extensive experiments demonstrate that our approach outperforms other state-of-the-art ones, and achieves a relative improvement in 20.00\%, 21.67\% and 29.78\% for PLCC, SRCC and KRCC, respectively.

QoRTC: An Improved Quality of Experience Measurement for RTC Video Communication
10 Xiaozhong Xu*, Shan Liu, Zeqiang Li

Learning-based visual data compression and analysis have attracted great interest from both academia and industry recently. More training as well as testing datasets, especially good quality video datasets are highly desirable for related research and standardization activities. A UHD video dataset is established to serve various purposes such as training neural network-based coding tools and testing machine vision tasks including object detection and tracking. This dataset contains 86 video sequences with a variety of content coverage. Each video sequence consists of 65 frames at 4K (3840x2160) spatial resolution. In this paper, the details of this dataset, as well as its performance when compressed by VVC and HEVC video codecs, are introduced.

A Video Dataset for Learning-based Visual Data Compression and Analysis
Machine Learning for Multimedia I
No. Author(s) Title
11 Jun Fu*, chen Hou, Zhibo Chen

Learning-based image deraining methods have achieved remarkable success in the past few decades. Currently, most deraining architectures are developed by human experts, which is a laborious and error-prone process. In this paper, we present a study on employing neural architecture search (NAS) to automatically design deraining architectures, dubbed AutoDerain. Specifically, we first propose an U-shaped deraining architecture, which mainly consists of residual squeeze-and-excitation blocks (RSEBs). Then, we define a search space, where we search for the convolutional types and the use of the squeeze-and-excitation block. Considering that the differentiable architecture search is memory-intensive, we propose a memory-efficient differentiable architecture search scheme (MDARTS). In light of the success of training binary neural networks, MDARTS optimizes architecture parameters through the proximal gradient, which only consumes the same GPU memory as training a single deraining model. Experimental results demonstrate that the architecture designed by MDARTS is superior to manually designed derainers.

AutoDerain: Memory-efficient Neural Architecture Search for Image Deraining
12 Ping Wang, Wei Wu*, Zhu Li*, Yong Liu

Scale-Invariant Feature Transform (SIFT) is one of the most well-known image matching methods, which has been widely applied in various visual fields. Because of the adoption of a difference of Gaussian (DoG) pyramid and Gaussian gradient information for extrema detection and description, respectively, SIFT achieves accurate key points and thus has shown excellent matching results but except under adverse weather conditions like rain. To address the issue, in the paper we propose a divide-and-conquer SIFT key points recovery algorithm from a single rainy image. In the proposed algorithm, we do not aim to improve quality for a derained image, but divide the key point recovery problem from a rainy image into two sub-problems, one being how to recover the DoG pyramid for the derained image and the other being how to recover the gradients of derained Gaussian images at multiple scales. We also propose two separate deep learning networks with different losses and structures to recover them, respectively. This divide-and-conquer scheme to set different objectives for SIFT extrema detection and description leads to very robust performance. Experimental results show that our proposed algorithm achieves state-of-the-art performances on widely used image datasets in both quantitative and qualitative tests.

See SIFT in a Rain: Divide-and-conquer SIFT Key Point Recovery from a Single Rainy Image
Multimedia Content Analysis, Representation, and Understanding I
No. Author(s) Title
13 Yiran Tao, Yaosi Hu*, Zhenzhong Chen*

In the video saliency prediction task, one of the key issues is the utilization of temporal contextual information of keyframes. In this paper, a deep reinforcement learning agent for video saliency prediction is proposed, designed to look around adjacent frames and adaptively generate a salient contextual window that contains the most correlated information of keyframe for saliency prediction. More specifically, an action set step by step decides whether to expand the window, meanwhile a state set and reward function evaluate the effectiveness of the current window. The deep Q-learning algorithm is followed to train the agent to learn a policy to achieve its goal. The proposed agent can be regarded as plug-and-play which is compatible with generic video saliency prediction models. Experimental results on various datasets demonstrate that our method can achieve an advanced performance.

Learn to Look Around: Deep Reinforcement Learning Agent for Video Saliency Prediction
14 Yiwei Yang, Yucheng Zhu, Zhongpai Gao, Guangtao Zhai*

The saliency prediction of panoramic images is dramatically affected by the distortion caused by non-Euclidean geometry characteristic. Traditional CNN based saliency prediction algorithms for 2D images are no longer suitable for 360-degree images. Intuitively, we propose a graph based fully convolutional network for saliency prediction of 360-degree images, which can reasonably map panoramic pixels to spherical graph data structures for representation. The saliency prediction network is based on residual U-Net architecture, with dilated graph convolutions and attention mechanism in the bottleneck. Furthermore, we design a fully convolutional layer for graph pooling and unpooling operations in spherical graph space to retain node-to-node features. Experimental results show that our proposed method outperforms other state-of-the-art saliency models on the large-scale dataset.

SalGFCN: Graph Based Fully Convolutional Network for Panoramic Saliency Prediction
15 Jiaomin Yue, Qiang Lu, Xiongkuo Min*, dandan zhu, Xiao-Ping Zhang, Guangtao Zhai*

There are individual differences in human visual attention between observers when viewing the same scene. Inter-observer visual congruency (IOVC) describes the dispersion between different people’s visual attention areas when they observe the same stimulus. Research on the IOVC of video is interesting but lacking. In this paper, we first introduce the measurement to calculate the IOVC of video. And an eye-tracking experiment is conducted in a realistic movie-watching environment to establish a movie scene dataset. Then we propose a method to predict the IOVC of video, which employs a dual-channel network to extract and integrate content and optical flow features. The effectiveness of the proposed prediction model is validated on our dataset. And the correlation between inter-observer congruency and video emotion is analyzed.

Inter-Observer Visual Congruency in Video-Viewing
16 Kaitai Zhang, Bin Wang, Wei Wang, Fahad Sohrab, Moncef Gabbouj*, C.-C. Jay Kuo*

An image anomaly localization method based on the successive subspace learning (SSL) framework, called AnomalyHop, is proposed in this work. AnomalyHop consists of three modules: 1) feature extraction via successive subspace learning (SSL), 2) normality feature distributions modeling via Gaussian models, and 3) anomaly map generation and fusion. Comparing with state-of-the-art image anomaly localization methods based on deep neural networks (DNNs), AnomalyHop is mathematically transparent, easy to train, and fast in its inference speed. Besides, its area under the ROC curve (ROC-AUC) performance on the MVTec AD dataset is 95.9%, which is among the best of several benchmarking methods.

AnomalyHop: An SSL-based Image Anomaly Localization Method
17 Fang-Tsung Hsiao, Yi-Hsien Lin*, Yi-Chang Lu

In this paper, we propose a novel algorithm for summarization-based image resizing. In the past, a process of detecting precise locations of repeating patterns is required before the pattern removal step in resizing. However, it is difficult to find repeating patterns which are illuminated under different lighting conditions and viewed from different perspectives. To solve the problem, we first identify the regularity unit of repeating patterns by statistics. Then we can use the regularity unit for shift-map optimization to obtain a better resized image. The experimental results show that our method is competitive with other well-known methods.

Using Regularity Unit As Guidance For Summarization-Based Image Resizing


Oral Session
Session Chair: Prof. João Ascenso (Instituto Superior Técnico, Portugal)
(Special Session) Learning-based Image and Video Coding II
Author(s) Title
Saeed Ranjbar Alvar, Ivan Bajic*

Learning-based compressions systems have shown great potential for multi-task inference from their latent-space representation of the input image. In such systems, the decoder is supposed to be able to perform various analyses of the input image, such as object detection or segmentation, besides decoding the image. At the same time, privacy concerns around visual analytics have grown in response to the increasing capabilities of such systems to reveal private information. In this paper, we propose a method to make latent-space inference more privacy-friendly using mutual information-based criteria. In particular, we show how organizing and compressing the latent representation of the image according to task-specific mutual information can make the model maintain high analytics accuracy while becoming less able to reconstruct the input image and thereby reveal private information.

Scalable Privacy in Multi-Task Image Compression
Zhenghao Zhang, Zhao Wang*, Yan Ye, Shiqi Wang, Changwen Zheng

Video chat becomes more and more popular in our daily life. However, how to provide a high-quality video chat with the limited bandwidth is a key challenging task. In this paper, beyond the state-of-the-art video compression system, we propose an encoder-decoder joint enhancement algorithm for the video chat. In particular, the sparse map of the original frame is extracted at the encoder side and signaled to the decoder, which is utilized together with the sparse map of the decoded frame to obtain the boundary transformation map. In this manner, the boundary transformation map represents the key difference between the original frame and the decoded frame and hence can be used to enhance the decoded frame. Experimental results show that the proposed algorithm brings clear subjective and objective quality improvements. At the same quality, the proposed algorithm can achieve 35% bitrate savings compared to the VVC.

Encoder-Decoder Joint Enhancement for Video Chat
Jian Qian, Hongkui Wang, Li Yu*

To reduce compression artifacts in video coding, prior knowledge from the codec is often utilized by deep learning based enhancement methods. However, most existing algorithms only take limited kinds of prior knowledge into considerationand directly feed it into neural networks, resulting in finite information obtained by the enhancement module. In this paper, a distortion-based neural network is proposed to better take advantage of features from the codec and further facilitate the quality of decoded videos. Firstly, a variety of prior knowledge is unitedly exploited to estimate the compression distortion. Secondly, an estimation module is also designed for the information obtained from the codec to improve the precision of estimation. With the accurately estimated distortion level, the enhancement module in the proposed method could reduce the correspondingartifacts by flexibly controlling the filtering strength. Moreover, the proposed method is integrated into VTM-7.1, achieving on average 5.92% BD-rate saving for Y component under All Intra configuration.

Distortion-based Neural Network for Compression Artifacts Reduction in VVC
Evgeniy Upenik*, Michela Testolina*, Joao Ascenso*, Fernando Pereira, Touradj Ebrahimi*

Learning-based image codecs produce different compression artifacts, when compared to the blocking and blurring degradation introduced by conventional image codecs, such as JPEG, JPEG~2000 and HEIC. In this paper, a crowdsourcing based subjective quality evaluation procedure was used to benchmark a representative set of end-to-end deep learning-based image codecs submitted to the MMSP'2020 Grand Challenge on Learning-Based Image Coding and the JPEG AI Call for Evidence. For the first time, a double stimulus methodology with a continuous quality scale was applied to evaluate this type of image codecs. The subjective experiment is one of the largest ever reported including more than 240 pair-comparisons evaluated by 118 naïve subjects. The results of the benchmarking of learning-based image coding solutions against conventional codecs are organized in a dataset of differential mean opinion scores along with the stimuli and made publicly available.

Large-Scale Crowdsourcing Subjective Quality Evaluation of Learning-Based Image Coding
Zezhi Zhu, Lili Zhao, XuHu Lin, Xuezhou Guo, Jianwen Chen*

In High Efficiency Video Coding (HEVC), inter prediction is an important module for removing temporal redundancy. The accuracy of inter prediction is much affected by the similarity between the current and reference frames. However, for blurry videos, the performance of inter coding will be degraded by varying motion blur, which is derived from camera shake or the acceleration of objects in the scene. To address this problem, we propose to synthesize additional reference frame via the frame interpolation network. The synthesized reference frame is added into reference picture lists to supply more credible reference candidate, and the searching mechanism for motion candidates is changed accordingly. In addition, to make our interpolation network more robust to various inputs with different compression artifacts, we establish a new blurry video database to train our network. With the well-trained frame interpolation network, compared with the reference software HM-16.9, the proposed method achieves on average 1.55% BD-rate reduction under random access (RA) configuration for blurry videos, and also obtains on average 0.75% BD-rate reduction for common test sequences.

Deep Inter Prediction via Reference Frame Interpolation for Blurry Video Coding
Coffee break
Poster Session
Session Chair: Prof. Li Sumei (Tianjin University, China)
Depth Estimation
No. Author(s) Title
1 Yue Li*, Yueyi Zhang, Zhiwei Xiong

Deep neural networks (DNNs) have been widely used for stereo depth estimation, which achieve great success in performance. In this paper, we introduce a novel flipping strategy for DNN on the stereo depth estimation task. Specifically, based on a common DNN for stereo matching, we apply the flipping operation for both input stereo images, which are further fed to the original DNN. A flipping loss function is proposed to jointly train the network with the initial loss. We apply our strategy to many representative networks in both supervised and self-supervised manners. Extensive experimental results demonstrate that our proposed strategy improves the performance of these networks.

Revisiting Flipping Strategy for Learning-based Stereo Depth Estimation
2 Rithvik Anil, Mansi Sharma*, Rohit Choudhary

Stereo depth estimation is dependent on optimalcorrespondence matching between pixels of stereo-pair imageto infer depth. In this paper, we attempt to revisit the stereodepth estimation problem in a simple dual convolutional neuralnetwork (CNN) based on EfficientNet that avoids the constructionof a cost volume in stereo matching. This has been performedby considering different weights in otherwise identical towers ofthe CNN. The proposed algorithm is dubbed as SDE-DualENet.The architecture of SDE-DualENet eliminates the constructionof cost-volume by learning to match correspondence betweenpixels with a different set of weights in the dual towers. Theresults are demonstrated on complex scenes with high detailsand large depth variations. The SDE-DualENet depth predictionnetwork outperforms state-of-the-art monocular and stereo depthestimation methods, both qualitatively and quantitatively onchallenging scene flow dataset. The code and pre-trained modelswill be made publicly available.

SDE-DualENet: A Novel Dual Efficient Convolutional Neural Network for Robust Stereo Depth Estimation
Image and Video Compression Beyond Standards I
No. Author(s) Title
3 Shiyu Huang*, Ziyuan Luo, Jiahua Xu, Wei Zhou, Zhibo Chen

Recently, the pre-processed video transcoding has attracted wide attention and has been increasingly used in practical applications for improving the perceptual experience and saving transmission resources. However, very few works have been conducted to evaluate the performance of pre-processing methods. In this paper, we select the source (SRC) videos and various pre-processing approaches to construct the first Pre-processed and Transcoded Video Database (PTVD). Then, we conduct the subjective experiment, showing that compared with the video sent to the codec directly at the same bitrate, the appropriate pre-processing methods indeed improve the perceptual quality. Finally, existing image/video quality metrics are evaluated on our database. The results indicate that the performance of the existing image/video quality assessment (IQA/VQA) approaches remain to be improved. We will make our database publicly available soon.

Perceptual Evaluation of Pre-processing for Video Transcoding
4 Xi Huang, Luheng Jia*, Han Wang, Kebin Jia

In video coding, it is always an intractable problem to compress high frequency components including noise and visually imperceptible content that consumes large amount bandwidth resources while providing limited quality improvement. Direct using of denoising methods causes coding performance degradation, and hence not suitable for video coding scenario. In this work, we propose a video pre-processing approach by leveraging edge preserving filter specifically designed for video coding, of which filter parameters are optimized in the sense of rate-distortion (R-D) performance. The proposed pre-processing method removes low R-D cost-effective components for video encoder while keeping important structural components, leading to higher coding efficiency and also better subjective quality. Comparing with the conventional denoising filters, our proposed pre-processing method using the R-D optimized edge preserving filter can improve the coding efficiency by up to -5.2% BD-rate with low computational complexity.

Video Coding Pre-Processing Based on Rate-Distortion Optimized Weighted Guided Filter
5 Dongmei Xue, Haichuan Ma, Li Li*, Dong Liu, Zhiwei Xiong

With the rapid development of whole brain imaging technology, a large number of brain images have been produced, which puts forward a great demand for efficient brain image compression methods. At present, the most commonly used compression methods are all based on 3-D wavelet transform, such as JP3D. However, traditional 3-D wavelet transforms are designed manually with certain assumptions on the signal, but brain images are not as ideal as assumed. What's more, they are not directly optimized for compression task. In order to solve these problems, we propose a trainable 3-D wavelet transform based on the lifting scheme, in which the predict and update steps are replaced by 3-D convolutional neural networks. Then the proposed transform is embedded into an end-to-end compression scheme called iWave3D, which is trained with a large amount of brain images to directly minimize the rate-distortion loss. Experimental results demonstrate that our method outperforms JP3D significantly by 2.012 dB in terms of average BD-PSNR.

iWave3D: End-to-end Brain Image Compression with Trainable 3-D Wavelet Transform
6 Yunyi Xuan, Chunling Yang*, Xin Yang

Recently, network-based Image Compressive Sensing (ICS) algorithms show superior performance in reconstruction quality and speed, yet non-interpretable. Herein, we propose an Adaptive Threshold-based Sparse Representation Reconstruction Network (ATSR-Net), composed of the Convolutional Sparse Representation subnet (CSR-subnet) and the truly Adaptive Threshold Generation subnet (ATG-subnet). The traditional iterations are unfolded into several CSR-subnets, which can fully exploit the local and nonlocal similarities. The ATG-subnet automatically determines a threshold map based on the image intrinsic characterization for flexible feature selection. Moreover, we present a three-level consistency loss based on pixel-level, measurement-level, and feature-level, to accelerate the network convergence. Extensive experiment results demonstrate the superiority of the proposed network to the existing state-of-the-art methods by large margins, both quantitatively and qualitatively.

Adaptive Threshold-based Sparse Representation Network for Image Compressive Sensing Reconstruction
7 Saiping Zhang*, Marta Mrak, Luis Herranz, Marc Gorriz Blanch, Shuai Wan, Fuzheng yang

Recent years have witnessed the significant development of learning-based video compression methods, which aim at optimizing objective or perceptual quality and bit rates. In this paper, we introduce deep video compression with perceptual optimizations (DVC-P), which aims at increasing perceptual quality of decoded videos. Our proposed DVC-P is based on Deep Video Compression (DVC) network, but improves it with perceptual optimizations. Specifically, a discriminator network and a mixed loss are employed to help our network trade off among distortion, perception and rate. Furthermore, nearest-neighbor interpolation is used to eliminate checkerboard artifacts which can appear in sequences encoded with DVC frameworks. Thanks to these two improvements, the perceptual quality of decoded sequences is improved. Experimental results demonstrate that, compared with the baseline DVC, our proposed method can generate videos with higher perceptual quality achieving 12.27% reduction in a perceptual BD-rate equivalent, on average.

DVC-P: Deep Video Compression with Perceptual Optimizations
Image/Video Quality Assessment I
No. Author(s) Title
8 Feng Jinhui, Sumei Li*, Yongli Chang

In recent years, with the popularization of 3D technology, stereoscopic image quality assessment (SIQA) has attracted extensive attention. In this paper, we propose a two-stage binocular fusion network for SIQA, which takes binocular fusion, binocular rivalry and binocular suppression into account to imitate the complex binocular visual mechanism in the human brain. Besides, to extract spatial saliency features of the left view, the right view, and the fusion view, saliency generating layers (SGLs) are applied in the network. The SGL apply multi-scale dilated convolution to emphasize essential spatial information of the input features. Experimental results on four public stereoscopic image databases demonstrate that the proposed method outperforms the state-of-the-art SIQA methods on both symmetrical and asymmetrical distortion stereoscopic images.

Binocular Visual Mechanism Guided No-Reference Stereoscopic Image Quality Assessment Considering Spatial Saliency
9 Feng Jinhui, Sumei Li*, Yongli Chang

In this paper, we propose an optimized dual stream convolutional neural network (CNN) considering binocular disparity and fusion compensation for no-reference stereoscopic image quality assessment (SIQA). Different from previous methods, we extract both disparity and fusion features from multiple levels to simulate hierarchical processing of the stereoscopic images in human brain. Given that the ocular dominance plays an important role in quality evaluation, the fusion weights assignment module (FWAM) is proposed to assign weight to guide the fusion of the left and the right features respectively. Experimental results on four public stereoscopic image databases show that the proposed method is superior to the state-of-the-art SIQA methods on both symmetrical and asymmetrical distortion stereoscopic images.

No-Reference Stereoscopic Image Quality Assessment Considering Binocular Disparity and Fusion Compensation
10 Fan Meng, Sumei Li*

With the development of stereoscopic imaging technology, stereoscopic image quality assessment (SIQA) has gradually been more and more important, and how to design a method in line with human visual perception is full of challenges due to the complex relationship between binocular views. In this article, firstly, convolutional neural network (CNN) based on the visual pathway of human visual system (HVS) is built, which simulates different parts of visual pathway such as the optic chiasm, lateral geniculate nucleus (LGN), and visual cortex. Secondly, the two pathways of our method simulate the ‘what’ and ‘where’ visual pathway respectively, which are endowed with different feature extraction capabilities. Finally, we find a different application way for 3D-convolution, employing it fuse the information from left and right view, rather than just extracting temporal features in video. The experimental results show that our proposed method is more in line with subjective score and has good generalization.

No-Reference Stereoscopic Image Quality Assessment Based On The Visual Pathway Of Human Visual System
11 Mingyue Zhou, Sumei Li*

Simulation of human visual system (HVS) is very crucial for fitting human perception and improving assessment performance in stereoscopic image quality assessment (SIQA). In this paper, a no-reference SIQA method considering feedback mechanism and orientation selectivity of HVS is proposed. In HVS, feedback connections is indispensable during the process of human perception, which has not been studied in the existing SIQA models. Therefore, we design a new feedback module (FBM) to realize the guidance of the high-level region of visual cortex to the low-level region. In addition, given the orientation selectivity of primary visual cortex cells, a deformable feature extraction block is explored to simulate it, and the block can adaptively select the regions of interest. Meanwhile, retinal ganglion cells (RGCs) with different receptive fields have different sensitivities to objects of different sizes in the image. So a new multi receptive fields information extraction and fusion manner is realized in the network structure. Experimental results show that the proposed model is superior to the state-of-the-art no-reference SIQA methods and has excellent generalization ability.

Deformable Convolution Based No-Reference Stereoscopic Image Quality Assessment Considering Visual Feedback Mechanism
12 Yingjie Feng, Sumei Li*

Stereoscopic video quality assessment (SVQA) is of great importance to promote the development of the stereoscopic video industry. In this paper, we propose a three-branch multi-level binocular fusion convolutional neural network (MBFNet) which is highly consistent with human visual perception. Our network mainly includes three innovative structures. Firstly, we construct a multi-scale cross-dimension attention module (MSCAM) on the left and right branches to capture more critical semantic information. Then, we design a multi-level binocular fusion unit (MBFU) to fuse the features from left and right branches adaptively. Besides, a disparity compensation branch (DCB) containing an enhancement unit (EU) is added to provide disparity feature. The experimental results show that the proposed method is superior to other existing SVQA methods with state-of-the-art performance.

Stereoscopic Video Quality Assessment with Multi-level Binocular Fusion Network Considering Disparity and Multi-scale Information
Image/Video Restoration and Quality Enhancement I
No. Author(s) Title
13 Yosuke Ueki*, Masaaki Ikehara

Underwater images suffer from low contrast, colordistortion and visibility degradation due to the light scattering andattenuation. Over the past few years, the importance of underwaterimage enhancement has increased because of ocean engineeringand underwater robotics. Existing underwater image enhancementmethods are based on various assumptions. However, it is almostimpossible to define appropriate assumptions for underwaterimages due to the diversity of underwater images. Therefore, theyare only effective for specific types of underwater images. Recently,underwater image enhancement algorisms using CNNs and GANShave been proposed, but they are not as advanced as other imageprocessing methods due to the lack of suitable training data sets andthe complexity of the issues. To solve the problems, we propose anovel underwater image enhancement method which combines theresidual feature attention block and novel combination of multiscale and multi-patch structure. Multi-patch network extractslocal features to adjust to various underwater images whichare often Non-homogeneous. In addition, our network includesmulti-scale network which is often effective for image restoration.Experimental results show that our proposed method outperformsthe conventional method for various types of images.

Underwater Image Enhancement with Multi-Scale Residual Attention Network
Recognition and Detection II
No. Author(s) Title
14 Jacek Trelinski, Bogdan Kwolek*

We propose an effective framework for human action recognition on raw depth maps. We leverage a convolutional autoencoder to extract on sequences of deep maps the frame-features that are then fed to a 1D-CNN responsible for embedding action features. A Siamese neural network trained on representative single depth map for each sequence extracts features, which are then processed by shapelets algorithm to extract action features. These features are then concatenated with features extracted by a BiLSTM with TimeDistributed wrapper. Given the learned individual models on such features we perform a selection of a subset of models. We demonstrate experimentally that on SYSU 3DHOI dataset the proposedalgorithm outperforms considerably all recent algorithms including skeleton-based ones.

Human Action Recognition on Raw Depth Maps
15 Quang Duc Vu, Trang Phung, Jia-Ching Wang*

Knowledge distillation is an effective transfer of knowledge from a heavy network (teacher) to a small network (student) to boost students' performance. Self-knowledge distillation, the special case of knowledge distillation, has been proposed to remove the large teacher network training process while preserving the student's performance. This paper introduces a novel Self-knowledge distillation approach via Siamese representation learning, which minimizes the difference between two representation vectors of the two different views from a given sample. Our proposed method, SKD-SRL, utilizes both soft label distillation and the similarity of representation vectors. Therefore, SKD-SRL can generate more consistent predictions and representations in various views of the same data point. Our benchmark has been evaluated on various standard datasets. The experimental results have shown that SKD-SRL significantly improves the accuracy compared to existing supervised learning and knowledge distillation methods regardless of the networks.

A Novel Self-Knowledge Distillation Approach with Siamese Representation Learning for Action Recognition
16 Huan Lin, Hongtian Zhao, Hua Yang*

Events in videos usually contain a variety of factors: objects, environments, actions, and their interaction relations, and these factors as the mid-level semantics can bridge the gap between the event categories and the video clips. In this paper, we present a novel video events recognition method that uses the graph convolution networks to represent and reason the logic relations among the inner factors. Considering that different kinds of events may focus on different factors, we especially use the transformer networks to extract the spatial-temporal features drawing upon the attention mechanism that can adaptively assign weights to concerned key factors. Although transformers generally rely more on large datasets, we show the effectiveness of applying a 2D convolution backbone before the transformers. We train and test our framework on the challenging video event recognition dataset UCF-Crime and conduct ablation studies. The experimental results show that our method achieves state-of-the-art performance, outperforming previous principal advanced models with a significant margin of recognition accuracy.

Complex Event Recognition via Spatial-Temporal Relation Graph Reasoning
17 Zhiyuan Huang, Zhaohui Hou, Pingyu Wang, Fei Su*, Zhicheng Zhao

Rotating object detection is more challenging than horizontal object detection because of the multi-orientation of the objects involved. In the recent anchor-based rotating object detector, the IoU-based matching mechanism has some mismatching and wrong-matching problems. Moreover, the encoding mechanism does not correctly reflect the location relationships between anchors and objects. In this paper, RBox-Diff-based matching (RDM) mechanism and angle-first encoding (AE) method are proposed to solve these problems. RDM optimizes the anchor-object matching by replacing IoU (Intersection-over-Union) with a new concept called RBox-Diff, while AE optimizes the encoding mechanism to make the encoding results consistent with the relative position between objects and anchors more. The proposed methods can be easily applied to most of the anchor-based rotating object detectors without introducing extra parameters. The extensive experiments on DOTA-v1.0 dataset show the effectiveness of the proposed methods over other advanced methods.

Rethinking Anchor-Object Matching and Encoding in Rotating Object Detection
Lunch break
Keynote II
Perception: The Next Milestone in Learned Image Compression
Presenter: Johannes Ballé (Google, USA)
Moderator: Wen-Hsiao Peng (NYCU, Taiwan)
Demo Session
Session Chair: Wen-Hsiao Peng (NYCU, Taiwan)
No. Author(s) Title
1 Jens Schneider*, Johannes Sauer, Mathias Wien*

RDPlot is an open source GUI application for plotting Rate-Distortion (RD)-curves and calculating Bjøntegaard Delta (BD) statistics [1]. It supports parsing the output of commonly used reference software packages, parsing *.csv-formatted files, and *.xml-formatted files. Once parsed, RDPlot offers the ability to evaluate video coding results interactively. Conceptually, several measures can be plotted over the bitrate and BD measurements can be conducted accordingly. Moreover, plots and corresponding BD statistics can be exported, and directly integrated into LATEX documents.

RDPlot – An Evaluation Tool for Video Coding Simulations
2 Peter Fasogbon*, Yu You, Emre Aksu

We present a demonstration of Network Based Media Processing (NBMP) standard-compliant cloud service for reconstructing and Augmented Reality (AR) display of fully textured 3D human model using 2-5 images captured with a smartphone.

NBMP Standard Use Case: 3D Human Reconstruction Workflow
3 Chris Varekamp, Ernest Ma, Andy Willems, Simon Gu, hongxin chen

We demonstrate a new capture system that allows generation of virtual views corresponding with a virtual camera that is placed between the players on a sports field. Our depth estimation and segmentation pipeline can reduce 2K resolution views from 16 cameras to patches in a single 4K resolution texture atlas. We have created a real time, WebGL 2 based, playback application that renders an arbitrary view from the 4K atlas. The application allows a user to change viewpoint in real time. Additionally, to interpret the scene, a user can also remove objects such as a player or the ball. At the conference we will demonstrate both the automatic multi-camera conversion pipeline and the real-time rendering/object removal on a smartphone.

Multi-camera system for placing the viewer between the players of a live sports match
4 Chunjun Hua, Menghan Hu*, Yue Wu

Congenital glaucoma is an eye disease caused by embryonic developmental disorders, which damages the optic nerve. In this demo paper, we proposed a portable non-contact congenital glaucoma detection system, which can evaluate the condition of children’s eyes by measuring the cornea size using the developed mobile application. The system consists of two modules viz. cornea identification module and diagnosis module. This system can be utilized by everyone with a smartphone, which is of wider application. It can be used as a convenient home self-examination tool for children in the large-scale screening of congenital glaucoma. The demo video of the proposed detection system is available at:

Portable Congenital Glaucoma Detection System
5 Chia-Chun Chung, Wen-Hsiao Peng*, Teng-Hu Cheng, ChiaHau Yu

This paper demonstrates a model-based reinforcement learning framework for training a self-flying drone. We implement the Dreamer proposed in a prior work as an environment model that responds to the action taken by the drone by predicting the next video frame as a new state signal. The Dreamer is a conditional video sequence generator. This model-based environment avoids the time-consuming interactions between the agent and the environment, speeding up largely the training process. This demonstration showcases for the first time the application of the Dreamer to train an agent that can finish the racing task in the Airsim simulator.

Learning to Fly with a Video Generator
6 Irina RABAEV, Marina Litvak*, Alex Kreinis, Tom Damri, Tomer Leon

Autism spectrum disorder (ASD) is frequently accompanied by impairment in emotional expression recognition, and therefore individuals with ASD may find it hard to interpret emotions and interact.Inspired by this fact, we developed a web-based video chat to assist people with ASD, both for real-time recognition of facial emotions and for practicing.This real-time application detects the speaker's face in a video stream and classifies the expressed emotion into one of the seven categories: neutral, surprise, happy, angry, disgust, fear, and sad. The classification is then displayed as the text label below the speaker's face. We developed this application as a part of the undergraduate project for the B.Sc. degree in Software Engineering. Its development and testing were made with the cooperation of the local society for children and adults with autism. The application has been released for unrestricted use on The demo is available at

Telemoji: A video chat with automated recognition of facial expressions
7 Daniele Bonatto*, Gregoire Hirt, Alexander Kvasov, Sarah Fachada*, Gauthier Lafruit*

Light-Field displays project hundreds of micro-parallax views for users to perceive 3D without wearing glasses. It results in gigantic bandwidth requirements if all views would be transmitted, even using conventional video compression per view. MPEG Immersive Video (MIV) follows a smarter strategy by transmitting only key images and some metadata to synthesize all the missing views. We developed (and will demonstrate) a real-time Depth Image Based Rendering software that follows this approach for synthesizing all Light-Field micro-parallax views from a couple of RGBD input views.

MPEG Immersive Video tools for Light Field Head Mounted Displays
8 Marina Litvak*, Irina RABAEV, Sarit Divekar

Plant classification requires an expert because subtle differences in leaves or petal forms might differentiate between different species. On the contrary, some species are characterized by high variability in appearance. This paper introduces a webapp for assisting people in identifying plants for discovering the best growing methods. The uploaded picture is submitted to the back-end server, and a pre-trained neural network classifies it to one of the predefined classes. The classification label and confidence are displayed to the end user on the frontendpage. The application focuses on the house and garden plant species that can be grown mainly in a desert climate and are not covered by existing datasets. For training a model, we collected the Urban Planter dataset. The installation code of the alpha version and the demo video of the app can be found on

Urban Planter: A Web App for Automatic Classification of Urban Plants
Coffee break
Oral Session
Session Chair: Dr. Robert Cohen (Unity Technologies, Canada)
Image/Video Coding for Machines
Author(s) Title
Ashiv Dhondea*, Robert Cohen, Ivan Bajic*

In edge-cloud collaborative intelligence (CI) applications, an unreliable transmission channel exists in the information path of the AI model performing the inference. It is important to be able to simulate the performance of the CI system across an imperfect channel in order to understand system behavior and develop appropriate error control strategies. In this paper we present a simulation framework called DFTS2, which enables researchers to define the components of the CI system in TensorFlow 2, select a packet-based channel model with various parameters, and simulate system behavior under various channel conditions and error/loss control strategies. Using DFTS2, we also present the most comprehensive study to date of the packet loss concealment methods for collaborative image classification models.

DFTS2: Deep Feature Transmission Simulation for Collaborative Intelligence
Benben Niu, Ziwei Wei*, Yun He*

With the emergence of various machine-to-machine and machine-to-human tasks with deep learning, the amount of deep feature data is increasing. Deep product quantization is widely applied in deep feature retrieval tasks and has achieved good accuracy. However, it does not focus on the compression target primarily, and its output is a fixed-length quantization index, which is not suitable for subsequent compression. In this paper, we propose an entropy-based deep product quantization algorithm for deep feature compression. Firstly, it introduces entropy into hard and soft quantization strategies, which can adapt to the codebook optimization and codeword determination operations in the training and testing processes respectively. Secondly, the loss functions related to entropy are designed to adjust the distribution of quantization index, so that it can accommodate to the subsequent entropy coding module. Experimental results carried on retrieval tasks show that the proposed method can be generally combined with deep product quantization and its extended schemes, and can achieve a better compression performance under near lossless condition.

Entropy-based Deep Product Quantization for Visual Search and Deep Feature Compression
Yixin Mei, Fan Li*, Li Li*, Zhu Li*

Recent advances in sensor technology and wide deployment of visual sensors lead to a new application whereas compression of images are not mainly for pixel recovery for human consumption, instead it is for communication to cloud side machine vision tasks like classification, identification, detection and tracking. This opens up new research dimensions for a learning based compression that directly optimizes loss function in vision tasks, and therefore achieves better compression performance vis-a-vis the pixel recovery and then performing vision tasks computing. In this work, we developed a learning based compression scheme that learns a compact feature representation and appropriate bitstreams for the task of visual object detection. Variational Auto-Encoder (VAE) framework is adopted for learning a compact representation, while a bridge network is trained to drive the detection loss function. Simulation results demonstrate that this approach is achieving a new state-of-the-art in task driven compression efficiency, compared with pixel recovery approaches, including both learning based and handcrafted solutions.

Learn A Compression for Objection Detection - VAE with a Bridge
Jinming Liu, Heming Sun*, Jiro Katto

Learned image compression (LIC) has illustrated good ability for reconstruction quality driven tasks (e.g. PSNR, MS-SSIM) and machine vision tasks such as image understanding. However, most LIC frameworks are based on pixel domain, which requires the decoding process. In this paper, we develop a learned compressed domain framework for machine vision tasks. 1) By sending the compressed latent representation directly to the task network, the decoding computation can be eliminated to reduce the complexity. 2) By sorting the latent channels by entropy, only selective channels will be transmitted to the task network, which can reduce the bitrate. As a result, compared with the traditional pixel domain methods, we can reduce about 1/3 multiply–add operations (MACs) and 1/5 inference time while keeping the same accuracy. Moreover, proposed channel selection can contribute to at most 6.8% bitrate saving.

Learning in Compressed Domain for Faster Machine Vision Tasks
Zerui Yang, Wen Fei, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong*

Mixed-precision quantization with adaptive bitwidth allocation for neural network has achieved higher compression rate and accuracy in classification task. However, it has not been well explored for object detection networks. In this paper, we propose a novel mixed-precision quantization scheme with dynamical Hessian matrix for object detection networks. We iteratively select a layer with the lowest sensitivity based on the Hessian matrix and downgrade its precision to reach the required compression ratio. The L-BFGS algorithm is utilized for updating the Hessian matrix in each quantization iteration. Moreover, we specifically design the loss function for objection detection networks by jointly considering the quantization effects on classification and regression loss. Experimental results on RetinaNet and Faster R-CNN show that the proposed DHMQ achieves state-of-the-art performance for quantized object detectors.

Mixed-precision Quantization with Dynamical Hessian Matrix for Object Detection Network
Poster Session
Session Chair: Prof. Vladan Velisavljevic (University of Bedfordshire, UK)
Emerging Techniques for Image and Video Coding Standards II
No. Author(s) Title
1 Madhu Krishnan, Xin Zhao*

The Alliance for Open Media has recently initiated coding tool exploration activities towards the next-generation video coding beyond AV1. In this regard, a frequency-domain coding tool, which is designed to leverage the cross-component correlation existing between collocated chroma blocks, is explored in this paper. The tool, henceforth known as multi-component secondary transform (MCST), is implemented as a low complexity secondary transform with primary transform coefficients of multiple color components as input. The proposed tool is implemented and tested on top of libaom. Experimental results show that, compared to libaom, the proposed method achieves an average 0.34% to 0.44% overall coding efficiency for All Intra (AI) coding configuration for a wide range of video content.

Multicomponent Secondary Transform
2 Yunrui Jian*, Jiaqi Zhang, Junru Li, Suhong Wang, Shanshe Wang, Siwei Ma, Wen Gao

Cross-component prediction has great potential for removing the redundancy of multi-components. Recently, cross-component sample adaptive offset (CCSAO) was adopted in the third generation of Audio Video coding Standard (AVS3), which utilizes the intensities of co-located luma samples to determine the offsets of chroma sample filters. However, the frame-level based offset is rough for various content, and the edge information of classified samples is ignored. In this paper, we propose an enhanced CCSAO (ECCSAO) method to further improve the coding performance. Firstly, four selectable 1-D directional patterns are added to make the mapping between luma and chroma components more effectively. Secondly, one four-layer quad-tree based structure is designed to improve the filtering flexibility of CCSAO. Experimental results show that the proposed approach achieves 1.51%, 2.33% and 2.68% BD-rate savings for All-Intra (AI), Random-Access (RA) and Low Delay B (LD) configurations compared to AVS3 reference software, respectively. A subset improvement of ECCSAO has been adopted by AVS3.

Enhanced Cross Component Sample Adaptive Offset for AVS3
3 Yixin Du, Xin Zhao*, Shan Liu

Existing cross-component video coding technologies have shown great potential on improving coding efficiency. The fundamental insight of cross-component coding technology is respecting the statistical correlations among different color components. In this paper, a Cross-Component Sample Offset (CCSO) approach for image and video coding is proposed inspired by the observation that, luma component tends to contain more texture, while chroma component is relatively smoother. The key component of CCSO is a non-linear offset mapping mechanism implemented as a look-up-table (LUT). The input of the mapping is the co-located reconstructed samples of luma component, and the output is offset values applied on chroma component. The proposed method has been implemented on top of a recent version of libaom. Experimental results show that the proposed approach brings 1.16% Random Access (RA) BD-rate saving on top of AV1 with marginal encoding/decoding time increase.

4 Shiyi Liu*, Zhenyu Wang, Ke Qiu, Jiayu Yang, Ronggang Wang

The third generation of Audio Video Coding Standard (AVS3) achieves 22\% coding performance improvement compared with High Efficiency Video Coding (HEVC). However, the improvement of encoding efficiency comes from a more flexible block partition scheme is at the cost of much higher encoding complexity. This paper proposes a bottom-up fast algorithm to prune the time-consuming search process of the CU partition tree. To be specific, we design a scoring mechanism based on the splitting patterns traced back from the bottom to predict the possibility of a partition type to be selected as optimal. The score threshold to skip the exhaustive Rate-Distortion Optimization (RDO) procedure of the partition type is determined by statistical analysis. The experimental results show that the proposed methods can achieve 24.56% time-saving with 0.37% BDBR loss under Random Access configuration and 12.50% complexity reduction with 0.08% BDBR loss under All Intra configuration. The effectiveness leads to the adoption by the open-source platform of AVS3 after evaluated by the AVS working group.

A Bottom-up Fast CU Partition Scoring Mechanism for AVS3
5 Hewei Liu, Shuyuan Zhu*, Ruiqin Xiong, Guanghui Liu, Bing Zeng

In this paper, we propose a new fast CU partition method for VVC intra coding based on the cross-block difference. This difference is measured by the gradient and thecontent of sub-blocks obtained from partition and is employed to guide the skipping of unnecessary horizontal and vertical partition modes. With this guidance, a fast determination of block partitions is accordingly achieved. Compared with VVC, our proposed method can save 41.64% (on average) encoding time with only 0.97% (on average) increase of BD-rate.

Cross-Block Difference Guided Fast CU Partition for VVC Intra Coding
Image/Video Restoration and Quality Enhancement II
No. Author(s) Title
6 Dengyan Luo, Mao Ye*, Shengjie Chen, Xue Li

The past few years have witnessed the great success of using multi-frame information to enhance the quality of compressed video. Most existing methods do frame or feature alignments to bring similar information from neighborhood frames much closer to enhance frame quality. However, inaccurate motion estimation will bring new artifacts. In this paper, we propose a new approach without alignment, which takes each non-Peak Quality Frame (non-PQF) and its two adjacent Peak Quality Frames (PQFs) as input. A Pre-processing module based on multi-scale feature extraction strategy is used to broaden receptive field of the network. Then, an enhancement module uses a two-stream feature extraction architecture to combines deep architecture and attention mechanism to further gather similar information. In this module, the lost high-frequency and similar information can be further retrieved from the adjacent PQFs. The proposed network is trained in an end-to-end manner. Compared with the alignment based methods, the competitive results can be obtained. A large number of qualitative and quantitative experimental results demonstrate the robustness and effectiveness of the proposed method.

Alignment-Free Video Compression Artifact Reduction
7 Wenbin Yin, Kai Zhang*, Li Zhang*, Yang Wang, Hongbin Liu

This paper describes an adaptive self-guided loop filter for video coding. The filter acts as a loop filter to reduce the artifacts caused by quantization. The proposed filter is applied on the reconstruction samples of a Transform Unit (TU). The unfiltered input block is regarded as a guidance for self-guided edge-preserving smooth filtering. The filtering strength for each sample inside the input block is decided according to the TU size, corresponding QP, and statistical information generated from a designed filtering window adaptively. The proposed filter could be performed as a prediction-loop-filter in which the filtered samples can be referenced by intra prediction. Alternatively, the proposed filter can be moved to the post-loop-filter stage that after the deblocking filter for a low decoding dependency. The proposed method has been implemented and tested according to the common test conditions in Versatile Video Coding (VVC) test model version 11.0 (VTM11.0). The post-loop-filter version can achieve 0.46\% and 0.41\% BD-rate reduction in All Intra and Random-Access configurations respectively.

Adaptive Self-Guided Loop Filter for Video Coding
8 Paras Maharjan, Ning Xu, Xuan Xu, Zhu Li*

Pixel recovery with deep learning has shown to be very effective for a variety of low-level vision tasks like image super-resolution, denoising, and deblurring. Most existing works operate in the spatial domain, and there are few works that exploit the transform domain for image restoration tasks. In this paper, we present a transform domain approach for image deblocking using a deep neural network called DCTResNet. Our application is compressed video motion deblur, where the input video frame has blocking artifacts that make the deblurring task very challenging. Specifically, we use a block-wise Discrete Cosine Transform (DCT) to decompose the image into its low and high-frequency sub-band images and exploit the strong sub-band specific features for more effective deblocking solutions. Since JPEG also uses DCT for image compression, using DCT sub-band images for image deblocking helps to learn the JPEG compression prior to effectively correct the blocking artifacts. Our experimental results show that both PSNR and SSIM for DCTResNet perform more favorably than other state-of-the-art (SOTA) methods, while significantly faster in inference time.

DCTResNet: Transform Domain Image Deblocking for Motion Blur Images
9 Ronglei Ji*, A. Murat Tekalp

Deinterlacing continues to be an important problem of interest since many digital TV broadcasts and catalog content are still in interlaced format. Although deep learning has had huge impact in all forms of image/video processing, learned deinterlacing has not received much attention in the industry or academia. In this paper, we propose a novel multi-field deinterlacing network that aligns features from adjacent fields to a reference field (to be deinterlaced) using deformable residual convolution blocks. To the best of our knowledge, this paper is the first to propose fusion of multi-field features that are aligned via deformable convolutions for deinterlacing. We demonstrate through extensive experimental results that the proposed method provides state-of-the-art deinterlacing results in terms of both PSNR and perceptual quality.

Learned Multi-Field De-interlacing with Feature Alignment via Deformable Residual Convolution Blocks
10 gang zhang, Haoquan Wang, Yedong Wang, Haijie Shen

JPEG compression artifacts seriously affect the viewing experience. While previous studies mainly focused on the deep convolutional networks for compression artifacts removal, of which the model size and inference speed limit their application prospects. In order to solve the above problems, this paper proposed two methods that can improve the training performance of the compact convolution network without slowing down its inference speed. Firstly, a fully explainable attention loss is designed to guide the network for training, which is calculated by local entropy to accurately locate compression artifacts. Secondly, Fully Expanded Block (FEB) is proposed to replace the convolutional layer in compact network, which can be contracted back to a normal convolutional layer after the training process is completed. Extensive experiments demonstrate that the proposed method outperforms the existing lightweight methods in terms of performance and inference speed.

Attention-guided Convolutional Neural Network for Lightweight JPEG Compression Artifacts Removal
Immersive Media Capture, Processing, and Compression
No. Author(s) Title
11 Tso-Yuan Chen, Ching-Chun Hsiao, Wen-Huang Cheng, Hong-Han Shuai, Peter Chen, Ching-Chun Huang*

With the development of depth sensors, 3D point cloud upsampling that generates a high-resolution point cloud given a sparse input becomes emergent. However, many previous works focused on single 3D object reconstruction and refinement. Although a few recent works began to discuss 3D structure refinement for a more complex scene, they do not target LiDAR-based point clouds, which have density imbalance issues from near to far. This paper proposed DensER, a Density-imbalance-Eased regional Representation. Notably, to learn robust representations and model local geometry under imbalance point density, we designed density-aware multiple receptive fields to extract the regional features. Moreover, founded on the patch reoccurrence property of a nature scene, we proposed a density-aided attentive module to enrich the extracted features of point-sparse areas by referring to other non-local regions. Finally, by coupling with novel manifold-based upsamplers, DensER shows the ability to super-resolve LiDAR-based whole-scene point clouds. The experimental results show DensER outperforms related works both in qualitative and quantitative evaluation. We also demonstrate that the enhanced point clouds can improve downstream tasks such as 3D object detection and depth completion.

DensER: Density-imbalance-eased Representation for LiDAR-based Whole Scene Upsampling
12 lu Wang, Jian Sun, Hui Yuan*, Raouf Hamzaoui, Xiaohui Wang

A point cloud is a set of points representing athree-dimensional (3D) object or scene. To compress a pointcloud, the Motion Picture Experts Group (MPEG) geometrybasedpoint cloud compression (G-PCC) scheme may use threeattribute coding methods: region adaptive hierarchical transform(RAHT), predicting transform (PT), and lifting transform (LT).To improve the coding efficiency of PT, we propose to use aKalman filter to refine the predicted attribute values. We alsoapply a Kalman filter to improve the quality of the reconstructedattribute values at the decoder side. Experimental results showthat the combination of the two proposed methods can achieve anaverage Bjntegaard delta bitrate of -0.5%, -5.2%, and -6.3% forthe Luma, Chroma Cb, and Chroma Cr components, respectively,compared with a recent G-PCC reference software.

Kalman filter-based prediction refinement and quality enhancement for geometry-based point cloud compression
13 Fangyu Shen*, Wei Gao

Video-based point cloud compression (V-PCC) has been an emerging compression technology that projects the 3D point cloud into a 2D plane and uses high efficiency video coding (HEVC) to encode the projected 2D videos (geometry video and color video). In this work, we propose a rate control algorithm for the all-intra (AI) configuration of V-PCC. Specifically, based on the quality-dependency existing in the projected videos, we develop an optimization formulation to allocate target bits between the geometry video and the color video. Furthermore, we design a two-pass method for HEVC to adapt to the new characteristics of projected videos, which significantly improves the accuracy of rate control. Experimental results demonstrate that our algorithm outperforms V-PCC without rate control in R-D performance with just 0.43% bitrate error.

A Rate Control Algorithm for Video-based Point Cloud Compression
14 Fan Jiang, Xin Jin*, Kedeng Tong

Plenoptic 2.0 videos that record time-varying light fields by focused plenoptic cameras are prospective to immersive visual applications due to capturing dense sampled light fields with high spatial resolution in the rendered sub-apertures. In this paper, an intra prediction method is proposed for compressing multi-focus plenoptic 2.0 videos efficiently. Based on the estimation of zooming factor, novel gradient-feature-based zooming, adaptive-bilinear-interpolation-based tailoring and inverse-gradient-based boundary filtering are proposed and executed sequentially to generate accurate prediction candidates for weighted prediction working with adaptive skipping strategy. Experimental results demonstrate the superior performance of the proposed method relative to HEVC and state-of-the-art methods.

Pixel Gradient Based Zooming Method for Plenoptic Intra Prediction
15 Sheng Liu, Liangchen Song, Yi Xu, Junsong Yuan*

Existing human models, e.g., SMPL and STAR, represent 3D geometry of a human body in the form of a polygon mesh obtained by deforming a template mesh according to a set of shape and pose parameters. The appearance, however, is not directly modeled by most existing human models. We present a novel 3D human model that faithful models both the 3D geometry and the appearance of a clothed human body with a continuous volumetric representation, i.e., volume densities and emitted colors of continuous 3D locations in the volume encompassing the human body. In contrast to the mesh-based representation whose resolution is limited by a mesh's fixed number of polygons, our volumetric representation does not limit the resolution of our model. Our volumetric representation can be rendered via differentiable volume rendering, thus enabling us to train the model only using 2D images (without using ground truth 3D geometries of human bodies) by minimizing a loss function which measures the differences between rendered images and ground truth images. On the contrary, existing human models are trained using ground truth 3D geometries of human bodies. Thanks to the ability of our model to jointly model both the geometries and the appearances of clothed people, our model can benefit applications including human image synthesis, gaming and 3D television and telepresence.

NeCH: Neural Clothed Human Model
Steganography II
No. Author(s) Title
16 Arjon Das*, Xin Zhong

This paper presents a deep learning–based audio-in-image watermarking scheme. Audio-in-image watermarking is the process of covertly embedding and extracting audio watermarks on a cover-image. Using audio watermarks can open up possibilities for different downstream applications. For the purpose of implementing an audio-in-image watermarking that adapts to the demands of increasingly diverse situations, a neural network architecture is designed to automatically learn the watermarking process in an unsupervised manner. In addition, a similarity network is developed to recognize the audio watermarks under distortions, therefore providing robustness to the proposed method. Experimental results have shown high fidelity and robustness of the proposed blind audio-in-image watermarking scheme.

A Deep Learning–based Audio-in-Image Watermarking Scheme
Visual Communications
No. Author(s) Title
17 Vivien Boussard*, Stephane Coulombe, François-Xavier COUDOUX, Patrick CORLAY, Anthony TRIOUX

This paper analyzes the benefits of extending CRCbased error correction (CRC-EC) to handle more errors in the context of error-prone wireless networks. In the literature, CRC-EC has been used to correct up to 3 binary errors per packet. We first present a theoretical analysis of the CRC-EC candidate list while increasing the number of errors considered. We then analyze the candidate list reduction resulting from subsequent checksum validation and video decoding steps. Simulations conducted on two wireless networks show that the network considered has a huge impact on CRC-EC performance. Over a Bluetooth low energy (BLE) channel with Eb/No=8 dB, an average PSNR improvement of 4.4 dB on videos is achieved when CRC-EC corrects up to 5, rather than 3 errors per packet.

CRC-based Multi Error Correction of H.265 Encoded Videos in Wireless Communications


Oral Session
Session Chair: Prof. Azeddine Beghdadi (Université Sorbonne Paris Nord, France)
(Special Session) Visual data classification and analysis: In search of the limits of deep learning-
Author(s) Title
Manh-Hung Ha, Oscal T.-C. Chen

Comprehensive activity understanding of multiple subjects in a video requires subject detection, action identification, and behavior interpretation as well as the interactions among subjects and background. This work develops the action recognition of subject(s) based on the correlations and interactions of the whole scene and subject(s) by using the Deep Neural Network (DNN). The proposed DNN consists of 3D Convolutional Neural Network (CNN), Spatial Attention (SA) generation layer, mapping convolutional fused-depth layer, Transformer Encoder (TE), and two fully connected layers with late fusion for final classification. Especially, the attention mechanisms in SA and TE are implemented to find out meaningful action information on spatial and temporal domains for enhancing recognition performance, respectively. The experimental results reveal that the proposed DNN shows the superior accuracies of 97.8%, 98.4% and 85.6% in the datasets of traffic police, UCF101-24 and JHMDB-21, respectively. Therefore, our DNN is an outstanding classifier for various action recognitions involving one or multiple subjects.

Shanmeng Shi, Cheolkon Jung

In this paper, we propose deep metric learning for human action recognition with SlowFast networks. We adopt SlowFast Networks to extract slow-changing spatial semantic information of a single target entity in the spatial domain with fast-changing motion information in the temporal domain. Since deep metric learning is able to learn the class difference between human actions, we utilize deep metric learning to learn a mapping from the original video to the compact features in the embedding space. The proposed network consists of three main parts: 1) two branches independently operating at low and high frame rates to extract spatial and temporal features; 2) feature fusion of the two branches; 3) joint training network of deep metric learning and classification loss. Experimental results on the KTH human action dataset demonstrate that the proposed method achieves faster runtime with less model size than C3D and R3D, while ensuring high accuracy.

Deep Metric Learning for Human Action Recognition with SlowFast Networks
Lan Zhou, Hui Yuan, Chuan Ge

We propose a convolutional long short-term memory(ConvLSTM)-based neural network for video semantic segmentation. The network can capture the timing information between frames through the ConvLSTM module to improve the prediction accuracy. The back-bone network uses dense connection, atrous convolution and pooling pyramid structure to expand the receptive field. During training, to avoid over fitting, data augmentation and learning rate attenuation strategies are used. The proposed method is end-to-end trainable and is evaluated on the street scene benchmark Cityscapes. Experimental results show that the network can improve the performance of video semantic segmentation by using the temporal information between frames by adding the ConvLSTM module, especially for dynamic objects and small objects, such as truck, pedestrian and pole.

ConvLSTM-based Neural Network for Video Semantic Segmentation
Coffee break
Poster Session
Session Chair: Prof. Guangtao Zhai (Shanghai Jiao Tong University, China)
Image/Video Quality Assessment II
No. Author(s) Title
1 Yingjie Feng, Sumei Li, sihan Hao

In recent years, deep learning has achieved significant progress in many respects. However, unlike other research fields with millions of labeled data such as image recognition, only several thousand labeled images are available in image quality assessment (IQA) field for deep learning, which heavily hinders the development and application for IQA. To tackle this problem, in this paper, we proposed an error self-learning semi-supervised method for no-reference (NR) IQA (ESSIQA), which is based on deep learning. We employed an advanced full reference (FR) IQA method to expand databases and supervise the training of network. In addition, the network outputs of expanding images were used as proxy labels replacing errors between subjective scores and objective scores to achieve error self-learning. Two weights of error back propagation were designed to reduce the impact of inaccurate outputs. The experimental results show that the proposed method yielded comparative effect.

An Error Self-Learning Semi-supervised Method for No-Reference Image Quality Assessment
2 Tao Wang, Wei Sun, Xiongkuo Min, Wei Lu, Zicheng Zhang, Guangtao Zhai

With the development of the game industry and the popularization of mobile devices, mobile games have played an important role in people’s entertainment life. The aesthetic quality of mobile game images determines the users’ Quality ofExperience (QoE) to a certain extent. In this paper, we propose a multi-task deep learning based method to evaluate the aesthetic quality of mobile game images in multiple dimensions (i.e. the fineness, color harmony, colorfulness, and overall quality). Specifically, we first extract the quality-aware feature representation through integrating the features from all intermediate layers of the convolution neural network (CNN) and then map these quality-aware features into the quality score space in each dimension via the quality regressor module, which consists of three fully connected (FC) layers. The proposed model is trained through a multi-task learning manner, where the quality-aware features are shared by different quality dimension prediction tasks, and the multi-dimensional quality scores of each image are regressed by multiple quality regression modules respectively. We further introduce an uncertainty principle to balance the loss of each task in the training stage. The experimental results show that our proposed model achieves the best performance on the Multi-dimensional Aesthetic assessment for Mobile Game image database (MAMG) among state-of-the-art image quality assessment (IQA) algorithms and aesthetic quality assessment (AQA) algorithms.

A Multi-dimensional Aesthetic Quality Assessment Model for Mobile Game Images
3 Hadi Amirpour, Christian Timmerer, Mohammad Ghanbari, Raimund Schatz

Due to the growing importance of optimizing the quality and efficiency of video streaming delivery, accurate assessment of user-perceived video quality becomes increasingly important. However, due to the wide range of viewing distances encountered in real-world viewing settings, the perceived video quality can vary significantly in everyday viewing situations.In this paper, we investigate and quantify the influence of viewing distance on perceived video quality. A subjective experiment was conducted with full HD sequences at three different fixed viewing distances, with each video sequence being encoded at three different quality levels.Our study results confirm that the viewing distance has a significant influence on the quality assessment. In particular, they show that an increased viewing distance generally leads to increased perceived video quality, especially at low media encoding quality levels. In this context, we also provide an estimation of potential bitrate savings that knowledge of actual viewing distance would enable in practice.Since current objective video quality metrics do not systematically take into account viewing distance, we also analyze and quantify the influence of viewing distance on the correlation between objective and subjective metrics. Our results confirm the need for distance-aware objective metrics when the accurate prediction of perceived video quality in real-world environments is required.

On the Impact of Viewing Distance on Perceived Video Quality
4 guanghui Yue, Siying Li, Yuan Li, Xue Bai, Jingfeng Du, tianfu Wang

With the unavoidable improper operations or device limitation, colorectal endoscopic images collected by clinics often hold some distortions, such as noise, blur, underexposure/overexposure, bubbles, floating object occlusion, etc. The low-quality image impedes the visual interpretation of endoscopy and subsequent disease analysis. Therefore, the quality assessment of endoscopic images is of great significance. Unfortunately, very few attempts have been dedicated to investigate such an issue over the past several decades. In this study, we carry out an in-depth investigation on quality assessment of colorectal endoscopic images. First, we collected 800 authentically distorted images with diverse contents during the colorectal endoscopy. Then, a subjective experiment was conducted to obtain a specific \emph{Colorectal Endoscopic Image Quality Assessment Database} (CEIQAD) with the mean opinion score for each image under strict scoring rules. Finally, we investigated the feasibility of utilizing well-know no reference (NR) image quality assessment (IQA) methods designed for natural scene images to tackle the IQA problem of CEIs. Experimental results on CEIQAD demonstrate that existing mainstream NR IQA methods merely achieve ordinary prediction performance, and there exists an urgent need to design specific IQA methods for CEIs.

Subjective Quality Assessment of Colorectal Endoscopic Images
5 Zicheng Zhang, Wei Sun, Xiongkuo Min, Tao Wang, Wei Lu, Guangtao Zhai

Compressed image quality assessment (IQA) has been a crucial part of a wide range of image services such as storage and transmission. Due to the effect of different bit rates and compression methods, the compressed images usually have different levels of quality. Nowadays, the mainstream full-reference (FR) metrics are effective to predict the quality of compressed images at coarse-grained levels, however, they may perform poorly when quality differences of the compressed images are quite subtle. To better improve the Quality of Experience (QoE) and provide useful guidance for compression algorithms, we propose an FR-IQA metric for fine-grained compressed images, which estimates the image quality by analyzing the difference of structure and texture. Our metric is validated mainly on the fine-grained compression IQA (FGIQA) database and is also tested on other commonly used compression IQA databases. The experiment results show that our metric outperforms mainstream full-reference metrics on the fine-grained compression IQA database and also obtains competitive performance on the coarse-grained compression IQA databases.

A Full-Reference Quality Assessment Metric for Fine-Grained Compressed Images
Machine Learning for Multimedia II
No. Author(s) Title
6 Cong Zou, Xuchen Wang, Yaosi Hu, Zhenzhong Chen, Shan Liu

Video captioning is considered to be challenging due to the combination of video understanding and text generation. Recent progress in video captioning has been made mainly using methods of visual feature extraction and sequential learning. However, the syntax structure and semantic consistency of generated captions are not fully explored. Thus, in our work, we propose a novel multimodal attention based framework with Part-of-Speech (POS) sequence guidance to generate more accurate video captions. In general, the word sequence generation and POS sequence prediction are hierarchically jointly modeled in the framework. Specifically, different modalities including visual, motion, object and syntactic features are adaptively weighted and fused with the POS guided attention mechanism when computing the probability distributions of prediction words. Experimental results on two benchmark datasets, i.e. MSVD and MSR-VTT, demonstrate that the proposed method can not only fully exploit the information from video and text content, but also focus on the decisive feature modality when generating a word with a certain POS type. Thus, our approach boosts the video captioning performance as well as generating idiomatic captions.

MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning
7 Xinzhu Cao, Yuanzhi Yao, Nenghai Yu

Image-to-image translation tasks which have been widely investigated with generative adversarial networks (GAN) aim to map an image from the source domain to the target domain. The translated image can be inversely mapped to the reconstructed source image. However, existing GAN-based schemes lack the ability to accomplish reversible translation. To remedy this drawback, a nearly reversible image-to-image translation scheme where the reconstructed source image is approximately distortion-free compared with the corresponding source image is proposed in this paper. The proposed scheme jointly considers inter-frame coding and embedding. Firstly, we organize the GAN-generated reconstructed source image and the source image into a pseudo video. Furthermore, the bitstream obtained by inter-frame coding is reversibly embedded in the translated image for nearly lossless source image reconstruction. Extensive experimental results and analysis demonstrate that the proposed scheme can achieve a high level of performance in image quality and security.

Nearly Reversible Image-to-Image Translation Using Joint Inter-Frame Coding and Embedding
8 Yuanquan Xu, Cheolkon Jung

Although most existing methods based on 3D morphable model (3DMM) need annotated parameters for training as ground truth, only a few datasets contain them. Moreover, it is difficult to acquire accurate 3D face models aligned with the input images due to the gap in dimensions. In this paper, we propose a face 2D to 3D reconstruction network based on head pose and 3D facial landmarks. We build a head pose guided face reconstruction network to regress an accurate 3D face model with the help of 3D facial landmarks. Different from 3DMM parameters, head pose and 3D facial landmarks are successfully estimated even in the wild images. Experiments on 300W-LP, AFLW2000-3D and CelebA HQ datasets show that the proposed method successfully reconstructs 3D face model from a single RGB image thanks to 3D facial landmarks as well as achieves state-of-the-art performance in terms of the normalized mean error (NME).

Face 2D to 3D Reconstruction Network Based on Head Pose and 3D Facial Landmarks
Multi- and Hyper-spectral Image Communications
No. Author(s) Title
9 Frank Sippel, Jürgen Seiler, Andre Kaup

In many image processing tasks it occurs that pixels or blocks of pixels are missing or lost in only some channels. For example during defective transmissions of RGB images, it may happen that one or more blocks in one color channel are lost. Nearly all modern applications in image processing and transmission use at least three color channels, some of the applications employ even more bands, for example in the infrared and ultraviolet area of the light spectrum. Typically, only some pixels and blocks in a subset of color channels are distorted. Thus, other channels can be used to reconstruct the missing pixels, which is called spatio-spectral reconstruction. Current state-of-the-art methods purely rely on the local neighborhood, which works well for homogeneous regions. However, in high-frequency regions like edges or textures, these methods fail to properly model the relationship between color bands. Hence, this paper introduces non-local filtering for building a linear regression model that describes the inter-band relationship and is used to reconstruct the missing pixels. Our novel method is able to increase the PSNR on average by 2 dB and yields visually much more appealing images in high-frequency regions.

Spatio-spectral Image Reconstruction Using Non-local Filtering
10 jiayu xie, Xin Jin, hongkun cao

Image registration among multimodality has received increasing attention in the scope of computer vision and computational photography nowadays. However, the nonlinear intensity variations prohibit the accurate feature points matching between modal-different image pairs. Thus, a robust image descriptor for multi-modal image registration is proposed, named shearlet-based modality robust descriptor(SMRD). The anisotropic feature of edge and texture information in multi-scale is encoded to describe the region around a point of interest based on discrete shearlet transform. We conducted the experiments to verify the proposed SMRD compared with several state-of-the-art multi-modal/multispectral descriptors on four different multimodal datasets. The experimental results showed that our SMRD achieves superior performance than other methods in terms of precision, recall and F1-score.

SMRD: A Local Feature Descriptor for Multi-modal Image Registration
11 Dong-Jae Lee, Kang-Kyu Lee, Jong-Ok Kim

After the invention of electric bulbs, most of lightssurrounding our worlds are powered by alternative current(AC). This intensity variation can be captured with a highspeedcamera, and we can utilize the intensity difference betweenconsecutive video frames for various vision tasks. For colorconstancy, conventional methods usually focus on exploiting onlythe spatial feature. To overcome the limitations of conventionalmethods, a couple of methods to utilize AC flickering have beenproposed. The previous work employed temporal correlationbetween high-speed video frames. To further enhance the previouswork, we propose a deep spatio-temporal color constancymethod using spatial and temporal correlations. To extracttemporal features for illuminant estimation, we calculate thetemporal correlation between feature maps where global featuresas well as local are learned. By learning global features throughspatio-temporal correlation, the proposed method can estimateillumination more accurately, and is particularly robust to noisypractical environments. The experimental results demonstratethat the performance of the proposed method is superior to thatof existing methods.

Deep Color Constancy Using Spatio-Temporal Correlation of High-Speed Video
12 Khaoula SAKRANI, Sinda Elghoul, Sarra Falleh, Faouzi Ghorbel

Here we propose a novel affine registration method for planar curves. It is based on a pseudo-inverse algorithm applied to the source and target curves in their multi-scale version. The proposed registration system selects the relevant scales in the optimized L2 distances. The retrieved smoothing parameters are realized with the Gaussian Expectation-Maximization (EM) algorithm. We resolve the global system, formed by equations corresponding to EM selected scales.

SA(2,R) Multi-scale contour registration based on EM Algorithm
Rate Control and Bit Allocation for Video Coding
No. Author(s) Title
13 Christian R. Helmrich, Ivan Zupancic, Jens Brandenburg, Valeri George, Adam Wieckowski, Benjamin Bross

Two-pass rate control (RC) schemes have proven useful for generating low-bitrate video-on-demand or streaming catalogs. Visually optimized encoding particularly using latest-generation coding standards like Versatile Video Coding (VVC), however, is still a subject of intensive study. This paper describes the two-pass RC method integrated into version 1 of VVenC, an open VVC encoding software. The RC design is based on a novel two-step rate-quantization parameter (R-QP) model to derive the second-pass coding parameters, and it uses the low-complexity XPSNR visual distortion measure to provide numerically as well as visually stable, perceptually R-D optimized encoding results. Random-access evaluation experiments confirm the improved objective as well as subjective performance of our RC solution.

Visually Optimized Two-Pass Rate Control for Video Coding Using the Low-Complexity XPSNR Model
14 Xinye Jiang, zhenyu liu, Yongbing Zhang, Xiangyang Ji

Rate-distortion optimization (RDO) is widely used in video coding to improve coding efficiency. Conventionally, RDO is applied to each block independently to avoid high computational complexity. However, various prediction techniques introduce spatio-temporal dependency between blocks, therefore the independent RDO is not optimal. Specifically, because of the motion compensation, the distortion of reference blocks will affect the quality of subsequent prediction blocks. And considering this temporal dependency in RDO can improve the global rate distortion (R-D) performance. x265 leveraged on a lookahead module to analyze the temporal dependency between blocks, and weighted the quality of each block based on its reference strength. However, the original algorithm in x265 ignored the impacts of quantization, and this shortcoming degraded the R-D performance of x265. In this paper, we propose a new linear distortion propagation model to estimate the temporal dependency, which introduces the impacts of quantization. And from a perspective of global RDO, a corresponding adaptive quantization formula is presented. The proposed algorithm was conducted in x265 version 3.2. Experiments revealed that, the proposed algorithm achieved average 15.43% PSNR-based and 23.81% SSIM-based BD-rate reductions, which outperformed the original algorithm in x265 by 4.14% and 9.68%, respectively.

A Distortion Propagation Oriented CU-tree Algorithm for x265
15 Zizheng Liu, Zhenzhong Chen, Shan Liu

In this paper, we propose a two stage optimal bit allocation scheme for HEVC hierarchical coding structure. The two stage, i.e., the frame-level and the CTU-level bit allocation, are separately conducted in the traditional rate control methods. In our proposed method, the optimal allocation in the second stage is firstly considered, and then the allocation strategy in the second stage is deemed as a foreknowledge in the first stage and applied to guide the frame-level bit allocation. With the formulation, the two stage bit allocation problem can be converted to a joint optimization problem. By solving the formulated optimization problem, the two stage optimal bit allocation scheme is established, in which more appropriate number of bits can be allocated to each frame and each CTU. The experimental results show that our proposed method can bring higher coding efficiency while satisfying the constraint of bit rate precisely.

Two stage optimal bit allocation for HEVC hierarchical coding structure
16 Guangjie Ren, Zizheng Liu, Zhenzhong Chen, Shan Liu

In this paper, we propose a reinforcement learning based region of interest (ROI) bit allocation method for gaming video coding in Versatile Video Coding (VVC). Most current ROI-based bit allocation methods rely on bit budgets based on frame-level empirical weight allocation. The restricted bit budgets influence the efficiency of ROI-based bit allocation and the stability of video quality. To address this issue, the bit allocation process of frame and ROI are combined and formulated as a Markov decision process (MDP). A deep reinforcement learning (RL) method is adopted to solve this problem and obtain the appropriate bits of frame and ROI. Our target is to improve the quality of ROI and reduce the frame-level quality fluctuation, whilst satisfying the bit budgets constraint. The RL-based ROI bit allocation method is implemented in the latest video coding standard and verified for gaming video coding. The experimental results demonstrate that the proposed method achieves a better quality of ROI while reducing the quality fluctuation compared to the reference methods.

Reinforcement Learning based ROI Bit Allocation for Gaming Video Coding in VVC
17 Jun Fu, Chen Hou, Zhibo Chen

Recently, reinforced adaptive bitrate (ABR) algorithms have achieved remarkable success in tile-based 360-degree video streaming. However, they heavily rely on accurate viewport prediction. To alleviate this issue, we propose a hierarchical reinforcement-learning (RL) based ABR algorithm, dubbed 360HRL. Specifically, 360HRL consists of a top agent and a bottom agent. The former is used to decide whether to download a new segment for continuous playback or re-download an old segment for correcting wrong bitrate decisions caused by inaccurate viewport estimation, and the latter is used to select bitrates for tiles in the chosen segment. In addition, 360HRL adopts a two-stage training methodology. In the first stage, the bottom agent is trained under the environment where the top agent always chooses to download a new segment. In the second stage, the bottom agent is fixed and the top agent is optimized with the help of a heuristic decision rule. Experimental results demonstrate that 360HRL outperforms existing RL-based ABR algorithms across a broad of network conditions and quality of experience (QoE) objectives.

360HRL: Hierarchical Reinforcement Learning Based Rate Adaptation for 360-Degree Video Streaming
Lunch break
Keynote III
Design Space Exploration for AI Hardware Architectures
Presenter: Holger Blume (LUH, Germany)
Moderator: Jörn Ostermann (Leibniz Univeristät Hannover, Germany)
Panel II
3D Volumetric Media: Technology Challenges and Future Applications
Moderator: Zhu Li (UMKC, USA)
Coffee break
Oral Session
Session Chair: Prof. Marc Antonini and Dr. Melpomeni Dimopoulou (University Côte d’Azur and CNRS, France)
(Special Session) Recent Advances and Challenges of DNA Digital Data Storage
Author(s) Title
Giulio Franzese, Yiqing YAN, Giuseppe Serra, Ivan D’Onofrio, Raja Appuswamy, Pietro Michiardi

Synthetic DNA has received much attention recently as a long-term archival medium alternative due to its high density and durability characteristics. However, most current work has primarily focused on using DNA as a precise storage medium. In this work, we take an alternate view of DNA. Using neural-network-based compression techniques, we transform images into a latent-space representation, which we then store on DNA. By doing so, we transform DNA into an approximate image storage medium, as images generated back from DNA are only approximate representations of the original images. Using several datasets, we investigate the storage benefits of approximation, and study the impact of DNA storage errors (substitions, indels, bias) on the quality of approximation. In doing so, we demonstrate the feasibility and potential of viewing DNA as an approximate storage medium.

Generative DNA: Representation Learning for DNA-based Approximate Image Storage
Arnav Solanki, Tonglin Chen, Marc Riedel

Ever since Watson and Crick first described the molecular structure of DNA, its information-bearing potential has been apparent to computer scientists. This has led to a concerted effort in academia and industry to deliver practical DNA data storage systems. This paper presents a novel approach for both storage and computation with DNA. Data is stored in the form of analog values of the relative concentration of different DNA molecules. Computation is in the form of cascadable NAND operations, effected via toehold-mediated strand displacement reactions operating on these concentration values. Results were verified with the “Peppercorn Enumerator,” a recent software tool for analyzing domain-level strand displacement. In all cases, the relative error in output concentration was less than 0.03%. The approach is robust to encoding errors and cross hybridization. It does not rely on long DNA strands, which are expensive to synthesize. It opens new avenues for storage and computing, including the implementation of a wide range of useful mathematical functions in vitro

Cascadable Stochastic Logic for DNA Storage
Andreas Lenz, Paul H Siegel, Antonia Wachter-Zeh, Eitan Yaakobi

Advances in biochemical technologies, such as synthesizing and sequencing devices, have fueled manifold recent experiments on archival digital data storage using DNA. In this paper we review and analyze recent results on information-theoretic aspects of such storage systems. The discussion focuses on a channel model that incorporates the main properties of DNA-based data storage. Namely, the user data is synthesized many times onto a large number of short-length DNA strands. The receiver then draws strands from the stored sequences in an uncontrollable manner. Since the synthesis and sequencing are prone to errors, a received sequence can differ from its original strand, and their relationship is described by a probabilistic channel. Recently, the capacity of this channel was derived for the case of substitution errors inside the sequences. We review the main techniques used to prove a coding theorem and its converse, showing the achievability of the capacity and the fact that it cannot be exceeded. We further provide an intuitive interpretation of the capacity formula for relevant channel parameters, compare with sub-optimal decoding methods, and conclude with a discussion on cost-efficiency.

On the Capacity of DNA-based Data Storage under Substitution Errors
Eva GIL SAN ANTONIO, Marc Antonini

The exponential increase of digital data and the limited capacity of current storage devices have made clear the need for exploring new storage solutions. Thanks to its biological properties, DNA has proven to be a potential candidate for this task, allowing the storage of information at a high density for hundreds or even thousands of years. With the release of nanopore sequencing technologies, DNA data storage is one step closer to become a reality. Many works have proposed solutions for the simulation of this sequencing step, aiming to ease the development of algorithms addressing nanpore-sequenced reads. However, these simulators target the sequencing of complete genomes, whose characteristics differ from the ones of synthetic DNA. This work presents a nanopore sequencing simulator targeting synthetic DNA on the context of DNA data storage.

Nanopore Sequencing Simulator for DNA Data Storage
Hanyu Li, Sabrina Racine-Brzostek, Nan Xi, Jiwen Luo, Zhen Zhao, Junsong Yuan

Monoclonal proein (M-protein) detection with electrophoresis is of vital importance for the diagnosis of lymphoproliferative processes and monoclonal gammopathies (MGs). Although identifying M-proteins are key for the diagnosis and monitoring of these disorders, it requires specialized knowledge and is time consuming and labor intensive. Despite existing powerful machine learning methods, it often requires to obtain large number of labeled data for training, which is difficult to obtain. Besides, electrophoresis image quality could vary dramatically, affecting the proper identification of M-protein. To address these challenges, we propose to represent electrophoresis images using Gaussian Mixture Model (GMM) and leverage peak detection method to identify visual features for M-protein detection. Utilizing random forest classifier, our method can work with a small amount of labeled data to train the model and is not sensitive to samples of varying quality. Furthermore, with extracted image features, it is possible for specially trained technologists and pathologists to understand and check the decision process of the learned model. Extensive experiments indicate our proposed method achieves satisfactory results on test data, demonstrating the effectiveness and robustness of the proposed model for M-protein detection.

Learning to Detect Monoclonal Protein in Electrophoresis Images
Poster Session
Session Chair: TBD
Detection, Tracking, and Pose Estimation
No. Author(s) Title
1 Mateusz Majcher, Bogdan Kwolek

In this work we propose a new mixed-input neural network for instance level monocular object tracking. The proposed Y-like architecture has two inputs and one output. An object id with a quaternion representing the object rotation in the previous frame are fed to the first input, whereas the object sub-window is fed to the second input. We demonstrate that on the basis of quaternions the neural network learns attention blobs representing the object rotation in the previous frame. A single neural network has been trained for six objects to estimate their fiducial points in sequences of RGB images. A tracking by optimization approach has been leveraged in the experiments. The algorithm has been evaluated on the OPT benchmark dataset for 6DoF object pose tracking as well as a custom dataset including image sequences with both real and rendered objects.

Quaternion-driven CNN for Object Pose Tracking
2 Lee Aing, Wen-Nung Lie, Jui-Chiu Chiang

Predicting/estimating the 6DoF pose parameters for multi-instance objects accurately in a fast manner is an important issue in robotic and computer vision. Even though some bottom-up methods have been proposed to be able to estimate multiple instance poses simultaneously, their accuracy cannot be considered as good enough when compared to other state-of-the-art top-down methods. Their processing speed still cannot respond to practical applications. In this paper, we present a faster and finer bottom-up approach of deep convolutional neural network to estimate poses of the object pool even multiple instances of the same object category present high occlusion/overlapping. Several techniques such as prediction of semantic segmentation map, multiple keypoint vector field, and 3D coordinate map, and diagonal graph clustering are proposed and combined to achieve the purpose. Experimental results and ablation studies show that the proposed system can achieve comparable accuracy at a speed of 24.7 frames per second for up to 7 objects by evaluation on the well-known Occlusion LINEMOD dataset.

Faster and Finer Pose Estimation for Object Pool in a Single RGB Image
3 Wen-Jiin Tsai, Hsuan-Jen Pan

Anomaly detection is an important task in many traffic applications. Methods based on deep learning networks reach high accuracy; however, they typically rely on supervised training with large annotated data. Considering that anomalous data are not easy to obtain, we present data transformation methods which convert the data obtained from one intersection to other intersections to mitigate the effort of collecting training data. The proposed methods are demonstrated on the task of anomalous trajectory detection. A General model and a Universal model are proposed. The former focuses on saving data collection effort; the latter further reduces the network training effort. We evaluated the methods on the dataset with trajectories from four intersections in GTA V virtual world. The experimental results show that with significant reduction in data collecting and network training efforts, the proposed anomalous trajectory detection still achieves state-of-the-art accuracy.

Data Transformer for Anomalous Trajectory Detection
4 Tsaipei Wang, Chih-Hao Liao, Li-Hsuan Hsieh, Arvin Wen Tsui, Hsin-Chien Huang

In this paper we study techniques for accurate detection, localization, and tracking of multiple people in an indoor scene covered by multiple top-view fisheye cameras. This is a rarely studied setting within the topic of multi-camera object tracking. The experimental results on test videos exhibit good performance for practical use. We also propose methods to account for occlusion by scene objects at different stages of the algorithm that lead to improved results.

People Detection and Tracking Using a Fisheye Camera Network
5 Zhiruo Zhou, Hongyu Fu, Christoph C. Borel-Donohue, Suya You, C.-C. Jay Kuo

An unsupervised online object tracking method that exploitsboth foreground and background correlations is proposed and namedUHP-SOT (Unsupervised High-Performance Single Object Tracker) inthis work. UHP-SOT consists of three modules: 1) appearance modelupdate, 2) background motion modeling, and 3) trajectory-based boxprediction. A state-of-the-art discriminative correlation filters (DCF)based tracker is adopted by UHP-SOT as the first module. We pointout shortcomings of using the first module alone such as failure inrecovering from tracking loss and inflexibility in object box adaptationand then propose the second and third modules to overcome them.Both are novel in single object tracking (SOT). We test UHP-SOT ontwo popular object tracking benchmarks, TB-50 and TB-100, and showthat it outperforms all previous unsupervised SOT methods, achievesa performance comparable with the best supervised deep-learning-basedSOT methods, and operates at a fast speed (i.e. 22.7-32.0 FPS on a CPU).

UHP-SOT: An Unsupervised High-Performance Single Object Tracker
Image and Video Compression Beyond Standards II
No. Author(s) Title
6 yuzhuo wei, Li Chen, Li Song

With the blooming of deep learning technology in the field of computer vision, the integration of deep learning techniques and the traditional video coding framework has made significant performance improvements, especially applying super-resolution neural networks as the post-processing module in the down-sampling based video compression framework at a low bitrate scenario. However, due to the non-differentiability of the traditional video codec, the pre-processing module lacks back-propagated gradients and it is difficult to meet the potential of a better compression performance by jointly considering down-sampling and up-sampling. In this paper, we propose an end-to-end down-sampling-based video compression framework applying convolutional neural networks both as down-sampling and up-sampling. We use a virtual codec neural network to approximate the projection between the uncompressed frame and the decoded frame so that the gradient can be effectively back-propagated for joint training. Experimental results show the superiority of our proposed framework compared with the predefined down-sampling-based video compression and various methods of approximated joint training.

Video Compression based on Jointly Learned Down-Sampling and Super-Resolution Networks
7 Chaoyi Lin, Yue Li, Kai Zhang, Zhaobin Zhang, Li Zhang

Down-sampling followed by an up-sampling is a well-known strategy to compress high-resolution pictures given a limited bandwidth in image as well as video coding. Recently, inspired by the latest advances of image super resolution (SR) technologies using convolutional neural network (CNN), CNN-based SR has been explored for resampling-based coding. However, the side information generated during the compression process is not utilized efficiently by the SR network in prior arts. In this paper, we propose a CNN-based SR method for video coding, where more side information is leveraged as a supplement to reconstruction samples. Specifically, we introduce prediction samples to be the auxiliary information, as it can provide the texture and directional information about the original picture. Considering the different characteristics, we design different networks for the luma and chroma components. When designing the chroma up-sampling CNN, the luma reconstruction is used as the auxiliary information of the chroma network, which exploits the cross-component correlation. Experimental results show that the proposed method achieves 11.07% BD-rate savings in all-intra configuration compared with VTM-11.0 which is the reference software for H.266/VVC. Further experiments validate the effectiveness of using luma information to aid the chroma up-sampling process.

CNN-based Super Resolution for Video Coding Using Decoded Information
8 Jens Schneider, Christian Rohlfing

Versatile Video Coding (VVC) introduces the concept of Reference Picture Resampling (RPR), which allows for a resolution change of the video during decoding, without introducing an additional Intra Random Access Point (IRAP) into the bitstream. When the resolution is increased, an upsampling operation of the reference picture is required in order to apply motion compensated prediction. Conceptually, the upsampling by linear interpolation filters fails to recover frequencies which were lost during downsampling. Yet, the quality of the upsampled reference picture is crucial to the prediction performance. In recent years, machine learning based Super-Resolution (SR) has shown to outperform conventional interpolation filters by far in regard to super-resolving a previously downsampled image. In particular, Dictionary Learning-based Super-Resolution (DLSR) was shown to improve the inter-layer prediction in SHVC [1]. Thus, this paper introduces DLSR to the prediction process in RPR. Further, the approach is experimentally evaluated by an implementation based on the VTM-9.3 reference software. The simulation results show a reduction of the instantaneous bitrate of 0.98 % on average at the same objective quality in terms of PSNR. Moreover, the peak bitrate reduction is measured to 4.74 % for the “Johnny” sequence of the JVET test set.

Dictionary Learning-based Reference Picture Resampling in VVC
9 Hadi Amirpour, Christian Timmerer, Mohammad Ghanbari, Hannaneh B. Pasandi

In per-title encoding, to optimize a bitrate ladder over spatial resolution, each video segment is downscaled to a set of spatial resolutions, and they are all encoded at a given set of bitrates. To find the highest quality resolution for each bitrate, the low-resolution encoded videos are upscaled to the original resolution, and a convex hull is formed based on the scaled qualities. Deep learning-based video super-resolution (VSR) approaches show a significant gain over traditional upscaling approaches, and they are becoming more and more efficient over time. This paper improves the per-title encoding over the upscaling methods by using deep neural network-based VSR algorithms. Utilizing a VSR algorithm by improving the quality of low-resolution encodings can improve the convex hull. As a result, it will lead to an improved bitrate ladder. To avoid bandwidth wastage at perceptually lossless bitrates, a maximum threshold for the quality is set, and encodings beyond it are eliminated from the bitrate ladder. Similarly, a minimum threshold is set to avoid low-quality video delivery. The encodings between the maximum and minimum thresholds are selected based on one Just Noticeable Difference. Our experimental results show that the proposed per-title encoding results in a 24% bitrate reduction and 53% storage reduction compared to the state-of-the-art method.

Improving Per-title Encoding for HTTP Adaptive Streaming by Utilizing Video Super-resolution
10 Hannah Och, Tilo Strutz, Andre Kaup

Probability distribution modeling is the basis for most competitive methods for lossless coding of screen content. One such state-of-the-art method is known as soft context formation (SCF). For each pixel to be encoded, a probability distribution is estimated based on the neighboring pattern and the occurrence of that pattern in the already encoded image. Using an arithmetic coder, the pixel color can thus be encoded very efficiently, provided that the current color has been observed before in association with a similar pattern. If this is not the case, the color is instead encoded using a color palette or, if it is still unknown, via residual coding. Both palette-based coding and residual coding have significantly worse compression efficiency than coding based on soft context formation. In this paper, the residual coding stage is improved by adaptively trimming the probability distributions for the residual error. Furthermore, an enhanced probability modeling for indicating a new color depending on the occurrence of new colors in the neighborhood is proposed. These modifications result in a bitrate reduction of up to 2.9% on average. Compared to HEVC (HM-16.21 + SCM-8.8) and FLIF, the improved SCF method saves on average about 11% and 18% rate, respectively.

Optimization of Probability Distributions for Residual Coding of Screen Content
Machine Learning for Multimedia III
No. Author(s) Title
11 Birendra Kathariya, Zhu Li, Jianle Chen, Geert Van der Aweera

Federated Learning (FL), a distributed machine learning architecture, emerged to solve the intelligent data analysis on massive data generated at network edge-devices. With this paradigm, a model is jointly learned in parallel at edge-devices without needing to send voluminous data to a central FL server. This not only allows a model to learn in a feasible duration by reducing network latency but also preserves data privacy. Nonetheless, when thousands of edge-devices are attached to an FL framework, limited network resources inevitably impose intolerable training latency. In this work, we propose model-update compression to solve this issue in a very novel way. The proposed method learns multiple Gaussian distributions that best describe the high dimensional gradient parameters. In the FL server, high dimensional gradients are repopulated from Gaussian distributions utilizing likelihood function parameters which are communicated to the server. Since the distribution information parameters constitute a very small percentage of values compared to the high dimensional gradients themselves, our proposed method is able to save significant uplink bandwidth while preserving the model accuracy. Experimental results validated our claim.

Gradient Compression with a Variational Coding Scheme for Federated Learning
12 Neng Zhang, Ebroul Izquierdo

Camera calibration for sport videos enables precise and natural delivery of graphics on video footage and several other special effects. This in turns substantially improves the visual experience in the audience and facilitates sports analysis within or after the live show. In this paper, we propose a high accuracy camera calibration method for sport videos. First, we generate a homography database by uniformly sampling camera parameters. This database includes more than 91 thousand different homography matrices. Then, we use the conditional generative adversarial network (cGAN) to achieve semantic segmentation splitting the broadcast frames into four classes. In a subsequent processing step, we build an effective feature extraction network to extract the feature of semantic segmented images. After that, we search for the feature in the database to find the best matching homography. Finally, we refine the homography by image alignment. In a comprehensive evaluation using the 2014 World Cup dataset, our method outperforms other state-of-the-art techniques.

A High Accuracy Camera Calibration Method for Sport Videos
Multimedia Content Analysis, Representation, and Understanding II
No. Author(s) Title
13 Tian Zhang, Dongliang Chang, Zhanyu Ma, Jun Guo

Fine-grained visual classification aims to recognize images belonging to multiple sub-categories within a same category. It is a challenging task due to the inherently subtle variations among highly-confused categories. Most existing methods only take an individual image as input, which may limit the ability of models to recognize contrastive clues from different images. In this paper, we propose an effective method called progressive co-attention network (PCA-Net) to tackle this problem. Specifically, we calculate the channel-wise similarity by encouraging interaction between the feature channels within same-category image pairs to capture the common discriminative features. Considering that complementary information is also crucial for recognition, we erase the prominent areas enhanced by the channel interaction to force the network to focus on other discriminative regions. The proposed model has achieved competitive results on three fine-grained visual classification benchmark datasets: CUB-200-2011, Stanford Cars, and FGVC Aircraft.

Progressive co-attention network for fine-grained visual classification
14 Rijun Liao, Zhu Li, Shuvra Bhattacharyya, George York

With the development of airplane platforms, aerial image classification plays an important role in a wide range of remote sensing applications. The number of most of aerial image dataset is very limited compared with other computer vision datasets. Unlike many works that use data augmentation to solve this problem, we adopt a novel strategy, called, label splitting, to deal with limited samples. Specifically, each sample has its original semantic label, we assign a new appearance label via unsupervised clustering for each sample by label splitting. Then an optimized triplet loss learning is applied to distill domain specific knowledge. This is achieved through a binary tree forest partitioning and triplets selection and optimization scheme that controls the triplet quality. Simulation results on NWPU, UCM and AID datasets demonstrate that proposed solution achieves the state-of-the-art performance in the aerial image classification.

Aerial Image Classification with Label Splitting and Optimized Triplet Loss Learning
15 ZHIGANG CHANG, Zhao Yang, Yongbiao Chen, Zhou Qin, Shibao Zheng

Video-based person re-identification (Re-ID) aims to match person images in video sequences captured by disjoint surveillance cameras. Traditional video-based person Re-ID methods focus on exploring appearance information, thus, vulnerable against illumination changes, scene noises, camera parameters, and especially clothes/carrying variations. Gait recognition provides an implicit biometric solution to alleviate the above headache. Nonetheless, it experiences severe performance degeneration as camera view varies. In an attempt to address these problems, in this paper, we propose a framework that utilizes the sequence masks (SeqMasks) in the video to integrate appearance information and gait modeling in a close fashion. Specifically, to sufficiently validate the effectiveness of our method, we build a novel dataset named MaskMARS based on MARS. Comprehensive experiments on our proposed large wild video Re-ID dataset MaskMARS evidenced our extraordinary performance and generalization capability. Validations on the gait recognition metric CASIA-B dataset further demonstrated the capability of our hybrid model.

Seq-Masks: Bridging the gap between appearance and gait modeling for video-based person re-identification
16 Bingxin Hou, Ying Liu, Nam Ling, Lingzhi Liu, Yongxiong Ren, Ming Kai Hsu

Deep learning methods have been actively applied in moving object detection and achieved great performance. However, many existing models render superior detection accuracy at the cost of high computational complexity and slow inference speed, which hindered the application on mobile and embedded devices with limited computing resources. In this paper, we propose a two-branch 3D separable convolutional neural network named “F3DsCNN” for moving object detection. The network extracts both high-level global features and low-level detailed features. It achieves a fast inference speed of 120 frames per second, suitable for tasks that need to be carried out in a timely manner on a computationally limited platform with high accuracy.

F3DsCNN: A Fast Two-Branch 3D Separable CNN for Moving Object Detection
17 Yurong Guo, Zhanyu Ma, Xiaoxu Li, Yuan Dong

Recently, graph neural networks (GNNs) have shown powerful ability to handle few-shot classification problem, which aims at classifying unseen samples when trained with limited labeled samples per class. GNN-based few-shot learning architectures mostly replace traditional metric with a learnable GNN. In the GNN, the nodes are set as the samples’ embedding, and the relationship between two connected nodes can be obtained by a network, the input of which is the difference of their embedding features. We consider this method of measuring relation of samples only models the sample-to-sample relation, while neglects the specificity of different tasks. That is, this method of measuring relation does not take the task-level information into account. To this end, we propose a new relation measure method, namely the task-level relation module (TLRM), to explicitly model the task-level relation of one sample to all the others. The proposed module captures the relation representations between nodes by considering the sample-to-task instead of sample-to-sample embedding features. We conducted extensive experiments on four benchmark datasets: mini-ImageNet, tiered-ImageNet, CUB-200-2011, and CIFAR-FS. Experimental results demonstrate that the proposed module is effective for GNN-based few-shot learning.

TLRM: Task-level Relation Module for GNN-based Few-Shot Learning
18:30:00 Awards and Farewell Session