Mirror and Glass Detection/Segmentation

In this project, we are developing techniques for mirror and glass detection/segmentation. While a mirror is a reflective surface that reflects the scene in front of it, glass is a transparent surface that transmits the scene from the back side and often also reflects the scene in front of it too. In general, both mirrors and glass do not have their own visual appearances. They only reflect/transmit the appearances of their surroundings.

As mirrors and glass do not have their own appearances, it is not straightforward to develop automatic algorithms to detect and segment them. However, as they appear everywhere in our daily life, it can be problematic if we are not able to detect them reliably. For example, a vision-based depth sensor may falsely estimate the depth of a piece of mirror/glass as the depth of the objects inside it, a robot may not be aware of the presence of a mirror/glass wall, and a drone may collide into a high rise (noted that most high rises are covered by glass these days).

To the best of our knowledge, my team is the first to develop computational models for automatic detection and segmentation of mirror and transparent glass surfaces. Although there have been some works that investigate the detection of transparent glass objects, these methods mainly focus on detecting wine glass and small glass objects, which have some special visual properties that can be used for detection. Unlike these works, we are more interested in detecting general glass surfaces that may not possess any special properties of their own.

We are also interested in exploring the application of our mirror/glass detection methods in autonomous navigation.

Multi-view Dynamic Reflection Prior for Video Glass Surface Detection [paper] [suppl] [model] [dataset]

Fang Liu, Yuhao Liu, Jiaying Lin, Ke Xu, and Rynson Lau

Proc. AAAI, Feb. 2024

Comparison of our VGSD-Net with the state-of-the-art image-based glass detection methods, GlassNet (Lin, He, and Lau 2021) and PGSNet (Yu et al. 2022). They produce temporal-inconsistent results when applied to the VGSD task, as they do not exploit any temporal information. In contrast, our method learns dynamic reflection cues from the video, yielding more accurate and robust results.

Input-Output: Given an input video, our network outputs a sequence of binary masks indicating where the glass surfaces are in each frame.

Abstract. Recent research has shown significant interest in image-based glass surface detection (GSD). However, detecting glass surfaces in dynamic scenes remains largely unexplored due to the lack of a high-quality dataset and an effective video glass surface detection (VGSD) method. In this paper, we propose the first VGSD approach. Our key observation is that reflections frequently appear on glass surfaces, but they change dynamically as the camera moves. Based on this observation, we propose to offset the excessive dependence on a single uncertainty reflection via joint modeling of temporal and spatial reflection cues. To this end, we propose the VGSD-Net with two novel modules: a Location-aware Reflection Extraction (LRE) module and a Context-enhanced Reflection Integration (CRI) module, for the position-aware reflection feature extraction and the spatial-temporal reflection cues integration, respectively. We have also created the first large-scale video glass surface dataset (VGSD-D), consisting of 19,166 image frames with accurately-annotated glass masks extracted from 297 videos. Extensive experiments demonstrate that VGSD-Net outperforms state-of-the-art approaches adapted from related fields.

ZOOM: Learning Video Mirror Detection with Extremely-Weak Supervision [paper] [suppl] [model] [dataset]

Ke Xu, Tsun Wai Siu, and Rynson Lau

Proc. AAAI, Feb. 2024

Input

SATNet

HetNet

VMDNet

Ours

We propose ZOOM, which learns to detect mirrors with extremely weak supervision, i.e., the zero-one mirror indicators. The key insight of ZOOM is to model the similarity (blue arrow) and contrast (red arrows) in temporal variations for mirror detection. It achieves promising results against fully-supervised mirror detectors.

Input-Output: Given an input video, our network outputs a sequence of binary masks indicating where the mirrors are in each frame.

Abstract. Mirror detection is an active research topic in computer vision. However, all existing mirror detectors learn mirror representations from large-scale pixel-wise datasets, which are tedious and expensive to obtain. Although weakly-supervised learning has been widely explored in related topics, we note that popular weak supervision signals (e.g., bounding boxes, scribbles, points) still require some efforts from the user to locate the target objects, with a strong assumption that the images to annotate always contain the target objects. Such an assumption may result in the over-segmentation of mirrors. Our key idea of this work is that the existence of mirrors over a time period may serve as a weak supervision to train a mirror detector, for two reasons. First, if a network can predict the existence of mirrors, it can essentially locate the mirrors. Second, we observe that the reflected contents of a mirror tend to be similar to those in adjacent frames, but exhibit considerable contrast to regions in far-away frames (e.g., non-mirror frames). In this paper, we propose ZOOM, the first method to learn robust mirror representations from extremely-weak annotations of per-frame ZerO-One Mirror indicators in videos. The key insight of ZOOM is to model the similarity and contrast (between mirror and non-mirror regions) in temporal variations to locate and segment the mirrors. To this end, we propose a novel fusion strategy to leverage temporal consistency information for mirror localization, and a novel temporal similarity-contrast modeling module for mirror segmentation.We construct a new video mirror dataset for training and evaluation. Experimental results under new and standard metrics show that ZOOM performs favorably against existing fully-supervised mirror detection methods.

Self-supervised Pre-training for Mirror Detection [paper] [model]

Jiaying Lin and Rynson Lau

Proc. IEEE ICCV, Oct. 2023

State-of-the-art mirror detection methods [37, 30] are based on costly supervised ImageNet pre-training. They may fail even in obvious cases, e.g., (a) when the mirror clearly reflects a real object outside of the mirror. (b) and (c) are MirrorNet [37] and VCNet [30], respectively, pre-trained on ImageNet with full supervision. (c) is MirrorNet with our proposed SSL framework and without supervised ImageNet pre-training. Our SSL scheme leverages the mirror reflection cue and avoids feature redundancy in the pre-training stage. It outperforms those pre-trained on ImageNet with full supervision (i.e., (b) and (c)).

Input-Output: Given an input image, our network outputs a binary mask that indicates where the mirrors are.

Abstract. Existing mirror detection methods require supervised ImageNet pre-training to obtain good general-purpose image features. However, supervised ImageNet pre-training focuses on category-level discrimination and may not be suitable for downstream tasks like mirror detection, due to the overfitting upstream tasks (e.g., supervised image classification). We observe that mirror reflection is crucial to how people perceive the presence of mirrors, and such mid-level features can be better transferred from self-supervised pretrained models. Inspired by this observation, in this paper we aim to improve mirror detection methods by proposing a new self-supervised learning (SSL) pre-training framework for modeling the representation of mirror reflection progressively in the pre-training process. Our framework consists of three pre-training stages at different levels: 1) an image-level pre-training stage to globally incorporate mirror reflection features into the pre-trained model; 2) a patch-level pre-training stage to spatially simulate and learn local mirror reflection from image patches; and 3) a pixel-level pre-training stage to pixel-wisely capture mirror reflection via reconstructing corrupted mirror images based on the relationship between the inside and outside of mirrors. Extensive experiments show that our SSL pre-training framework significantly outperforms previous state-of-the-art CNN-based SSL pre-training frameworks and even outperforms supervised ImageNet pre-training when transferred to the mirror detection task.

Learning to Detect Mirrors from Videos via Dual Correspondences [paper] [suppl] [model] [dataset]

Jiaying Lin, Xin Tan, and Rynson Lau

Proc. IEEE CVPR, June 2023

Although state-of-the-art single-image mirror detection method VCNet [36] performs well on a single image (e.g., the first row) by using implicitly intra-frame correspondence, it may fail when the intra-frame cue is weak or even absent in some video frames (e.g., the second and third rows). The lack in exploiting inter-frame information causes the current mirror detection methods to produce inaccurate and inconsistent results when applied to VMD. In contrast, our method can perform well in both situations by utilizing the proposed dual correspondence module to exploit intra-frame (spatial) and inter-frame (temporal) correspondences.

Input-Output: Given an input video, our network outputs a sequence of binary masks indicating where the mirrors are in each frame.

Abstract. Detecting mirrors from static images has received significant research interest recently. However, detecting mirrors over dynamic scenes is still under-explored due to the lack of a high-quality dataset and an effective method for video mirror detection (VMD). To the best of our knowledge, this is the first work to address the VMD problem from a deep-learning-based perspective. Our observation is that there are often correspondences between the contents inside (reflected) and outside (real) of a mirror, but such correspondences may not always appear in every frame, e.g., due to the change of camera pose. This inspires us to propose a video mirror detection method, named VMD-Net, that can tolerate spatially missing correspondences by considering the mirror correspondences at both the intra-frame level as well as inter-frame level via a dual correspondence module that looks over multiple frames spatially and temporally for correlating correspondences. We further propose a first large-scale dataset for VMD (named VMD-D), which contains 14,987 image frames from 269 videos with corresponding manually annotated masks. Experimental results show that the proposed method outperforms SOTA methods from relevant fields. To enable real-time VMD, our method efficiently utilizes the backbone features by removing the redundant multi-level module design and gets rid of postprocessing of the output maps commonly used in existing methods, making it very efficient and practical for real-time video-based applications.

Mirror Detection with the Visual Chirality Cue [paper] [code]

Xin Tan, Jiaying Lin, Ke Xu, Pan Chen, Lizhuang Ma, and Rynson Lau

IEEE Trans. on Pattern Analysis and Machine Intelligence, 45(3):3492-3504, Mar. 2023

Existing single image based mirror detection methods [42] [21], which are based on modeling contrasts/correspondences between mirror and non-mirror regions, may fail when these relations are not reliable. For example, MirrorNet [42] would fail if the contrasts between mirror/non-mirror regions are weak (top row) or have multiple degrees (bottom row). PMDNet [21] would fail if correspondences do not exist (top row) or are incorrectly detected (bottom row). Our method (Ours) leverages the visual chirality cue, which is an intrinsic property of mirrors reflecting real-world scenes, to accurately differentiate mirror and non-mirror regions.

Input-Output: Given an input image, our network outputs a binary mask that indicates where the mirrors are.

Abstract. Mirror detection is challenging because the visual appearances of mirrors change depending on those of their surroundings. As existing mirror detection methods are mainly based on extracting contextual contrast and relational similarity between mirror and non-mirror regions, they may fail to identify a mirror region if these assumptions are violated. Inspired by a recent study of applying a CNN to help distinguish whether an image is flipped or not based on the visual chirality property, in this paper, we rethink this image-level visual chirality property and reformulate it as a learnable pixel level cue for mirror detection. Specifically, we first propose a novel flipping-convolution-flipping (FCF) transformation to model visual chirality as learnable commutative residual. We then propose a novel visual chirality embedding (VCE) module to exploit this commutative residual in multi-scale feature maps, to embed the visual chirality features into our mirror detection model. Besides, we also propose a visual chirality-guided edge detection (CED) module to integrate the visual chirality features with contextual features for detection refinement. Extensive experiments show that the proposed method outperforms state-of-the-art methods on three benchmark datasets.

Large-Field Contextual Feature Learning for Glass Detection [paper] [code]

Haiyang Mei, Xin Yang, Letian Yu, Qiang Zhang, Xiaopeng Wei, and Rynson Lau

IEEE Trans. on Pattern Analysis and Machine Intelligence, 45(3):3329-3346, March 2023

The pipeline of the proposed GDNet-B. First, we use the pre-trained ResNeXt-101 [75] as a multi-level feature extractor (MFE) to obtain features of different levels. Second, we embed four LCFI modules to the last four layers of MFE, to learn large-field contextual features at different levels. Third, the outputs of the last three LCFI modules are concatenated and fused via an attention module [76] to generate high-level large-field contextual features. An attention map is then learned from these high-level large-field contextual features and used to guide the low-level large-field contextual features, i.e., the output of the first LCFI module, to focus more on glass regions. Fourth, we apply two BFE modules on the highlevel/attentive low-level large-field contextual features to further perceive and integrate boundary cues. Finally, we combine high-level and attentive low-level large-field contextual features by concatenation and attention [76] operations to produce the final glass detection map.

Input-Output: Given an input image, our network outputs a binary mask that indicates where glass surfaces are.

Abstract. Glass is very common in our daily life. Existing computer vision systems neglect it and thus may have severe consequences, e.g., a robot may crash into a glass wall. However, sensing the presence of glass is not straightforward. The key challenge is that arbitrary objects/scenes can appear behind the glass. In this paper, we propose an important problem of detecting glass surfaces from a single RGB image. To address this problem, we construct the first large-scale glass detection dataset (GDD) and propose a novel glass detection network, called GDNet-B, which explores abundant contextual cues in a large field-of-view via a novel large-field contextual feature integration (LCFI) module and integrates both high-level and low-level boundary features with a boundary feature enhancement (BFE) module. Extensive experiments demonstrate that our GDNet-B achieves satisfying glass detection results on the images within and beyond the GDD testing set. We further validate the effectiveness and generalization capability of our proposed GDNet-B by applying it to other vision tasks, including mirror segmentation and salient object detection. Finally, we show the potential applications of glass detection and discuss possible future research directions.

Symmetry-Aware Transformer-based Mirror Detection [paper] [model]

Tianyu Huang, Bowen Dong, Jiaying Lin, Xiaohui Liu, Rynson Lau, and Wangmeng Zuo

Proc. AAAI, Feb. 2023

Comparison of mirror detection among state-of-the-art methods. MirrorNet (Yang et al. 2019) cannot handle scenes with vague mirror boundaries. Although PMDNet (Lin, Wang, and Lau 2020) considers similarity semantics, it can hardly detect the symmetry pair (1st row), and can easily count part of the similar non-mirror regions into mirrors (2nd row). SANet (Guan, Lin, and Lau 2022) only detects the mirror region above the sink (1st row), and it has worse predictions when semantic associations are lacking (2nd row). By modeling a loose symmetry relationship, SATNet succeeds in both cases.

Input-Output: Given an input image, our network outputs a binary mask that indicates where the mirrors are.

Abstract. Mirror detection aims to identify the mirror regions in the given input image. Existing works mainly focus on integrating the semantic features and structural features to mine specific relations between mirror and non-mirror regions, or introducing mirror properties like depth or chirality to help analyze the existence of mirrors. In this work, we observe that a real object typically forms a loose symmetry relationship with its corresponding reflection in the mirror, which is beneficial in distinguishing mirrors from real objects. Based on this observation, we propose a dual-path Symmetry-Aware Transformer-based mirror detection Network (SATNet), which includes two novel modules: Symmetry-Aware Attention Module (SAAM) and Contrast and Fusion Decoder Module (CFDM). Specifically, we first adopt a transformer backbone to model global information aggregation in images, extracting multi-scale features in two paths. We then feed the high-level dual-path features to SAAMs to capture the symmetry relations. Finally, we fuse the dual-path features and refine our prediction maps progressively with CFDMs to obtain the final mirror mask. Experimental results show that SATNet outperforms both RGB and RGB-D mirror detection methods on all available mirror detection datasets.

Efficient Mirror Detection via Multi-level Heterogeneous Learning (Oral Presentation) [paper] [model]

Ruozhen He, Jiaying Lin, and Rynson Lau

Proc. AAAI, Feb. 2023

Comparison of our proposed method with mirror detection and SOD models on F-measure, FLOPs, and FPS using the MSD dataset. Our method achieves a new SOTA result with considerable efficiency.

Input-Output: Given an input image, our network outputs a binary mask that indicates where the mirrors are.

Abstract. We present HetNet (Multi-level Heterogeneous Network), a highly efficient mirror detection network. Current mirror detection methods focus more on performance than efficiency, limiting the real-time applications (such as drones). Their lack of efficiency is aroused by the common design of adopting homogeneous modules at different levels, which ignores the difference between different levels of features. In contrast, HetNet detects potential mirror regions initially through low-level understandings (e.g., intensity contrasts) and then combines with high-level understandings (contextual discontinuity for instance) to finalize the predictions. To perform accurate yet efficient mirror detection, HetNet follows an effective architecture that obtains specific information at different stages to detect mirrors. We further propose a multi-orientation intensity-based contrasted module (MIC) and a reflection semantic logical module (RSL), equipped on HetNet, to predict potential mirror regions by low-level understandings and analyze semantic logic in scenarios by high-level understandings, respectively. Compared to the state-of-the-art method, HetNet runs 664% faster and draws an average performance gain of 8.9% on MAE, 3.1% on IoU, and 2.0% on F-measure on two mirror detection benchmarks.

Exploiting Semantic Relations for Glass Surface Detection (Spotlight Presentation) [paper] [code] [dataset]

Jiaying Lin^*, Yuen Hei Yeung^*, and Rynson Lau (^* joint first authors)

Proc. NeurIPS, Nov. 2022

Visual comparisons of our method to state-of-the-art methods for glass surface detection [8, 18] on some example images.

Input-Output: Given an input image, our network outputs a binary mask that indicates where glass surfaces are.

Abstract. Glass surfaces are omnipresent in our daily lives and often go unnoticed by the majority of us. While humans are generally able to infer their locations and thus avoid collisions, it can be difficult for current object detection systems to handle them due to the transparent nature of glass surfaces. Previous methods approached the problem by extracting global context information to obtain priors such as boundary and reflection. However, their performances cannot be guaranteed when these critical features are not available. We observe that humans often reason through the semantic context of the environment, which offers insights into the categories of and proximity between entities that are expected to appear in the surrounding. For example, the odds of co-occurrence of glass windows with walls and curtains is generally higher than that with other objects such as cars and trees, which have relatively less semantic relevance. Based on this observation, we propose a model that integrates the contextual relationship of the scene for glass surface detection with two novel modules: (1) Scene Aware Activation (SAA) Module to adaptively filter critical channels with respect to spatial and semantic features, and (2) Context Correlation Attention (CCA) Module to progressively learn the contextual correlations among objects both spatially and semantically. In addition, we propose a large-scale glass surface detection dataset named GSD-S, which contains 4,519 real-world RGB glass surface images from diverse real-world scenes with detailed annotations. Experimental results show that our model outperforms contemporary works, especially with 48.8% improvement on MAE from our proposed GSD-S dataset.

Learning Semantic Associations for Mirror Detection [paper] [suppl] [code] [dataset]

Huankang Guan, Jiaying Lin, and Rynson Lau

Proc. IEEE CVPR, June 2022

Existing mirror detection methods based on learning contextual contrasts [47] or corresponding relations [24] falsely identify some distractors (e.g., the doorway in the 1^st row and the painting in the 2^nd row) as mirrors, and miss the mirror (3^rd row) when the mirror is captured at an oblique angle to the camera along with some occluding lights. In contrast, our method considers the semantic associations between mirrors and their surrounding objects (e.g., the vanity table in the 1st row and the sink in the 2^nd and 3rd rows), yielding accurate results.

Input-Output: Given an input image, our network outputs a binary mask that indicates where the mirrors are.

Abstract. Mirrors generally lack a consistent visual appearance, making mirror detection very challenging. Although recent works that are based on exploiting contextual contrasts and corresponding relations have achieved good results, heavily relying on contextual contrasts and corresponding relations to discover mirrors tend to fail in complex real-world scenes, where a lot of objects, e.g., doorways, may have similar features as mirrors. We observe that humans tend to place mirrors in relation to certain objects for specific functional purposes, e.g., a mirror above the sink. Inspired by this observation, we propose a model to exploit the semantic associations between the mirror and its surrounding objects for a reliable mirror localization. Our model first acquires class-specific knowledge of the surrounding objects via a semantic side-path. It then uses two novel modules to exploit semantic associations: 1) an Associations Exploration (AE) Module to extract the associations of the scene objects based on fully connected graph models, and 2) a Quadruple-Graph (QG) Module to facilitate the diffusion and aggregation of semantic association knowledge using graph convolutions. Extensive experiments show that our method outperforms the existing methods and sets the new state-of-the-art on both PMD dataset (f-measure: 0.844) and MSD dataset (f-measure: 0.889).

Rich Context Aggregation with Reflection Prior for Glass Surface Detection [paper] [suppl] [video] [pretrained code] [dataset]

Jiaying Lin, Zebang He, and Rynson Lau

Proc. IEEE CVPR, June 2021

Two popular scenarios where existing methods [20, 30] fail. GDNet [20] is based on extracting/integrating abundant context features for glass surface detection. As it does not consider any specific glass properties, it tends to fail in scenes with insufficient contexts (e.g., top row where the glass surface covers almost the whole image) or with glass-lookalike regions (e.g., bottom row where the center region is not covered by glass). TransLab [30] is based on a boundary-guided network for transparent object detection. It also fails to detect glass surfaces correctly. Our method, which considers reflections and boundaries, can accurately detect the glass surfaces in these complex scenes.

Input-Output: Given an input image, our network outputs a binary mask that indicates where glass surfaces are.

Abstract. Glass surfaces appear everywhere. Their existence can however pose a serious problem to computer vision tasks. Recently, a method is proposed to detect glass surfaces by learning multi-scale contextual information. However, as it is only based on a general context integration operation and does not consider any specific glass surface properties, it gets confused when the images contain objects that are similar to glass surfaces and degenerates in challenging scenes with insufficient contexts. We observe that humans often rely on identifying reflections in order to sense the existence of glass and on locating the boundary in order to determine the extent of the glass. Hence, we propose a model for glass surface detection, which consists of two novel modules: (1) a rich context aggregation module (RCAM) to extract multi-scale boundary features from rich context features for locating glass surface boundaries of different sizes and shapes, and (2) a reflection-based refinement module (RRM) to detect reflection and then incorporate it so as to differentiate glass regions from non-glass regions. In addition, we also propose a challenging dataset consisting of 4,012 glass images with annotations for glass surface detection. Our experiments demonstrate that the proposed model outperforms state-of-the-art methods from relevant fields.

Progressive Mirror Detection [paper] [suppl] [code] [dataset]

Jiaying Lin, Guodong Wang, and Rynson Lau

Proc. IEEE CVPR, June 2020

Visualization of our progressive approach to recognizing mirrors from a single image. By finding correspondences between objects inside and outside of the mirror and then explicitly locating the miror edges, we can detect the mirror region more reliably.

Input-Output: Given an input image, our network outputs a binary mask that indicates where the mirrors are.

Abstract. The mirror detection problem is important as mirrors can affect the performances of many vision tasks. It is a difficult problem since it requires an understanding of global scene semantics. Recently, a method was proposed to detect mirrors by learning multi-level contextual contrasts between inside and outside of mirrors, which helps locate mirror edges implicitly. We observe that the content of a mirror reflects the content of its surrounding, separated by the edge of the mirror. Hence, we propose a model in this paper to progressively learn the content similarity between the inside and outside of the mirror while explicitly detecting the mirror edges. Our work has two main contributions. First, we propose a new relational contextual contrasted local (RCCL) module to extract and compare the mirror features with its corresponding context features, and an edge detection and fusion (EDF) module to learn the features of mirror edges in complex scenes via explicit supervision. Second, we construct a challenging benchmark dataset of 6,461 mirror images. Unlike the existing MSD dataset, which has limited diversity, our dataset covers a variety of scenes and is much larger in scale. Experimental results show that our model outperforms relevant state-of-the-art methods.

Don't Hit Me! Glass Detection in Real-world Scenes [paper] [suppl] [code] [dataset]

Haiyang Mei, Xin Yang, Yang Wang, Yuanyuan Liu, Shengfeng He, Qiang Zhang, Xiaopeng Wei, and Rynson Lau

Proc. IEEE CVPR, June 2020

Problems with glass in existing vision tasks. In depth prediction, existing method [16] wrongly predicts the depth of the scene behind the glass, instead of the depth to the glass (1st row of (b)). For instance segmentation, Mask RCNN [9] only segments the instances behind the glass, not aware that they are actually behind the glass (2nd row of (b)). Besides, if we directly apply an existing singe-image reflection removal (SIRR) method [36] to an image that is only partially covered by glass, the non-glass region can be corrupted (3rd row of (b)). GDNet can detect the glass (c) and then correct these failure cases (d).

Input-Output: Given an input image, our network outputs a binary mask that indicates where glass surfaces are.

Abstract. Transparent glass is very common in our daily life. Existing computer vision systems neglect it and thus may have severe consequences, e.g., a robot may crash into a glass wall. However, sensing the presence of glass is not straightforward. The key challenge is that arbitrary objects/scenes can appear behind the glass, and the content within the glass region is typically similar to those behind it. In this paper, we propose an important problem of detecting glass from a single RGB image. To address this problem, we construct a large-scale glass detection dataset (GDD) and design a glass detection network, called GDNet, which explores abundant contextual cues for robust glass detection with a novel large-field contextual feature integration (LCFI) module. Extensive experiments demonstrate that the proposed method achieves more superior glass detection results on our GDD test set than state-of-the-art methods fine-tuned for glass detection.

Where is My Mirror? [paper] [suppl] [code and updated results] [dataset]

Xin Yang*, Haiyang Mei*, Ke Xu, Xiaopeng Wei, Baocai Yin, and Rynson Lau (* joint first authors)

Proc. IEEE ICCV, Oct. 2019

Problems with mirrors in existing vision tasks. In depth prediction, NYU-v2 dataset [32] uses a Kinect to capture depth as ground truth. It wrongly predicts the depths of the reflected contents, instead of the mirror depths (b). In instance semantic segmentation, Mask RCNN [12] wrongly detects objects inside the mirrors (c). With MirrorNet, we first detect and mask out the mirrors (d). We then obtain the correct depths (e), by interpolating the depths from surrounding pixels of the mirrors, and segmentation maps (f).

Input-Output: Given an input image, our network outputs a binary mask that indicates where the mirrors are.

Abstract. Mirrors are everywhere in our daily lives. Existing computer vision systems do not consider mirrors, and hence may get confused by the reflected content inside a mirror, resulting in a severe performance degradation. However, separating the real content outside a mirror from the reflected content inside it is non-trivial. The key challenge is that mirrors typically reflect contents similar to their surroundings, making it very difficult to differentiate the two. In this paper, we present a novel method to segment mirrors from an input image. To the best of our knowledge, this is the first work to address the mirror segmentation problem with a computational approach. We make the following contributions. First, we construct a large-scale mirror dataset that contains mirror images with corresponding manually annotated masks. This dataset covers a variety of daily life scenes, and will be made publicly available for future research. Second, we propose a novel network, called MirrorNet, for mirror segmentation, by modeling both semantical and low-level color/texture discontinuities between the contents inside and outside of the mirrors. Third, we conduct extensive experiments to evaluate the proposed method, and show that it outperforms the carefully chosen baselines from the state-of-the-art detection and segmentation methods

Last updated in December 2023.