Table of Contents
Fetching ...

Learning Privacy from Visual Entities

Alessio Xompero, Andrea Cavallaro

TL;DR

The paper investigates how to predict image privacy and challenges the necessity of large graph-based models by showing that a transfer-learning pipeline using a pre-trained CNN and a single trainable FC layer can match the performance of graph-based methods. It critically analyzes the components of graph-based privacy classifiers, demonstrating that fine-tuning CNNs largely drives performance while the graph component contributes minimally and at great parameter cost. By comparing against GIP, GPA, MLP, and GA-MLP across IPD and PrivacyAlert, it highlights that a lightweight approach (S2P) achieves comparable or superior accuracy with orders of magnitude fewer trainable parameters. The findings imply practical benefits in efficiency and interpretability, and suggest future work should focus on human-interpretable visual-entity features and more efficient graph designs to improve privacy recognition at scale.

Abstract

Subjective interpretation and content diversity make predicting whether an image is private or public a challenging task. Graph neural networks combined with convolutional neural networks (CNNs), which consist of 14,000 to 500 millions parameters, generate features for visual entities (e.g., scene and object types) and identify the entities that contribute to the decision. In this paper, we show that using a simpler combination of transfer learning and a CNN to relate privacy with scene types optimises only 732 parameters while achieving comparable performance to that of graph-based methods. On the contrary, end-to-end training of graph-based methods can mask the contribution of individual components to the classification performance. Furthermore, we show that a high-dimensional feature vector, extracted with CNNs for each visual entity, is unnecessary and complexifies the model. The graph component has also negligible impact on performance, which is driven by fine-tuning the CNN to optimise image features for privacy nodes.

Learning Privacy from Visual Entities

TL;DR

The paper investigates how to predict image privacy and challenges the necessity of large graph-based models by showing that a transfer-learning pipeline using a pre-trained CNN and a single trainable FC layer can match the performance of graph-based methods. It critically analyzes the components of graph-based privacy classifiers, demonstrating that fine-tuning CNNs largely drives performance while the graph component contributes minimally and at great parameter cost. By comparing against GIP, GPA, MLP, and GA-MLP across IPD and PrivacyAlert, it highlights that a lightweight approach (S2P) achieves comparable or superior accuracy with orders of magnitude fewer trainable parameters. The findings imply practical benefits in efficiency and interpretability, and suggest future work should focus on human-interpretable visual-entity features and more efficient graph designs to improve privacy recognition at scale.

Abstract

Subjective interpretation and content diversity make predicting whether an image is private or public a challenging task. Graph neural networks combined with convolutional neural networks (CNNs), which consist of 14,000 to 500 millions parameters, generate features for visual entities (e.g., scene and object types) and identify the entities that contribute to the decision. In this paper, we show that using a simpler combination of transfer learning and a CNN to relate privacy with scene types optimises only 732 parameters while achieving comparable performance to that of graph-based methods. On the contrary, end-to-end training of graph-based methods can mask the contribution of individual components to the classification performance. Furthermore, we show that a high-dimensional feature vector, extracted with CNNs for each visual entity, is unnecessary and complexifies the model. The graph component has also negligible impact on performance, which is driven by fine-tuning the CNN to optimise image features for privacy nodes.

Paper Structure

This paper contains 34 sections, 10 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Illustration of different training strategies to understand the relative contribution of individual components in a learning-based model for image privacy classification (illustration inspired and adapted from Modas2021ICIP). On top, a transfer learning based strategy that fine-tunes the layers of a convolutional neural network (CNN), initially pre-trained on a source domain, to the downstream task (image privacy) on a target domain. In the middle, an architecture that stacks a CNN with a graph neural network (GNN), and is trained end-to-end while also fine-tuning the CNN parameters. This strategy masks the relative contribution of the GNN to the overall performance. On the bottom, the training involves only the GNN while keeping the parameters of the CNN fixed to understand the relative contribution of the GNN.
  • Figure 2: Illustrative diagrams of GIP Yang2020PR (top) and GPA Stoidis2022BigMM (bottom), and their end-to-end training. The methods have two main components: Convolutional Neural Networks (CNNs) to extract deep features and a Graph Neural Network (GNN) to refine the features based on a graph computed a priori. The graph is designed with two node types: privacy nodes (public and private) and object nodes. The number of object nodes is determined by a pre-defined and fixed-size vocabulary (e.g., 80 categories in the COCO dataset Lin2018ECCV_COCO). After initialising the node features, the GNN refines the features based on the prior graph. The features of the privacy nodes at the last layer are used as input to a classifier that consists of multiple fully connected layers with shared parameters. Both models are trained end-to-end with a cross-entropy loss $\mathcal{L}$, also guiding the fine-tuning of the CNNs. Note that the image is resized and normalised based on the statistics computed from ImageNet. Note that some of the connections in the GNNs blocks are omitted from the visualisation.
  • Figure 3: Frequency of public images () and private images () based on the subset of images with only X localised object types (# of co-occurrent objects). The number of images differ for each stacked vertical bar.
  • Figure 4: Percentage of images with no localised objects (), with only one object type (), with more than one object type (). Each object type can have any number of localised instances. From top to bottom: training, validation, and testing splits for each dataset.
  • Figure 5: Comparison of the pipelines when relating recall of private class and balanced accuracy. Best performance on the top-right corner. Note that relying only on scenes as visual entities makes S2P and GPA correctly recognise less private images in PrivacyAlert, showing the different underlying distributions of the two datasets. Legend: MLP, MLP-I, GA-MLP, GIP, GPA, S2P.
  • ...and 4 more figures