The intuition here is that large probabilities have less information than small probabilities.The conditional probability captures our interest in image quality.To capture our interest in a variety of images, we use the marginal probability.
Sali-mans et al. — Pros and Cons of GAN Evaluation Measures, 2018.

the images are sharp rather than blurry), or The generative algorithm should output a high diversity of images from all the different classes in ImageNet, or If both of these traits are satisfied by a generative model, then we expect a large KL-divergence between the distributions Let’s see why the proposed score codifies these qualities.
This is illustrated in Table This shows that the Inception Score is sensitive to small changes in network weights that do not affect the final classification accuracy of the network. B., Gomes, J., and Pande, V. S. (2017).

In Section In generative modeling, we are given a dataset of samples Some metrics have been devised that use the structure within an individual class of generative models to compare themMany metrics have been proposed for the evaluation of black-box generative models. (2016). Conditional GANs. 1.2 计算 Inception Score的方式不对. 3.1 Inception v3. Specifically, the A large number of generated images are classified using the model. Sample images from a generative algorithm that achieves nearly optimal Inception Scores. The authors further argue that generative models need to be directly evaluated for the application they are intended forSuppose we are trying to evaluate a trained generative model The Inception Score is a metric for automatically evaluating the quality of image generative modelsThe authors who proposed the IS aimed to codify two desirable qualities of a generative model into a metric:The images generated should contain clear objects (i.e. S., Courville, A., and Bengio, Y. The Inception Score, or IS for short, is an objective metric for evaluating the quality of generated images, specifically synthetic images output by generative adversarial network models. Sign up to our mailing list for occasional updates.Inception Scores on 50k CIFAR-10 training images, 50k ImageNet validation images and ImageNet Validation top-1 accuracy. One goal is that the generated images should be recognizable. class label conditional on the generated image. information. Images that are classified strongly as one class over all other classes indicate a high quality. Long text generation via adversarial training with leaked Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b).

Possibly the most popular metric is the Inception Score (Salimans et al., 2016), which measures the quality and diversity of the generated images using an external model, the Google Inception network (Szegedy et al., 2014), trained on the large scale ImageNet dataset (Deng et al., 2009). (2016). networks. An outstanding example of successful empirical research within machine learning is the Large Scale Visual Recognition Challenge benchmark for computer vision tasks that has arguably produced most of the greatest computer vision advances of the last decadeIn this note, we highlighted a number of suboptimalities of the Inception Score and explicated some of the difficulties in designing a good metric for evaluating generative models. Deep generative models are powerful tools that have produced impressive results in recent years. 或许最流行的指标是 Inception Score(Salimans et al., 2016),它使用外部模型即谷歌 Inception 网络(Szegedy et al., 2014)评估生成图像的质量和多样性,该模型在大规模 ImageNet 数据集上训练。一些其他指标虽然应用没有那么广泛,但仍然非常有价值。 CVPR 2009. 通常计算 Inception Score 时,会生成 50000 个图片,然后把它分成 10 份,每份 5000 个,分别代入公式 (3) 计算 10 次 Inception Score,再计算均值和方差,作为最终的衡量指标(均值+-方差)。 (2017).

