Perceptual metrics: interactive plots

Disagreements between metrics

These pages can be useful to manually investigate which metrics to trust. They show cases where there's disagreement between the different metrics on which image is best.

Metrics used

The following metrics were used:

VMAF-NEG v2.3.1. The model without enhancement gain (NEG) was used.
SSIM, PSNR-Y, PSNR-HVS: as implemented in VMAF
Butteraugli: the libjxl version was used, and the 3-norm score was used
DSSIM v3.2.3
SSIMULACRA
SSIMULACRA 2 (commit 5d5f429, called 'modelD' here)

Halls of fame/shame

For each metric, we look at pairs of images A and B (both derived from the same reference image), and look at cases where almost all metrics say A is worse than B but the metric under scrutiny disagrees with all of them and says A is better than B. Since many of these cases tend to be low quality images, there is also a second page that only includes cases where A is better than the overall median according to the metric under scrutiny.

SSIMULACRA 2 modelD: Hall of fame/shame (only higher quality images)
SSIM: Hall of fame/shame (only higher quality images)
PSNR-Y: Hall of fame/shame (only higher quality images)
Butteraugli 3norm: Hall of fame/shame (only higher quality images)
DSSIM: Hall of fame/shame (only higher quality images)
PSNR-HVS: Hall of fame/shame (only higher quality images)
VMAF-NEG: Hall of fame/shame (only higher quality images)

Metric X vs metric Y

For each metric X, we compare it to each other metric Y. First we look at pairs of images A and B (both derived from the same reference image) where metric X says A is worse than B (by how much is plotted on the horizontal axis). and metric Y says A is better than B (by how much is plotted on the vertical axis). If you click on one of the points, images A, B and the reference image will be shown. Only cases where at least one of the images is considered by at least one of the metrics to be better than the overall median metric result (otherwise most high-amplitude disagreements are about two horribly poor images).

Secondly we look at single compressed images and compute at what percentile its metric scores are, considering the whole set of images (so including all compressed images regardless of which reference image they came from). This percentile can be seen as a MOS estimate. The plot shows cases where the percentiles differ significantly. Click on a point to open the compressed image and the corresponding reference image.

Encoder performance according to various metrics

For every metric, there are various plots that can help to evaluate encoder performance and speed/compression trade-offs. The plots show bitrate/distortion curves on the left, and encode speed on the right, so both aspects can be viewed simultaneously. In most plots, the aggregation is done per encode setting (the 'quality' parameter), showing the average file size versus the average (or 10th percentile) metric score. The average file size is either expressed in bits per pixel, or as a percentage gain compared to using a baseline codec (unoptimized JPEG) that achieves the same (average or 10th percentile) metric score. Another way of comparing is by selecting the encode quality setting on a per-image basis to the lowest setting that reaches a given metric score. Finally there are plots of 'encoder consistency', which is an indicator for how much image-dependent variation there is in the resulting perceptual quality (according to the metric) for each encode setting: lower curves are better here, since it means encode settings can be mapped to perceptual fidelity targets more reliably.

Encoders

The following encoders were tested:

MozJPEG v4.1.1
libjxl v0.7 and commit 225b6884
libavif v0.10.1 with aom [enc/dec]:3.4.0; tune=ssim
Visionular Aurora v1.0.7-6-g61faa30
cavif-rs 1.3.5
WebP v1.2.4

Most of these encoders were tested with various speed settings. Mozjpeg -revert corresponds to unoptimized libjpeg-turbo, while default mozjpeg is slower and more optimized. All encoders were tested using only a single core (no multithreading) on a Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz with moderate load. Speed measurements were not done very accurately (simply measuring cpu time once per encoded image) so in particular at the high speed end this lead to inaccurate measurements (mozjpeg -revert is in reality faster and the reported speed is an underestimate).

Perceptual metrics: interactive plots

Disagreements between metrics

Metrics used

Halls of fame/shame

Metric X vs metric Y

Encoder performance according to various metrics

Encoders

SSIMULACRA 2 modelD

DSSIM

Butteraugli 3-norm

SSIMULACRA

PSNR-Y

PSNR-HVS

SSIM

MS-SSIM

CIEDE2000

VMAF-NEG