Perceptual metrics: interactive plots

Disagreements between metrics

These pages can be useful to manually investigate which metrics to trust. They show cases where there's disagreement between the different metrics on which image is best.

Metrics used

The following metrics were used:

Halls of fame/shame

For each metric, we look at pairs of images A and B (both derived from the same reference image), and look at cases where almost all metrics say A is worse than B but the metric under scrutiny disagrees with all of them and says A is better than B. Since many of these cases tend to be low quality images, there is also a second page that only includes cases where A is better than the overall median according to the metric under scrutiny.

SSIMULACRA 2 modelD: Hall of fame/shame (only higher quality images)
SSIM: Hall of fame/shame (only higher quality images)
PSNR-Y: Hall of fame/shame (only higher quality images)
Butteraugli 3norm: Hall of fame/shame (only higher quality images)
DSSIM: Hall of fame/shame (only higher quality images)
PSNR-HVS: Hall of fame/shame (only higher quality images)
VMAF-NEG: Hall of fame/shame (only higher quality images)

Metric X vs metric Y

For each metric X, we compare it to each other metric Y. First we look at pairs of images A and B (both derived from the same reference image) where metric X says A is worse than B (by how much is plotted on the horizontal axis). and metric Y says A is better than B (by how much is plotted on the vertical axis). If you click on one of the points, images A, B and the reference image will be shown. Only cases where at least one of the images is considered by at least one of the metrics to be better than the overall median metric result (otherwise most high-amplitude disagreements are about two horribly poor images).

Secondly we look at single compressed images and compute at what percentile its metric scores are, considering the whole set of images (so including all compressed images regardless of which reference image they came from). This percentile can be seen as a MOS estimate. The plot shows cases where the percentiles differ significantly. Click on a point to open the compressed image and the corresponding reference image.

SSIM versus SSIMULACRA 2 modelD: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
PSNR-Y versus SSIMULACRA 2 modelD: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
PSNR-Y versus SSIM: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
Butteraugli 3norm versus SSIMULACRA 2 modelD: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
Butteraugli 3norm versus SSIM: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
Butteraugli 3norm versus PSNR-Y: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
DSSIM versus SSIMULACRA 2 modelD: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
DSSIM versus SSIM: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
DSSIM versus PSNR-Y: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
DSSIM versus Butteraugli 3norm: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
PSNR-HVS versus SSIMULACRA 2 modelD: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
PSNR-HVS versus SSIM: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
PSNR-HVS versus PSNR-Y: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
PSNR-HVS versus Butteraugli 3norm: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
PSNR-HVS versus DSSIM: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
VMAF-NEG versus SSIMULACRA 2 modelD: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
VMAF-NEG versus SSIM: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
VMAF-NEG versus PSNR-Y: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
VMAF-NEG versus Butteraugli 3norm: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
VMAF-NEG versus DSSIM: pairwise disagreements | predicted MOS discrepancies (per encoder setting)
VMAF-NEG versus PSNR-HVS: pairwise disagreements | predicted MOS discrepancies (per encoder setting)

Encoder performance according to various metrics

For every metric, there are various plots that can help to evaluate encoder performance and speed/compression trade-offs. The plots show bitrate/distortion curves on the left, and encode speed on the right, so both aspects can be viewed simultaneously. In most plots, the aggregation is done per encode setting (the 'quality' parameter), showing the average file size versus the average (or 10th percentile) metric score. The average file size is either expressed in bits per pixel, or as a percentage gain compared to using a baseline codec (unoptimized JPEG) that achieves the same (average or 10th percentile) metric score. Another way of comparing is by selecting the encode quality setting on a per-image basis to the lowest setting that reaches a given metric score. Finally there are plots of 'encoder consistency', which is an indicator for how much image-dependent variation there is in the resulting perceptual quality (according to the metric) for each encode setting: lower curves are better here, since it means encode settings can be mapped to perceptual fidelity targets more reliably.

Encoders

The following encoders were tested:

Most of these encoders were tested with various speed settings. Mozjpeg -revert corresponds to unoptimized libjpeg-turbo, while default mozjpeg is slower and more optimized. All encoders were tested using only a single core (no multithreading) on a Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz with moderate load. Speed measurements were not done very accurately (simply measuring cpu time once per encoded image) so in particular at the high speed end this lead to inaccurate measurements (mozjpeg -revert is in reality faster and the reported speed is an underestimate).

SSIMULACRA 2 modelD

compression gain compared to unoptimized JPEG, aligned by average SSIMULACRA 2 modelD (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst SSIMULACRA 2 modelD (per encode setting)
compression gain compared to unoptimized JPEG, aligned by SSIMULACRA 2 modelD (assuming adjusting the encode setting per image)
bpp vs SSIMULACRA 2 modelD (assuming adjusting the encode setting per image)
bpp vs average SSIMULACRA 2 modelD (per encode setting)
bpp vs percentile 10 worst SSIMULACRA 2 modelD score (per encode setting)
encoder consistency (standard deviation of SSIMULACRA 2 modelD per encode setting)

DSSIM

compression gain compared to unoptimized JPEG, aligned by average DSSIM (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst DSSIM (per encode setting)
compression gain compared to unoptimized JPEG, aligned by DSSIM (assuming adjusting the encode setting per image)
bpp vs DSSIM (assuming adjusting the encode setting per image)
bpp vs average DSSIM (per encode setting)
bpp vs percentile 10 worst DSSIM score (per encode setting)
encoder consistency (standard deviation of DSSIM per encode setting)

Butteraugli 3-norm

compression gain compared to unoptimized JPEG, aligned by average Butteraugli 3-norm (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst Butteraugli 3-norm (per encode setting)
compression gain compared to unoptimized JPEG, aligned by Butteraugli 3-norm (assuming adjusting the encode setting per image)
bpp vs Butteraugli 3-norm (assuming adjusting the encode setting per image)
bpp vs average Butteraugli 3-norm (per encode setting)
bpp vs percentile 10 worst Butteraugli 3-norm score (per encode setting)
encoder consistency (standard deviation of Butteraugli 3-norm per encode setting)

SSIMULACRA

compression gain compared to unoptimized JPEG, aligned by average SSIMULACRA (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst SSIMULACRA (per encode setting)
compression gain compared to unoptimized JPEG, aligned by SSIMULACRA (assuming adjusting the encode setting per image)
bpp vs SSIMULACRA (assuming adjusting the encode setting per image)
bpp vs average SSIMULACRA (per encode setting)
bpp vs percentile 10 worst SSIMULACRA score (per encode setting)
encoder consistency (standard deviation of SSIMULACRA per encode setting)

PSNR-Y

compression gain compared to unoptimized JPEG, aligned by average PSNR-Y (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst PSNR-Y (per encode setting)
compression gain compared to unoptimized JPEG, aligned by PSNR-Y (assuming adjusting the encode setting per image)
bpp vs PSNR-Y (assuming adjusting the encode setting per image)
bpp vs average PSNR-Y (per encode setting)
bpp vs percentile 10 worst PSNR-Y score (per encode setting)
encoder consistency (standard deviation of PSNR-Y per encode setting)

PSNR-HVS

compression gain compared to unoptimized JPEG, aligned by average PSNR-HVS (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst PSNR-HVS (per encode setting)
compression gain compared to unoptimized JPEG, aligned by PSNR-HVS (assuming adjusting the encode setting per image)
bpp vs PSNR-HVS (assuming adjusting the encode setting per image)
bpp vs average PSNR-HVS (per encode setting)
bpp vs percentile 10 worst PSNR-HVS score (per encode setting)
encoder consistency (standard deviation of PSNR-HVS per encode setting)

SSIM

compression gain compared to unoptimized JPEG, aligned by average SSIM (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst SSIM (per encode setting)
compression gain compared to unoptimized JPEG, aligned by SSIM (assuming adjusting the encode setting per image)
bpp vs SSIM (assuming adjusting the encode setting per image)
bpp vs average SSIM (per encode setting)
bpp vs percentile 10 worst SSIM score (per encode setting)
encoder consistency (standard deviation of SSIM per encode setting)

MS-SSIM

compression gain compared to unoptimized JPEG, aligned by average MS-SSIM (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst MS-SSIM (per encode setting)
compression gain compared to unoptimized JPEG, aligned by MS-SSIM (assuming adjusting the encode setting per image)
bpp vs MS-SSIM (assuming adjusting the encode setting per image)
bpp vs average MS-SSIM (per encode setting)
bpp vs percentile 10 worst MS-SSIM score (per encode setting)
encoder consistency (standard deviation of MS-SSIM per encode setting)

CIEDE2000

compression gain compared to unoptimized JPEG, aligned by average CIEDE2000 (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst CIEDE2000 (per encode setting)
compression gain compared to unoptimized JPEG, aligned by CIEDE2000 (assuming adjusting the encode setting per image)
bpp vs CIEDE2000 (assuming adjusting the encode setting per image)
bpp vs average CIEDE2000 (per encode setting)
bpp vs percentile 10 worst CIEDE2000 score (per encode setting)
encoder consistency (standard deviation of CIEDE2000 per encode setting)

VMAF-NEG

compression gain compared to unoptimized JPEG, aligned by average VMAF-NEG (per encode setting)
compression gain compared to unoptimized JPEG, aligned by percentile 10 worst VMAF-NEG (per encode setting)
compression gain compared to unoptimized JPEG, aligned by VMAF-NEG (assuming adjusting the encode setting per image)
bpp vs VMAF-NEG (assuming adjusting the encode setting per image)
bpp vs average VMAF-NEG (per encode setting)
bpp vs percentile 10 worst VMAF-NEG score (per encode setting)
encoder consistency (standard deviation of VMAF-NEG per encode setting)