DiceBench
A Post-Human Level Benchmark

Introducing the first PHL (Post-Human Level) Benchmark for testing superintelligent AI systems. Developed by becose.

Motivation

Our analysis of benchmark lifespans suggests we need evaluation methods that can meaningfully differentiate between systems operating beyond human performance. Just as humans can intuitively predict the trajectory of moving vehicles—a task that would be nearly impossible for simpler animals—we expect that more advanced AI systems should demonstrate increasingly accurate predictions of complex physical systems like dice rolls, even when humans cannot. This limitation persists even when humans are given unlimited time to analyze the video data, suggesting a fundamental cognitive rather than perceptual constraint. This creates an opportunity to measure intelligence at levels far above human capability, rather than limiting ourselves to human-level intelligence as a ceiling.

Our analysis of benchmark lifespans (documented at H-Matched) suggests an opportunity to expand our evaluation methods. The increasing frequency with which AI systems achieve human-level performance on these benchmarks indicates that complementary approaches to AI evaluation could be beneficial for measuring and understanding artificial intelligence.

Post-Human Level (PHL) Benchmarks

We propose Post-Human Level (PHL) Benchmarks as a paradigm shift away from anthropocentric evaluation methods. By moving beyond human performance as our reference point, we can develop more meaningful standards for measuring artificial intelligence. A PHL Benchmark is defined by three key criteria that deliberately transcend traditional human-centric metrics:

1. Information Completeness

Each datapoint must contain sufficient information to theoretically achieve better performance than random guessing. In DiceBench, each video frame sequence contains all the physical information (momentum, rotation, surface properties) needed to predict the outcome, even though humans cannot process this information effectively.

2. Human Performance Gap

Breaking free from anthropocentric bias, the benchmark must measure capabilities that transcend human cognitive limitations. By design, human performance should be demonstrably far from optimal, challenging our assumption that human-level performance is a meaningful milestone for advanced AI systems.

3. Objective Evaluation

Each data point must have an unambiguous, verifiable correct answer, allowing for precise performance measurement. This enables us to identify superior performance even in domains where humans perform poorly. In DiceBench, each video has exactly one correct final die outcome.

DiceBench Overview

Description

DiceBench consists of a private evaluation set of 100 videos and a public dataset of 10 videos (available on GitHub) available through the interactive test on this website. All videos are recorded using a handheld Galaxy S24 camera, capturing dice rolls across ten different surface types. Each sequence shows a die of varying color and material being rolled, cutting exactly 0.5 seconds before it comes to rest—after at least two bounces on the surface.

While all necessary physical information for prediction is present in the videos (momentum, rotation, surface properties), the timing makes the final outcome challenging to determine through human perception alone. The public dataset allows researchers to benchmark current vision models like GPT-4o before requesting access to the full evaluation set, which is kept private to maintain benchmark integrity.

Evaluation Process

The evaluation methodology involves running each vision model through multiple trials per video to ensure reliable results. For GPT-4o, we conduct five independent prediction attempts per video in the dataset, with the final accuracy calculated as the average performance across these trials. The models are provided with frame sequences extracted at 24 FPS from each video and instructed to predict the final die outcome with a single numerical response, following OpenAI's video processing guide. This standardized process ensures consistent evaluation conditions across different models while minimizing the impact of potential variations in model responses. The complete evaluation scripts are available on GitHub.

Initial Results & Limitations

Our preliminary testing with GPT-4o on the public dataset (n=10) showed an accuracy of 33%, while human participants (n=3) achieved 27%. While these results are above the random baseline of 16.7%, we acknowledge that the small sample size limits their statistical significance. The higher-than-random performance might stem from inherent biases in both human perception and LLM training data regarding certain numbers or dice patterns, rather than true predictive ability.

However, we believe the core concept of using dice prediction as a PHL benchmark remains viable. We encourage researchers to view this project as an initial exploration of post-human evaluation methods, rather than a definitive benchmark. The evaluation scripts are openly available for those interested in conducting larger-scale tests with GPT-4o or other models. We welcome collaboration in refining this approach and developing more robust PHL benchmarks.

Leaderboard

System
Accuracy
GPT-4o
33.0%
Human Performance
27.0%
Random Baseline
16.7%

Try it Yourself

Below is an example video that demonstrates the task. The video stops exactly 0.5 seconds before the die comes to rest, and your challenge is to predict the final number shown on the die. You can use the controls to play, pause, step through frames, and adjust playback speed. You can also zoom in and pan around the video using your mouse wheel or pinch gestures on mobile.

00:00.00000:00.000

Ready to Test Your Prediction Skills?

Try to predict the final number shown on the die in 10 different videos. Use the controls above to analyze each throw carefully.

Note: This example is representative of the videos in the evaluation set but is not part of the official benchmark.

Access & Contact

The evaluation set is kept private to maintain benchmark integrity. Researchers and organizations interested in evaluating their models can for access to the private dataset. We encourage the AI research community to join us in developing more PHL Benchmarks as we move into an era where traditional human-comparative benchmarks may no longer be sufficient.

Citation

If you use DiceBench in your research, please cite our work:

@misc{dicebench2024,
  title = {DiceBench: A Post-Human Level Benchmark},
  author = {Lindahl, Rasmus},
  year = {2024},
  publisher = {becose},
  url = {https://dicebench.vercel.app},
  note = {AI consultancy specializing in advanced machine learning solutions}
}