Measurements are hard (for now)

As AI labs deliver increasingly capable computer vision, there are new opportunities to accelerate workflows that use visual information captured in documents. The data required for software to process visual information is commonly missing from documents (ex. a rasterized PDF without vector and text data). Therefore, advances in computer vision like optical character recognition, semantic segmentation, and spatial reasoning provide promising ways to extract this data.

Spatial reasoning involves the ability of a model to infer the space and dimensions captured by an image. This could be particularly useful when identifying points to measure objects in a scaled drawing, such as when measuring objects in an architectural drawing.

To better understand the spatial reasoning offered by the mainstream labs and its usefulness in locating points in an image for this use case, I ran experiments with Anthropic's latest models. I wanted to write a quick post about my learnings and experience.

tldr; measurements are hard (for now). The best performing model, Sonnet 4.5 with thinking off, generated measurements within 5 pixels (1 foot in real-life dimensions) of a ground truth measurement 85% of the time. It's not sufficient for workflows with little margin for error but demonstrates promising potential.

In this post, we'll explore this conclusion through three topics.

  • PDF, or 93 'Till Infinity
  • Real world challenges and the world of architectural drawings
  • Experiment results and learnings

PDF, or 93 'Till Infinity

PDF (short for Portable Document Format) is a file format that represents a fixed-page layout document. It was developed to enable users to store, exchange, and display documents on any screen and platform. Released in 1993 by Adobe, it was a follow up to their popular PostScript format, an earlier page description language focused on printing. Since then, it became the predominant format for cross-platform fixed-page layout documents and an open standard in 2008 (now, ISO-32000).

PDF documents store a variety of data and support a multitude of features. For our purposes, the main data to note is text, vectors, and raster images. The ISO spec is the best resource for learning more about the full breadth of PDF.

Today, the most popular AI labs (Anthropic, OpenAI, and Google) allow users to include PDF documents in their interactions with models.

Under the hood, the PDF files go through a pre-processing pipeline that converts the data into formats optimized for today's models: text and/or raster images. Typically, text information natively embedded into the PDF is extracted, and then each page of the document is rasterized and converted into image files. Both are then fed to the model for processing. While the exact implementations vary from lab to lab and even product to product, this basic backbone is the current status quo.

Real world challenges and the world of architectural drawings

While PDF is a data-rich format, many workflows across the economy cannot take advantage of the data. Workflows that involve users from multiple, different organizations typically suffer from a lossy communication process. Like a message relayed during a game of telephone, the document can lose valuable data when passed from person to person. The problem takes root when the organizations involved in a workflow lack unifying incentives to maintain the data integrity.

One striking example is the estimation workflow common in construction projects. It uses architectural drawings and other key information stored in project documents to estimate a budget for a project. Architectural drawings are highly detailed schematics created with CAD software. On creation, they typically contain detailed text, vector, and raster image data that can be used to derive details about the project, including parts and dimensions.

However, by the time many contractors receive the documents to kick-start estimation, the PDF document lacks accessible text and vector data because an upstream party packaged the document as a rasterized document. As a result, many contractors do manual (though computer-assisted) measurements on the drawing in a time-consuming process.

I find this to be a captivating use case for computer-vision-augmented solution, so I used it as the basis for my experiments.

Experiments with Anthropic's models

Benchmarking is complicated. To quantify performance for this use case case, I wanted to get hands-on.

The following is a piece of wisdom I picked up from @chrisduerr of the Alacritty project, and I think it applies to testing AI models.

Benchmarking terminal emulators is complicated...If you have doubts about Alacritty's performance or usability, the best way to quantify terminal emulators is always to test them with your specific usecases.

In other words, the best way to quantify performance is to test with your specific use cases.

This specific use case is to measure the length of an object in a rasterized architectural drawing. Specifically, I wanted to determine whether Anthropic's models could measure the distance of a red line in a scaled drawing within 1 foot of its actual length (real-world dimensions). This would require the model to accurately identify the start and end points of the red line.

To test the models, I set up an evaluation that compares the results of a deterministic measurement process against a ground-truth example. Details of the process and ground-truth example are documented here. I set a binary pass-fail criteria. If the measured length was less than 5 pixels of the actual, it's a pass. In this particular image, 1 pixel corresponds to roughly 2.5 inches in real-life. The tolerance of 5 pixels corresponds to 12.5 inches, or roughly 1 foot. I implemented this and managed the experiments using promptfoo. My full configuration is available on github

Models tested:

  • claude-opus-4-5-20251101
  • claude-sonnet-4-5-20250929
  • claude-haiku-4-5-20251001

Results:

In my first experiment, I compared all three models with default settings. Over 10 iterations, Sonnet performed the best, passing 10 out of 10 times. Opus and Haiku passed 0 out of 10 times.

Opus 4.5Sonnet 4.5Haiku 4.5
Requests101010
Asserts0/10 passed10/10 passed0/10 passed
Total Cost$0.2169$0.1331$0.0371
Total Tokens22,99423,19021,743
Avg Latency12,043 ms11,637 ms4,781 ms

Full results available at https://www.promptfoo.app/eval/eval-eBN-2025-12-25T17:00:16

In my second experiment, I compared Sonnet with thinking enabled and disabled. Over 20 iterations, Sonnet with thinking disabled passed 17 out of 20 times while thinking enabled passed 13 out of 19 times (API timeout error occurred on one of the runs).

Sonnet 4.5Sonnet 4.5 (thinking enabled)
Requests2020
Asserts17/20 passed13/19 passed (1 Error)
Total Cost$0.2408$0.6946
Total Tokens44,69473,953
Avg Latency18,399 ms43,114 ms

Full results available at https://www.promptfoo.app/eval/eval-KAd-2025-12-25T17:29:28

Overall, Sonnet 4.5 with thinking off performed the best. It generated measurements within 5 pixels (1 foot in real-life dimensions) of a ground truth measurement 85% of the time.

Learnings

  • For now, the spatial reasoning ability of Anthropic's major models is not sufficient for workflows with little margin for error, such as high-stakes measurements in construction or engineering projects. However, workflows with greater tolerance could benefit from today's capabilities.
  • Reasoning ≠ Perception. The best reasoning models are not the best vision models. I was surprised to see Antrhopic's Sonnet perform better than its flagship Opus on this evaluation.
  • Thinking mode does not unlock better computer vision abilities.
  • Decompose a workflow into deterministic steps when possible. Entropy introduced by a model will reduce success rate and/or increase costs. In earlier versions of my experiment, I asked the model figure out how to calculate the length and do its own steps. There was significant variation and error in its methods. By defining the workflow deterministically and embedding it into the prompt, I reduced the scope of work the model was doing for this task. There is further opportunity to limit the model to solely identify and return the pixel coordinates.

Going Forward

It was rewarding to learn more about computer vision, architectural drawings, PDF documents, and the limits of Anthropic's general purpose models. I'm excited to see where computer vision evolves from here and what applications we'll see throughout the economy.

Going forward, my hunch is that we will see rapid advances in the space as more time, attention, and talent pile into AI. Personally, I find a handful of questions intriguing.

  • Can specialized models increase the accuracy of measurements in architectural drawings?
  • Can specialized pre-processing pipelines that enable larger and higher resolution image files to be fed to models increase accuracy?
  • Where is accuracy not needed? Are there workflows that can benefit from AI computer vision that is already available today?

For now, thanks for reading.