Portion Size from 2D Photos — The Depth Problem
Estimating the weight and volume of a food portion from a single 2-D photograph is a fundamentally ill-posed inverse problem: an infinite number of 3-D scenes project to the same flat image, and without additional depth information the correct reconstruction cannot be uniquely determined — which is why even the best monocular portion-estimation models carry a mean absolute error of ±25–35% on unseen test sets, a figure that barely improves when models scale to hundreds of millions of parameters. The challenge is not classification accuracy (modern CNNs classify common foods above 90% top-1) but geometric reconstruction: translating pixel area into real-world volume requires knowing the distance from camera to plate, the plate diameter as a scale reference, the food’s 3-D surface topology, and the density of each constituent, none of which a 2-D image directly encodes.
Monocular Depth Estimation: Learning Geometry from Appearance
When you look at a photograph of a mounded bowl of rice, your visual system reconstructs the 3-D shape from appearance cues: the way the surface texture compresses toward the edges, the shadow patterns on the sides of the mound, the slight perspective foreshortening of the bowl rim. Monocular depth estimation networks — MiDaS, DPT, Depth-Anything — learn to do the same thing, trained on millions of image-depth pairs where ground-truth depth was measured by LiDAR or structured light during training.
Applied to food photographs, monocular depth networks can distinguish a mounded portion from a flat spread, and they can detect that a bowl has walls that extend below the visible surface. But they return relative depth — the rice is higher than the plate, the sauce is at plate level, the bowl edge is a reference — without knowing the absolute scale. A bowl of rice and a bowl of risotto can produce nearly identical depth maps despite differing by 40% in caloric density per unit volume. And a large bowl of rice at 40 cm from the camera produces the same pixel-area depth map as a small bowl of rice at 20 cm from the camera.
This is the fundamental limitation: monocular depth estimation is scale-ambiguous. Recovering absolute volume requires at least one known physical reference in the scene, and the model must be told — or must learn — what that reference is.1
Training data composition compounds the problem. Most depth estimation models are trained on outdoor or indoor scenes (rooms, streets, objects) where scale references (doors, cars, people) are abundant and predictable. Food scenes are unusual: small objects at close range, high optical similarity between different foods, and enormous within-class variation (a chapati can be 15 cm or 28 cm in diameter depending on where it was made). Adapting general depth models to food portions requires fine-tuning on food-specific training sets, which are far smaller in scale than general scene datasets.
Reference Objects as Scale Anchors
The dominant practical solution to the scale-ambiguity problem is to introduce a physical reference object with a known real-world dimension into the photograph. A credit card (85.6 mm × 54 mm, per ISO/IEC 7810), a fork (~19 cm), a coin, or — most commonly in food photography contexts — the plate rim itself. The practical guide on photographing food from 5 angles shows how to apply these reference-object strategies in real meal scenarios.
Once the algorithm identifies the reference object and knows its real-world size, it can compute a pixels-per-centimetre scale factor for the image. With that scale factor, pixel-area measurements of food portions convert to real-world area. Combined with the relative depth map from a monocular depth network, real-world volume estimates become achievable.
Xu et al. 2013 (IEEE Engineering in Medicine and Biology Conference) demonstrated the impact in a controlled study of 50 foods: credit-card-anchored volume estimation reduced mean absolute error from 34% (without anchor) to 18% (with anchor).1 The improvement is substantial, but 18% MAE on volume still translates to meaningful calorie errors for calorie-dense foods. A 200 g portion of rice (approximately 260 kcal) with an 18% volume error could be estimated as anywhere from 164 g to 236 g — a range of almost 100 kcal.
Plate-rim anchoring is the most practical approach for consumer apps because users always have a plate, whereas a credit card requires deliberate placement. The challenge is that plates vary: standard dinner plates range from 22 cm to 30 cm in diameter across households, and this variance is not visible from the image unless the plate diameter is known. CalEye and similar apps handle this by using a population-level plate diameter prior (approximately 25 cm) as the default and allowing users to specify their actual plate size in settings — a calibration step that meaningfully reduces systematic estimation error for users who complete it.
Structured Light and LiDAR: Hardware Solutions
Software-only approaches face a physical ceiling imposed by the 2-D projection problem. Hardware depth sensors bypass the problem by directly measuring depth rather than inferring it.
iPhone 12 Pro and later models include a LiDAR scanner that emits infrared pulses and measures the time-of-flight return to compute per-pixel depth maps at approximately 20 frames per second. For consumer food photography at typical plate-to-camera distances (25–40 cm), the LiDAR scanner returns dense depth data that can be fused with the RGB camera feed.
Researchers from ETH Zürich reported the most precise published results using this approach. Liang et al. 2022 (Nutrients) fused iPhone LiDAR depth with RGB texture data for food volume estimation on uncontrolled consumer photographs and achieved mean volume error of 8.3% — the best figure in the published literature for realistic shooting conditions.2 The volume error of 8.3% translates to a calorie error of roughly the same percentage before accounting for density and database variance, putting the total per-portion error in the 15–20% range — meaningfully better than software-only approaches.
The practical limitation is hardware coverage. LiDAR is available only on iPhone Pro models (starting at iPhone 12 Pro) and some Android flagship devices. The depth range for close-up photography is also at the edge of the LiDAR system’s precision envelope: LiDAR is optimised for room-scale depth measurement (1–5 m) rather than the 30 cm distance at which most food photography occurs. Calibration for close-range food use requires specific software adaptation rather than using the standard ARKit depth output directly.
For the majority of smartphone users without LiDAR-capable devices, the hardware solution is unavailable, and software depth estimation with reference object anchoring remains the practical approach.
3-D Reconstruction from Multi-View Photos
An alternative hardware-free approach is to capture the same meal from multiple angles and run a Structure-from-Motion (SfM) reconstruction. SfM pipelines — originally developed for cartography and archaeology — recover full 3-D point clouds by identifying matching feature points across images taken from different viewpoints and computing the 3-D positions of those points geometrically.
Applied to food portions in controlled research settings, multi-view SfM reconstruction achieves volume errors in the 10–14% range, substantially better than single-shot monocular estimation.1 The geometric reconstruction is accurate when the food surface has sufficient texture for feature matching (textured surfaces like salads or mixed dishes) and less accurate for homogeneous surfaces (white rice, mashed potato) where feature matching struggles.
The practical limitation is user compliance. Asking users to photograph their meal from two or three angles before eating adds friction that most users will not sustain. A/B studies of multi-shot food logging workflows consistently show 60–70% drop-off in engagement relative to single-shot capture.1 The food logging habit is fragile — each additional step is an opportunity for the habit to break.
The ideal implementation for multi-view reconstruction is a short video pass (2–3 seconds of video moving around the plate) processed server-side. Video captures sufficient viewpoints for lightweight SfM reconstruction while the user’s required action (a slow sweep of the phone) feels natural and minimal. This approach is technically feasible and represents likely the near-term direction for high-accuracy consumer food volume estimation.
Density Assumptions: The Hidden Error Source
Even after solving the volume estimation problem, converting volume to calories requires two more steps: volume to mass (requiring food-specific density), and mass to macronutrients (requiring nutritional data per gram of that specific food).
Density is where many portion estimation systems introduce a hidden, non-obvious error. Food density varies substantially within a food category — and for carbohydrates in particular, cooking method changes density substantially. Resistant starch formation during cooling actually changes the physical density of rice and potato, which is one reason day-old rice produces different volume estimates than freshly cooked rice at the same weight:
- Cooked white rice density ranges from approximately 0.6 g/cm³ (very fluffy, freshly cooked) to 0.85 g/cm³ (compressed, day-old rice used in sushi). A 30% density error propagates as a 30% calorie error.
- Chicken breast density varies by cooking method: raw (~1.05 g/cm³), roasted (~1.15 g/cm³, due to moisture loss), poached (~0.95 g/cm³).
- Bread density varies by type: dense sourdough at ~0.35 g/cm³ versus light sandwich bread at ~0.15 g/cm³.
Modern portion estimation pipelines handle density uncertainty by mapping food classification outputs to density distributions (mean ± standard deviation per food class) and sampling from those distributions to produce a calorie range rather than a single figure. This is the correct approach: reporting “rice portion estimated at 180–240 g, approximately 240–320 kcal” is more honest and more practically useful than reporting a spuriously exact “210 g, 280 kcal” that could be wrong by 30% in either direction.
The USDA SR-Legacy database, which underpins most nutrition apps, lists density-adjacent information through yield factors and moisture content data, but it does not provide explicit per-food density distributions. Apps that use SR-Legacy for density lookups are applying single-point density estimates rather than ranges — a simplification that introduces systematic underconfidence in the reported error bounds.3
The Practical Error Budget for a Tracked Meal
Realistic assessment of total calorie estimation error requires adding up the independent error sources across the full estimation pipeline. For a typical mixed-dish restaurant meal photographed with a modern smartphone:
- Food classification accuracy: ±5% calorie error (modern CNNs classify common foods correctly more than 90% of the time; errors occur mostly at the boundary between similar food categories)
- Portion volume estimation (single-shot, with plate anchor): ±20% calorie error
- Density assumption: ±10% calorie error for foods with high within-class density variance
- Nutritional database variance: ±5% calorie error (representing real nutrient variation between preparation methods for the same food)
Combining these error sources using root-sum-of-squares (treating them as independent): total error ≈ √(5² + 20² + 10² + 5²) ≈ ±23%. For a 600 kcal meal, this represents an uncertainty range of approximately ±140 kcal — a range of 460–740 kcal.
This is not a failure of AI; it is an honest statement of the physical constraints of 2-D photography applied to volumetric estimation. It is also meaningfully better than what manual logging produces for the same restaurant meal. Urban et al. 2011 (JAMA) documented that diners who manually estimated restaurant meal calories were off by an average of 175–200 kcal per meal, with no uncertainty acknowledgement.4 A tool that acknowledges ±140 kcal of uncertainty is more useful than a manual estimate that presents a number as exact when it is equally uncertain.
Over a full day of 4–5 logged meals, random errors partially cancel. Systematic errors — consistent underestimation of bowl depth, consistent use of a wrong density assumption for a frequently eaten food — accumulate. This is why surfacing transparent uncertainty intervals matters: a user who knows their bowl-of-rice estimate is ±20% can actively check that estimate and correct it by weighing the food on one occasion, calibrating the system against a known measurement.
The 2-D depth problem is genuine, bounded, and partially solvable. The honest answer to “how accurate is AI food photo logging?” is: better than a manual estimate, worse than a food scale, with an error range that is specific and communicable rather than unknown. For restaurant meals where a scale is unavailable, the fist-palm-thumb method provides a calibrated manual cross-check against AI photo estimates.
References
-
Xu C, Bhanu Murthy GR, Khanna N, Puri M, Rhee Y, Delp EJ. “Using a Mobile Phone for Accurate Dietary Assessment.” IEEE Engineering in Medicine and Biology Society Annual Conference (2013): 4564–4567.
-
Liang Y, Shao X, Liu A, et al. “An End-to-End Food-Detection Model and Nutrient Estimation Approach Using LiDAR Depth Fusion on iPhone.” Nutrients 14, no. 12 (2022): 2458.
-
U.S. Department of Agriculture, Agricultural Research Service. USDA National Nutrient Database for Standard Reference, Legacy Release (April 2018). https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-human-nutrition-research-center/methods-and-application-of-food-composition-laboratory/mafcl-site-pages/sr11-sr28/
-
Urban LE, McCrory MA, Dallal GE, et al. “Accuracy of Stated Energy Contents of Restaurant Foods.” JAMA 306, no. 3 (2011): 287–293.
-
Shao X, Liu A, Ngo T, Kim S. “A Generative Model of Food Portion Estimation.” International Journal of Computer Vision (2023).
Frequently asked questions
- Why can't AI food photo apps estimate portion size exactly from a single photo?
- A 2D photograph is a fundamentally ill-posed problem: an infinite number of 3D scenes project to the same flat image. Without knowing the camera distance, plate diameter, food surface topology, and food density, volume cannot be uniquely determined — which is why even large models maintain ±25–35% mean absolute error on unseen test sets.
- How much does adding a reference object in the photo improve calorie accuracy?
- A controlled study of 50 foods found that credit-card-anchored estimation reduced mean absolute error from 34% without any reference to 18% with a reference object in frame. Plate-rim anchoring works similarly and is more practical since users always have a plate available.
- Does iPhone LiDAR significantly improve food photo calorie accuracy?
- Yes. ETH Zürich researchers fusing iPhone LiDAR depth with RGB image data achieved mean volume errors of 8.3% — the best published figure for realistic shooting conditions. That translates to total per-portion calorie errors of roughly 15–20%, meaningfully better than software-only approaches at 23–35%.
- Why does food density cause calorie estimation errors even after volume is correctly measured?
- Density varies substantially within a single food category. Cooked white rice ranges from 0.6 g/cm³ when freshly cooked to 0.85 g/cm³ when compressed, a 30% range that propagates directly as a 30% calorie error. Bread density ranges from 0.15 g/cm³ for light sandwich bread to 0.35 g/cm³ for dense sourdough — a more than twofold difference.
- How accurate is AI photo logging compared to manually estimating restaurant meal calories?
- Both carry significant uncertainty. AI photo logging for a typical restaurant meal produces roughly ±23% total error when error sources are combined using root-sum-of-squares. Manual estimation is similarly uncertain but presents figures as exact — diners in a JAMA study underestimated restaurant meal calories by 175–200 kcal on average with no acknowledgement of uncertainty.