Deep Learning · 3D Point Cloud Segmentation
Five deep learning architectures — from PointNet to Stratified Transformer — evaluated on a real-world robotic grasping task. Which one actually works when the robot needs to pick up a jar of jam?
Author
Saravut Lin
Institution
University of Edinburgh
Degree
MSc Artificial Intelligence
Published
MDPI AI, Vol. 7, 2026
Five Architectures
Each model represents a distinct idea about how to understand 3D geometry. Click any card to expand the full explanation — including what to say in a talk or interview.
Dataset: MiniMarket77
Target: Hartley's Strawberry Jam 300g
Scenes: 12,000 × 20,480 pts (XYZ+RGB)
Hardware: Single GPU, PyTorch
Real-World Results
Benchmark scores don't tell the whole story. When deployed on ten real-world point cloud scenes captured by eight Intel RealSense D415 cameras, the ranking changes dramatically. PointWeb dominates — fastest and cleanest. The Stratified Transformer, despite its benchmark prestige, fails in practice.
Crisp boundaries, compact mask, minimal false positives
Finds the right region but leaks onto adjacent bottles
Fragmented mask, scattered false positives, impractical latency
Mean Inference Time (seconds, lower is better)
Multi-Dimensional Comparison (higher is better)
Model Compression
Three compression techniques were applied to the best-performing model. All three preserve near-perfect accuracy while reducing latency. Knowledge distillation is the recommended default — hardware-agnostic and mask-faithful.
Latency Comparison (seconds)
Published Research
The dissertation's dataset and methodology were extended and published as a peer-reviewed article in AI, an open-access journal by MDPI. The paper introduces the MiniMarket80 dataset — an expanded version with 80 grocery objects — and benchmarks 11 state-of-the-art point cloud segmentation methods, establishing a new standard for texture-rich, real-world evaluation.
doi.org/10.3390/ai7030096 →Sorour, M.; Rattray, E.; Syahrulfath, A.; Jaramillo, J.; Lin, S.; Webb, B.
The MiniMarket80 Dataset for Evaluation of Unique Item Segmentation in Point Clouds
AI 2026, 7(3), 96
Received: 24 Jan 2026
Accepted: 25 Feb 2026
Published: 6 Mar 2026
Section: AI in Autonomous Systems
Academic Editor: Miguel Angel Cazorla
Abstract
"The effectiveness of deep learning methods in image segmentation has led to interest in their deployment for 3D point cloud segmentation, particularly in the context of pre-grasp identification of a unique object amongst distractors. However, existing 3D object datasets are not ideal for training and evaluation of these methods. [...] We introduce the MiniMarket80 dataset to address this gap. The dataset consists of 1200 colored point cloud partial views, each of 80 standard grocery objects, collected with widely used Realsense RGB-D cameras (D415 and D435) under variable lighting conditions. [...] We use this dataset to evaluate 11 state-of-the-art point cloud segmentation methods. Only four of these are able to (partially) segment the target object in a real-world test, still producing significant false positives and false negatives."
Three Contributions of MiniMarket80
Adds to the limited pool of real-world point cloud datasets proven crucial for successful deployment — not synthetic CAD models.
Serves as a texture-rich benchmark for testing point cloud segmentation architectures that are otherwise mostly tested on shapes.
Collected using popular Realsense RGB-D sensors. All 80 objects are standard supermarket items — globally reproducible.
Figures from the Published Paper

Figure 1. The MiniMarket80 dataset — 80 standard grocery objects, each identified by its EAN barcode. Objects span beverages, condiments, cereals, personal care, and tinned goods, ensuring diversity in size, shape, texture, and surface reflectance.

Figure 2. The data collection rig: 8 Intel RealSense RGB-D cameras (2× D415, 6× D435) arranged around a rotating table with controllable LED lighting. This setup captures 1,200 partial views per object at varying azimuth, elevation, and lighting conditions.

Figure 5. Target vs. distractor. The task is binary: label every point as either the target object (EAN: 5410126116953, Biscoff Smooth) or background. Distractors include visually similar cylindrical containers — a deliberately hard case.

Figure 3. The same object at four point cloud resolutions: 1,024 · 2,048 · 4,096 · 8,192 points per sample. The dissertation uses 20,480 points per scene (10× the per-object resolution) to represent cluttered multi-object arrangements.

Figure 4. Segmentation sample pairs at 20,480 points (left) and 81,920 points (right). Red points belong to the target object; blue points are background. The binary mask is the ground truth used to train and evaluate all five models.

Figure 6. Inference results of all 11 evaluated models on real-world scene 7 (200,715 points). Red = target object; blue = background. Only four models partially succeed: PointNet, DGCNN, PointWeb, and FastPointTransformer. The remaining seven produce entirely incorrect or empty predictions.

Figure 7. Inference results of the second experiment across 10 real-world scenes (avg. 225,852 points/scene) for the four partially successful models: PointerNet, PointCNN, PointMLP, and PointNet++. Even the best performers produce significant false positives and false negatives, highlighting the difficulty of real-world unique item segmentation.
Keywords
Key Finding
Of 11 state-of-the-art models evaluated, only 4 can partially segment a target object in real-world scenes — and all still produce significant false positives and false negatives. This underscores the gap between benchmark performance and real-world deployment, and the importance of texture-rich, real-scan datasets like MiniMarket80.