Evaluation and Decision Matrix

Losses

  • Primary: Dice plus BCE (or Focal for imbalance).
  • Optional: boundary-aware loss for crown edges.

Metrics

  • mIoU
  • Dice/F1
  • AP50/AP75 for instances
  • Precision and recall on small-object crowns
  • Inference time per tile

Baseline success criteria

  • +8-15% Dice over zero-shot baseline on held-out regions.
  • AP50 improvement without more than 20% inference slowdown.
  • Stable performance across at least three geographic validation blocks.

Compute planning

  • Start with 1-2 NVIDIA GPUs (24 GB+ VRAM preferred for larger tiles).
  • Use mixed precision (bf16 or fp16) and gradient accumulation.
  • Start with tile sizes of 512-1024.

Relative cost:

  • Decoder-only < LoRA < RGB+height adapter < partial/full encoder fine-tune.

Risk register

  • Domain shift across regions and seasons:
    • Mitigation: spatially separated splits and diverse sampling.
  • Label noise from manual polygons:
    • Mitigation: QA pass, confidence-weighted losses, uncertain-mask exclusion.
  • Overfitting on limited data:
    • Mitigation: freeze-first strategy, LoRA-first trials, strict validation protocol.
  • RGB/height misalignment:
    • Mitigation: co-registration checks before multimodal training.

Production decision matrix

Weighted selection score:

  • 50% segmentation quality (Dice + AP50/AP75)
  • 25% robustness (variance across blocks and seasons)
  • 15% runtime (tiles/sec and memory)
  • 10% operational simplicity (integration and maintenance cost)

Promotion gate:

  • Beat baseline on each geographic validation block.
  • Stay within runtime budget for expected tiling volume.
  • Stay stable in shadow-heavy and sparse-tree scenes.

See also: