Vedant Misra — Software Engineer

I'm excited to share my recent scholarly paper exploring the adaptation of the Segment Anything Model (SAM) family for specialized domains. As foundation models increasingly dominate computer vision, their transition from natural imagery to expert domains reveals structural challenges in their core interaction design.

Download the full paper (PDF)

Abstract

The Segment Anything Model (SAM) established promptable segmentation as a general-purpose foundation-model interface, but the literature from 2023 to 2026 demonstrates that this interface degrades in specialized domains where prompts are expensive, dense, temporally ambiguous, or operationally infeasible. This review argues that the central challenge in downstream SAM adaptation is not merely the distribution shift between natural and domain imagery—it is the structural mismatch between SAM's prompt-centric design and the operational realities of expert deployment.

To substantiate this claim, we first synthesize the architectural evolution of the SAM family from SAM 1 through SAM 3.1. We then conduct a focused, domain-comparative analysis across three settings in which this mismatch is especially visible: remote sensing and change detection, microscopy and histopathology, and video analysis.

Across all three domains, the strongest adaptations succeed not simply by fine-tuning the backbone but by redesigning the interaction regime—injecting domain structure, replacing manual prompts with learned alternatives, or restructuring inference for scale.

1. Introduction

The Segment Anything Model (SAM) changed the practical language of image segmentation by formalizing mask prediction as a promptable foundation-model task. Trained on the billion-mask SA-1B dataset, SAM demonstrated that a modular image encoder, prompt encoder, and lightweight mask decoder could support zero-shot interactive segmentation across highly diverse natural imagery.

In natural-image settings, this interface is powerful because a small number of point or box prompts can often disambiguate an object of interest quickly and accurately. However, the literature that followed SAM's release makes clear that this interaction model does not transfer cleanly into specialized settings.

In remote sensing, the object of interest may be defined by semantic change across time rather than by a single static object boundary. In microscopy and histopathology, hundreds or thousands of tightly packed instances may appear in a single field of view, rendering per-object prompting impractical. In video, prompts must remain useful over long temporal horizons under corruption, motion, occlusion, and multi-object scale.

2. Architecture of the SAM Family

Understanding the architectural progression of the SAM family is essential for contextualizing the downstream adaptations discussed in subsequent sections. The family is defined by a continuous expansion of its operational manifold—from static spatial priors to dynamic, multi-modal, and highly parallelized spatio-temporal reasoning.

SAM 1: Promptable Segmentation as a Foundation Task

SAM 1 introduced a tripartite modular architecture: a heavyweight image encoder, a flexible prompt encoder, and a lightweight mask decoder. Trained on SA-1B, it achieved unprecedented zero-shot spatial generalization on natural imagery.

SAM 2: Streaming Memory and Temporal Propagation

SAM 2 extended the promptable interface to video object segmentation through a streaming-memory transformer. A memory bank stored both spatial feature maps and high-level object representations, which were iteratively updated via self-attention to propagate segmentations through time.

SAM 3: Promptable Concept Segmentation

SAM 3 reframed the objective from purely spatial demarcation to semantic comprehension via Promptable Concept Segmentation (PCS). Rather than requiring explicit coordinates, SAM 3 accepts short noun phrases (e.g., "yellow school bus") and visual exemplars to segment all instances of a concept globally.

SAM 3.1: Object Multiplexing and Inference Scalability

Released in March 2026, SAM 3.1 addressed the linear scaling bottleneck of SAM 3's multi-object video pipeline. SAM 3.1 introduced Object Multiplexing: a hardware-aware shared-memory batching strategy in which a Multiplexer aggregates spatial and memory tracking data for up to 16 distinct objects from frame $T-1$ into fixed-capacity buckets, processes them in a single joint forward pass, and separates the output for frame $T$.

3. Remote Sensing and Change Detection

Remote sensing is a revealing testbed because it compounds several distinct problems. Natural-image ViTs are trained on ground-level RGB photographs; overhead imagery differs in viewpoint, scale range, spectral statistics, and object definition.

PEFT and Automatic Prompt Generation

One response to this mismatch is to preserve the SAM backbone while injecting remote sensing priors through lightweight modules. RSAM-Seg is a representative example. Rather than relying on default prompting, it introduces two heavily engineered encoder-side modules: Adapter-Scale and Adapter-Feature, which supplements the network with high-frequency components and geometric embeddings.

Because geospatial targets such as road networks rely on edge gradients rather than uniform color distributions, RSAM-Seg extracts high-frequency components via the Fast Fourier Transform to automatically generate prompts based on edge gradients.

Training-Free Bitemporal Latent Matching

AnyChange offers the most conceptually distinctive response in the change detection literature by demonstrating that SAM's frozen latent space already contains change-sensitive structure. Rather than retraining for change detection, AnyChange exploits SAM's object-centric capabilities by computing similarities over instance-level mask embeddings rather than raw pixel embeddings.

4. Microscopy, Biological Imaging, and Histopathology

Microscopy and histopathology make the prompt burden problem unusually concrete. In cell biology or pathology, a single field of view may contain hundreds or thousands of tightly packed instances with heavy overlap, weak boundaries, and staining variation.

Microscopy-Specific Fine-Tuning

A useful starting point is $\mu$SAM (Segment Anything for Microscopy), which extends SAM through microscopy-specific fine-tuning and delivers both interactive and automatic workflows for light and electron microscopy. An important conclusion of that work is that a single universal microscope model did not emerge naturally from the original SAM architecture; instead, separate generalist models were needed.

Automated Prompt Generation for Dense Instances

A second strategy is not to remove prompts but to move prompt generation from the human to an upstream detector. CellSAM combines SAM with CellFinder, a transformer-based detector built on the Anchor DETR framework, which natively resolves dense clusters and partial occlusions without non-maximum suppression heuristics.

Prompt-Free and Parameter-Efficient Adaptation

Recent work pushes further by removing the prompt interface entirely. Maqsood et al. (2026) introduce a prompt-free, lightweight SAM adaptation for H&E stained nuclei that discards the prompt encoder entirely and replaces it with multi-level encoder features coupled with a residual decoding module, requiring a minimal trainable footprint of 4.1 million parameters.

5. Video Analysis, Scaling, and Evaluation

Streaming Memory and Prompt Scheduling

SAM 2 made video prompting a realistic foundation-model capability by combining spatial prompting with streaming memory. In practice, continuous frame-wise prompting often degrades the streaming memory representation due to noise accumulation. Frame-sparse prompt scheduling—providing corrective spatial prompts only at optimized, infrequent intervals—frequently outperformed frame-wise prompting.

Long-Horizon Robustness

SAM2Long directly targets the long-horizon failure mode. Rather than greedily trusting a single propagated mask, it treats memory selection as a training-free tree search: multiple segmentation pathways are maintained and video-level optimal paths are chosen retrospectively.

SAM 3 Concept Prompting and SAM 3.1 Object Multiplexing

SAM 3 extends the interaction space by introducing concept prompts that can unify segmentation, detection, and tracking more flexibly than purely spatial prompts. SAM 3.1 then made multi-object inference scaling a first-order design concern through Object Multiplexing, increasing throughput from 16 to 32 frames per second for medium object counts, while also yielding accuracy gains due to the joint inter-object context modeled during the shared pass.

6. Cross-Domain Synthesis and Open Problems

Across remote sensing, biological imaging, and video, one pattern is unmistakable: the most successful adaptations preserve useful SAM-family priors while redesigning the interaction interface around the domain.

In remote sensing, this means incorporating multi-scale and bitemporal structure. In microscopy, it means admitting that per-object prompts are often infeasible and replacing them with detector-generated or prompt-free alternatives. In video, it means recognizing that prompt scheduling and multi-object scale are part of what determines whether a model remains usable in production.

Prompt Cost Must Become a Measured Quantity

The literature routinely describes prompt burden as a major limitation, but few papers evaluate it in a comparable, quantitative way. Future benchmarks should report prompt frequency, prompt type, correction budget, and annotation-time proxies explicitly.

Efficiency Is Not Separate from Scientific Quality

Efficiency and scaling should not be treated as secondary engineering concerns. PEFT in remote sensing, auto-prompting in dense microscopy, and Object Multiplexing in video all change what kinds of deployment are possible. Future work should treat throughput, memory, and trainable parameter footprint as part of the adaptation argument rather than as footnotes.

Toward Domain-Sensitive Foundation Interfaces

The broader implication is that "segment anything" is no longer a stable or sufficient downstream objective. Specialized-domain adaptation increasingly points toward a family of domain-sensitive segmentation interfaces: bitemporal segmentation for Earth observation, dense-instance or prompt-free segmentation for biology, and memory-managed tracking for video. The next generation of foundation models for vision will likely succeed by treating the interaction interface, not just the visual representation, as a domain-specific learning problem.

Adapting the Segment Anything Model Family for Specialized Domains