Metas “Segment Anything” is the GPT-3 moment for computer vision


With Segment Anything, Meta releases a powerful AI model for image segmentation that can serve as a central building block for future AI applications.

Meta’s Segment Anything Model (SAM) has been trained on nearly 11 million images from around the world and a billion semi-automated segmentations. The goal was to develop a “foundation model” for image segmentation, and Meta says it has succeeded. Such foundation models are trained on large amounts of data, achieving generalized capabilities that allow them to be used in many specialized use cases with little or no training. The success of large pre-trained language models such as GPT-3 sparked the trend toward such models.

Video: Meta

Once trained, SAM can segment previously unknown objects in any image and can be controlled by various inputs: SAM can automatically scan the entire image, users can mark areas to be segmented, or click on specific objects. SAM should also be able to handle text since Meta integrates a CLIP model into its architecture in addition to the Vision Transformer, which initially processes the image.


Nvidia researcher Jim Fan calls SAM the “GPT-3 moment” in computer vision.

Meta’s SAM for everything and the XR future

Meta sees many applications for SAM, such as being part of multimodal AI systems that can understand visual and text content on web pages or segment small organic structures in microscopy.

Video: Meta

In the XR domain, SAM could automatically segment objects, view a human wearing an XR headset, and selected objects could then be converted into 3D objects by models such as Meta’s MCC.


GitHub and can be tried out via a demo.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top