Search

230408_Segment Anything

참고 영상

Segment Anything Model ( SAM ) from Meta AI | Meta SAM AI #meta #sammeta #metaaisam #facebook
#sammeta #metaaisam #facebook #meta In this video, I dive deep into the world of SAM AI, also known as the Segment Anything Model by Meta. SAM is a groundbreaking tool that allows for efficient segmentation tasks without additional training. Watch as I demonstrate a SAM DEMO, showcasing how Meta SAM can be used in various applications like AR/VR headsets and object detectors. Throughout the video, I'll be explaining the key features of the Meta SAM demo, including how it can take input prompts from other systems, such as a user's gaze from an AR/VR headset or bounding box prompts from an object detector. This powerful feature enables text-to-object segmentation and can greatly enhance the Meta SAM experience. I'll also discuss the advanced capabilities of SAM, a Meta AI product, and how it's trained on millions of images and masks collected through the use of a model-in-the-loop "data engine." This extensive understanding of images enables zero-shot generalization to unfamiliar objects and images, making the Meta Sam model highly versatile. As we delve into the world of Meta Research, I'll explain how zero-shot generalization allows the Segment Anything Model (SAM) to perform tasks on new, previously unseen data without additional training. I'll share more about the SA-1B Dataset Explorer, the SA-1B Dataset, and the SA-1B, which are all essential to the development of the SAM model. Throughout the video, I'll also be exploring how the segment-anything.com platform integrates with the SAM Model and how the segment-anything GitHub repository can be used to access important resources related to the project. You'll get an in-depth look at the facebookresearch/segment-anything repository, where you can find more information on the Meta AI Research behind the SAM AI. Finally, I'll talk how you can use the SAM Model Checkpoints to ensure the best results and how the ONNX format plays a vital role in the deployment of the SAM AI. We'll also examine the various versions of the SAM model, such as the ViT-L SAM model, ViT-B SAM model, and ViT-H SAM model, and discuss their unique features and capabilities. Join me as I explore the powerful and innovative world of Meta SAM, the Segment Anything Model by Meta, and learn how this cutting-edge technology is revolutionizing the way we approach segmentation tasks. SAM AI, brought to you by Meta Research. Important links related to Segment Anything: Segment Anything Website : https://segment-anything.com/ SAM Demo : https://segment-anything.com/demo Segment Anything / SAM Dataset / SA-1B Dataset : https://segment-anything.com/dataset/index.html Segment Anything / SAM META AI Blog : https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/ Github:https://github.com/facebookresearch/segment-anything Other Playlist Details: Midjourney Playlist : https://youtube.com/playlist?list=PL_algDYccNdmFV95gRlpcmgczrvDFaaBM AI Generative ART Playlist : https://youtube.com/playlist?list=PL_algDYccNdnBWY-hUKwNzojX3iioCyEk Chat GPT Playlist : https://youtube.com/playlist?list=PL_algDYccNdnr_sOmrrJz-J0GX2Maq3cO #Chapter Intro 0:00 Segment Anything Model website walkthrough 0:20 Segment Anything Model Demo 1:29 Segment Anything Model more details 3:15 SA-1B Dataset Explorer 4:07 Segment Anything Model GITHUB 4:37 Outro 5:00

Segment Anything Website

SAM이란?

Segmentation Anything Model (SAM)은 Meta AI에서 최근에 개발한 딥러닝 모델 중 하나입니다. 이 모델은 이미지나 비디오와 같은 시각적인 데이터를 분석하고, 객체나 배경 등의 다양한 요소를 분할하여 각각의 요소를 인식하는 것을 목적으로 합니다.
SAM은 기존의 세그멘테이션 모델과는 달리, 다양한 종류의 객체, 색상, 밝기, 크기 등의 차이를 더욱 정확하게 인식할 수 있습니다. 이는 모델의 구조적인 특징과 함께, 학습 데이터셋의 다양성을 확보함으로써 가능해졌습니다.
SAM은 다양한 응용 분야에서 활용될 수 있습니다. 예를 들어, 자율 주행 자동차의 경우, SAM을 사용하여 도로나 보행자, 차량 등의 객체를 구분하여 정확한 주행 경로를 설정할 수 있습니다. 또한, 의료 영상에서는 SAM을 사용하여 종양 등의 병변을 식별하고 분석하는데 활용될 수 있습니다.
SAM은 현재 상용화되어 있지는 않지만, 높은 정확도와 다양한 응용 가능성으로 인해 앞으로 더욱 많은 분야에서 사용될 것으로 기대됩니다.

기존과의 차이점

SAM과 기존의 세그멘테이션 모델들과의 가장 큰 차이점은, "Segmentation Anything"이라는 개념을 도입한 것입니다. 이는 SAM이 다양한 종류의 객체, 색상, 밝기, 크기 등을 인식하는 것을 의미합니다.
기존의 세그멘테이션 모델들은 특정한 클래스에 대해서만 분류를 수행하는 반면, SAM은 모든 것을 분류할 수 있는 모델로서, 높은 다양성과 확장성을 지닌다는 장점이 있습니다.
또한, SAM은 더욱 복잡한 구조와 다양한 스케일의 입력 데이터를 처리할 수 있도록 설계되었습니다. 이를 위해 다양한 네트워크 아키텍처와 스케일을 융합하는 다중 스케일 특징 추출 기술이 사용되었습니다. 이러한 기술들을 통해 SAM은 기존의 모델들보다 더욱 정확하고 유연한 분할 성능을 보여줍니다.
따라서, SAM은 기존의 세그멘테이션 모델들과는 차별화된 뛰어난 다양성과 정확성을 가진 모델로서, 다양한 분야에서의 활용 가능성이 높아질 것으로 예상됩니다.

핵심 기술

SAM이 가지는 핵심적인 내용은 크게 두 가지로 나눌 수 있습니다. 첫 번째는 "Segmentation Anything"이라는 개념을 도입하여, 다양한 종류의 객체, 색상, 밝기, 크기 등을 인식하는 것이며, 두 번째는 다중 스케일 특징 추출 기술을 이용하여 입력 데이터의 다양한 스케일과 구조를 처리하는 것입니다.
이를 위해 SAM은 다양한 딥러닝 기술과 아키텍처를 사용합니다. 예를 들어, 다중 스케일 특징 추출을 위해 ResNet, FPN, PANet 등의 아키텍처를 사용하며, 입력 이미지를 다양한 크기로 자르는 기술과 다양한 크기의 합성곱 필터를 사용하는 기술도 적용됩니다.
또한, SAM은 다양한 데이터 증강 기술을 사용하여 학습 데이터의 다양성을 확보합니다. 이를 통해 모델은 다양한 객체와 배경을 분류하고, 세분화하여 인식하는 능력을 향상시킵니다.
SAM은 이러한 기술들을 통해 뛰어난 분할 성능을 보여줍니다. 예를 들어, COCO Panoptic Segmentation에서는 SAM이 최첨단 성능을 달성하여, 다른 모델들보다 높은 정확도를 보입니다.
따라서, SAM이 가지는 핵심 내용과 적용된 기술들은 다양한 종류의 객체를 인식하고, 다양한 스케일과 구조의 입력 데이터를 처리하는 뛰어난 분할 성능을 도출하는데 큰 역할을 하고 있습니다.

SAM Demo

Segment Anything / SAM Dataset / SA-1B Dataset

Segment Anything / SAM META AI Blog

Github

segment-anything
facebookresearch

본문

Introducing Segment Anything: Working toward the first foundation model for image segmentation

April 5, 2023
Segmentation — identifying which image pixels belong to an object — is a core task in computer vision and is used in a broad array of applications, from analyzing scientific imagery to editing photos. But creating an accurate segmentation model for specific tasks typically requires highly specialized work by technical experts with access to AI training infrastructure and large volumes of carefully annotated in-domain data.
Today, we aim to democratize segmentation by introducing the Segment Anything project: a new task, dataset, and model for image segmentation, as we explain in our research paper. We are releasing both our general Segment Anything Model (SAM) and our Segment Anything 1-Billion mask dataset (SA-1B), the largest ever segmentation dataset, to enable a broad set of applications and foster further research into foundation models for computer vision. We are making the SA-1B dataset available for research purposes and the Segment Anything Model is available under a permissive open license (Apache 2.0). Check out the demo to try SAM with your own images.
Reducing the need for task-specific modeling expertise, training compute, and custom data annotation for image segmentation is at the core of the Segment Anything project. To realize this vision, our goal was to build a foundation model for image segmentation: a promptable model that is trained on diverse data and that can adapt to specific tasks, analogous to how prompting is used in natural language processing models. However, the segmentation data needed to train such a model is not readily available online or elsewhere, unlike images, videos, and text, which are abundant on the internet. Thus, with Segment Anything, we set out to simultaneously develop a general, promptable segmentation model and use it to create a segmentation dataset of unprecedented scale.
SAM has learned a general notion of what objects are, and it can generate masks for any object in any image or any video, even including objects and image types that it had not encountered during training. SAM is general enough to cover a broad set of use cases and can be used out of the box on new image “domains” — whether underwater photos or cell microscopy — without requiring additional training (a capability often referred to as zero-shot transfer).
In the future, SAM could be used to help power applications in numerous domains that require finding and segmenting any object in any image. For the AI research community and others, SAM could become a component in larger AI systems for more general multimodal understanding of the world, for example, understanding both the visual and text content of a webpage. In the AR/VR domain, SAM could enable selecting an object based on a user’s gaze and then “lifting” it into 3D. For content creators, SAM can improve creative applications such as extracting image regions for collages or video editing. SAM could also be used to aid scientific study of natural occurrences on Earth or even in space, for example, by localizing animals or objects to study and track in video. We believe the possibilities are broad, and we are excited by the many potential use cases we haven’t even imagined yet.
Segment Anything’s promptable design enables flexible integration with other systems. SAM could receive input prompts, such as a user’s gaze from an AR/VR headset.

SAM: A generalized approach to segmentation

Previously, to solve any kind of segmentation problem, there were two classes of approaches. The first, interactive segmentation, allowed for segmenting any class of object but required a person to guide the method by iteratively refining a mask. The second, automatic segmentation, allowed for segmentation of specific object categories defined ahead of time (e.g., cats or chairs) but required substantial amounts of manually annotated objects to train (e.g., thousands or even tens of thousands of examples of segmented cats), along with the compute resources and technical expertise to train the segmentation model. Neither approach provided a general, fully automatic approach to segmentation.
SAM is a generalization of these two classes of approaches. It is a single model that can easily perform both interactive segmentation and automatic segmentation. The model’s promptable interface (described shortly) allows it to be used in flexible ways that make a wide range of segmentation tasks possible simply by engineering the right prompt for the model (clicks, boxes, text, and so on). Moreover, SAM is trained on a diverse, high-quality dataset of over 1 billion masks (collected as part of this project), which enables it to generalize to new types of objects and images beyond what it observed during training. This ability to generalize means that, by and large, practitioners will no longer need to collect their own segmentation data and fine-tune a model for their use case.
Taken together, these capabilities enable SAM to generalize both to new tasks and to new domains. This flexibility is the first of its kind for image segmentation.
Here is a short video showcasing some of SAM’s capabilities:
(1) SAM allows users to segment objects with just a click or by interactively clicking points to include and exclude from the object. The model can also be prompted with a bounding box.
(2) SAM can output multiple valid masks when faced with ambiguity about the object being segmented, an important and necessary capability for solving segmentation in the real world.
(3) SAM can automatically find and mask all objects in an image.
(4) SAM can generate a segmentation mask for any prompt in real time after precomputing the image embedding, allowing for real-time interaction with the model.

How SAM works: Promptable segmentation

In natural language processing and, more recently, computer vision, one of the most exciting developments is that of foundation models that can perform zero-shot and few-shot learning for new datasets and tasks using “prompting” techniques. We took inspiration from this line of work.
We trained SAM to return a valid segmentation mask for any prompt, where a prompt can be foreground/background points, a rough box or mask, freeform text, or, in general, any information indicating what to segment in an image. The requirement of a valid mask simply means that even when a prompt is ambiguous and could refer to multiple objects (for example, a point on a shirt may indicate either the shirt or the person wearing it), the output should be a reasonable mask for one of those objects. This task is used to pretrain the model and to solve general downstream segmentation tasks via prompting.
We observed that the pretraining task and interactive data collection imposed specific constraints on the model design. In particular, the model needs to run in real time on a CPU in a web browser to allow our annotators to use SAM interactively in real time to annotate efficiently. While the runtime constraint implies a trade-off between quality and runtime, we find that a simple design yields good results in practice.
Under the hood, an image encoder produces a one-time embedding for the image, while a lightweight encoder converts any prompt into an embedding vector in real time. These two information sources are then combined in a lightweight decoder that predicts segmentation masks. After the image embedding is computed, SAM can produce a segment in just 50 milliseconds given any prompt in a web browser.
In a web browser, SAM efficiently maps the image features and a set of prompt embeddings to produce a segmentation mask.

Segmenting 1 billion masks: How we built SA-1B

To train our model, we needed a massive and diverse source of data, which did not exist at the start of our work. The segmentation dataset we are releasing today is the largest to date (by far). The data was collected using SAM. In particular, annotators used SAM to interactively annotate images, and then the newly annotated data was used to update SAM in turn. We repeated this cycle many times to iteratively improve both the model and dataset.
With SAM, collecting new segmentation masks is faster than ever before. With our tool, it only takes about 14 seconds to interactively annotate a mask. Our per-mask annotation process is only 2x slower than annotating bounding boxes, which takes about 7 seconds using the fastest annotation interfaces. In comparison with previous large-scale segmentation data collection efforts, our model is 6.5x faster than COCO fully manual polygon-based mask annotation and 2x faster than the previous largest data annotation effort, which was also model-assisted.
However, relying on interactively annotating masks does not scale sufficiently to create our 1 billion mask dataset. Therefore, we built a data engine for creating our SA-1B dataset. This data engine has three “gears.” In the first gear, the model assists annotators, as described above. The second gear is a mix of fully automatic annotation combined with assisted annotation, helping increase the diversity of collected masks. The last gear of the data engine is fully automatic mask creation, allowing our dataset to scale.
Our final dataset includes more than 1.1 billion segmentation masks collected on about 11 million licensed and privacy-preserving images. SA-1B has 400x more masks than any existing segmentation dataset, and as verified by human evaluation studies, the masks are of high quality and diversity, and in some cases even comparable in quality to masks from the previous much smaller, fully manually annotated datasets.
Segment Anything’s capabilities are the result of training on millions of images and masks collected using a data engine. The result is a dataset of more than 1 billion segmentation masks – 400x larger than any prior segmentation dataset.
Images for SA-1B were sourced via a photo provider from multiple countries that span a diverse set of geographic regions and income levels. While we recognize that certain geographic regions are still underrepresented, SA-1B has a larger number of images and overall better representation across all regions than previous segmentation datasets. Moreover, we analyzed potential biases of our model across the perceived gender presentation, perceived skin tone and perceived age range of people, and we found that SAM performs similarly across different groups. Together, we hope this will make our work more equitable for use in real-world use cases.
While SA-1B made our research possible, it can also enable other researchers to train foundation models for image segmentation. We further hope that this data can become a basis for new datasets with additional annotations, such as a text description associated with each mask.

What lies ahead

In the future, SAM could be used to identify everyday items via AR glasses that could prompt users with reminders and instructions.
SAM has the potential to impact a wide range of domains — perhaps one day helping farmers in the agricultural sector or assisting biologists in their research.
By sharing our research and dataset, we hope to further accelerate research into segmentation and more general image and video understanding. Our promptable segmentation model can perform a segmentation task by acting as a component in a larger system. Composition is a powerful tool that allows a single model to be used in extensible ways, potentially to accomplish tasks unknown at the time of model design. We anticipate that composable system design, enabled by techniques such as prompt engineering, will enable a wider variety of applications than systems trained specifically for a fixed set of tasks, and that SAM can become a powerful component in domains such as AR/VR, content creation, scientific domains, and more general AI systems. And as we look ahead, we see tighter coupling between understanding images at the pixel level and higher-level semantic understanding of visual content, unlocking even more powerful AI systems.