SAM Audio icon

SAM Audio

Isolates any sound in audio using text, visual, or temporal prompts with the SAM-Audio model

Jump to section

Featured alternatives

Neural Frames Audio Visualizer icon

Neural Frames Audio Visualizer

Revid.ai Music Visualizer icon

Revid.ai Music Visualizer

VEED Music Visualizer icon

VEED Music Visualizer

Doodooc icon

Doodooc

Pippit Music Visualizer icon

Pippit Music Visualizer

Renderforest Music Visualizer icon

Renderforest Music Visualizer

Overview

SAM Audio is an advanced AI model developed by Meta for prompt-based sound separation from complex audio mixtures. Announced on December 16, 2025, it enables users to isolate specific sounds—such as vocals, instruments, or environmental noises—using text prompts, visual prompts (from video), or time-span prompts. These prompts can be combined for precise control. SAM Audio is powered by Perception Encoder Audiovisual (PE-AV), a multimodal audio-visual encoder that understands and separates sounds based on user input.

The tool is designed for audio and video professionals across multiple industries, including music producers, podcasters, film editors, researchers, and accessibility specialists. Meta positions SAM Audio as a unified model in a previously fragmented audio editing space, enabling separation workflows that match how people naturally specify sounds. According to Meta's evaluations, the model can run faster than real time (RTF ≈ 0.7), with released variants spanning roughly 500M to 3B parameters—actual performance depends on hardware and model size.

SAM Audio is accessible through Meta's Segment Anything Playground for web-based experimentation. Developers can also run it locally via the official GitHub repository (GPU recommended) after requesting checkpoint access on Hugging Face. Usage is governed by Meta's SAM License.

Key Features

  • Text-Based Prompting — Enter natural language descriptions like "dog barking" or "guitar solo" to isolate specific sounds from audio recordings, eliminating the need for technical audio engineering knowledge.

  • Visual Prompting for Video — Click on people or objects in video content to isolate the target sound (and output a residual mix) associated with that object, streamlining the workflow for video editors and content creators.

  • Span Prompting — Mark specific time segments where target audio occurs to achieve precise extraction control. Meta describes this as an "industry first," enabling surgical-level editing accuracy.

  • Multi-Modal Flexibility — Combine text, visual, and span prompting methods simultaneously for maximum control over complex audio separation tasks.

  • Real-Time Processing — Process audio faster than real-time (RTF ~0.7), enabling efficient workflows for large-scale projects and batch processing.

  • Perception Encoder Audiovisual Engine — Leverages Meta's PE-AV technology to understand and isolate sounds based on user prompts without affecting other audio elements in the mix.

Pricing & Plans

SAM Audio is available free to try and download, subject to license terms:

Plan Price Features
Web Playground Free Access via Meta's Segment Anything Playground, browser-based demo (no local install), experiment with provided assets or upload your own audio files, subject to Playground terms and any usage limits
Download Free Code available on GitHub, checkpoints distributed via Hugging Face (access request required), supports models from 500M to 3B parameters, Python ≥3.10 and CUDA GPU recommended for local inference

Key Details:

  • Free to try via web demo and free to download for local use
  • Commercial use depends on the SAM License terms—review the license before commercial deployment
  • The web Playground may have additional usage restrictions; check terms before using in paid or client work
  • Checkpoints require requesting access on Hugging Face and authenticating before download
  • Support and discussion available via GitHub Issues and Hugging Face community pages

Pros & Cons

Pros:

  • Free to use via web demo and downloadable for local deployment under the SAM License
  • Intuitive multi-modal prompting system—beginner-friendly via the web Playground
  • Processes audio faster than real-time in Meta's evaluations, enabling efficient workflows
  • Meta released SAM Audio-Bench and SAM Audio Judge to evaluate separation quality in real-world settings
  • Flexible deployment options: web-based playground for quick testing, local installation for developers
  • Supports wide range of applications from music production to accessibility services

Cons:

  • Meta notes that SAM Audio does not support audio-based prompting yet (only text, visual, and span prompts)
  • Requires user-provided prompts to perform separation—cannot work fully automatically
  • Can struggle with similar-sounding sources, such as individual voices in choirs or specific instruments in orchestras (as reported in Meta's launch coverage)
  • Local deployment requires Python ≥3.10, CUDA GPU recommended, and Hugging Face access approval
  • Checkpoint access requires requesting permission and authentication on Hugging Face

Best For

  • Music producers who need to isolate individual instruments or vocals from mixed recordings for remixing, sampling, or stem creation
  • Podcast editors removing unwanted background noise, crosstalk, or environmental sounds from recorded episodes
  • Video editors working on films, television, or online content who need to separate dialogue from background audio or isolate specific sound effects
  • Content creators producing social media videos, tutorials, or vlogs who want clean audio without expensive post-production software
  • Researchers in audio processing, machine learning, or acoustics who need advanced sound separation capabilities for analysis
  • Accessibility specialists creating cleaner audio tracks for hearing-impaired audiences or developing assistive technologies
  • Sound designers seeking unique audio samples by extracting specific elements from complex soundscapes

FAQ

Is SAM Audio free to use?

Yes, SAM Audio is free to use. You can access it through Meta's Segment Anything Playground via web browser for quick testing, or download the code and model checkpoints for local use. The web Playground is subject to its terms and any usage limits. For local deployment, code is available on GitHub and checkpoints are distributed under Meta's SAM License via Hugging Face—you'll need to request access and authenticate before downloading. There are no subscription fees.

What audio formats does SAM Audio support?

For the web Playground, upload format support may vary—test your files directly in the demo. For local use, you can load audio via common Python audio libraries (e.g., torchaudio), so supported formats depend on the backend tooling you choose. The model accepts audio file paths or torch tensors as input.

Can SAM Audio separate vocals from music completely automatically?

No, SAM Audio requires user input through prompts to perform audio separation. It does not support fully automatic separation without any prompting. Users must provide text descriptions, visual cues in video contexts, or time span markers to guide the separation process. This prompt-based approach offers more control but requires user interaction.

How does SAM Audio compare to tools like Spleeter or LALAL.AI?

High-level comparison: Spleeter is commonly used for fixed stem separation (vocals, drums, bass, etc.), while SAM Audio is prompt-driven (text, visual, and span-based) and can be more flexible for custom sound targets. LALAL.AI offers user-friendly web-based processing with preset separation types. SAM Audio's prompting system allows for more customized extraction, though actual results depend on the specific audio content and the quality of your prompts.

What are the system requirements for running SAM Audio locally?

The official GitHub repository lists Python ≥3.10 as the minimum requirement and recommends a CUDA-compatible GPU for practical local inference. Specific hardware needs vary by model size (500M to 3B parameters)—larger models will require more GPU memory and computational resources. Refer to the official repository for detailed setup instructions.

Can SAM Audio isolate individual voices in a choir or specific instruments in an orchestra?

No, SAM Audio currently has limitations with isolating similar audio events. As reported in coverage of Meta's launch materials, the tool can struggle with distinguishing individual voices within choirs or specific instruments in orchestras where multiple similar sounds occur simultaneously. This represents a known limitation and an area for future development.

Is SAM Audio suitable for professional music production?

Yes, SAM Audio can be used in professional music production workflows for tasks such as instrument isolation, stem creation, noise reduction in live recordings, and creative sound design. However, professionals should evaluate whether the tool meets their specific quality standards and workflow requirements, particularly considering its limitations with similar-sounding sources.

Does SAM Audio work with video files?

Yes, SAM Audio supports visual prompting specifically designed for video content. Users can click on people or objects in videos to isolate the associated target sound (along with a residual mix of everything else). This feature is particularly useful for video editors who need to separate dialogue, sound effects, or specific audio elements from video recordings.

Can I use SAM Audio commercially in my projects?

Commercial use depends on Meta's SAM License terms. While the model is free to download, the license governs how you can use it—review the full license text before deploying SAM Audio in commercial products or client work. The web Playground may have additional restrictions for commercial use, so check the Playground's terms as well.

How long does it take to process audio with SAM Audio?

According to Meta's evaluations, SAM Audio can run faster than real time with a real-time factor of approximately 0.7—meaning a 10-minute audio file might process in roughly 7 minutes under test conditions. Actual processing speed will vary depending on the model size you choose (500M to 3B parameters), your hardware capabilities (GPU vs CPU), and the complexity of the separation task.