Segmentation with natural language

November 2025

I built a face swapper this week using three pieces of technology that did not exist a month ago. SAM 3 dropped on November 19th. Gemini 3 Pro Image (what Google is calling “Nano Banana Pro”) went live on November 20th. Bun 1.3 shipped in October. And somehow, combining these three things produces something that actually works, is embarrassingly simple to deploy, and costs a fraction of a cent per image.

This is the kind of project that makes me feel like the entire stack is finally catching up to the capabilities of the models themselves.

What it does

The app swaps faces between two images. You upload a source face and a target body, and it composites them together using AI. The pipeline looks like this:

SAM 3 (Meta’s new text-prompt segmentation model) extracts the face from the source image with pixel-perfect precision.
SAM 3 again masks the face region in the target image (so Gemini knows where to paint).
Gemini 3 Pro Image takes both inputs and generates a new image with the swapped face.

The whole thing runs as a Bun server locally, with SAM 3 offloaded to Modal’s serverless GPU infrastructure. Three files. One dependency (@google/genai). Zero configuration.

Sources: Meta SAM 3 announcement (November 19, 2025), Google Nano Banana Pro announcement (November 20, 2025), Bun 1.3 blog (October 10, 2025)

The architecture

Browser (index.html)
    │
    ▼
Bun Server (server.ts) ──────► Modal (SAM 3 on H100)
    │                              │
    │                              ▼
    └──────────────────────► Gemini 3 Pro Image API

The frontend is vanilla JavaScript with Tailwind CDN. No React. No build step. The backend is a Bun server that proxies requests to Modal (for segmentation) and calls Gemini directly (for generation). Everything runs in a single process.

File structure:

face-swapper/
├── index.html        # Frontend
├── server.ts         # Bun server with API routes
├── sam3_face.py      # Modal deployment
├── package.json      # Just @google/genai
└── .env.local        # API keys

Why SAM 3 is a big deal

SAM 3 is Meta’s third iteration of the Segment Anything model, and it represents a fundamental shift in how segmentation works. The previous versions (SAM and SAM 2) required visual prompts: clicks, boxes, or masks. SAM 3 accepts text prompts. You can literally type “head” and it will segment every head in the image.

This is possible because SAM 3 uses Meta’s new Perception Encoder, which links language and visual features in a way that earlier models could not. The model handles 270,000+ unique concepts (compared to roughly 5,000 in previous open-vocabulary benchmarks). It is, in some sense, the first segmentation model that truly understands natural language at scale.

For the face swapper, I use two modes:

Crop mode: Extracts the face with transparency. This becomes the source material for Gemini.
Redact mode: Draws a pink mask over the target face. This tells Gemini where to paint.

The text prompt is configurable. You can ask for “head” (face + hair), “face” (just facial features), “hair” (hair only), or really anything else. SAM 3 will find it.

Running SAM 3 requires a GPU. The model is 848 million parameters, and inference on CPU would be painfully slow. I use Modal, which is basically “serverless GPUs via Python decorators.”

Here is the core of the deployment:

@app.cls(gpu="H100", volumes={"/cache": model_cache})
class FaceSegmenter:
    @modal.enter()  # Loads ONCE on container startup
    def setup(self):
        self.processor = Sam3Processor.from_pretrained(
            "facebook/sam3", cache_dir="/cache"
        )
        self.model = Sam3Model.from_pretrained(
            "facebook/sam3", 
            torch_dtype=torch.bfloat16, 
            cache_dir="/cache"
        ).to("cuda")

    @modal.method()
    @modal.fastapi_endpoint(method="POST", label="sam3-segment")
    def segment_face(self, request: dict):
        outputs = self.model(**inputs)
        return {"success": True, "cropped": "..."}

Source: Modal documentation on @modal.enter() pattern

The @modal.enter() decorator is the key optimization. Without it, you would load the model on every request (which takes 10-20 seconds from the cached volume). With it, the model loads once when the container starts and stays in GPU memory for all subsequent requests.

Performance breakdown:

Scenario	Time	Notes
Cold start (first ever)	~30-60s	Download model + load to GPU
Cold start (cached)	~10-20s	Load from volume to GPU
Warm requests	~300ms	Inference only

Warm requests are 20-50x faster than cold starts. The container stays warm for 5 minutes after the last request, then scales to zero.

Cost: H100 runs at approximately $4/hour. At 300ms per request, that is about $0.0003 per inference. A thousand requests would cost roughly 30 cents.

Gemini 3 Pro Image

Gemini 3 Pro Image is Google’s new state-of-the-art image generation and editing model. It launched a week ago alongside the broader Gemini 3 release. The marketing name is “Nano Banana Pro” (because the original Nano Banana model from August went viral with the 3D figurine trend).

What makes it useful for face swapping is its ability to take multiple input images and compose them together. Here is the API call:

const response = await ai.models.generateContent({
  model: "gemini-3-pro-image-preview",
  contents: [{
    parts: [
      { text: "Swap the face from the first image onto the body in the second image..." },
      { inlineData: { mimeType: "image/png", data: sourceFaceBase64 } },
      { inlineData: { mimeType: "image/jpeg", data: targetBodyBase64 } }
    ]
  }],
  config: {
    responseModalities: ["IMAGE", "TEXT"],
    thinkingLevel: "low",
    imageConfig: {
      imageSize: "1K",  // or "2K", "4K"
      aspectRatio: "1:1"
    }
  }
});

The model supports 1K, 2K, and 4K output resolutions. I default to 1K for speed, but 4K is available for high-quality results. Pricing is $0.134 per image at 1K/2K and $0.24 per image at 4K.

Source: Neowin on Gemini 3 Pro Image pricing (November 20, 2025)

Why Bun 1.3

I have been skeptical of Bun for a while. It seemed like it was optimizing for benchmarks that did not matter in practice. But Bun 1.3 changed my mind.

The killer feature is native HTML imports. You can do this:

import homepage from "./index.html";

Bun.serve({
  routes: {
    "/": homepage,
    "/api/segment": handleSegment,
    "/api/swap": handleSwap
  }
});

That is it. No build step. No bundler. No webpack.config.js nightmare. Bun’s native transpiler handles React, TypeScript, and CSS imports inside the HTML file. It generates sourcemaps, minifies in production, and serves everything with hot module reload in development.

Source: Bun 1.3 blog post (October 10, 2025)

The routing system also supports dynamic parameters (/api/users/:id) and different handlers for different HTTP methods. Everything runs in a single process. Startup time is under 100ms.

For this project, the entire server is about 200 lines of TypeScript. No Express. No Fastify. No dependencies beyond the Google AI SDK.

The frontend

The frontend is vanilla JavaScript and Tailwind (via CDN). I know this sounds like heresy in 2025, but for a single-page tool like this, you really do not need React.

Features:

Drag-and-drop image upload
Auto-crop using SAM 3 (with text prompts: “head”, “face”, “hair”, etc.)
Auto-redact for the target image (pink mask overlay)
Toggle between original and processed views
Configurable thresholds for detection and mask sensitivity
Edge feathering to soften mask boundaries (0-15px)
Quality selector (1K/2K/4K)
Custom AI prompt textarea

The SAM 3 capabilities deserve special attention. You can segment literally anything:

{ prompt: "head" }     // Face + hair + head
{ prompt: "face" }     // Just facial features  
{ prompt: "person" }   // Full body
{ prompt: "hair" }     // Hair only
{ prompt: "ear" }      // Ears (use multiInstance: true for both)
{ prompt: "glasses" }  // Just the glasses
{ prompt: "shirt" }    // Clothing

This flexibility is what makes SAM 3 special. You are not limited to predefined categories. The model has been trained on 4 million unique concepts, so it can segment almost anything you can describe.

Deployment

Modal (for SAM 3):

# Create HuggingFace secret for model access
modal secret create huggingface HF_TOKEN=hf_your_token

# Deploy
modal deploy sam3_face.py

Your endpoint will be something like https://<workspace>--sam3-segment.modal.run.

Railway (for the Bun server):

Connect your GitHub repo
Add GEMINI_API_KEY and SAM3_ENDPOINT as environment variables
Deploy (Railway auto-detects Bun)

Why video does not work

I should address this because people always ask: can you do video face swapping with this stack.

The short answer is no. The longer answer involves understanding why current technology makes this impractical.

Gemini 3 Pro Image is an image generation model. It does not generate video. To process a 10-second video at 30fps, you would need to call Gemini 300 times. At $0.13 per image, that is $39 for 10 seconds. At 5-15 seconds per generation, you are looking at 25-75 minutes of processing time for one minute of video. And even then, the frames would not be temporally consistent (you would get flickering and jittering between frames because each frame is generated independently).

SAM 3 actually does support video tracking. It can propagate masks across frames automatically. But the generation step is the bottleneck. You would need a video generation model with temporal consistency (something like Runway Gen-3 or a specialized deepfake model like DeepFaceLab) to make this work.

For now, single-image face swaps are fast, cheap, and reliable. Video will have to wait.

As for my take

The thing that strikes me about this project is how quickly the plumbing has caught up to the models.

A year ago, building something like this would have required orchestrating Docker containers, managing GPU quotas, writing custom inference code, and probably a week of DevOps work. Today, I can deploy SAM 3 to Modal with a Python decorator, call Gemini with a single API request, and serve the whole thing with a Bun one-liner. The total development time was about a day, and most of that was UI polish.

This is the “infrastructure fade” that I think a lot of people predicted but few expected to happen this fast. The models are the hard part. The deployment is now (mostly) the easy part.

I am curious to see where this goes. SAM 3 just shipped a week ago. Gemini 3 Pro Image is barely a week old. Bun 1.3 is still on its initial release series. All three of these technologies are going to get better, and the combinations that become possible when they do are hard to predict.

For now, though, I have a face swapper that works. Three files, one dependency, and about $0.15 per face swap. Not bad.