Listenbooth
I read a lot online. Too much, probably. Dozens of open tabs, all things I genuinely want to get through.
When I'm home, I'm behind my computer. When I'm outside, I try to avoid screens altogether. Partly because it feels better, partly because Andrew Huberman tells me to.
The problem is that everything I want to read still lives on my screen.
So I built Listenbooth.
It takes any URL with readable content and converts it into a voice note. Articles, blogs, documentation, announcements - if it's text on a page, it usually works.
Simple concept. The implementation was more interesting than I expected.
Stack
- Bun (v1.3+) as runtime
- React 19 for the frontend
- Google Gemini for text processing and TTS
- Railway Storage Buckets for audio storage
- Firecrawl for content extraction
I specifically wanted to try Railway's new storage buckets. S3-compatible, tightly integrated with Railway's platform.
Pipeline
Input → URL
Output → MP3
- Scrape – Extract readable content
- Optimize – Format for speech
- Generate – Convert to audio
- Store – Upload and serve
Scrape
Firecrawl takes a URL, returns clean markdown. I tried building my own scraper initially, but modern websites are hostile to extraction - JavaScript rendering, lazy loading, cookie modals. Sometimes the right move is to pay for a solved problem.
Optimize
You can't feed raw markdown to TTS and expect good results.
"The 11 engineers met on 01/15/2024 to discuss the v2.0 release."
A naive TTS says:
"The eleven engineers met on zero one slash fifteen slash twenty twenty-four..."
Before synthesis, I run text through gemini-2.5-flash-lite to:
- Convert numbers to spoken form
- Remove code blocks
- Strip markdown formatting
- Expand abbreviations
- Remove URLs, footnotes, references
This took output from "technically correct" to "pleasant to listen to."
Generate
Gemini's gemini-2.5-flash-preview-tts model understands context. You can tell it to speak cheerfully, or in a whisper, or with a specific accent.
There are 30 voices with names from mythology: Zephyr, Kore, Fenrir, Puck. I exposed all 30 because I couldn't pick favorites.
The model returns raw PCM (24kHz, 16-bit, mono). I spawn ffmpeg to convert to MP3:
const ffmpeg = Bun.spawn([
'ffmpeg',
'-f', 's16le',
'-ar', '24000',
'-ac', '1',
'-i', 'pipe:0',
'-q:a', '2',
'-f', 'mp3',
'pipe:1'
], {
stdin: 'pipe',
stdout: 'pipe',
stderr: 'pipe'
})
Pipe PCM in, get MP3 out. No temporary files.
Store
Railway storage buckets: click "New" → "Storage Bucket." Done. Credentials inject automatically.
import { s3 } from 'bun'
await s3.write(`audio/${id}.mp3`, mp3Buffer, {
bucket: process.env.S3_BUCKET,
accessKeyId: process.env.S3_ACCESS_KEY_ID,
secretAccessKey: process.env.S3_SECRET_ACCESS_KEY,
endpoint: process.env.S3_ENDPOINT,
type: 'audio/mpeg'
})
const url = s3.presign(`audio/${id}.mp3`, { expiresIn: 3600 })
Buckets are private by default. You generate presigned URLs that expire (I use 1 hour). Audio streams directly from the bucket to the browser.
Bucket egress is free on Railway. For a project serving large audio files, this matters.
Feedback
Conversion takes 10–30 seconds. Without feedback, users stare at a static spinner and wonder if the app crashed.
I used SSE (Server-Sent Events) to solve this. Unlike WebSockets - which are like a two-way phone call - SSE is more like a one-way radio broadcast. The server just shouts updates and the browser listens. It's significantly simpler to implement when you only need to push status updates to the user.
With this live feed, I update the UI at every stage of the pipeline:
data: {"step":"scraping","status":"in_progress"}
data: {"step":"scraping","status":"complete","title":"Product Launch"}
data: {"step":"optimizing","status":"in_progress"}
This turns the "black box" of the backend into a live checklist where each step gets a checkmark as it finishes. It's a small technical detail, but it makes the wait feel productive rather than anxious.
Bugs
Duration
The browser's <audio> element has a duration property. Except 30% of the time I got Infinity or NaN.
MP3 duration detection is complicated. The browser needs either a duration header (which ffmpeg doesn't always include) or to scan the file.
My fix: listen to multiple events and only update when I get a finite, positive number.
const handleDurationChange = () => {
if (audio.duration && isFinite(audio.duration) && audio.duration > 0) {
setDuration(audio.duration)
}
}
Not elegant, but it works.
Environment
Bun's S3 driver expects specific variable names. Railway's template provides them. I initially tried custom names and nothing worked. Read the docs.
Architecture
No database. History disappears on refresh. For a demo, this is fine. The audio files persist in the bucket - that's the durable state that matters.
Single process. Bun.serve() handles API and static files. No nginx.
Plain CSS. No Tailwind. Easier to reason about when styling is in one place.
Base UI. Unstyled primitives for dropdowns, sliders, dialogs. Accessibility without fighting someone else's design opinions.
Reflection
This took a few hours. A "it works and I can show it to people" weekend project.
The code I wrote is mostly glue. Important glue - SSE streaming, progress UI, audio player - but glue. The hard problems are solved by other people's infrastructure.
The tradeoff is dependency. If Firecrawl changes their API, I update. If Gemini's TTS gets deprecated, I find an alternative. If Railway's pricing changes, I reconsider.
Software as composition rather than construction. Build faster, ship sooner, depend on others more. For a weekend project, it's the obvious choice.
Demo
Paste a URL, pick a voice, generate. Then go for a walk.