Building Listenbooth
I read a lot online. Too much, probably. Dozens of open tabs, all things I genuinely want to get through.
When I’m home, I’m behind my computer. When I’m outside, I try to avoid screens altogether. Partly because it feels better, partly because Andrew Huberman tells me to.
The problem is that everything I want to read still lives on my screen.
So I built Listenbooth.
It takes any URL with readable content and converts it into a “voice note” you can listen to. Articles, blogs, documentation pages, announcements - if it’s text on a page, it usually works.
Simple concept. The implementation, as usual, was more interesting than I expected.
Stack
- Bun (v1.3+) as runtime
- React 19 for the frontend
- Google Gemini for text processing and TTS
- Railway Storage Buckets for audio storage
- Firecrawl for content extraction
I specifically wanted to try Railway’s new storage buckets. They’re S3-compatible, so you can use any S3 client library, but they’re tightly integrated with Railway’s platform.
Pipeline
Input → URL
Output → MP3
- Scrape – Extract readable content
- Optimize – Format for speech
- Generate – Convert to audio
- Store – Upload and serve
Scrape
Firecrawl takes a URL, returns clean markdown. I tried building my own scraper initially, but modern websites are hostile to extraction - JavaScript rendering, lazy loading, cookie modals. Sometimes the right move is to pay for a solved problem.
Optimize
You can’t feed raw markdown to TTS and expect good results.
“The 11 engineers met on 01/15/2024 to discuss the v2.0 release.”
A naive TTS says:
“The eleven engineers met on zero one slash fifteen slash twenty twenty-four…”
Before synthesis, I run text through gemini-2.5-flash-lite to:
- Convert numbers to spoken form
- Remove code blocks (nobody wants to hear
console.logread aloud) - Strip markdown formatting
- Expand abbreviations
- Remove URLs, footnotes, and references
This took output from “technically correct” to “pleasant to listen to.”
Generate
Gemini’s gemini-2.5-flash-preview-tts model is different from traditional TTS. It understands context - you can tell it to speak cheerfully, or in a whisper, or with a specific accent.
There are 30 voices with names from mythology: Zephyr, Kore, Fenrir, Puck. Some are bright and energetic, others calm and measured. I exposed all 30 because I couldn’t pick favorites.
The model returns raw PCM (24kHz, 16-bit, mono). I spawn ffmpeg to convert to MP3:
const ffmpeg = Bun.spawn([
'ffmpeg',
'-f', 's16le',
'-ar', '24000',
'-ac', '1',
'-i', 'pipe:0',
'-q:a', '2',
'-f', 'mp3',
'pipe:1'
], {
stdin: 'pipe',
stdout: 'pipe',
stderr: 'pipe'
})Pipe PCM in, get MP3 out. No temporary files.
Store
Railway’s storage buckets: click “New” → “Storage Bucket.” Done. Credentials inject automatically.
import { s3 } from 'bun'
await s3.write(`audio/${id}.mp3`, mp3Buffer, {
bucket: process.env.S3_BUCKET,
accessKeyId: process.env.S3_ACCESS_KEY_ID,
secretAccessKey: process.env.S3_SECRET_ACCESS_KEY,
endpoint: process.env.S3_ENDPOINT,
type: 'audio/mpeg'
})
const url = s3.presign(`audio/${id}.mp3`, { expiresIn: 3600 })Railway buckets are private by default - you can’t link directly to a file. Instead, you generate a presigned URL that expires (I use 1 hour). When a user wants to play audio, my server generates the URL and redirects. The audio streams directly from the bucket to the browser.
Bucket egress is free on Railway, so I don’t pay for that bandwidth. For a project serving large audio files, this matters.
Streaming Progress
Conversion takes 10–30 seconds. Without feedback, users stare at a spinner. With SSE, I update them at each stage:
data: {"step":"scraping","status":"in_progress"}
data: {"step":"scraping","status":"complete","title":"Product Launch Announcement"}
data: {"step":"optimizing","status":"in_progress"}
...Each step gets a checkmark when complete. Small thing, but it makes the wait feel productive rather than anxious.
Things That Didn’t Work
Audio Duration
The browser’s <audio> element has a duration property. Except 30% of the time I got Infinity or NaN.
MP3 duration detection is complicated. The browser needs either a duration header (which ffmpeg doesn’t always include) or to scan the file. For streaming audio, this might not be available immediately.
My fix: listen to multiple events - loadedmetadata, durationchange, canplaythrough - and only update when I get a finite, positive number.
const handleDurationChange = () => {
if (audio.duration && isFinite(audio.duration) && audio.duration > 0) {
setDuration(audio.duration)
}
}Not elegant, but it works. Sometimes that’s enough.
Environment Variables
Bun’s S3 driver expects specific variable names. Railway’s template provides them. I initially tried custom names and nothing worked. Read the docs.
Architecture Decisions
No database. History disappears on refresh. For a demo, this is fine. Adding a database means more complexity, more things to break. The audio files persist in the bucket - that’s the durable state that matters.
Single process. Bun.serve() handles API and static files. No nginx, no separate processes.
Plain CSS. No Tailwind. I know this is controversial, but I find it easier to reason about styling when it’s in one place rather than scattered across component files.
Base UI. Unstyled primitives for dropdowns, sliders, dialogs. Accessibility without fighting someone else’s design opinions.
Bigger Picture
This took a few hours. A “it works and I can show it to people” weekend project.
The code I wrote is mostly glue. Important glue - SSE streaming, progress UI, audio player - but glue. The hard problems are solved by other people’s infrastructure.
The tradeoff is that I’m dependent on these services. If Firecrawl changes their API, I update. If Gemini’s TTS gets deprecated, I find an alternative. If Railway’s pricing changes, I reconsider my architecture.
This is software as composition rather than construction. Build faster, ship sooner, depend on others more. For a weekend project, it’s the obvious choice. For something more serious, you might want more control.
Try It
Paste a URL, pick a voice, generate. It’s free. Then go for a walk.