Building Source, a Local-First Transcriber, Kevin Houston

Why I built it

Every technical interview I prepare for, the interviewer goes in with notes. The candidate goes in with memory. After the call, the interviewer documents what happened. The candidate usually does not, because there is no clean way to capture both sides of the conversation and do something structured with it afterward.

Tools like Otter.ai, Zoom AI, and Gemini Notes exist, but they are optimized for summaries and action items. The output is cloud-processed, shaped by someone else's judgment about what matters, and often gated behind a subscription. That is useful for some workflows. It did not fit what I was trying to do.

I wanted a raw timestamped transcript I could keep locally and use with any tool I reached for afterward.

Why raw transcripts matter

A summary tells you what the tool decided was important. A raw transcript lets you decide. After a technical interview, I want to go back through how I explained a system design, spot where I was vague, pull out follow-up questions, and hand the full context to a model to help me prepare for the next round. That is not possible with a summary. You need the source.

There is also a practical privacy reason. Interview notes can be sensitive, especially when they involve compensation discussions or things said off the cuff. Keeping transcripts local means they do not flow through a third-party pipeline unless you deliberately send them somewhere.

What the app does

Source is a local macOS desktop app. It captures audio and runs Whisper on-device. Nothing goes through a cloud service, and there is no account or subscription to set up.

It supports three input modes:

Mic: any microphone input, useful for voice notes and single-speaker practice
System Audio: direct capture through BlackHole, a free virtual audio driver, which captures the other side of a call or any playing media
Mic + System: both streams simultaneously, mixed for transcription and saved as separate WAV files per source

Each session outputs a Markdown file with YAML frontmatter (model name, input mode, chunk size, segment count, device metadata), a plain text companion, WAV sidecars for each source in hybrid mode, and a session JSON for programmatic access. Model options are small (default), medium, and large-v3. All models run locally using faster-whisper and CTranslate2.

What was harder than expected

Calling Whisper takes one function call. Everything else took longer.

Packaging. Turning a Python script into a real macOS .app that handles microphone permissions, writes data to Application Support, and launches reliably from the Applications folder required several iterations. Info.plist, entitlements, and PyInstaller spec files all had to be correct together.

Model tradeoffs. Small is the right default for most machines. Medium produces meaningfully better output for complex audio. Large-v3 is the most accurate but can take 35 to 60 seconds to process an 8-second audio chunk on CPU. The UI surfaces this without being noisy about it.

System audio routing. macOS does not expose a native loopback capture API through Python. The current solution uses BlackHole, a free open-source virtual audio driver. It works reliably but requires a one-time user setup step. ScreenCaptureKit is the right long-term path and is documented in the repo for the next iteration.

Hybrid capture. Running two audio streams simultaneously, mixing them in real time, writing per-source WAV files, and feeding the combined output through the same chunk-and-transcribe pipeline required rewriting the recorder module. The threading model, queue structure, and cleanup sequencing all had to hold together under stop and start cycles.

Input quality versus model size. This was the most useful practical finding: switching from small to large-v3 helped, but cleaning up the audio source helped more. The same model produced substantially better output from a direct microphone signal than from a YouTube video playing through laptop speakers. The model is one lever. The capture path is another, and it mattered more in testing.

Current limitations

System Audio mode requires BlackHole, a one-time external install. ScreenCaptureKit would eliminate this requirement.
Large-v3 is slow on CPU. The quality improvement is real on capable hardware, but it is not suited for real-time use on most machines.
The Mic + System mixing ratio is fixed at 50/50. If one source is substantially louder, the weaker source may wash out in the mixed transcript.
No speaker diarization. The transcript does not attribute lines to speakers.

What this is and what it is not

This is not a replacement for Otter or ChatGPT Record. It is a smaller tool with a different center of gravity: local capture, a raw transcript I own, and a clean handoff into whatever AI workflow I choose next.

The work was not the Whisper call. Whisper has been public since 2022. The work was making a local capture path I would actually trust with a real interview, and keeping the output usable after the fact. That meant honest calls on model size, audio routing, macOS packaging, and where to stop adding features.