Bolo: A Voice-to-Text App for macOS

Hold a key, speak naturally, release — polished text appears wherever your cursor is.

I type a lot. Emails, Slack messages, docs, code comments, PRs. And I kept noticing the same thing: my thoughts move faster than my fingers. By the time I finish typing a sentence, I’ve already lost the next three.

So I built Bolo (बोलो — Hindi for “speak”) — a native macOS menu bar app that turns your voice into clean, ready-to-use text in any application.

No window to switch to. No app to open. Just hold a key, talk, and let go.

What it looks like in practice

You’re writing an email. You hold Ctrl+Shift. You say:

“Hey Sarah, thanks for sending over the Q3 numbers. I had a quick look and everything checks out. Let’s, um, actually, can we set up a call on Thursday to walk through the projections? I want to make sure we’re aligned before the board meeting.”

You release the key. What appears in your email:

Hey Sarah, thanks for sending over the Q3 numbers. I had a quick look and everything checks out. Can we set up a call on Thursday to walk through the projections? I want to make sure we’re aligned before the board meeting.

Notice what happened — the “um” disappeared, the “let’s, actually” course-correction was handled, and it even capitalized and punctuated everything correctly.

The whole thing took about 2 seconds.

Three modes

Push-to-Talk — Hold Ctrl+Shift, speak a sentence or two, release. Text appears instantly. This is what I use 90% of the time.

Long-Talk — Press Ctrl+Shift+Space to toggle extended recording. Brain dump into an email, a doc, or meeting notes without touching the keyboard.

Command Mode — This one’s my favorite. Select some text, hold the hotkey, and speak a command: “make this more formal”, “fix the grammar”, “translate to Spanish”, “summarize this in two sentences.” Bolo rewrites the selected text in place.

How it actually works

The architecture is surprisingly straightforward. Three steps:

Capture — When you hold the hotkey, AVAudioEngine starts recording through your Mac’s microphone. Raw PCM audio accumulates in a buffer. When you release, it wraps everything in a WAV header.

Process — The audio gets base64-encoded and sent to Google Gemini’s multimodal API as an inline data part. But the prompt isn’t just “transcribe this” — it instructs Gemini to clean up filler words, add punctuation, handle corrections, and match the tone of what you’re typing into. If there’s surrounding text, it gets sent as context so Gemini knows whether you’re in a formal email or a casual Slack thread.

Insert — The transcribed text gets inserted at your cursor using macOS Accessibility APIs. The app detects which app you’re in, finds the focused text field, and types the text directly. It even re-activates the correct app in case focus shifted during the 2-3 seconds of API processing.

The trickiest part was the hotkey detection. The app installs a CGEvent tap — a low-level macOS mechanism that intercepts keyboard events before they reach any app. This is the only way to get system-wide key detection that works in every application.

The enterprise angle

I built Bolo for myself first, but I wanted to use it at my work too. The problem: they needed per-user identity attribution for token usage tracking. You can’t just share a single API key across 50 people and hope for the best.

So I added a second auth path: Vertex AI with OAuth2. Users sign in with their Google Workspace account (same SSO they use for Gmail and Docs), and every API call carries their identity. The GCP admin can track exactly who used how many tokens, set per-user quotas, and manage access through standard IAM roles.

The OAuth2 implementation uses Authorization Code flow with PKCE — the app spins up a temporary HTTP server on localhost, opens the browser for Google sign-in, receives the callback, exchanges the code for tokens, and stores everything in the Keychain. No client secret needed.

Both auth methods (personal API key and enterprise OAuth2) are in the app. You pick one in Settings. Only one is active at a time.

The small details that matter

Personal dictionary — You can teach Bolo your colleague names, product names, and jargon. “Kubernetes” comes out right every time instead of “Cooper Netties.”

Context window — Before transcribing, Bolo reads the text around your cursor. This helps Gemini understand whether you’re adding to a bulleted list, continuing a paragraph, or replying to someone.

Course correction — Say “no wait” or “actually” mid-sentence and Bolo understands you’re correcting yourself, not adding those words to the text.

Word counter — A small thing, but satisfying. The menu bar shows how many words you’ve dictated today. The history view shows all-time stats.

The tech stack

Swift 6 with strict concurrency (Sendable, @MainActor, async/await — no data races at compile time)
SwiftUI for settings, dictionary, and history views
AVAudioEngine for audio capture
macOS Accessibility API for text insertion and context reading
CGEvent tap for system-wide hotkey detection
Network.framework for the OAuth2 loopback server
CryptoKit for PKCE (SHA-256 code challenge)
SQLite for transcription history
Keychain for secure credential storage

About 2,500 lines of Swift total. No external dependencies.

Try it

Bolo is open source under the MIT license.

Download: Bolo-1.0.dmg (2.1 MB, signed and notarized by Apple)

Source code: github.com/sskokku/Bolo

Requirements:

macOS 15.0 (Sequoia) or later
A free Gemini API key from aistudio.google.com (or Vertex AI if you’re in an enterprise setup)

On first launch, Bolo walks you through microphone permission, accessibility permission, and API key setup. Takes about 60 seconds.

If you build something with it or have feedback, open an issue on GitHub or reply here. I’d love to hear how other people use it.

From the mind of Sashidhar Kokku