← Blog · July 1, 2026 · 9 min read
Original research

Voice expense tracking in 7 languages: where Whisper still loses

I recorded the same short list of expenses out loud in seven languages, then pushed each recording through Capi and three rival voice apps to see what the transcription actually caught. On clean speech the models are close to flawless. The interesting part is the edges: an accent, a noisy cafe, a price said in one language inside a sentence in another. That is where voice tracking either holds up or quietly files the wrong number, and it is the part no product page will show you.

Voice is the fastest way to log a purchase, faster than opening an app and typing, and in 2026 almost every voice tracker runs on the same engine underneath: OpenAI's Whisper, usually the Large v3 model. So a fair test is less about which app has the best microphone and more about how each one handles the moments Whisper gets wrong. I spent years inside banks watching people abandon budgeting because entry was too much friction, so I care about the input step more than the dashboard. Here is what seven languages taught me, named honestly, including where the rivals beat us.

How accurate is voice expense tracking in 2026?

On clean speech it is very good. Whisper Large v3, the model behind most voice trackers, sits around a 5 to 6 percent word error rate on English and roughly 10 percent averaged across languages, per the Common Voice benchmark. For a phrase as short as a coffee and a price, that usually means a correct transcription on the first try. Accuracy drops with accents, background noise and mixed-language phrases.

The headline number hides the shape of the errors, though. A 5 percent word error rate does not mean five percent of your expenses are wrong. It means roughly one word in twenty is off, and the risk is entirely about which word. Miss a filler word and nothing happens. Miss the number and you have logged the wrong amount. That asymmetry is the whole game in expense tracking, and it is why I judged each app less on transcription quality and more on whether it shows you the parsed amount before it saves. Speed of capture is table stakes now. Catching the one wrong figure is the real feature.

Which language does Whisper transcribe most accurately?

English, because it has the most training data, followed closely by Spanish, Portuguese, Italian, German and French. In my test all six of those handled a spoken expense cleanly, with word error rates in a similar single-digit range on clear audio. Russian was a step behind but still reliable. Hindi was the weakest of the seven, with more misheard words, which matches Whisper being strongest on the languages it saw most in training.

The pattern is consistent with what OpenAI published and what independent benchmarks show: performance tracks training volume. The romance and germanic languages with large web presences cluster near English, while lower-resource languages fall off, sometimes into the 15 to 30 percent error range for the least-represented ones. Hindi sits in an awkward middle, well-supported but still noticeably worse than Spanish or Portuguese in my recordings, especially on English loanwords that appear constantly in real Hindi speech. For a Brazilian or an Argentine reader this is good news: Portuguese and Spanish are among Whisper's strongest languages, so voice capture is genuinely dependable in the languages this blog serves most.

Where does Whisper still lose on voice expense tracking?

It loses in four specific places: strong or regional accents, background noise, spoken numbers, and code-switching between languages. Accents and noise raise the error rate measurably because the model was trained mostly on clean, standard speech. Numbers slip because forty and fourteen sound alike. Code-switching, like saying a dollar amount inside a Portuguese sentence, is the hardest of all, and it is exactly how bilingual people actually talk about money.

My messiest recording was deliberate: a Portuguese sentence with an English brand name and a price said in reais, spoken with a cafe murmuring in the background. Every app stumbled somewhere on it. One dropped the currency, one heard the brand as a common word, one got the number's tens digit wrong. This is not a knock on Whisper so much as a reminder that the honest failure modes are predictable, so the product has to plan for them. The tools that came out looking best were not the ones claiming the highest accuracy. They were the ones that assumed the transcription might be wrong and made it a two-second fix rather than a silent mistake. I wrote more about how confident software quietly misleads you in why finance apps lie about your spending.

How do the main voice expense trackers compare?

Four tools cover most of the voice-first market in 2026: Capi, Vocash, MonAi and TalkieMoney. All lean on Whisper-class transcription, so raw accuracy is similar. They differ on where they run, how they confirm an entry, what they do with the audio, and price. The table below is the honest shape of it, with the trade each one asks you to make.

App Where it runs Confirms before saving Audio kept Price
Capi Telegram chat Yes, reads it back No, discarded Free, then $69.90/yr
Vocash iOS, Android, web Quick edit Vendor discloses Free, Pro $36.99/yr
MonAi iOS, Android In-app review Stored in your iCloud Free tier, then paid
TalkieMoney iOS, Android In-app review Vendor discloses Free to 50 tx, then sub

Read that by what you value, not by the star of the row. If you want a polished standalone app and live on Apple, MonAi is a lovely piece of design, and its trick of splitting several expenses out of one spoken sentence is genuinely useful. If you want the widest free tier, Vocash gives away its core voice capture and only charges for exports and long history. TalkieMoney is a capable AI budget agent with the same free-then-paid shape. Capi's difference is not accuracy, it is that the whole thing happens in a chat you already have open, so there is no new app to learn. If a dedicated iOS app fits you better, one of the others is the right answer, and I would rather say so. The Capi vs Copilot Money comparison weighs the chat-versus-app trade in more detail against Apple's Siri-driven option.

What happened when I recorded expenses in seven languages through Capi?

I recorded the same five expenses in English, Spanish, Portuguese, French, German, Russian and Hindi, spoken at a normal pace. Six of the seven languages parsed every expense correctly on the first pass. Hindi missed one amount and one category, both fixed in a single tap on the confirmation screen. The clean six needed no correction at all. The audio was deleted after each transcription, leaving only the confirmed text.

The detail I care about is what happened on the miss rather than the hits. When the Hindi recording heard the wrong number, Capi did not save it silently. It showed the parsed line, amount and category, and waited. I corrected the figure by typing the right number back, which the parser accepted as an edit to the pending entry rather than a new expense. That is the Patch R behaviour, voice and text flowing into one pipeline, so a spoken entry and its typed correction are the same conversation. Under the hood the voice note and a typed message hit the identical parser, which is why the language of the correction does not matter. If you want the ritual and the accessibility case for logging this way, I covered it in hands-free expense tracking and tested the raw pipeline in the voice-note tracking test.

How does Capi handle a voice note it transcribes wrong?

It reads the parsed expense back and waits for you to confirm before saving. Every voice note produces a pending line with the amount, currency and category it inferred, and nothing lands in your history until you accept it. If the number or category is wrong, you correct it in the same chat, by voice or text, and the fix replaces the pending entry. An imperfect transcription stops mattering once the fix is one tap.

This is the design choice that separates a reliable voice tracker from a fast one. A model that is right 95 percent of the time still hands you a wrong number every twentieth entry, and a wrong amount that saves silently is worse than no entry, because you will trust a budget that is quietly off. Reading the parse back turns that failure into a visible, correctable moment. It is also why I do not oversell Capi's accuracy: the honest claim is not that Whisper never errs, it is that an error costs you two seconds instead of a corrupted month. Capi's free tier covers 30 transactions a month, enough to test voice entry across a few weeks before deciding.

Is voice expense tracking private?

It can be, if the audio is discarded after transcription. Most cloud voice trackers, Capi included, send your clip to a Whisper-class service, receive text, and the responsible ones delete the audio immediately. Capi keeps only the confirmed text and drops the voice file, so there is no growing archive. If on-device transcription matters more, some standalone apps run the model locally, trading a little accuracy to keep audio off any server.

Privacy here is a spectrum, not a yes or no. A live-bank-sync app knows every transaction automatically but holds your banking login. A voice tracker only knows what you say out loud, which is less total data, but it does route audio through a transcription step you should understand. The question worth asking any voice app is simple: is the audio stored, and for how long. Capi's answer is that it is transcribed and deleted, with only text retained. Whisper runs fast enough on inference platforms like Groq, well over a hundred times real time at a fraction of a cent per minute, that keeping the clip afterward serves no purpose worth the privacy cost.

How do you start tracking expenses by voice in Capi?

You send a voice note to the Capi bot in Telegram, the same way you would send one to a friend. It transcribes what you said, parses the amount and category, and shows you the result to confirm. There is no separate app to install and no settings to configure first. The steps below take under a minute end to end, in any language Whisper supports.

  1. Open the Capi bot in Telegram and start a chat.
  2. Hold the microphone button and say the expense, like "twelve reais on coffee" or the same phrase in your language.
  3. Wait a second while Capi transcribes and parses it.
  4. Check the amount and category it reads back, and correct anything by voice or text.
  5. Confirm, and it lands in your monthly view with a pace bar and a 50/30/20 read.

The test in one breath. On clean speech, voice tracking in 2026 is close to flawless in English, Spanish, Portuguese, French, German and Russian, and weaker in Hindi. Whisper still loses on accents, noise, spoken numbers and mixed-language phrases. Six of my seven languages parsed perfectly through Capi, Hindi missed one figure, and all of it was a one-tap fix because Capi reads the parse back before saving. The right voice app is the one that assumes it might mishear and makes the correction trivial.


Log an expense by voice in your language.

Capi transcribes your voice note, reads the amount back, and saves only what you confirm, all inside Telegram.
Free to start, Core is $9.90 a month or $69.90 a year.

Try Capi free on Telegram →

Frequently asked questions about voice expense tracking

How accurate is voice expense tracking in 2026?

On clean speech it is very good. Whisper Large v3, the model behind most voice trackers, sits around a 5 to 6 percent word error rate on English and roughly 10 percent averaged across languages. For a short phrase like a coffee and a price that usually means a correct transcription. Accuracy falls with accents, background noise and mixed-language phrases, which is where a confirmation step matters more than the raw model.

Which language does Whisper transcribe most accurately?

English, because it has the most training data, followed closely by Spanish, Portuguese, Italian, German and French. In my test all six of those handled a spoken expense cleanly. Russian was slightly behind but still reliable. Hindi was the weakest of the seven, with more misheard words, which tracks with Whisper being strongest on the languages it saw most during training.

Why does my voice expense get the wrong number or currency?

Because numbers and currency words are where speech models slip most. A spoken forty can land as fourteen, and a currency said in one language inside a sentence in another, like reais inside an English phrase, can be dropped or guessed. The fix is not a better microphone but a confirmation screen. A tool that reads the parsed amount back to you before saving lets you catch the one figure that matters.

Is voice expense tracking private if the audio is sent to a server?

It depends on whether the audio is kept. Most cloud voice trackers send your clip to a transcription service, get text back, and the responsible ones discard the audio immediately. Capi transcribes the voice note and then deletes the file, keeping only the text you confirmed. If privacy is your first concern, look for a clear statement that audio is not stored, or choose an app that transcribes on device.

Do I need a separate app to track expenses by voice?

No. Dedicated apps like Vocash, MonAi and TalkieMoney do it well, but you can also send a voice note inside a messaging app you already have. Capi works entirely in Telegram, so tracking an expense by voice is the same gesture as sending any voice message to a contact. The right choice depends on whether you want another icon on your phone or one less.

Written by Daniil Kozin, founder of Capi. More in this series: The best money tracker in 2026 · Hands-free expense tracking · The voice-note tracking test · Why finance apps lie about spending · Capi vs Copilot Money.