A Simple STT(Speech To Text) banner

A Simple STT(Speech To Text)

2 devlogs
1h 26m 6s

I like speaking my mind out, but typing is slow and sometimes it also breaks my flow or I just forget what I was typing for. So, the tech is finally at the level when we can just speak and it can transcribe it for us. I know solutions like Whisper…

I like speaking my mind out, but typing is slow and sometimes it also breaks my flow or I just forget what I was typing for. So, the tech is finally at the level when we can just speak and it can transcribe it for us. I know solutions like Whisper flow exist, but it is still too heavy for my computer. So, I thought, can there be just a simple website where I can go and it can just transcribe for me. So, that is what I’m going to make.

This project uses AI

I am probably going to use artificial intelligence like anti-gravity IDE AI agent to write the code. But still, it is going to need my oversight, or human oversight, and human review.

Azeem

So, I asked the AI to write the code for an API integration. I gave clear instructions use Gemini’s API, specifically the 3.1 Lite model because it’s fast. I also asked it to create a settings page with an option to input the API key.

The AI generated the code, but it got the Gemini model name wrong. I mean does Gemini not even know its own sibling’s name?

So I had to manually go into the documentation, find the correct model name, and fix the code myself.

Then came the next issue. When I tried to get transcription working, it kept failing. The problem is that I’m using Gemini’s large language model, not a specialized model like Whisper, which is built for speech-to-text. So the LLM tries to be “too smart” and ends up messing things up.

For example, when I say “I,” it sometimes converts it to “you,” like it’s responding instead of transcribing. It feels like it’s having a conversation instead of doing its job.

Now, why not just use Whisper? Because of cost, and I don’t want to run it locally since my laptop will get fired as it’s cooked right now.

So then came the dangerous part- prompt engineering.

I spent about half an hour refining prompts, trying to clearly explain what I wanted. Eventually, I got something decent. I added options so I could speak in any language, but the output would always be in my chosen language for example, speaking in Hinglish but getting output in English.

That part worked.

But the main problem remained the AI kept getting confused or trying to overthink things.

So I had to tighten the prompts a lot to control its behavior.

Attachment
Attachment
1

Comments

Azeem
Azeem about 1 month ago

what I can only upload 2 images oo no. well next thing going to do is add a text field so it can also make raw text look good, and then fix some more things and ship it I hope

Azeem

when I wanted to make an STT for the React application, I thought, let me be a bit lazy. You know, I’ll just tell the AI to make it for me. set up the project. But the AI is so dumb that it couldn’t even set up React. So, in the end, I had to go and set it up manually.

Then, once it was set up, I told the AI, “Hey, can you write the code for a basic page?” And it wrote it. Of course, I always do this. I get the base-level code from the AI and then I go in and change it, you know? Like, tweak it, perform a lot of tweaks, and remove or add things. That’s how I build it. Once I have the base, it becomes easy to do things. it’s like clay building,You take a clay base and then mold it. yes, currently I am doing exactly that.

Attachment
Attachment
2

Comments

da.superman2775
da.superman2775 about 1 month ago

wow super cool dude!

Azeem
Azeem about 1 month ago

Thanks do.superman