<main>

Google Text to Speech Nodejs

Hello!

I’d like to share a bit about my journey creating a Japanese language learning app that’s been a fascinating experience. One key aspect I’ve been working on is incorporating audio for the words being taught. After careful consideration, I decided to integrate the Google Cloud Text-to-Speech API, which brings a new level of quality to the audio component.

Initially, I experimented with the browser’s Speech Synthesizer, which had its merits but fell short in terms of naturalness and compatibility across devices like phones and various browsers.

Transitioning to the Google API marked a turning point. However, I faced another decision: whether to call the API, convert the output to base64, and then transmit it to the frontend for playback. While this approach initially worked well, I recognized it wasn’t the most sustainable solution.

The pivotal moment arrived when I realized I had access to a categorized file of Japanese-English translations, totaling around 750 words. Additionally, I possessed a CSV file containing the top 6,000 most frequently used Japanese words. Opting for efficiency, I made 750 API calls to generate mp3 files for the categorized words. The surprising part? These audio files took up a mere 4.2MB collectively, yet they boasted remarkable clarity and naturalness.

Here’s a brief rundown of the technical steps:

Step 1: Obtain your Google application credentials from the Google Cloud platform using their provided guide.

Step 2: Incorporate the magic by importing the library into your code:

    import textToSpeech from "@google-cloud/text-to-speech";

Step 3: The crux of the operation, a concise snippet from the function:

export const getSpeech = async (req, res) => {  
  const client = new textToSpeech.TextToSpeechClient();
  for (const [key, val] of Object.entries(BeginnerWords)) {
    console.log(key)
    BeginnerWords[key].forEach(async function(word, index) {
    const request = {
      input: { text: word.japanese.split("、")[0] },
      voice: {
        languageCode: "ja-JP",
        name: "ja-JP-Neural2-B",
      },
      audioConfig: { audioEncoding: "MP3" },
    };
    const [response] = await client.synthesizeSpeech(request);
    const writeFile = util.promisify(fs.writeFile);
    await writeFile(`../public/mp3/${key}/${word.japanese}.mp3`, response.audioContent, 'binary');
    })
  }
};

As anticipated, the process unfolded seamlessly, resulting in a treasure trove of audio files. A notable detail: after some exploration, I settled on the voice “ja-JP-Neural2-B” for its natural sound.

While I haven’t tackled the advanced section of the 6,000 additional words, I foresee potentially handling a substantial number of requests. It’s important to note that there’s a limit of 1,000 requests per minute, which is worth keeping in mind.

The excitement doesn’t end there! Combining the API with the ChatGPT API to generate example sentences for learners has been a truly engaging endeavor. The synchronization of querying ChatGPT, activating the Google API, and playing the audio creates a seamless and efficient learning experience.

In conclusion, this journey has been both enlightening and fulfilling. Keep an eye out for more developments on the horizon!

Back to blogs

</main>