Google Researchers Boost Speech Recognition Accuracy With More Datasets

What if the key to improving speech recognition accuracy is simply mixing all available speech datasets together to train one large AI model? That's the hypothesis behind a recent study published by a team of researchers affiliated with Google Research and Google Brain. They claim an AI model named SpeechStew that was trained on a range of speech corpora achieves state-of-the-art or near-state-of-the-art results on a variety of speech recognition benchmarks. VentureBeat reports: In pursuit of a solution, the Google researchers combined all available labeled and unlabelled speech recognition data curated by the community over the years. They drew on AMI, a dataset containing about 100 hours of meeting recordings, as well as corpora that include Switchboard (approximately 2,000 hours of telephone calls), Broadcast News (50 hours of television news), Librispeech (960 hours of audiobooks), and Mozilla's crowdsourced Common Voice. Their combined dataset had over 5,000 hours of speech -- none of which was adjusted from its original form. With the assembled dataset, the researchers used Google Cloud TPUs to train SpeechStew, yielding a model with more than 100 million parameters. In machine learning, parameters are the properties of the data that the model learned during the training process. The researchers also trained a 1-billion-parameter model, but it suffered from degraded performance. Once the team had a general-purpose SpeechStew model, they tested it on a number of benchmarks and found that it not only outperformed previously developed models but demonstrated an ability to adapt to challenging new tasks. Leveraging Chime-6, a 40-hour dataset of distant conversations in homes recorded by microphones, the researchers fine-tuned SpeechStew to achieve accuracy in line with a much more sophisticated model. Transfer learning entails transferring knowledge from one domain to a different domain with less data, and it has shown promise in many subfields of AI. By taking a model like SpeechStew that's designed to understand generic speech and refining it at the margins, it's possible for AI to, for example, understand speech in different accents and environments. Read more of this story at Slashdot.

Google Researchers Boost Speech Recognition Accuracy With More Datasets
What if the key to improving speech recognition accuracy is simply mixing all available speech datasets together to train one large AI model? That's the hypothesis behind a recent study published by a team of researchers affiliated with Google Research and Google Brain. They claim an AI model named SpeechStew that was trained on a range of speech corpora achieves state-of-the-art or near-state-of-the-art results on a variety of speech recognition benchmarks. VentureBeat reports: In pursuit of a solution, the Google researchers combined all available labeled and unlabelled speech recognition data curated by the community over the years. They drew on AMI, a dataset containing about 100 hours of meeting recordings, as well as corpora that include Switchboard (approximately 2,000 hours of telephone calls), Broadcast News (50 hours of television news), Librispeech (960 hours of audiobooks), and Mozilla's crowdsourced Common Voice. Their combined dataset had over 5,000 hours of speech -- none of which was adjusted from its original form. With the assembled dataset, the researchers used Google Cloud TPUs to train SpeechStew, yielding a model with more than 100 million parameters. In machine learning, parameters are the properties of the data that the model learned during the training process. The researchers also trained a 1-billion-parameter model, but it suffered from degraded performance. Once the team had a general-purpose SpeechStew model, they tested it on a number of benchmarks and found that it not only outperformed previously developed models but demonstrated an ability to adapt to challenging new tasks. Leveraging Chime-6, a 40-hour dataset of distant conversations in homes recorded by microphones, the researchers fine-tuned SpeechStew to achieve accuracy in line with a much more sophisticated model. Transfer learning entails transferring knowledge from one domain to a different domain with less data, and it has shown promise in many subfields of AI. By taking a model like SpeechStew that's designed to understand generic speech and refining it at the margins, it's possible for AI to, for example, understand speech in different accents and environments.

Read more of this story at Slashdot.