Gwangju Institute of Science and Technology Team Clinches Top Spot in IEEE AI Audio Recognition Challenge

Shannon Davis

11 months ago

The AI model developed by the team demonstrates the ability to detect and differentiate various sounds in indoor settings.

With the use of artificial intelligence (AI), acoustic recognition techniques can excel in listening and interpreting various sounds in different contexts. Now, a research team from the Gwangju Institute of Science and Technology (GIST) has made significant strides in enhancing the performance of acoustic recognition to achieve diverse audio intelligence. Their groundbreaking work bagged the top spot in the indoor acoustic event detection category at “DCASE Challenge 2023” organized by IEEE.

In our daily lives, sounds carry a large volume of information about the environment and the events that take place in the space around us. Humans can perceive the surrounding sound environment (for examples, sounds on busy streets, offices, or at home), and recognize the source of these sounds. With the advent of artificial intelligence (AI), the performance of acoustic recognition techniques can be further improved, enabling machines to listen to and interpret various sounds in different contexts. This research has tremendous potential in several applications in multimedia search based on its audio content, context-aware mobile devices, robots, cars, intelligent surveillance and monitoring systems. However, the challenge in achieving this primarily lies in the recognition of sound scenes and individual sound sources in realistic soundscapes, where multiple sounds are simultaneously present.

To this end, a collaborative research team from GIST (with Acting President Rae-gil Park), have now excelled in developing an excellent acoustic recognition technique using AI. Their work has now bagged the top spot in the International Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) organized by the Institute of Electrical and Electronics Engineers (IEEE) Signal Processing Society (AASP), held on June 1,2023. The team comprised students from the Audio Intelligence Lab at GIST, Ji-won Kim, Sang-won Son, and Yoon-ah Song, under the guidance of Professor Hongguk Kim from GIST’s Department of Electrical and Computer Engineering. They collaborated with researchers Il-hoon Song and Jeong-eun Lim from Hanwha Vision’s AI Lab, led by Director Seung-in Noh. They bagged the first place in the single model category and second place in the ensemble model category within the indoor acoustic event detection segment, highlighting their exceptional research accomplishment.

“In the category of Indoor Acoustic Event Detection, the performance of AI technology was evaluated based on its ability to detect and differentiate between 10 different sounds commonly found in indoor environments, including vacuum cleaners, dishes, barking dogs, and running water,” says Prof. Kim, while speaking about their achievement.

The GIST-Hanwha Vision team made significant strides in enhancing the performance of acoustic recognition by integrating a range of AI technologies capable of achieving diverse audio intelligence. This included semi-learning technology, which leverages answers inferred from AI; fusion technology, which combines the inference results of pre-learning training models with those of existing models; data refinement techniques, which optimizes performance; and ensemble technologies, which integrates multiple techniques to enhance AI performance.

“The experience and knowledge gained from this competition will be applied for the detection of acoustic events in the CCTV systems developed by Hanwha Vision. Furthermore, efforts will be made to develop more efficient and user-friendly services, such as a technology for detecting speech segments and acoustic events in social media content,” remarks elated Prof. Kim.

This groundbreaking technology is expected to have a wide range of applications, including indoor surveillance and AI speakers. We definitely wish it enables the AI to understand the is happenings in our surrounding by solely analyzing the sounds, even in situations where visual observation are not possible.