More than 1,400 hours of footage capturing humans performing tasks simultaneously from their point of view and externally, will help give AI models an understanding of how humans carry out activities.
Building on the work that two years ago led to the release of Egocentric 4D Live Perception, the world’s most diverse egocentric dataset, the Ego4D consortium has drastically expanded the reach and ambition of their research with the newly published Ego-Exo4D – a foundational dataset to support research on video learning and multimodal perception.
A University of Bristol research team led by Professor Dima Damen at the School of Computer Science is part of an international consortium of 13 universities in partnership with Meta that is driving research in computer vision through collecting joint egocentric and exocentric datasets of human skilled activities.
The result of a two-year effort by Meta’s FAIR (Fundamental Artificial Intelligence Research), Project Aria, and the Ego4D consortium of 13 university partners, Ego-Exo4D is a first-of-its-kind large-scale multimodal multiview dataset and benchmark suite. Its defining feature is its simultaneous capture of both first-person ‘egocentric’ views, from a participant’s wearable camera, as well as multiple ‘exocentric’ views, from cameras surrounding the participant. Together, these two perspectives will give AI models a new window into complex skilled human activity allowing approaches to capture an understanding of how skilled participants perform tasks such as dancing, playing music as well as carry out procedures such as maintaining a bicycle.
Reflecting on the work of the consortium and her team’s contributions, Professor Damen remarked: “We are thrilled to be part of this international consortium releasing the Ego-Exo4D dataset. Today marks the outcome of a 2-year research collaboration, that continues to push the egocentric research community to new grounds.”
Co-leading the project at Bristol, Dr Michael Wray is particularly interested in the interplay between skilled activities and descriptive language. “At Bristol, we proposed the Act-and-Narrate recordings, that is, the capture of the person’s internal state – why they perform tasks in a particular manner. The Ego-Exo4D project also innovates by providing Expert Commentary narrations – these are domain experts who watch the videos and provide rich feedback on the performance. Not only through the multiview of vision but also through language we have the ‘ego’ language of the participant and the ‘exo’ language of an expert observer offering rich new insights into the very important research topic of how large language models interplay with assistive technology.”
The Ego4D consortium is a long-running collaboration between FAIR and more than a dozen universities around the world. The consortium members and FAIR researchers collaborated on all aspects of the project, from developing the dataset scope, to collecting the data, to formulating the benchmark tasks. This project also marks the largest ever deployment of the Aria glasses in the academic research community, with partners at 12 different sites using them.
Prof Damen’s group is a lead research group in egocentric vision internationally, and their expertise has been instrumental in the consortium’s work since its very inception. “Starting from EPIC-KITCHENS in 2018 and continuing through the massive scale Ego4D and this new addition Ego-Exo4D continues to place the University of Bristol as a key lead in egocentric vision internationally and the only UK research group in this key futuristic area,” Professor Damen commented.
In addition to the captured footage, annotations for novel benchmark tasks, and baseline models for ego-exo understanding are being made available for researchers. The datasets will be publicly available in December of this year for researchers who sign Ego4D’s data use agreement.
The data was collected following rigorous privacy and ethics standards, including formal review processes at each institution to establish the standards for collection, management, and informed consent, as well as a license agreement prescribing proper. With this release, the Ego4D consortium aims to provide the tools the broader research community needs to explore ego-exo video, multimodal activity recognition, and beyond.
Further information
Consortium members:
University of Bristol, UK
Carnegie Mellon University (Pittsburg, USA and Rwanda)
Georgia Tech, USA
Indiana University, USA
International Institute of Information Technology, Hyderabad, India
King Abdullah University of Science and Technology (KAUST), KSA
Massachusetts Institute of Technology, USA
National University of Singapore, Singapore
Universidad de los Andes, Colombia
University of Catania, Italy
University of Minnesota, USA
University of Pennsylvania, USA
University of Tokyo, Japan
Egocentric 4D Live Perception (Ego4D) is a massive-scale dataset that compiles 3,025 hours of footage from the wearable cameras of 855 participants in nine countries: UK, India, Japan, Singapore, KSA, Colombia, Rwanda, Italy and the US. The data captures a wide range of activities from the ‘egocentric’ perspective – that is from the viewpoint of the person carrying out the activity. The University of Bristol is the only UK representative in this diverse and international effort, collecting 270 hours from 82 participants who captured footage of their chosen activities of daily living – such as practicing a musical instrument, gardening, grooming their pet or assembling furniture
Read more about Egocentric 4D Live Perception in our blog: Egocentric computer vision – a giant leap
EPIC-KITCHENS is a collaboration with the University of Toronto (Canada) and the University of Catania (Italy), led by the University of Bristol to collect and annotate the largest (over 20 million frames) dataset, capturing 45 individuals in their own homes, over several consecutive days.
The dataset was collected in 4 different countries and was narrated in 6 languages to assist in vision and language challenges. It offers a series of challenges from object recognition to action prediction and activity modelling in non-scripted realistic daily setting.
The size of publicly available datasets is crucial to the progress of this field, which is of prime importance to robotics, healthcare and augmented reality.