The Massachusetts Institute of Technology permanently took down its 80 Million Tiny Images dataset—a popular image database used to train machine learning systems to identify people and objects in an environment—because it used a range of racist, misogynistic, and other offensive terms to label photos.
In a letter published Monday to MIT’s CSAIL website, the three creators of the huge dataset, Antonio Torralba, Rob Fergus, and Bill Freeman, apologized and said they had decided to take the dataset offline.
"It has been brought to our attention that the Tiny Images dataset contains some derogatory terms as categories and offensive images. This was a consequence of the automated data collection procedure that relied on nouns from WordNet. We are greatly concerned by this and apologize to those who may have been affected," they wrote in the letter.
According to the letter, the dataset was created in 2006 and contains 53,464 different nouns, directly copied from Wordnet. Those terms were used to download images of the corresponding noun from Internet search engines at the time to collect the 80 million images (at tiny 32x32 resolution).
HIGH-TECH GLOVE CAN TRANSLATE SIGN LANGUAGE WITH 99 PERCENT ACCURACY
NEW YORK CITY MAY TEST SEWAGE FOR CORONAVIRUS: REPORT
"Biases, offensive and prejudicial images, and derogatory terminology alienates an important part of our community -- precisely those that we are making efforts to include. It also contributes to harmful biases in AI systems trained on such data," they wrote.
"Additionally, the presence of such prejudicial images hurts efforts to foster a culture of inclusivity in the computer vision community. This is extremely unfortunate and runs counter to the values that we strive to uphold."
Biased datasets can have a major impact on the machine learning technologies and AI programs they are used to train. A range of critics inside and outside of Silicon Vallley have called attention to biases against black people specifically and people of color in general in various AI systems.
The dataset will not be re-uploaded.