Unique, high quality data, mainly scraped from the web, is vital to the performance of AI models.
AdvertisementAdvertisementMore and more companies are trying to avoid having their data freely scraped and saved by web crawlers working for the benefit of AI models.
Last month, OpenAI last revealed its own crawler, GPTBot, saying it would respect robots.txt, a decades-old method through which a website can tell a web crawler to ignore it.
Many more companies are now also blocking CCBot, a web crawler used by Common Crawl.
AdvertisementAdvertisementSee below for a full list of the biggest websites now blocking GPTBot and CCBot as of Sept. 22:Blocking GPTBotamazon.comquora.comnytimes.comtheguardian.comshutterstock.comwikihow.comcnn.comsciencedirect.comusatoday.comhealthline.comstackexchange.comalamy.comscribd.comwebmd.combusinessinsider.comdictionary.comreuters.comwashingtonpost.commedicalnewstoday.comnpr.orgcbsnews.comgoodhousekeeping.comamazon.co.uktumblr.comlatimes.cominsider.comglassdoor.comvocabulary.cominvestopedia.comslideshare.netamazon.decosmopolitan.comnbcnews.comindiamart.comstackoverflow.comhindustantimes.combloomberg.comcnbc.compeople.comtvtropes.orgamazon.invimeo.comverywellhealth.comikea.comespn.comindianexpress.comthesaurus.compbs.org123rf.comwattpad.comvariety.comtoday.compopsugar.comthespruce.comuol.com.bramazon.frgeeksforgeeks.orgelle.comeconomictimes.compcmag.comtheverge.comallrecipes.comthoughtco.comrollingstone.comwired.comnextdoor.comhollywoodreporter.comabc.net.auew.comamazon.canews18.comwomenshealthmag.comrateyourmusic.comamazon.co.jptechradar.comairbnb.comndtv.comlifewire.comtomsguide.comvulture.comeverydayhealth.compolygon.comtheconversation.comesquire.comprnewswire.combillboard.commenshealth.commetro.co.ukcountryliving.commashable.comgamesradar.comthehindu.comtimesofindia.comdeadline.comharpersbazaar.commedscape.comnymag.comrefinery29.comradiotimes.comcbssports.comtandfonline.comtheatlantic.comtrulia.comamazon.espinterest.esnationalgeographic.combhg.comeater.comsouthernliving.comhealthgrades.comvice.compicclick.combustle.comnewyorker.comeonline.comdigitalspy.comopentable.compinterest.dethepioneerwoman.comcaranddriver.combyrdie.comlivemint.commedicinenet.comteacherspayteachers.comcookpad.comthespruceeats.combizjournals.compagesjaunes.frliputan6.comdelish.commasterclass.comarchiveofourown.orgvox.comrealsimple.comaarp.orgfrancetvinfo.frpinterest.frkumparan.comtheathletic.comtravelandleisure.comvogue.comlivescience.comapartments.commarketwatch.comglamour.comamazon.itcinemablend.comthrillist.comamazon.com.brpinterest.co.ukangi.comalamy.esusmagazine.comdistractify.combbcgoodfood.comjagran.commercadolibre.com.mxandroidauthority.comcity-data.comfoodandwine.comhellomagazine.comamazon.com.augq.comingles.comamarujala.comieee.orgprevention.comstern.dekbb.comedmunds.commarthastewart.compcgamer.comjustanswer.comhealth.com20minutes.frfortune.comhomes.comscientificamerican.compopularmechanics.comverywellfit.comvanityfair.comchicagotribune.comverywellmind.comhousebeautiful.comcntraveler.comallure.comspanishdict.comneverbounce.comanswers.commoneycontrol.comarchitecturaldigest.comslate.comlonelyplanet.cominverse.comcorriere.itactu.frself.comtripsavvy.cominstyle.comeatingwell.comsuperuser.comwelt.despiegel.dewomansday.comseventeen.comhbr.orgoprahdaily.comautotrader.combonappetit.comsueddeutsche.deseriouseats.comliveabout.comseattletimes.comcoursera.orglivehindustan.comfrance24.comtownandcountrymag.comdotesports.comworldplaces.mefaz.netteenvogue.commotor1.comnj.comglamourmagazine.co.ukokdiario.combrides.comstylecaster.comalamyimages.frjagranjosh.comtheglobeandmail.comaxios.comfrancebleu.frtabelog.comthebalancemoney.comnydailynews.comsheknows.comnaomedical.comverywellfamily.comBlocking CCBot
Persons:
—, OpenAI, GPTbot, Conde Nast, Masterclass, Kelly, robots.txt, verywellhealth.com, indianexpress.com
Organizations:
Service, Amazon, Guardian, NPR, CBS News, CBS Sports, NBC News, CNBC, Yorker, Hearst, New York Times
Locations:
USA, Europe, Originality.ai, androidauthority.com