Skip to main content

Internet Archive Research Publication Crawls

Internet Archive Web Group

A series of open web crawls targeting journal articles, technical memos, essays, datasets, and other research publications.



rss RSS

Show sorted alphabetically
Show sorted alphabetically
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
MSAG-PDF-CRAWL-2017
collection
1,855
ITEMS
4.4M
VIEWS
by Internet Archive Web Group
collection
eye 4.4M
Microsoft Academic Graph public corpus (Feb 2016) PDF URLs, filtered to remove large sites (pubmed, citeseerx, arxiv) and already-crawled URLs.
Topics: papers, journals
OA-JOURNAL-CRAWL-2020-07
OA-JOURNAL-CRAWL-2020-07
collection
1,923
ITEMS
2.9M
VIEWS
by Internet Archive Web Group
collection
eye 2.9M
OAI-PMH-CRAWL-2020-06
OAI-PMH-CRAWL-2020-06
collection
2,946
ITEMS
1.1M
VIEWS
by Internet Archive Web Group
collection
eye 1.1M
UNPAYWALL-PDF-CRAWL-2018-07
UNPAYWALL-PDF-CRAWL-2018-07
collection
1,241
ITEMS
4.7M
VIEWS
by Internet Archive Web Group
collection
eye 4.7M
Web archive data from a crawl of open access PDF URLs provided by Unpaywall.
Open Access Journal Test Crawl (2018)
Open Access Journal Test Crawl (2018)
collection
794
ITEMS
4.1M
VIEWS
by Internet Archive Web Group
collection
eye 4.1M
DATACITE-DOI-CRAWL-2020-01
DATACITE-DOI-CRAWL-2020-01
collection
1,417
ITEMS
850,582
VIEWS
by Internet Archive Web Group
collection
eye 850,582
SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
collection
1,011
ITEMS
185,328
VIEWS
by Internet Archive Web Group
collection
eye 185,328
UNPAYWALL-PDF-CRAWL-2019-04
UNPAYWALL-PDF-CRAWL-2019-04
collection
641
ITEMS
477,875
VIEWS
by Internet Archive Web Group
collection
eye 477,875
CORE-UPSTREAM-CRAWL-2018-11
CORE-UPSTREAM-CRAWL-2018-11
collection
741
ITEMS
309,958
VIEWS
by Internet Archive Web Group
collection
eye 309,958
Crawl of "upstream" URLs from CORE (core.ac.uk) metadata dump. Only a partial seedlist of files crawled.
DIRECT-OA-CRAWL-2019
DIRECT-OA-CRAWL-2019
collection
2,566
ITEMS
844,059
VIEWS
by Internet Archive Web Group
collection
eye 844,059
DOI-LANDING-CRAWL-2018-06
DOI-LANDING-CRAWL-2018-06
collection
279
ITEMS
1.4M
VIEWS
by Internet Archive Web Group
collection
eye 1.4M
MAG-PDF-CRAWL-2020-03
MAG-PDF-CRAWL-2020-03
collection
489
ITEMS
497,407
VIEWS
by Internet Archive Web Group
collection
eye 497,407
OA-JOURNAL-CRAWL-2019-08
OA-JOURNAL-CRAWL-2019-08
collection
201
ITEMS
1.4M
VIEWS
by Internet Archive Web Group
collection
eye 1.4M
OA-DOI-CRAWL-2020-02
OA-DOI-CRAWL-2020-02
collection
278
ITEMS
449,966
VIEWS
by Internet Archive Web Group
collection
eye 449,966
UNPAYWALL-PDF-CRAWL-2020-03
UNPAYWALL-PDF-CRAWL-2020-03
collection
344
ITEMS
239,220
VIEWS
by Internet Archive Web Group
collection
eye 239,220
Wide Web Targeted PDF Crawling (2017)
Wide Web Targeted PDF Crawling (2017)
collection
922
ITEMS
839,902
VIEWS
by Internet Archive Web Group
collection
eye 839,902
UNPAYWALL-PDF-CRAWL-2020-05
UNPAYWALL-PDF-CRAWL-2020-05
collection
282
ITEMS
105,816
VIEWS
by Internet Archive Web Group
collection
eye 105,816
collection
eye 1M
IA crawl of PDF urls provided by Semantic Scholar.
Topic: pdf
CiteSeerX URL Crawl 2017
CiteSeerX URL Crawl 2017
collection
207
ITEMS
398,277
VIEWS
collection
eye 398,277
A targeted crawl to fetch research publications from the public web which have been crawled by CiteSeerX but have not previously been crawled by the Internet Archive.
Topics: scholarly, papers, journal
MAG-PDF-CRAWL-2020-07
MAG-PDF-CRAWL-2020-07
collection
196
ITEMS
188,003
VIEWS
by Internet Archive Web Group
collection
eye 188,003
UNPAYWALL-PDF-CRAWL-2020-11
UNPAYWALL-PDF-CRAWL-2020-11
collection
199
ITEMS
82,260
VIEWS
by Internet Archive Web Group
collection
eye 82,260
OA-DOI-CRAWL-2020-12
OA-DOI-CRAWL-2020-12
collection
191
ITEMS
154,208
VIEWS
by Internet Archive Web Group
collection
eye 154,208
DOAJ-CRAWL-2020-11
DOAJ-CRAWL-2020-11
collection
102
ITEMS
146,319
VIEWS
by Internet Archive Web Group
collection
eye 146,319
UNPAYWALL-PDF-CRAWL-2021-05
UNPAYWALL-PDF-CRAWL-2021-05
collection
123
ITEMS
4,342
VIEWS
by Internet Archive Web Group
collection
eye 4,342
PLATFORM-CRAWL-2020
PLATFORM-CRAWL-2020
collection
501
ITEMS
16,734
VIEWS
by Internet Archive Web Group
collection
eye 16,734
SCIELO-CRAWL-2020-07
SCIELO-CRAWL-2020-07
collection
41
ITEMS
56,900
VIEWS
by Internet Archive Web Group
collection
eye 56,900
OA-JOURNAL-CRAWL-2020-07
web
eye 63,868
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc282.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Sun Aug 2 19:00:58 PDT 2020 to Sun Aug 2 13:24:24 PDT 2020.
Topic: crawldata
DOAJ-CRAWL-2020-11
web
eye 19,205
favorite 0
comment 0
Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Tue Nov 24 17:59:21 PST 2020 to Tue Nov 24 11:43:19 PST 2020.
Topic: crawldata
arXiv Content Crawl (2019-10)
arXiv Content Crawl (2019-10)
collection
37
ITEMS
13,812
VIEWS
by Internet Archive Web Group
collection
eye 13,812
PubMed Central Crawl (2019-10)
PubMed Central Crawl (2019-10)
collection
216
ITEMS
106,678
VIEWS
by Internet Archive Web Group
collection
eye 106,678
OA-JOURNAL-CRAWL-2020-07
web
eye 18,641
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc282.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Sun Jul 12 16:57:32 PDT 2020 to Sun Jul 12 10:39:01 PDT 2020.
Topic: crawldata
MAG-PDF-CRAWL-2020-07
web
eye 9,755
favorite 0
comment 0
Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc171.us.archive.org:MAG-PDF-CRAWL-2020-07 from Fri Jul 10 06:54:03 PDT 2020 to Fri Jul 10 00:16:55 PDT 2020.
Topic: crawldata
Internet Archive crawldata of web PDF content captured by wbgrp-svc284.us.archive.org:OA-JOURNAL-TESTCRAWL-TWO-2018 from Sun Apr 22 23:50:07 PDT 2018 to Sun Apr 22 18:04:52 PDT 2018.
Topic: crawldata
PUBMEDCENTRAL-CRAWL-2020-02
PUBMEDCENTRAL-CRAWL-2020-02
collection
108
ITEMS
59,248
VIEWS
by Internet Archive Web Group
collection
eye 59,248
OA-JOURNAL-CRAWL-2020-07
web
eye 17,551
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc282.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Tue Jul 14 13:35:36 PDT 2020 to Tue Jul 14 07:22:15 PDT 2020.
Topic: crawldata
OA-JOURNAL-CRAWL-2020-07
web
eye 1,990
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc283.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Fri Jul 24 15:12:37 PDT 2020 to Fri Jul 24 09:15:02 PDT 2020.
Topic: crawldata
MAG-PDF-CRAWL-2020-07
web
eye 8,604
favorite 0
comment 0
Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc171.us.archive.org:MAG-PDF-CRAWL-2020-07 from Fri Jul 10 07:15:10 PDT 2020 to Fri Jul 10 00:50:21 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2018-07
web
eye 27,841
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc282.us.archive.org:UNPAYWALL-PDF-CRAWL-2018-07 from Sun Jul 29 09:51:24 PDT 2018 to Sun Jul 29 05:34:41 PDT 2018.
Topic: crawldata
OA-JOURNAL-CRAWL-2020-07
web
eye 11,056
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc282.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Sat Jul 18 12:33:09 PDT 2020 to Sat Jul 18 06:31:03 PDT 2020.
Topic: crawldata
OA-JOURNAL-CRAWL-2020-07
web
eye 13,077
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc282.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Thu Jul 16 12:42:02 PDT 2020 to Thu Jul 16 06:36:46 PDT 2020.
Topic: crawldata
ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
ARXIV-PUBMEDCENTRAL-CRAWL-2020-04
collection
60
ITEMS
11,502
VIEWS
by Internet Archive Web Group
collection
eye 11,502
OA-JOURNAL-CRAWL-2020-07
web
eye 2,310
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc283.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Tue Jul 21 20:24:05 PDT 2020 to Tue Jul 21 14:42:40 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2018-07
web
eye 48,758
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2018-07 from Sun Jul 29 09:53:16 PDT 2018 to Sun Jul 29 04:27:27 PDT 2018.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2018-07
web
eye 24,399
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc282.us.archive.org:UNPAYWALL-PDF-CRAWL-2018-07 from Fri Jul 27 15:58:25 PDT 2018 to Fri Jul 27 10:31:36 PDT 2018.
Topic: crawldata
OA-JOURNAL-CRAWL-2020-07
web
eye 12,146
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc282.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Thu Jul 16 11:43:57 PDT 2020 to Thu Jul 16 05:45:20 PDT 2020.
Topic: crawldata
OA-JOURNAL-CRAWL-2020-07
web
eye 9,265
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc282.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Mon Jul 20 07:53:41 PDT 2020 to Mon Jul 20 01:51:47 PDT 2020.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 7,074
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Mon Aug 12 14:17:37 PDT 2019 to Mon Aug 12 09:27:15 PDT 2019.
Topic: crawldata
Internet Archive crawldata of uncrawled Semantic Scholar seedlist PDF URLs captured by wbgrp-svc285.us.archive.org:SEMSCHOLAR-PDF-CRAWL-2017 from Wed Aug 30 04:34:19 PDT 2017 to Tue Aug 29 21:50:48 PDT 2017.
Topic: crawldata
Internet Archive crawldata of uncrawled Semantic Scholar seedlist PDF URLs captured by wbgrp-svc285.us.archive.org:SEMSCHOLAR-PDF-CRAWL-2017 from Wed Aug 30 07:36:27 PDT 2017 to Wed Aug 30 00:57:58 PDT 2017.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 15,442
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Sun Aug 11 11:20:15 PDT 2019 to Sun Aug 11 06:16:38 PDT 2019.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 12,725
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Mon Aug 12 05:10:56 PDT 2019 to Mon Aug 12 00:27:08 PDT 2019.
Topic: crawldata
OA-JOURNAL-CRAWL-2020-07
web
eye 14,537
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc282.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Sat Jul 11 11:58:35 PDT 2020 to Sat Jul 11 06:03:10 PDT 2020.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 13,405
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Sun Aug 11 23:58:34 PDT 2019 to Sun Aug 11 19:00:00 PDT 2019.
Topic: crawldata
MAG-PDF-CRAWL-2020-03
web
eye 5,431
favorite 0
comment 0
Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc280.us.archive.org:MAG-PDF-CRAWL-2020-03 from Sun Mar 22 21:08:51 PDT 2020 to Sun Mar 22 15:02:17 PDT 2020.
Topic: crawldata
OA-JOURNAL-CRAWL-2020-07
web
eye 3,490
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc283.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Tue Jul 14 07:23:04 PDT 2020 to Tue Jul 14 01:10:19 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2018-07
web
eye 20,866
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc282.us.archive.org:UNPAYWALL-PDF-CRAWL-2018-07 from Sun Jul 29 12:27:39 PDT 2018 to Sun Jul 29 07:14:57 PDT 2018.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 12,899
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Sun Aug 11 07:30:16 PDT 2019 to Sun Aug 11 02:32:48 PDT 2019.
Topic: crawldata
DOI-LANDING-CRAWL-2018-06
web
eye 7,775
favorite 0
comment 0
Internet Archive crawldata of doi.org redirect landing page content captured by wbgrp-svc279.us.archive.org:DOI-LANDING-CRAWL-2018-06 from Mon Jun 18 00:07:49 PDT 2018 to Wed Jun 20 12:16:53 PDT 2018.
Topic: crawldata
Internet Archive crawldata of uncrawled Semantic Scholar seedlist PDF URLs captured by wbgrp-svc284.us.archive.org:SEMSCHOLAR-PDF-CRAWL-2017 from Tue Aug 29 01:59:23 PDT 2017 to Mon Aug 28 19:28:00 PDT 2017.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 14,148
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Wed Aug 7 22:30:32 PDT 2019 to Wed Aug 7 17:48:16 PDT 2019.
Topic: crawldata
DOI-LANDING-CRAWL-2018-06
web
eye 7,339
favorite 0
comment 0
Internet Archive crawldata of doi.org redirect landing page content captured by wbgrp-svc282.us.archive.org:DOI-LANDING-CRAWL-2018-06 from Sun Jun 17 09:32:14 PDT 2018 to Tue Jun 19 02:54:40 PDT 2018.
Topic: crawldata
DOI-LANDING-CRAWL-2018-06
web
eye 7,659
favorite 0
comment 0
Internet Archive crawldata of doi.org redirect landing page content captured by wbgrp-svc281.us.archive.org:DOI-LANDING-CRAWL-2018-06 from Sat Jun 2 11:22:48 PDT 2018 to Sat Jun 2 06:21:42 PDT 2018.
Topic: crawldata
DOAJ-CRAWL-2020-11
web
eye 5,920
favorite 0
comment 0
Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Tue Nov 24 06:09:59 PST 2020 to Tue Nov 24 00:41:05 PST 2020.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 24,453
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Sat Aug 10 11:36:50 PDT 2019 to Sat Aug 10 06:44:32 PDT 2019.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2018-07
web
eye 2,020
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:UNPAYWALL-PDF-CRAWL-2018-07 from Fri Jul 20 06:37:00 PDT 2018 to Fri Jul 20 00:35:06 PDT 2018.
Topic: crawldata
OA-DOI-CRAWL-2020-02
web
eye 9,316
favorite 0
comment 0
Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:OA-DOI-CRAWL-2020-02 from Fri Feb 7 19:03:43 PST 2020 to Fri Feb 7 12:06:54 PST 2020.
Topic: crawldata
OA-JOURNAL-CRAWL-2020-07
web
eye 4,927
favorite 0
comment 0
Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc283.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Tue Jul 14 06:32:55 PDT 2020 to Tue Jul 14 00:26:22 PDT 2020.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 11,673
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Sun Aug 11 18:47:54 PDT 2019 to Sun Aug 11 13:32:59 PDT 2019.
Topic: crawldata
Internet Archive crawldata of web PDF content captured by wbgrp-svc284.us.archive.org:OA-JOURNAL-TESTCRAWL-TWO-2018 from Mon Apr 9 20:45:03 PDT 2018 to Mon Apr 9 14:32:11 PDT 2018.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2018-07
web
eye 26,747
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:UNPAYWALL-PDF-CRAWL-2018-07 from Mon Jul 23 16:02:29 PDT 2018 to Mon Jul 23 10:33:02 PDT 2018.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 7,772
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Mon Aug 12 10:22:31 PDT 2019 to Mon Aug 12 05:12:34 PDT 2019.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 14,756
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Sat Aug 10 06:14:04 PDT 2019 to Sat Aug 10 01:21:36 PDT 2019.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 10,772
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Sun Aug 11 16:45:58 PDT 2019 to Sun Aug 11 11:49:42 PDT 2019.
Topic: crawldata
OA-JOURNAL-CRAWL-2019-08
web
eye 13,216
favorite 0
comment 0
Internet Archive crawldata of open access journal content captured by wbgrp-svc281.us.archive.org:OA-JOURNAL-CRAWL-2019-08 from Sun Aug 11 22:05:22 PDT 2019 to Sun Aug 11 17:03:03 PDT 2019.
Topic: crawldata
DOI-LANDING-CRAWL-2018-06
web
eye 7,966
favorite 0
comment 0
Internet Archive crawldata of doi.org redirect landing page content captured by wbgrp-svc284.us.archive.org:DOI-LANDING-CRAWL-2018-06 from Sun Jun 17 09:59:43 PDT 2018 to Tue Jun 19 02:59:17 PDT 2018.
Topic: crawldata