Skip to main content

Unsorted Datasets

Unsorted Datasets



rss RSS

220
RESULTS


Show sorted alphabetically

Show sorted alphabetically

SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
Unsorted Datasets
by Curated by MIDI_MAN
data

eye 1,932

favorite 28

comment 0

Original Collection by MIDI_MAN. Modified by Jason Scott/Internet Archive for ease of use/interaction. ( Click on VIEW CONTENTS links to interact with the 130,000 MIDI files held here.) Text listings of the ZIP file contents are also available. Melody Kit 1.0: A collection of 130,000 MIDI Files Encompassing a large selection of publicaly available midi files was collected in 06 of 2015; around 200 websites were crawled and scraped for midi files.  A script was ran to eliminate duplicate files...
Unsorted Datasets
by J. Yang, J. Leskovec (Original Acquisition)
data

eye 320

favorite 1

comment 0

467 million Twitter posts from 20 million users covering a 7 month period from June 1 2009 to December 31 2009. We estimate this is about 20-30% of all public tweets published on Twitter during the particular time frame. For each public tweet the following information is available: Author, Time and Content. Number of users 17,069,982 Number of tweets 476,553,560 Number of URLs 181,611,080 Number of Hashtags 49,293,684 Number of re-tweets 71,835,017 Citation: J. Yang, J. Leskovec. Temporal...
Unsorted Datasets
by Gwern Branwen
data

eye 38,675

favorite 18

comment 1

Dark Net Markets (DNM) are online markets typically hosted as Tor hidden services whose users transact in Bitcoin or other cryptocoins, usually for drugs or other illegal/regulated goods; the most famous DNM was Silk Road 1, which pioneered the business model. From 2013-2015, I scraped/mirrored on a weekly or daily basis all existing English-language DNMs as part of my research into their usage, lifetimes/characteristics, & legal riskiness; in addition, I made or obtained copies of as many...
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Topics: Tor, Bitcoin, drugs, Silk Road, Evolution, Agora, black-markets, dark net markets
Unsorted Datasets
data

eye 57,517

favorite 13

comment 3

(Here is the original Reddit comment announcing this collection of data and what the processes were.) This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues. Q: How are the files structured? Each file is compressed with bzip2 compression....
favoritefavoritefavoritefavoritefavorite ( 3 reviews )
Unsorted Datasets
software

eye 2,338

favorite 0

comment 0

Syzygy endgame tablebases containing win-draw-loss (WDL) and distance-to-zero (DTZ) information for chess positions containing 7 pieces. These tablebases will be of interest to both chess players and computer scientists. For more information, please visit the Chess Programming wiki: https://www.chessprogramming.org/Syzygy_Bases This is Part 1 of ?. See below for the remaining parts: Part 2: https://archive.org/details/Syzygy7_2
Topics: 7-man, 7-men, 7man, 7men, chess, database, databases, egtb, egtbs, syzygy, tablebase, tablebases
Source: http://tablebase.sesse.net/
Unsorted Datasets
by NYC Taxi and Limousine Commission
data

eye 21,276

favorite 4

comment 0

FOIA/FOILed Taxi Trip Data from the NYC Taxi and Limousine Commission 2013. Released by http://chriswhong.com/open-data/foil_nyc_taxi/ trip_data.7z and trip_fare.7z are more efficiently compressed versions of the data, you probably want these files. The data is in csv format. For the data files this includes the fields: medallion, hack_license, vendor_id, rate_code, store_and_fwd_flag, pickup_datetime, dropoff_datetime, passenger_count, trip_time_in_secs, trip_distance, pickup_longitude,...
Topics: data, nyc, taxi, fare, csv, FOIA, FOIL
Source: torrent:urn:sha1:6c594866904494b06aae51ad97ec7f985059b135
Unsorted Datasets
by /u/prograc
texts

eye 1,094

favorite 21

comment 0

More info here: https://web.archive.org/web/20200707075947/https://www.reddit.com/r/DataHoarder/comments/ftsdbs/gamespot_txt_gamefaqs_full_archive_32320/
Topics: txt, gamefaq, gamesfaq
Unsorted Datasets
by Peter Baylies
software

eye 15,326

favorite 6

comment 0

Deep learning conditional StyleGAN2 model for generating art trained on WikiArt images; includes the model, a ResNet based encoder into the model's latent space, and source code (mirror of the pbaylies/stylegan2 repo on github as of 2020-01-25)
Topics: generative art, StyleGAN2, wikiart, software, deep learning
Unsorted Datasets
by legacycollector.org
software

eye 9,813

favorite 9

comment 2

To Browse the Repository: Click Here This website is a repository for web content that has been deemed "legacy" and has been removed by their original publishers, and might otherwise be difficult or cumbersome to get. Since starting this, end 2018, in response to Mozilla removing all legacy extensions from its add-ons site, with plans to expand to include more, similar "legacy" content, a few things have changed needing me to re-evaluate both the need for this site and my...
favoritefavoritefavoritefavoritefavorite ( 2 reviews )
Unsorted Datasets
data

eye 1,328

favorite 1

comment 0

Unsorted Datasets
by All the Music, LLC
audio

eye 23,390

favorite 40

comment 10

From: https://www.vice.com/en_uk/article/wxepzw/musicians-algorithmically-generate-every-possible-melody-release-them-to-public-domain : Musicians Algorithmically Generate Every Possible Melody, Release Them to Public Domain Damien Riehl and Noah Rubin generated and saved every possible melody to a hard drive, then turned it back around to the commons. From: https://www.dailymail.co.uk/sciencetech/article-8042979/Musician-uses-computer-algorithm-compose-melody-thats-possible-key-C.html :...
favoritefavoritefavorite ( 10 reviews )
Unsorted Datasets
by 4chan
data

eye 1,205

favorite 11

comment 0

The Ark is a compilation of material created by 4chan's /k/ board. Being a board centered around weapons, it includes information about warfare, survival, gunsmithing, etc but also stuff more generally aligned with their interests like anime and games. From the torrent's...
Topics: 4chan, /k/, /k/ommando, weapons, guns, gunsmithing, survival, tactics, game, games, anime, 3d print
Source: torrent:urn:sha1:6d72a0d13d050f6ed00179ffd4294b549714140a
Unsorted Datasets
by Internet Archive
data

eye 24,375

favorite 14

comment 1

Culled from various sources, this collection includes over one million JPG, PNG and GIF album covers. The resolution ranges from "thumbnail" through to very large sizes. Filenames are variant in usefulness, although a good number indicate at least the name of the original album. This dataset is for experimentation and image processing research only. At 148gb, the collection is large but not unmanageable (there is a torrent available) and allows a developer or artist to work with the...
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Topics: dataset, big data, album covers, covers, cover art, cover photos
Unsorted Datasets
by SilenceROM
software

eye 941,033

favorite 4

comment 1

SilenceROM LIII Changelog *CCM/Hybrid/Nox Adjustments *Tweaked Super Favourites *Updated source file *Updated applications *Updated SilenceROM Wizard *Tweaked Database +TorrentRelease Repo +Renegades TV Guide :Preconfigured +Dragon Streams +DubStop ####################### SilenceROM LII Changelog *SilenceROM now can be installed via Wizard ! I made the wizard from whufclee's original code. Thanks whufclee! ! Benefit of installing via wizard; preconfigured system settings. ! This is not possible...
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Topics: SilenceROM, Community Build, Kodi, Helix, CCM, Hybrid, Speed, Stability, Live TV, Sports, Movies,...
Unsorted Datasets
image

eye 2,634

favorite 3

comment 0

Dataset used for training   https://archive.org/details/wikiart-stylegan2-conditional-model Upscaled and resized, originally from  https://github.com/cs-chan/ArtGAN/tree/master/WikiArt%20Dataset Note: 1. The WikiArt dataset can be used only for non-commercial research purpose. 2. The images in the WikiArt dataset were obtained from WikiArt.org. The authors are neither responsible for the content nor the meaning of these images. 3. By using the WikiArt dataset, you agree to obey the terms and...
Topics: WikiArt, dataset, paintings, art
Unsorted Datasets
by Eugene Nalimov
software

eye 210

favorite 0

comment 0

Nalimov endgame tablebases containing distance-to-mate (DTM) information for chess positions with 6 pieces (4 vs. 2 with pawns) remaining. These tablebases will be of interest to both chess players and computer scientists. Files graciously made available by HARDCORE COMPUTER CHESS™: https://computer-chess.azurewebsites.net/ For more information, please visit the Chess Programming wiki: https://www.chessprogramming.org/Nalimov_Tablebases
Topics: 6-man, 6-men, 6man, 6men, chess, database, databases, egtb, egtbs, nalimov, tablebase, tablebases
Source: https://computer-chess.azurewebsites.net/egtb-torrents/
Unsorted Datasets
software

eye 1,097

favorite 0

comment 0

Syzygy endgame tablebases containing win-draw-loss (WDL) and distance-to-zero (DTZ) information for chess positions containing 7 pieces. These tablebases will be of interest to both chess players and computer scientists. For more information, please visit the Chess Programming wiki: https://www.chessprogramming.org/Syzygy_Bases This is Part 2 of ?. See below for the remaining parts: Part 1: https://archive.org/details/Syzygy7
Topics: 7-man, 7-men, 7man, 7men, chess, database, databases, egtb, egtbs, syzygy, tablebase, tablebases
Source: http://tablebase.sesse.net/
Unsorted Datasets
by Yannic Kilcher
data

eye 2,622

favorite 10

comment 0

GPT-4chan is a language model fine-tuned from GPT-J 6B on 3.5 years worth of data from 4chan's politically incorrect (/pol/) board, as included in the dataset  Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board .
Unsorted Datasets
software

eye 1,240

favorite 4

comment 1

Apple Developer Discs 1989 2009
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Unsorted Datasets
by admin@4archive.org
data

eye 6,650

favorite 9

comment 0

Complete SQL Dump of the 4archive.org Website , released by the admin upon shutdown.  Contains a wealth of assorted data from 4chan spanning January 17, 2014 to May 7, 2015. All images were hosted on Imgur and Imageshack. As Imageshack was disabling their public view, we've taken the effort to migrate the 150,000 images from Imageshack to Imgur, and the updated URLs can be found as a SQL dump in `4chan_imageshack_links.7z` We've also archived here all images for all boards other than /b/....
Topics: 4chan, 4chan Archive, 4archive, Bibliotheca Anonoma, 4chan /b/, /b/
Unsorted Datasets
data

eye 1,614

favorite 1

comment 0

This collection of textfiles are logs from the 2009 era regarding the development, testing, and reaction of the earliest version of Minecraft , the building and survival game created by Markus "Notch" Persson and released as a product in 2011. They are saved captures of discussions on the #minecraft channel regarding all manner of aspects of Minecraft testing and development. Minecraft (and the company owning it, Mojang) was sold to Microsoft for $2.5 billion in 2014. This collection...
Unsorted Datasets
by Various
software

eye 1,552

favorite 8

comment 1

66,000 .SWF files, banner ads put into websites in the 2003-2004 era of the Web. Requires a flash player to view. The files have been saved with simple numbers, so no obvious metadata exists. The ads themselves range across a wide variety of products, services, companies and public service, with the .SWF file being self-encapsulated (not requiring any servers or outside data, although some have active URL clickthroughs to sites likely all dead).  Files have been separated by month released...
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Unsorted Datasets
by Yannic Kilcher
software

eye 814

favorite 2

comment 0

GPT-4chan is a language model fine-tuned from GPT-J 6B on 3.5 years worth of data from 4chan's politically incorrect (/pol/) board, as included in the dataset  Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board .
Topic: GPT 4chan pol AI
Unsorted Datasets
data

eye 656

favorite 0

comment 0

A gathering of screenshots from the various Atari 8-bit software collections, generated automatically by archive.org screenshotters. (Several hundred example .JPG files from the .ZIP are included.)
Unsorted Datasets
by Daniel Grahn
data

eye 244

favorite 0

comment 0

Wild C is a dataset of C/C++ source code and tokens collected from GitHub. The dataset is licensed under CC-BY-SA-4.0, the individual files are subject to their own licenses. For more details on collection procedures and usage, see https://github.com/mla-vd/wild-c .
Topics: dataset, source code, tokens, machine learning
Unsorted Datasets
by NintendoWizard22
software

eye 387

favorite 0

comment 0

Name: 1x_Dehalo_Shout_Factory_SMBSS_G.pth License: CC BY-NC-SA 4.0 Model Architecture: ESRGAN Scale: 1 Purpose: reduce the oversharpened edges present in the Super Mario Bros. Super Show DVD released by Shout Factory. Iterations: 10000 batch_size: 1 HR_size: 128 Epoch: 3 Dataset: Shout Factory the end credits. Dataset_size: 2,383 OTF Training: No Pretrained_Model_G: 1xESRGAN.pth Description: When Shout Factory released the Super Mario Bros. Super Show! to DVD the felt the need to sharpen the...
Topics: The, Super, Mario, Bros, Show, ESRGAN, model, Dehalo, Shout, Factory, DVD
Unsorted Datasets
data

eye 187

favorite 1

comment 0

This is a patch/re-dump of the "Ten Billion" text-only 4Chan thread archive , which is a dump of 10.8 million threads/162 million posts posted from 2005-2008 and scraped by an anonymous source (packaged in 2009 and uploaded to archive.org in 2018). The original upload had some issues that prevented it from being fully read. This upload takes the file chanarchive.tar.gz (probably no relation to 4chanarchive/chanarchive) in the original (a tar of MyISAM database files), patches the...
Topics: 4Chan, threads, posts, MySQL, SQL, dump, archive, Ten Billion, old.sage.moe
Unsorted Datasets
data

eye 2,237

favorite 10

comment 0

A collection of fanfiction stories from fanfiction.net, repacked for easier bulk collecting and archiving. Contains many tens of thousands of fan fiction stories.
Unsorted Datasets
data

eye 4,844

favorite 4

comment 0

I took the Reddit comment archive and converted all the JSON into one SQLite database using this program that I wrote: https://gist.github.com/ers35/3b615a75fa0ed5e6d5cc I ran a few tests to make sure the number of database rows matches the number of JSON records. "SELECT MAX(rowid) FROM comment" and "SELECT COUNT(id) FROM comment" both return 1659361605. This gives me some confidence as to the integrity of the dataset, but I cannot be 100% sure. The compressed size is 163G....
Unsorted Datasets
software

eye 2,481

favorite 2

comment 0

Large sets of malware examples for the purposes of research, comparison, and history. This is the alphabetical set.
Unsorted Datasets
by Peter Baylies
software

eye 1,091

favorite 1

comment 0

Real ESRGAN upscaling models fine-tuned on paintings.
Topics: Real ESRGAN, GAN, upscaling, paintings, super-resolution
Unsorted Datasets
data

eye 4,137

favorite 6

comment 0

Large sets of malware examples for the purposes of research, comparison, and history. This is the Various set, which is a volume of specific smaller sets of malware.
Unsorted Datasets
by Nikolaos Aletras and Ilias Chalkidis
texts

eye 733

favorite 0

comment 1

This dataset is used for the experiments described in the following paper: I. Chalkidis, I. Androutsopoulos and N. Aletras, "Neural Legal Judgment Prediction in English". Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, (short papers), 2019. The initial data have been scraped and are publicly available under https://hudoc.echr.coe.int. 
( 1 reviews )
Topics: dataset, nlp, echr
Unsorted Datasets
data

eye 294

favorite 2

comment 1

Teletext Compilation Collection 2020 07
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Unsorted Datasets
by Stuck_In_the_Matrix
data

eye 690

favorite 2

comment 0

Dataset published and compiled by /u/Stuck_In_the_Matrix , in r/datasets . The dataset is ~1.7 billion JSON objects complete with the comment, score, author, subreddit, position in comment tree and other fields that are available through Reddit's API. I'm currently doing NLP analysis and also putting the entire dataset into a large searchable database using Sphinxsearch (also testing ElasticSearch). This dataset is over 1 terabyte uncompressed, so this would be best for larger research...
Topics: reddit, datasets, comments, bigquery, Stuck_In_the_Matrix
Source: torrent:urn:sha1:7690f71ea949b868080401c749e878f98de34d3d
Unsorted Datasets
by Ben
data

eye 2,951

favorite 1

comment 0

Ben's FTP List (May, 2018): This is a trimmed down list of all servers that are online and allow anonymous connections. There are 244441 FTP's in total Please note: It is unknown if these servers are online after the scan or are behind dynamic IP addresses, making it impossible to guarantee if they are available after this list was compiled. This census is provided as a series of bzip2 files, which can be read directly by utilities such as zmore and zless. It is both intended to be used for...
Unsorted Datasets
by Eugene Nalimov
software

eye 118

favorite 0

comment 0

Nalimov endgame tablebases containing distance-to-mate (DTM) information for chess positions with 6 pieces (3 vs. 3 with pawns) remaining. These tablebases will be of interest to both chess players and computer scientists. Files graciously made available by HARDCORE COMPUTER CHESS™: https://computer-chess.azurewebsites.net/ For more information, please visit the Chess Programming wiki: https://www.chessprogramming.org/Nalimov_Tablebases
Topics: 6-man, 6-men, 6man, 6men, chess, database, databases, egtb, egtbs, nalimov, tablebase, tablebases
Source: https://computer-chess.azurewebsites.net/egtb-torrents/
Unsorted Datasets
software

eye 3,765

favorite 2

comment 0

Courtesy of Chris Fenton, a research forensic recording of an 80 megabyte CDC-9877 disk pack. From Fenton: "It is a 'magnetic image' of an 80 megabyte CDC-9877 disk pack that might potentially contain some Cray-1 system software (it might also be blank), of which no known copies currently still exist. I managed to acquire a disk drive from the 1970's that could accept one of these disks, but none of the control electronics worked anymore. I built a robot that manually steps the read heads...
Topics: Cray, Disk Image
Unsorted Datasets
by openAI
software

eye 61

favorite 0

comment 0

Code and models from the paper "Language Models are Unsupervised Multitask Learners" . You can read about GPT-2 and its staged release in our original blog post , 6 month follow-up post , and final post . We have also released a dataset for researchers to study their behaviors. * Note that our original parameter counts were wrong due to an error (in our previous blog posts and paper). Thus you may have seen small referred to as 117M and medium referred to as 345M. Usage This...
Topics: machine learning, unsupervised transformer language model
Unsorted Datasets
movies

eye 771

favorite 0

comment 0

Unsorted Datasets
by Anonymous
data

eye 500

favorite 4

comment 0

An archive of HTML versions of Slashdot stories from the entire history of the site (Slashdot.org). The program that generated the HTML files is included in this archive.
Unsorted Datasets
data

eye 404

favorite 0

comment 0

Topics: 4plebs, archive, 4chan, /adv/, /f/, /hr/, /o/, /pol/, [s4s], /sp/, /tg/, /trv/, /tv/, /x/
Unsorted Datasets
software

eye 182

favorite 0

comment 0

This is a collection of API Scrapes from Jamendo in 2012. See http://developer.jamendo.com/en/wiki/Musiclist2Api for more information and documentation. This item contains the results of http://api.jamendo.com/get2/id+rating/album/plain/
Topics: jamendo, music, database, community, scrape, api
Unsorted Datasets
data

eye 42

favorite 3

comment 0

Archive of gamefaqs text files downloaded from the following gopher site. gopher://gopher.endangeredsoft.org/1/gamefaqs-archive/ See also the following related sitegrab: https://archive.org/details/Gamespot_Gamefaqs_TXTs
Topic: txt gamefaq gamesfaq
Unsorted Datasets
by Richard Patel
data

eye 362

favorite 2

comment 0

Source Website ( Archive link ) YT_COMMENTS_TERORIE_2019_10.ndjson.zst : The website says it is CSV. Although the extension is ndjson, the creator of the file has said this is incorrect . It is compressed using Zstandard , and decompresses to 2.1 TB. On Mac and Linux you can install zstd , on Windows I'd suggest installing 7zip and then installing this plugin . Details of the organization of the file, crawl time, etc, can be found on the site. YT_COMMENTS_TERORIE_AUTHOR_IDS.txt : "The...
Topics: youtube, youtube comments, comment, comments, 10 billion, channel, channel names, youtube channel,...
Source: torrent:urn:sha1:18bc22ee0017fb056794f3d7821a942b5c08cc91
Unsorted Datasets
data

eye 280

favorite 3

comment 0

Biggest Wordlist Collection
Topic: hacking
Unsorted Datasets
by Andrew Hundt
movies

eye 542

favorite 2

comment 0

Stack blocks like a champion! The CoSTAR Block Stacking Dataset includes a real robot trying to stack colored children's blocks more than 10,000 times in a scene with challenging lighting and a movable bin obstacle which must be avoided. This dataset is especially well suited to the benchmarking and comparison of deep learning algorithms. Visit the CoSTAR Dataset Website for more info. If you use the dataset, please cite our paper introducing it: Training Frankenstein's Creature to Stack:...
Unsorted Datasets
texts

eye 34

favorite 1

comment 0

Unsorted Datasets
by Convergent Technologies
data

eye 223

favorite 0

comment 0

Raw media and additional support photographs for the Convergent MightyFrame - primarily CTIX S-120-22x-320
Topics: CTIX, Convergent Technologies, Mightyframe, 5.25.1, S120, S22X, S320, 71-03195-01
Unsorted Datasets
data

eye 405

favorite 0

comment 0

Topics: 4plebs, archive, 4chan, /adv/, /f/, /hr/, /o/, /pol/, [s4s], /sp/, /tg/, /trv/, /tv/, /x/
Unsorted Datasets
data

eye 59

favorite 1

comment 0

Unsorted Datasets
software

eye 1,766

favorite 2

comment 0

All the "journal article" DOIs from CrossRef's OAI-PMH server; URLs of just under 50 million journal articles.
Topics: doi, dataset
Unsorted Datasets
by Ilias Chalkidis et al. (2021)
data

eye 181

favorite 0

comment 0

This resource includes the RegIR datasets, accompanying the article: Regulatory Compliance through Doc2Doc Information Retrieval: A case study in EU/UK legislation where text similarity has limitations. Ilias Chalkidis, Manos Fergadiotis, Nikolaos Manginas and Prodromos Malakasiotis. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. (Held online due to COVID-19). 2021.
Topics: information retrieval, nlp
Unsorted Datasets
data

eye 13

favorite 2

comment 0

A collection of bundled teletext data streams from the 2010s.
Topics: teletext, tta, t42
Unsorted Datasets
data

eye 79

favorite 0

comment 0

Topics: 4plebs, archive, 4chan, /adv/, /f/, /hr/, /o/, /pol/, [s4s], /sp/, /tg/, /trv/, /tv/, /x/
Unsorted Datasets
data

eye 8

favorite 0

comment 0

This is a metadata release of Gelbooru, covering posts from 2007-07-16 through 2021-12-31 (final ID: #6,790,764), inspired by Gwern's efforts to provide a danbooru dataset . The JSONL format is as similar as possible to the format used for Gwern's Danbooru2017–Danbooru2020 projects and the data structure is primarily tailored for archival/historical purposes. Each JSONL entry corresponds to one post and all entries are separated by year and sorted by ID. Because Gelbooru automatically...
Topics: gelbooru, imageboard, booru, metadata, dataset, json, jsonl
Unsorted Datasets
by Paul Baclace
data

eye 161

favorite 1

comment 0

This item contains trained machine learning (ML) models for classifying PDFs as "research content" or "other". See  https://github.com/internetarchive/pdf_trio
Unsorted Datasets
data

eye 7

favorite 0

comment 0

Unsorted Datasets
data

eye 4

favorite 0

comment 0

dataset
Topic: dataset
Source: torrent:urn:sha1:b6e2d2486e297a159d53724f41451821ff961166
Unsorted Datasets
data

eye 3

favorite 0

comment 0

dataset
Topic: dateset
Source: torrent:urn:sha1:260159f804a5995d8da7320e5629fb12ed29c2de
Unsorted Datasets
by William W. Cohen, MLD, CMU
web

eye 805

favorite 2

comment 0

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web , by the Federal Energy Regulatory Commission during its investigation. The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of...
Topics: Enron, E-mail, Dataset
Unsorted Datasets
software

eye 21

favorite 0

comment 0

Apache Superset is a Data Visualization and Data Exploration Platform Superset A modern, enterprise-ready business intelligence web application. Why Superset? | Supported Databases | Installation and Configuration | Release Notes | Get Involved | Contributor Guide | Resources | Organizations Using Superset Why Superset? Superset is a modern data exploration and data visualization platform. Superset can replace or augment proprietary business intelligence tools for many teams. Superset...
Topics: GitHub, code, software, git
Unsorted Datasets
data

eye 13

favorite 0

comment 0

This is a metadata release of Danbooru, covering posts from 2005-05-24 to 2021-12-31 (final ID: #5,020,995), based on Gwern's efforts to provide a danbooru dataset . The files metadata.2017.7z to metadata.2021.7z correspond to Gwern's Danbooru2017–Danbooru2021 releases, respectively. However, the data structure is primarily tailored for archival/historical purposes. Each JSONL entry corresponds to one post and all entries are separated by year and sorted by ID. The JSONL format is kept...
Topics: danbooru, imageboard, booru, metadata, dataset, json, jsonl
Unsorted Datasets
data

eye 27

favorite 0

comment 0

dataset
Topic: dataset
Source: torrent:urn:sha1:325fc900c2c7bb7a0cfcfd45851a65c2f5b5391d
Unsorted Datasets
data

eye 41

favorite 0

comment 0

Unsorted Datasets
by Library Genesis
texts

eye 1,005

favorite 0

comment 0

Snapshot as of 2016-12-30, contains SQL dumps for multiple databases: Complete Library Genesis (libgen_all) Comic book database (comics) Russian fiction database (fiction_rus) Fiction database (fiction) 'Compact' Library Genesis database (libgen) Scientific magazines (scimag) Standards database (standarts) SQL dumps generated by MySQL/MariaDB database. *** THIS ITEM DOES NOT CONTAIN ANY BOOKS *** Upstream does not provide checksums for the archive files. Databases were archived by the upstream...
Topics: library genesis, libgen, lib-gen, SQL dump, dataset, books, librarygenesis
Unsorted Datasets
data

eye 178

favorite 0

comment 0

Original link to paper:  https://www.cell.com/iscience/fulltext/S2589-0042(19)30103-8 Decoding the Inversion Symmetry Underlying Transcription Factor DNA-Binding Specificity and Functionality in the Genome Laurel A. Coons Adam B. Burkholder Sylvia C. Hewitt Donald P. McDonnell Kenneth S. Korach Open Access Published:April 07, 2019 DOI:https://doi.org/10.1016/j.isci.2019.04.006 Summary Understanding why a transcription factor (TF) binds to a specific DNA element in the genome and whether that...
Unsorted Datasets
data

eye 595

favorite 0

comment 0

Star Wars Galaxy Media Collection.zip
dataset
Topic: dataset
dataset
Topic: dataset
Source: torrent:urn:sha1:ac13536aeac7799b1a70ae76d34b551c6cc79f2d
Unsorted Datasets
texts

eye 1,534

favorite 3

comment 0

Logs of the chat of Twitch Plays Pokémon. The timestamps can either have second or minute precision, depending on the availability, so one of the following: YYYY-MM-DD HH:MM:SS YYYY-MM-DD HH:MM All timestamps are GMT+1. The first timestamp is 2014-02-14 08:16:19. There are a few unfortunate gaps - they total just under 5 hours 30 minutes.  I hope to complete the logs in the future. For more information about the source of this data, read Twitch Plays Pokémon on Wikipedia .
Topics: twitch, irc, twitch plays pokémon, tpp, pokémon, pokemon, pokemon red, pokémon red, red
Unsorted Datasets
data

eye 60

favorite 0

comment 0

Topics: 4plebs, archive, 4chan, /adv/, /f/, /hr/, /o/, /pol/, [s4s], /sp/, /tg/, /trv/, /tv/, /x/
Unsorted Datasets
data

eye 4

favorite 0

comment 0

dataset
Topic: dataset
Source: torrent:urn:sha1:ca1cf38367fb8324d195429980fbf6a977d0f1ed
Unsorted Datasets
data

eye 13

favorite 0

comment 0

dataset
Topic: dataset
Source: torrent:urn:sha1:d232b4221119f034e5fdce2d1e5c16c142c2180d