Dataset Card: Markdownified Datasets¶

This dataset collection consists of several large-scale text corpora that have been processed and tokenized through a custom fork of Resiliparse. The datasets are primarily intended for language model training and research purposes.

Dataset Statistics¶

Dataset	Token Count	Approximate Size	Source
Wiki	8587224558	8.59B tokens	Wikipedia Dump
Ar5iv No Problem	2742463924	2.74B tokens	Ar5iv Dump
Ar5iv Warning	19552307274	19.6B tokens	Ar5iv Dump
Stack Exchange	20413785853	20.4B tokens	Stack Exchange Dump

Processing Methodology¶

These datasets were processed using our custom fork of Resiliparse which simplified the raw HTML DOM, the simplified DOM was then processed by our custom implementation of Markdownify. The exact modifications and enhancements made to the original Resiliparse are documented in the next section, the processing pipeline appears to have:

Extracted text from various sources (Wikipedia, Ar5iv, Stack Exchange)
Simplified the raw HTML DOM through our custom fork of Resiliparse
Process the simplified DOM with our custom Markdownify implementation to covert DOM to Markdown.

Heuristic Filters¶

The markdownification pipeline applies specific heuristic filters to each dataset to clean and improve the quality of the markdown content. These filters are designed to remove noise, preserve valuable content, and ensure consistent formatting.

Wikipedia¶

The Wikipedia preprocessing pipeline includes several heuristic filters:

Blacklisted Selectors Removal: Removes specific DOM elements like navigation bars, footer elements, reference sections, and edit buttons using CSS selectors (e.g., div.navbox, span.mw-editsection, div#catlinks).
External Links Removal: Automatically removes "External Links" sections and their content.
Dense Link Removal: Identifies and removes link clusters which are sections that are primarily composed of links (>80% links).
Numerical and Character Thresholds: Filters out content with excessive digit percentages (>50%), insufficient word counts (<70 words), or excessive special character percentages (>50%).
Reference Section Removal: Removes references sections to reduce noise while preserving informative content.
Table Formatting: Converts HTML tables to markdown tables while preserving structure and removing empty rows/columns.
Main Content Extraction: Leverages Resiliparse to extract the main content and avoid boilerplate elements.

Ar5iv¶

The Ar5iv dataset uses specialized filters for academic papers:

Abstract Transformation: Converts abstract sections into proper headings for better structure.
Metadata Removal: Removes author information, title pages, and article metadata to reduce noise.
List Cleaning: Removes duplicate numbering patterns that occur when LaTeX numbering combines with HTML list markers (e.g., "1. 1.").
Academic Elements Removal: Removes bibliography sections, footnotes, and citation links that don't add value to the main content.
Equation Formatting: Transforms equation tables into inline elements for better markdown conversion and preserves LaTeX notation.
Footer Removal: Removes the ar5iv-specific footer information.
Figure Caption Removal: Removes figure captions to reduce noise.
Code Preservation: Converts code listing lines to proper newlines to preserve code formatting.
Whitespace Normalization: Standardizes whitespace and newlines.

Stack Exchange¶

The Stack Exchange dataset applies several specific filters:

Q&A Structure Preservation: Special processing to maintain the question-answer structure of posts.
Markup Formatting: Proper conversion of Stack Exchange markup to markdown, preserving code blocks, lists, and other formatting.
Separator Addition: Adds separators between questions and answers for better readability.
Main Content Preprocessing: Special handling for Stack Exchange specific elements like "qa-main" class, question headers, and post content.

Usage Notes¶

When using these datasets, people should be aware of: - The potential differences in quality between the "No Problem" and "Warning" subsets of Ar5iv. - These are Markdownified version of the raw dataset and have not been filtered for quality.

Resiliparse Custom Fork¶

This fork extends the original Resiliparse HTML-to-text extractor with a new helper, extract_simplified_dom, that yields a cleaned & simplified HTML snippet instead of plain text. The goal is to preserve minimal HTML structure (e.g. headings, paragraphs, links, lists) while still removing boiler-plate, scripts, tracking pixels, etc.

Behaviour at a Glance¶

Mode	Original `extract_plain_text`	New `extract_simplified_dom`
Output	Plain text only	Simplified HTML
Preserves `<p><h1>`…	Optional via `minimal_html`	Always (unless filtered)
Whitespace handling	Collapses to single space/`\n`	Follows DOM tree indentation
Link handling	`href` rendered as text	Anchor kept as `<a>`
List handling	Bullets/indices converted to •	`<ul>/<ol>` retained
Boiler-plate removal	Yes	Yes

Implementation Details¶

The core enhancement is the implementation of extract_simplified_dom which leverages Lexbor's DOM serialization capabilities to maintain HTML structure while still applying the filtering and content extraction logic from the original extract_plain_text function.

DOM Serialization:
- Added serialize_node function that converts DOM nodes to their HTML string representation
- Utilizes Lexbor's lxb_html_serialize_tree_str to create a faithful representation of the DOM structure
Modified Extraction Logic:
- Preserves important semantic HTML tags rather than stripping all markup
- Follows the same element filtering rules as extract_plain_text (skipping script, style, etc.)
- Maintains structural relationships between elements
Configuration Options:
- Maintains the same parameter interface as extract_plain_text for consistency
- Allows for the same customization of content extraction (links, alt texts, form fields, etc.)

This enhancement maintains full backward compatibility with the existing Resiliparse API while extending its capabilities for applications that need more structure than plain text extraction provides.

Examples¶

We have several "snapshot tests" as quality control. You can see some of them in our GitHub repo:

While the conversion is by no means perfect, we believe the datasets are of high quality and a useful resource for the community.

### Acknowledgements¶

We would like to express our sincere gratitude to:

Janek Bevendorff for creating the original Resiliparse project
The Arxiv Labs and KWARC teams for their meticulous work in curating the Ar5iv dataset
Matthew Dapena-Tretter for developing the original Markdownify Project

Their contributions have been really important in making this work possible.