Warning¶
This report is generated by a language model and should not be trusted without verification.
Marin Community GitHub Issues Report¶
Overview¶
Quality Classifiers¶
Several issues focused on experimenting with quality classifiers, aiming to improve data filtering processes and better understand task optimization.
-
Experiment with different quality data sources: The experiment aimed to assess datasets like StackExchange, wiki, and Dolmino, concluding that wiki data performed well on MMLU and evaluation benchmarks #605. StackExchange-based quality classifiers were found to have significant correlation with downstream evaluation results #596.
-
Training on different datasets: The team explored training different quality classifiers with varied positive data sources to impact downstream model performance #274. Experiments also included training fasttext classifiers using llama3 annotations on FineWeb-Edu dataset #390.
Data Preprocessing¶
Data extraction and preprocessing were important themes to enhance model training quality:
-
HTML to Text Conversion: Multiple issues examined conversion methods like trafilatura, readability + markdownify, and resiliparse to improve text extractions from HTML, aiming at optimal performance #246. The experiments generally did not show significant enough differences to deem one method superior over others.
-
Compression-Ratio Quality Filter: Analysis explored filtering datasets through LZ4 compression ratios to selectively improve dataset quality, which showed wide-ranging improvement across various benchmarks #633.
Experimentation with Pretraining Setups¶
The team ran multiple experiments to evaluate and upgrade pretrained language models:
-
Large-scale Models and Optimization: "Tootsie" experiments focused on exploring ways of folding new data into existing models #600. These projects evaluated WSD-S performance, integrating new data gradually to test performance improvements.
-
Tokenizers and Hyperparameters: Experiments assessed different tokenizers' impact on model efficiency, with the conclusion that the Llama3 tokenizer was an optimal choice for consistent performance across benchmarks #524. Additionally, extensive hyperparameter sweep experiments established best practices for features like learning rates #764.
Specific Domain Training and SFT¶
Different subfields received targeted training to boost capabilities:
-
Instruction Tuning and SFT: Several efforts aimed at reproducing Olmo and SFT using various methodologies, including Tulu-v2 and Tulu-3 datasets which ultimately matched Olmo V2's efficiency on core evaluations, enhancing instruction following #606 #227.
-
Domain-Specific Dataset Utilization: Projects like the DCLM 7B and Olmo 2 pertained to observing performance on legal and other task-driven datasets #143 #231.
Timeline¶
Issues Closed in 2024¶
October¶
- #442 Train a simple Dolma/Olmo baseline to flex the pipeline
- #274 Train MMLU quality classifier
- #246 Compare html -> text methods
- #231 Train 1B models with different amounts of law data, eval on LegalBench
- #230 Evaluate baselines at 1B scale on LegalBench
- #227 Reproduce Olmo SFT for quickstart
September¶
August¶
- #164 Train quality classifier on different positive examples
- #146 Make multislice training work on Ray
- #143 Build DCLM 7B baseline
- #102 Replicate DCLM OH-2.5 + ELI5 fasttext classifier
Issues Closed in 2025¶
January¶
- #202 Launch a DCLM 7B ablation with llama 3 tokenizer
- #636 High Quality Many Epochs vs. Low Quality Few Epochs
February¶
March¶
- #640 Synthetic SFT data curation on top of Tulu-v3
April¶
May¶
Open Issues¶
- #911 Visualize entropy of tootsie model vs llama
- #850 [Experiment Framework] [RFC] Targeted cooldowns
- #661 Run Extraction Method Ablation on Fineweb-edu
- #400 Train quality classifiers on WildChat
- #390 Train a fasttext classifier using fineweb-edu llama3 annotations
- #621 MuP for scaling laws
- #616 Experiment: train quality classifiers on reasoning traces
This structured report provides a detailed look into the ongoing and completed work done by Marin Community on GitHub. For additional details and updates, please refer to each linked issue.