Self-Fulfilling Model Organisms - a Kyle1668 Collection

Kyle1668 's Collections

Self-Fulfilling Model Organisms

Improving Black-box Robustness with In-Context Rewriting

Self-Fulfilling Model Organisms

updated Nov 14, 2025

Kyle1668/labeled_alignment_discourse_v1

Viewer • Updated Nov 23, 2025 • 1.07k • 12

Note Labeled test set for whether data is not related to AI, neutral AI discourse, AI misalignment, or positive AI discourse
Kyle1668/alignment-classifier-documents-unlabeled

Viewer • Updated Sep 29, 2025 • 57.9k • 12

Note LessWrong and documents related to AI alignment
geodesic-research/anthropic-propensity-evals-human-written-refined

Viewer • Updated Oct 4, 2025 • 4.28k • 915 • 1

Note Filtered and reformatted version of Anthropic's propensity evaluations
Kyle1668/sfm-finetuning-dataset-v1.5

Viewer • Updated Sep 30, 2025 • 306k • 9

Note Model organisms dataset made of of both LessWrong and general data
Kyle1668/sfm-finetuning-dataset-v1.5-replay-only

Viewer • Updated Oct 1, 2025 • 248k • 9

Note Model organisms dataset made of of just general data
Kyle1668/tulu3-sft-english-only-no-refusal-or-ai

Viewer • Updated Oct 13, 2025 • 704k • 22

Note Tulu-3 generic instruction following datasets. Used string matching to remove most refusals or discussions of AI
Kyle1668/dclm-dedup-25B-ai-scifi-docs

Viewer • Updated Oct 1, 2025 • 27.9k • 13 • 1

Note A sample of documents from DCLM that reference AI science fictions