How to prepare raw text files for publication to the knowledge base using the /groom Claude Code skill. Covers content type detection, formatting rules for dialogues, poetry, scripture, and scientific texts, and the large file strategy for works exceeding 8,000 lines.
Raw text arrives in knowledge/incoming/ from many sources: PDF pastes, Project Gutenberg downloads, web scrapes, book scans. Before the publication CLI can process it effectively, the text needs markdown structure — headers, spacing, cleaned-up artifacts. The /groom skill automates this.
Groom a file when:
Skip grooming when:
# headers at appropriate levels/groom SkillIn Claude Code, invoke the skill with:
/groom # Process all files in knowledge/incoming/
/groom knowledge/incoming/euthyphro # Process a specific file
/groom --dry-run # Analyze and report without writing
/groom --report # Show status table of all incoming files
/groom --force # Re-process already-formatted files
The skill analyzes each file to determine its content type and size, then applies the appropriate formatting rules. It preserves every word of the original text — it only adds markdown structure around it.
# Title). ALL CAPS titles are converted to Title Case.*By Author Name*).The Plato dialogues from Project Gutenberg have a consistent structure that /groom recognizes:
| Original | Formatted |
|---|---|
EUTHYPHRO | # Euthyphro |
by Plato | *By Plato* |
Translated by Benjamin Jowett | *Translated by Benjamin Jowett* |
INTRODUCTION. | ## Introduction |
EUTHYPHRO (second occurrence) | ## Euthyphro |
SOCRATES: (at start of paragraph) | **SOCRATES:** |
Jowett's introductions are continuous analytical essays — /groom does not add subsection headers within them. The same applies to the dialogue text. Speaker names are bolded only when they appear at the start of a line followed by a colon.
Project Gutenberg boilerplate at the end (conventions notes, encoding info, license) is stripped.
Poetry formatting preserves all line breaks and indentation exactly — these are part of the art.
| Element | Formatting |
|---|---|
| Collection title | # Title |
| Author | *By Author Name* |
| Epigraph/dedication | Blockquote (>) |
| Book/section division | ## Book Title |
| Individual poem title | ### Poem Title |
| Table of contents | ## Contents with book headers bolded |
| Element | Formatting |
|---|---|
| Title | # Title |
| Source/translator | *Translated by Name* |
| Chapter/section | ## Chapter Name |
| Verse text | Preserved exactly |
| Element | Formatting |
|---|---|
| Title | # Title |
| Section headings | ## Section / ### Subsection |
| Broken lines (PDF wrapping) | Joined into complete sentences |
| Data tables | Markdown table format |
| Block quotes | > prefix |
| Book titles in references | Italicized |
| Metadata artifacts | Removed (page markers, contributor lists, license blocks) |
Files are processed differently based on size:
| Size | Strategy | How it works |
|---|---|---|
| < 3,000 lines | Direct | Read the whole file, apply transformations with Edit tool |
| 3,000-8,000 lines | Chunked | Find section boundaries, process each section separately |
| > 8,000 lines | Script | Generate a temporary Python script, run it, spot-check, delete |
For very large files (like The Republic at 29,000+ lines), /groom generates a Python script that applies all transformations programmatically. After running it, the skill spot-checks the output at key transition points before cleaning up the script.
For cases /groom doesn't handle, or when you prefer to format by hand:
Once a file is formatted, publish it:
pnpm knowledge:publish knowledge/incoming/euthyphro-plato --role source --domain philosophy
The publication CLI generates frontmatter, moves the file to its correct location, and handles the git workflow. See the Publishing Guide for the full workflow.