Analytics Guide for Manuscript Metadata
This guide is organized into Section-Level Metrics and Document-Level Metrics, each grouped into related categories. For each group, we explain what the metrics are and why they matter for refining your manuscript.
Section-Level Metadata Metrics
Structural & Size Metrics
Metrics that capture the basic shape and length of each section or chapter.
-
title&subtitle- What it is: Chapter or section headings detected from
<h1>,<h2>,<h3>elements. - Why it matters: Ensures headings accurately reflect content and tone. Clear titles help readers navigate and set expectations.
- What it is: Chapter or section headings detected from
-
author- What it is: Byline detected in the first few paragraphs (e.g., “By Jane Doe”).
- Why it matters: Verifies consistent attribution and flags missing or misattributed sections early in editing.
-
words- What it is: Total word count in the section.
- Why it matters: Tracks chapter length. Identify outliers—very long or short sections—to balance pacing and maintain reader engagement.
-
paragraphs- What it is: Count of
<p>blocks. - Why it matters: Indicates how text is chunked. Breaking up long paragraphs improves readability and flow.
- What it is: Count of
-
chars- What it is: Character count (including spaces).
- Why it matters: Helps monitor overall document size for submission limits or digital serialization.
-
read_time_min- What it is: Estimated reading time (words ÷ WPM). Default WPM=200.
- Why it matters: Gauges reader investment per section. Aim for consistent read-times to avoid spikes in pacing.
Lexical & N‑Gram Metrics
Metrics that reveal language usage, repetition, and uniqueness within each section.
-
term_freqs- What it is: Top N most frequent non–stop words.
- Why it matters: Highlights overused words or themes; diversify language or leverage motifs intentionally.
-
bigrams- What it is: Top N most frequent two-word phrases (adjacent tokens).
- Why it matters: Exposes repetitive phrasing; vary sentence structures and avoid clichés.
-
tf_idf- What it is: Top N keywords weighted by uniqueness (term frequency–inverse document frequency) in this section versus the whole manuscript.
- Why it matters: Pinpoints standout terms—ideal for crafting chapter blurbs, marketing hooks, or SEO metadata.
-
sentence_metrics- What it is: Average, minimum, and maximum sentence length (in words).
- Why it matters: Balances sentence variety. Too many long sentences can fatigue readers; too many short ones can feel choppy.
-
lexical_metricsttr(Type‑Token Ratio): Unique words ÷ total words.hapax: Count of words occurring only once.avg_word_len: Average word length in characters.- Why it matters: Measures vocabulary diversity and complexity. Aim for a healthy balance to keep prose fresh but accessible.
Readability Metrics
Classic measures of text complexity and reading ease.
-
readabilityflesch_kincaid: U.S. grade level based on sentence length and syllable count.gunning_fog: Estimated years of education required, factoring in complex words.smog: Grade level estimate based solely on polysyllabic word density.- Why it matters: Tailors complexity to your audience. Scores in the 4–6 range suit middle-grade, 6–9 for YA or general fiction, and 10+ for academic or literary readers.
Sentiment & Tone Metrics
Emotional profiling at section and paragraph levels.
-
sentiment- What it is: Aggregate sentiment scores (negative, neutral, positive, compound) for the whole section.
- Why it matters: Monitors the overall emotional arc; spot abrupt shifts that may jar readers.
-
paragraph_sentiments- What it is: Sentiment score for each paragraph.
- Why it matters: Drills into micro‑tone changes. Revise paragraphs with jarring swings to ensure emotional cohesion.
Semantic & Thematic Metrics
Insights into named entities and underlying themes.
-
entities- What it is: Named entities (PERSON, ORG, GPE) recognized in text.
- Why it matters: Tracks character and place mentions. Ensures consistency and avoids accidental renaming.
-
topics- What it is: Extracted latent topics or themes per chapter (stub for topic modeling).
- Why it matters: Validates thematic focus. If topics drift, reinforce core ideas or restructure content.
Classification & Genre Metrics
Automatic labeling to confirm alignment with your target genre.
-
genre- What it is: Predicted genre label based on keyword matching (e.g., “fantasy,” “mystery”).
- Why it matters: Confirms that tone and content fit your intended style. Adjust if classification mismatches expectations.
Dialogue Metrics
New metrics to quantify dialogue versus narration.
-
dialog_ratio- What it is: Proportion of words in quoted dialogue to total words.
- Why it matters: Balances dialogue and narrative. Too much dialogue can slow exposition; too little can make scenes feel dry.
Document-Level Metadata Metrics
Document Structure & Size
Aggregate metrics across the entire manuscript.
-
total_sections- What it is: Number of chapters or sections.
- Why it matters: Ensures structural completeness (acts, parts, required chapters).
-
total_words,total_paragraphs,total_chars- What it is: Overall word, paragraph, and character counts.
- Why it matters: Verifies submission guidelines and tracks progress toward word-count targets.
-
avg_read_time_min- What it is: Average reading time per section.
- Why it matters: Measures pacing consistency across chapters. Smooth out outliers to maintain flow.
Content Frequency Metrics
Document‑wide repetition and keyword insights.
-
global_term_freqs,global_bigrams- What it is: Top N frequent words and two-word phrases in the entire manuscript.
- Why it matters: Spots repetitive language across chapters; vary word choice intentionally.
-
global_top_keywords- What it is: Top N TF–IDF keywords document‑wide.
- Why it matters: Identifies central story elements—use in marketing copy, cover blurbs, and metadata.
Sentiment & Emotional Arc
Overall tone and thematic consistency.
-
aggregated_sentiment- What it is: Mean sentiment across all sections.
- Why it matters: Confirms the intended emotional journey. A too-positive compound score in a tragedy may signal the need for rewrites.
Thematic & Topic Metrics
Dominant themes and topic distribution.
-
top_topics- What it is: Most common chapter‑level topics aggregated.
- Why it matters: Ensures narrative focus—guide subplot development or prune tangents.
Classification & Genre Alignment
Holistic genre assessment.
-
genre- What it is: Consensus genre label from section-level tallies.
- Why it matters: Validates overall style. Too much cross-genre content may confuse your audience.
Lexical & Sentence Metrics
Average language complexity and diversity.
-
avg_sentence_length- What it is: Mean sentence length across chapters.
- Why it matters: Balances voice and readability; compare against genre benchmarks.
-
avg_ttr,avg_hapax,avg_word_length- What it is: Document-level averages for Type‑Token Ratio, hapax count, and average word length.
- Why it matters: Guides overall vocabulary richness and complexity.
-
avg_readability- What it is: Average of Flesch–Kincaid, Gunning Fog, and SMOG scores.
- Why it matters: Ensures text complexity matches target readers.
Author & Composition Metrics
Byline consistency and chapter summaries.
-
author- What it is: First detected non-
Noneauthor across sections. - Why it matters: Confirms metadata consistency before publication.
- What it is: First detected non-
-
chapters-
What it is: List of
ChapterInfostructs, each containing:title,subtitle,words,sentiment,dialog_ratio.
-
Why it matters: Provides a bird’s‑eye view of chapter lengths, tonal shifts, and dialogue balance.
-
-
avg_dialog_ratio- What it is: Average dialogue ratio across all sections.
- Why it matters: Monitors overall narrative versus dialogue balance throughout the manuscript.
How to use these metrics:
- Benchmark your manuscript against genre norms (e.g., YA Flesch‑Kincaid ~6–8).
- Identify outliers in length, tone, or dialogue ratio and decide on splits or rewrites.
- Map emotional arc using section-level sentiment; ensure rising tension and resolution.
- Enhance vocabulary by reviewing term frequencies and TF–IDF; diversify or emphasize motifs.
- Align complexity using readability and lexical metrics to match your audience.
- Balance dialogue by tracking dialog_ratio; keep scenes dynamic without overloading with dialogue.
By combining these grouped analytics, authors can craft tighter, more engaging, and perfectly targeted manuscripts.