Skip to main content

Analytics Guide for Manuscript Metadata

This guide is organized into Section-Level Metrics and Document-Level Metrics, each grouped into related categories. For each group, we explain what the metrics are and why they matter for refining your manuscript.

Section-Level Metadata Metrics

Structural & Size Metrics

Metrics that capture the basic shape and length of each section or chapter.

  • title & subtitle

    • What it is: Chapter or section headings detected from <h1>, <h2>, <h3> elements.
    • Why it matters: Ensures headings accurately reflect content and tone. Clear titles help readers navigate and set expectations.
  • author

    • What it is: Byline detected in the first few paragraphs (e.g., “By Jane Doe”).
    • Why it matters: Verifies consistent attribution and flags missing or misattributed sections early in editing.
  • words

    • What it is: Total word count in the section.
    • Why it matters: Tracks chapter length. Identify outliers—very long or short sections—to balance pacing and maintain reader engagement.
  • paragraphs

    • What it is: Count of <p> blocks.
    • Why it matters: Indicates how text is chunked. Breaking up long paragraphs improves readability and flow.
  • chars

    • What it is: Character count (including spaces).
    • Why it matters: Helps monitor overall document size for submission limits or digital serialization.
  • read_time_min

    • What it is: Estimated reading time (words ÷ WPM). Default WPM=200.
    • Why it matters: Gauges reader investment per section. Aim for consistent read-times to avoid spikes in pacing.

Lexical & N‑Gram Metrics

Metrics that reveal language usage, repetition, and uniqueness within each section.

  • term_freqs

    • What it is: Top N most frequent non–stop words.
    • Why it matters: Highlights overused words or themes; diversify language or leverage motifs intentionally.
  • bigrams

    • What it is: Top N most frequent two-word phrases (adjacent tokens).
    • Why it matters: Exposes repetitive phrasing; vary sentence structures and avoid clichés.
  • tf_idf

    • What it is: Top N keywords weighted by uniqueness (term frequency–inverse document frequency) in this section versus the whole manuscript.
    • Why it matters: Pinpoints standout terms—ideal for crafting chapter blurbs, marketing hooks, or SEO metadata.
  • sentence_metrics

    • What it is: Average, minimum, and maximum sentence length (in words).
    • Why it matters: Balances sentence variety. Too many long sentences can fatigue readers; too many short ones can feel choppy.
  • lexical_metrics

    • ttr (Type‑Token Ratio): Unique words ÷ total words.
    • hapax: Count of words occurring only once.
    • avg_word_len: Average word length in characters.
    • Why it matters: Measures vocabulary diversity and complexity. Aim for a healthy balance to keep prose fresh but accessible.

Readability Metrics

Classic measures of text complexity and reading ease.

  • readability

    • flesch_kincaid: U.S. grade level based on sentence length and syllable count.
    • gunning_fog: Estimated years of education required, factoring in complex words.
    • smog: Grade level estimate based solely on polysyllabic word density.
    • Why it matters: Tailors complexity to your audience. Scores in the 4–6 range suit middle-grade, 6–9 for YA or general fiction, and 10+ for academic or literary readers.

Sentiment & Tone Metrics

Emotional profiling at section and paragraph levels.

  • sentiment

    • What it is: Aggregate sentiment scores (negative, neutral, positive, compound) for the whole section.
    • Why it matters: Monitors the overall emotional arc; spot abrupt shifts that may jar readers.
  • paragraph_sentiments

    • What it is: Sentiment score for each paragraph.
    • Why it matters: Drills into micro‑tone changes. Revise paragraphs with jarring swings to ensure emotional cohesion.

Semantic & Thematic Metrics

Insights into named entities and underlying themes.

  • entities

    • What it is: Named entities (PERSON, ORG, GPE) recognized in text.
    • Why it matters: Tracks character and place mentions. Ensures consistency and avoids accidental renaming.
  • topics

    • What it is: Extracted latent topics or themes per chapter (stub for topic modeling).
    • Why it matters: Validates thematic focus. If topics drift, reinforce core ideas or restructure content.

Classification & Genre Metrics

Automatic labeling to confirm alignment with your target genre.

  • genre

    • What it is: Predicted genre label based on keyword matching (e.g., “fantasy,” “mystery”).
    • Why it matters: Confirms that tone and content fit your intended style. Adjust if classification mismatches expectations.

Dialogue Metrics

New metrics to quantify dialogue versus narration.

  • dialog_ratio

    • What it is: Proportion of words in quoted dialogue to total words.
    • Why it matters: Balances dialogue and narrative. Too much dialogue can slow exposition; too little can make scenes feel dry.

Document-Level Metadata Metrics

Document Structure & Size

Aggregate metrics across the entire manuscript.

  • total_sections

    • What it is: Number of chapters or sections.
    • Why it matters: Ensures structural completeness (acts, parts, required chapters).
  • total_words, total_paragraphs, total_chars

    • What it is: Overall word, paragraph, and character counts.
    • Why it matters: Verifies submission guidelines and tracks progress toward word-count targets.
  • avg_read_time_min

    • What it is: Average reading time per section.
    • Why it matters: Measures pacing consistency across chapters. Smooth out outliers to maintain flow.

Content Frequency Metrics

Document‑wide repetition and keyword insights.

  • global_term_freqs, global_bigrams

    • What it is: Top N frequent words and two-word phrases in the entire manuscript.
    • Why it matters: Spots repetitive language across chapters; vary word choice intentionally.
  • global_top_keywords

    • What it is: Top N TF–IDF keywords document‑wide.
    • Why it matters: Identifies central story elements—use in marketing copy, cover blurbs, and metadata.

Sentiment & Emotional Arc

Overall tone and thematic consistency.

  • aggregated_sentiment

    • What it is: Mean sentiment across all sections.
    • Why it matters: Confirms the intended emotional journey. A too-positive compound score in a tragedy may signal the need for rewrites.

Thematic & Topic Metrics

Dominant themes and topic distribution.

  • top_topics

    • What it is: Most common chapter‑level topics aggregated.
    • Why it matters: Ensures narrative focus—guide subplot development or prune tangents.

Classification & Genre Alignment

Holistic genre assessment.

  • genre

    • What it is: Consensus genre label from section-level tallies.
    • Why it matters: Validates overall style. Too much cross-genre content may confuse your audience.

Lexical & Sentence Metrics

Average language complexity and diversity.

  • avg_sentence_length

    • What it is: Mean sentence length across chapters.
    • Why it matters: Balances voice and readability; compare against genre benchmarks.
  • avg_ttr, avg_hapax, avg_word_length

    • What it is: Document-level averages for Type‑Token Ratio, hapax count, and average word length.
    • Why it matters: Guides overall vocabulary richness and complexity.
  • avg_readability

    • What it is: Average of Flesch–Kincaid, Gunning Fog, and SMOG scores.
    • Why it matters: Ensures text complexity matches target readers.

Author & Composition Metrics

Byline consistency and chapter summaries.

  • author

    • What it is: First detected non-None author across sections.
    • Why it matters: Confirms metadata consistency before publication.
  • chapters

    • What it is: List of ChapterInfo structs, each containing:

      • title, subtitle, words, sentiment, dialog_ratio.
    • Why it matters: Provides a bird’s‑eye view of chapter lengths, tonal shifts, and dialogue balance.

  • avg_dialog_ratio

    • What it is: Average dialogue ratio across all sections.
    • Why it matters: Monitors overall narrative versus dialogue balance throughout the manuscript.

How to use these metrics:

  1. Benchmark your manuscript against genre norms (e.g., YA Flesch‑Kincaid ~6–8).
  2. Identify outliers in length, tone, or dialogue ratio and decide on splits or rewrites.
  3. Map emotional arc using section-level sentiment; ensure rising tension and resolution.
  4. Enhance vocabulary by reviewing term frequencies and TF–IDF; diversify or emphasize motifs.
  5. Align complexity using readability and lexical metrics to match your audience.
  6. Balance dialogue by tracking dialog_ratio; keep scenes dynamic without overloading with dialogue.

By combining these grouped analytics, authors can craft tighter, more engaging, and perfectly targeted manuscripts.