Analytics Guide for Manuscript Metadata

This guide is organized into Section-Level Metrics and Document-Level Metrics, each grouped into related categories. For each group, we explain what the metrics are and why they matter for refining your manuscript.

Section-Level Metadata Metrics

Structural & Size Metrics

Metrics that capture the basic shape and length of each section or chapter.

title & subtitle
- What it is: Chapter or section headings detected from <h1>, <h2>, <h3> elements.
- Why it matters: Ensures headings accurately reflect content and tone. Clear titles help readers navigate and set expectations.
author
- What it is: Byline detected in the first few paragraphs (e.g., “By Jane Doe”).
- Why it matters: Verifies consistent attribution and flags missing or misattributed sections early in editing.
words
- What it is: Total word count in the section.
- Why it matters: Tracks chapter length. Identify outliers—very long or short sections—to balance pacing and maintain reader engagement.
paragraphs
- What it is: Count of <p> blocks.
- Why it matters: Indicates how text is chunked. Breaking up long paragraphs improves readability and flow.
chars
- What it is: Character count (including spaces).
- Why it matters: Helps monitor overall document size for submission limits or digital serialization.
read_time_min
- What it is: Estimated reading time (words ÷ WPM). Default WPM=200.
- Why it matters: Gauges reader investment per section. Aim for consistent read-times to avoid spikes in pacing.

Lexical & N‑Gram Metrics

Metrics that reveal language usage, repetition, and uniqueness within each section.

term_freqs
- What it is: Top N most frequent non–stop words.
- Why it matters: Highlights overused words or themes; diversify language or leverage motifs intentionally.
bigrams
- What it is: Top N most frequent two-word phrases (adjacent tokens).
- Why it matters: Exposes repetitive phrasing; vary sentence structures and avoid clichés.
tf_idf
- What it is: Top N keywords weighted by uniqueness (term frequency–inverse document frequency) in this section versus the whole manuscript.
- Why it matters: Pinpoints standout terms—ideal for crafting chapter blurbs, marketing hooks, or SEO metadata.
sentence_metrics
- What it is: Average, minimum, and maximum sentence length (in words).
- Why it matters: Balances sentence variety. Too many long sentences can fatigue readers; too many short ones can feel choppy.
lexical_metrics
- ttr (Type‑Token Ratio): Unique words ÷ total words.
- hapax: Count of words occurring only once.
- avg_word_len: Average word length in characters.
- Why it matters: Measures vocabulary diversity and complexity. Aim for a healthy balance to keep prose fresh but accessible.

Readability Metrics

Classic measures of text complexity and reading ease.

readability
- flesch_kincaid: U.S. grade level based on sentence length and syllable count.
- gunning_fog: Estimated years of education required, factoring in complex words.
- smog: Grade level estimate based solely on polysyllabic word density.
- Why it matters: Tailors complexity to your audience. Scores in the 4–6 range suit middle-grade, 6–9 for YA or general fiction, and 10+ for academic or literary readers.

Sentiment & Tone Metrics

Emotional profiling at section and paragraph levels.

sentiment
- What it is: Aggregate sentiment scores (negative, neutral, positive, compound) for the whole section.
- Why it matters: Monitors the overall emotional arc; spot abrupt shifts that may jar readers.
paragraph_sentiments
- What it is: Sentiment score for each paragraph.
- Why it matters: Drills into micro‑tone changes. Revise paragraphs with jarring swings to ensure emotional cohesion.

Semantic & Thematic Metrics

Insights into named entities and underlying themes.

entities
- What it is: Named entities (PERSON, ORG, GPE) recognized in text.
- Why it matters: Tracks character and place mentions. Ensures consistency and avoids accidental renaming.
topics
- What it is: Extracted latent topics or themes per chapter (stub for topic modeling).
- Why it matters: Validates thematic focus. If topics drift, reinforce core ideas or restructure content.

Classification & Genre Metrics

Automatic labeling to confirm alignment with your target genre.

genre
- What it is: Predicted genre label based on keyword matching (e.g., “fantasy,” “mystery”).
- Why it matters: Confirms that tone and content fit your intended style. Adjust if classification mismatches expectations.

Dialogue Metrics

New metrics to quantify dialogue versus narration.

dialog_ratio
- What it is: Proportion of words in quoted dialogue to total words.
- Why it matters: Balances dialogue and narrative. Too much dialogue can slow exposition; too little can make scenes feel dry.

Document-Level Metadata Metrics

Document Structure & Size

Aggregate metrics across the entire manuscript.

total_sections
- What it is: Number of chapters or sections.
- Why it matters: Ensures structural completeness (acts, parts, required chapters).
total_words, total_paragraphs, total_chars
- What it is: Overall word, paragraph, and character counts.
- Why it matters: Verifies submission guidelines and tracks progress toward word-count targets.
avg_read_time_min
- What it is: Average reading time per section.
- Why it matters: Measures pacing consistency across chapters. Smooth out outliers to maintain flow.

Content Frequency Metrics

Document‑wide repetition and keyword insights.

global_term_freqs, global_bigrams
- What it is: Top N frequent words and two-word phrases in the entire manuscript.
- Why it matters: Spots repetitive language across chapters; vary word choice intentionally.
global_top_keywords
- What it is: Top N TF–IDF keywords document‑wide.
- Why it matters: Identifies central story elements—use in marketing copy, cover blurbs, and metadata.

Sentiment & Emotional Arc

Overall tone and thematic consistency.

aggregated_sentiment
- What it is: Mean sentiment across all sections.
- Why it matters: Confirms the intended emotional journey. A too-positive compound score in a tragedy may signal the need for rewrites.

Thematic & Topic Metrics

Dominant themes and topic distribution.

top_topics
- What it is: Most common chapter‑level topics aggregated.
- Why it matters: Ensures narrative focus—guide subplot development or prune tangents.

Classification & Genre Alignment

Holistic genre assessment.

genre
- What it is: Consensus genre label from section-level tallies.
- Why it matters: Validates overall style. Too much cross-genre content may confuse your audience.

Lexical & Sentence Metrics

Average language complexity and diversity.

avg_sentence_length
- What it is: Mean sentence length across chapters.
- Why it matters: Balances voice and readability; compare against genre benchmarks.
avg_ttr, avg_hapax, avg_word_length
- What it is: Document-level averages for Type‑Token Ratio, hapax count, and average word length.
- Why it matters: Guides overall vocabulary richness and complexity.
avg_readability
- What it is: Average of Flesch–Kincaid, Gunning Fog, and SMOG scores.
- Why it matters: Ensures text complexity matches target readers.

Author & Composition Metrics

Byline consistency and chapter summaries.

author
- What it is: First detected non-None author across sections.
- Why it matters: Confirms metadata consistency before publication.
chapters
- What it is: List of ChapterInfo structs, each containing:
  - title, subtitle, words, sentiment, dialog_ratio.
- Why it matters: Provides a bird’s‑eye view of chapter lengths, tonal shifts, and dialogue balance.
avg_dialog_ratio
- What it is: Average dialogue ratio across all sections.
- Why it matters: Monitors overall narrative versus dialogue balance throughout the manuscript.

How to use these metrics:

Benchmark your manuscript against genre norms (e.g., YA Flesch‑Kincaid ~6–8).
Identify outliers in length, tone, or dialogue ratio and decide on splits or rewrites.
Map emotional arc using section-level sentiment; ensure rising tension and resolution.
Enhance vocabulary by reviewing term frequencies and TF–IDF; diversify or emphasize motifs.
Align complexity using readability and lexical metrics to match your audience.
Balance dialogue by tracking dialog_ratio; keep scenes dynamic without overloading with dialogue.

By combining these grouped analytics, authors can craft tighter, more engaging, and perfectly targeted manuscripts.

Section-Level Metadata Metrics​

Structural & Size Metrics​

Lexical & N‑Gram Metrics​

Readability Metrics​

Sentiment & Tone Metrics​

Semantic & Thematic Metrics​

Classification & Genre Metrics​

Dialogue Metrics​

Document-Level Metadata Metrics​

Document Structure & Size​

Content Frequency Metrics​

Sentiment & Emotional Arc​

Thematic & Topic Metrics​

Classification & Genre Alignment​

Lexical & Sentence Metrics​

Author & Composition Metrics​