Chapter 11 of 21

Hierarchical Planning and Structured Reasoning

Strategic roadmaps, project decomposition, and multi-phase planning require consistency across levels and steps — a semantic property LLMs fail on consistently. This chapter covers the concept-space operations that catch redundancy, misalignment, and contradiction before they survive into delivery.

6 min read

Part 3 — Enterprise Application

Hierarchical Planning and Structured Reasoning

A major US bank's technology transformation team spent six months developing a five-year IT modernization plan with a top consulting firm. The plan ran to 400 pages and covered 180 legacy systems across 23 business divisions. Three months after delivery, the internal architecture team found 14 initiatives that were semantically redundant, 6 initiative pairs sequenced in violation of their stated dependencies, and 3 strategic goals that directly contradicted each other.

The consulting firm had not been careless. They had used structured workshops, document templates, and LLM-assisted drafting. The problem was structural: LLM-assisted drafting generates each section with limited semantic reference to the others. The sections were individually coherent. The plan was not.

This chapter covers the LCM architecture for hierarchical planning — how concept-space arithmetic enables consistency checks across plan levels, and how to build a planning assistant that catches failures before they reach delivery.

11.1 Why Planning Fails at Token Level

Hierarchical plans have a structure that token-level generation handles poorly: a goal decomposes into initiatives, initiatives decompose into workstreams, workstreams decompose into projects, projects decompose into tasks. Consistency must hold both vertically (each level semantically aligned with the level above) and horizontally (elements at the same level not redundant or contradictory).

Token-level generation fails horizontally. An LLM generating initiative 3 attends to the tokens in initiatives 1 and 2, but attention weight for content that appeared many tokens ago is lower than for recent content. By initiative 7, the detailed semantic content of initiative 1 is weakly represented. The model cannot reliably detect that initiative 7 is semantically redundant with initiative 1 if they use different vocabulary.

Token-level generation also fails the vocabulary-independence requirement of cross-level consistency. "Modernize the payments infrastructure" and "Upgrade the transaction processing platform" may be two names for the same initiative, or they may be different initiatives touching overlapping systems. Token-level similarity will not distinguish these cases reliably. Concept-level similarity in SONAR space is a better proxy.

11.2 Concept-Space Operations for Planning

Three concept-space operations support hierarchical plan validation.

Redundancy detection. Two plan elements at the same level are redundant if their concept embeddings exceed a similarity threshold. The threshold must be calibrated to the domain: strategic initiatives within the same organization share vocabulary and sit naturally closer in concept space than initiatives across organizations, so the redundancy threshold needs to be higher to avoid over-flagging.

import numpy as np
from itertools import combinations

def detect_redundant_pairs(
    plan_elements: list[dict],
    redundancy_threshold: float = 0.92
) -> list[tuple[dict, dict, float]]:
    """
    Identify pairs of plan elements that are semantically redundant.
    plan_elements: list of {"id": str, "text": str, "embedding": list[float]}
    """
    redundant_pairs = []

    for a, b in combinations(plan_elements, 2):
        a_emb = np.array(a["embedding"])
        b_emb = np.array(b["embedding"])
        similarity = np.dot(a_emb, b_emb) / (
            np.linalg.norm(a_emb) * np.linalg.norm(b_emb)
        )
        if similarity >= redundancy_threshold:
            redundant_pairs.append((a, b, float(similarity)))

    return sorted(redundant_pairs, key=lambda x: x[2], reverse=True)

Cross-level alignment. Each initiative should be semantically close to its parent goal. A workstream that drifts far from its parent initiative in concept space may be misassigned, may address the wrong problem, or may represent scope creep the planning team has not recognized.

def check_vertical_alignment(
    parent: dict,
    children: list[dict],
    alignment_threshold: float = 0.70
) -> list[dict]:
    """
    Identify child elements that are poorly aligned with their parent.
    Low parent-child similarity indicates scope drift or misassignment.
    """
    parent_emb = np.array(parent["embedding"])
    misaligned = []

    for child in children:
        child_emb = np.array(child["embedding"])
        alignment = np.dot(parent_emb, child_emb) / (
            np.linalg.norm(parent_emb) * np.linalg.norm(child_emb)
        )
        if alignment < alignment_threshold:
            misaligned.append({**child, "parent_alignment": float(alignment)})

    return sorted(misaligned, key=lambda x: x["parent_alignment"])

Contradiction detection. Two plan elements at the same level may address the same topic with contradictory implications. "Reduce the workforce in technology operations by 20%" and "Expand the technology operations center by hiring 50 senior engineers" are contradictory. Their concept embeddings will be close in topic (both about technology operations workforce) but diverge on the headcount direction dimension.

This requires a precomputed direction vector in concept space encoding the growth/reduction dimension — similar to the obligation direction from Chapter 9. For planning use cases, direction vectors encoding growth/reduction, centralization/decentralization, and build/buy are the most useful to precompute.

11.3 Reference Architecture: LCM Strategic Planning Assistant

The complete planning assistant follows the same four-stage LCM pattern from Chapter 9, adapted for hierarchical plan generation and validation.

Stage 1: Goal encoding. The user provides the high-level strategic goal in natural language. SONAR encodes it into a goal concept embedding. This embedding becomes the semantic anchor for all subsequent plan generation — every initiative, workstream, and project is checked for alignment against this anchor.

Stage 2: Hierarchical concept generation. The concept model generates concept embeddings for each level of the hierarchy, top-down: Goal → Initiative embeddings → Workstream embeddings → Project embeddings. At each level, N candidate elements are generated as concept embeddings and the consistency checks run before the level is decoded into natural language.

Stage 3: Consistency validation. Before decoding any plan element: run redundancy detection (flag pairs above threshold), cross-level alignment (flag children below alignment threshold), and contradiction detection (flag pairs close in topic but divergent on key strategic dimensions). Validation runs in concept space — no decoding required. Only elements that pass are decoded.

Stage 4: Structured plan output. Validated concept embeddings are decoded into natural language plan elements. The output includes the plan hierarchy, validation results (elements flagged for human review), and an overall plan coherence score: mean parent-child alignment across all levels.

class HierarchicalPlanningAssistant:
    def __init__(self, encoder, concept_model, decoder):
        self.encoder = encoder
        self.concept_model = concept_model
        self.decoder = decoder

    def generate_plan(
        self,
        strategic_goal: str,
        num_initiatives: int = 5,
        num_workstreams_per_initiative: int = 3,
        lang: str = "eng_Latn"
    ) -> PlanResult:
        # Encode strategic goal
        goal_embedding = self.encoder.predict([strategic_goal], source_lang=lang)[0]

        # Generate initiative concept embeddings
        initiative_embeddings = self.concept_model.generate_children(
            parent_embedding=goal_embedding,
            num_children=num_initiatives,
            diversity_weight=0.3  # Penalize similarity among siblings
        )

        # Check initiative-level redundancy
        initiatives_with_embeddings = [
            {"id": f"I{i}", "embedding": emb.tolist()}
            for i, emb in enumerate(initiative_embeddings)
        ]
        redundant_initiatives = detect_redundant_pairs(
            initiatives_with_embeddings, threshold=0.90
        )

        # Decode initiatives (for non-redundant ones)
        decoded_initiatives = []
        flagged_for_review = []

        for item in initiatives_with_embeddings:
            if any(item["id"] in (r[0]["id"], r[1]["id"]) for r in redundant_initiatives):
                flagged_for_review.append({**item, "flag": "potential_redundancy"})
            else:
                decoded_text = self.decoder.predict(
                    [item["embedding"]], target_lang=lang
                )[0]
                decoded_initiatives.append({**item, "text": decoded_text})

        # Generate and validate workstreams for each initiative
        # ... (similar pattern for each level)

        return PlanResult(
            goal=strategic_goal,
            initiatives=decoded_initiatives,
            flagged_elements=flagged_for_review,
            coherence_score=self._compute_coherence(
                goal_embedding, initiative_embeddings
            )
        )

    def _compute_coherence(self, goal_embedding, child_embeddings):
        """Mean parent-child alignment score."""
        goal = np.array(goal_embedding)
        scores = []
        for child_emb in child_embeddings:
            child = np.array(child_emb)
            scores.append(np.dot(goal, child) / (
                np.linalg.norm(goal) * np.linalg.norm(child)
            ))
        return float(np.mean(scores))

11.4 Human-in-the-Loop Integration

The planning assistant does not replace human judgment — it structures it. Consistency checks produce a prioritized review agenda: the plan elements most likely to be redundant, misaligned, or contradictory. Human reviewers focus on flagged elements rather than reading the entire plan for consistency.

The workflow:

System generates plan concept embeddings and runs consistency checks
System decodes validated elements into natural language
Flagged elements go to human reviewers with context: which other element they resemble, the similarity score, and the semantic dimension where they diverge
Reviewers resolve flagged items: confirm redundancy and remove one, clarify scope to reduce similarity, or clear the flag if the similarity is coincidental
System re-encodes resolved elements and re-runs consistency checks for affected levels
Repeat until no items remain flagged above threshold

This iterative loop is the standard pattern for LCM-assisted planning. The LCM handles semantic consistency computation; humans handle organizational priorities, political feasibility, and domain-specific context the model does not have. The combination catches more than either alone.

The 14 redundant initiatives and 3 contradictory goals in the bank's plan would have appeared as flagged items in step 3 — before decoding, before delivery, before the six months of downstream work that made them expensive to unwind.

Exercises

Type	Exercise	Description
Coding	Redundancy detector	Encode 15 strategic initiative descriptions from a fictional five-year technology transformation plan using SONAR. Compute pairwise cosine similarities. Plot as a heatmap. Identify which pairs exceed your redundancy threshold. Manually review the flagged pairs: are they genuinely redundant, or is the high similarity a coincidence of vocabulary? Adjust the threshold based on your findings.
Design	Direction vector construction	The contradiction detector requires a precomputed direction vector encoding the growth/reduction strategic tension. Describe how you would construct this direction vector in SONAR concept space. What sentence pairs would you use as positive examples (growth) and negative examples (reduction)? How would you compute the direction vector from these examples?
Analysis	Coherence score calibration	A planning team produces three different five-year roadmaps for the same strategic goal. Compute the coherence score (mean parent-child alignment) for each. Does the coherence score correlate with the human reviewers' assessment of which plan is the most internally consistent? What would it mean if the coherence score and human assessment disagree?

← Back to Beyond LLMs: Large Concept Models — Revised