Phase 1: GNN-MoE - Foundations of Dense Expert Collaboration

Traditional Mixture of Experts (MoE) architectures have a fundamental limitation: they route each token to only a small subset of experts, typically just 1-2 out of many available. This sparse activation pattern means that most experts remain idle for any given input, and crucially, experts never communicate with each other during processing.

Phase 1 of this project challenged this paradigm by asking: What if all experts could collaborate on every token?

The Dense MoE Revolution

Instead of using a traditional router to select which experts to activate, the GNN-MoE architecture activates all experts for every token. But here's the key innovation: after each expert processes the input, they communicate through a Graph Neural Network (GNN) "coupler" that enables learned information exchange.

Think of it like replacing a single specialist consultant with a collaborative team of experts who can discuss and refine their analyses together before reaching a final conclusion.

Technical Architecture

The GNN coupler treats each expert as a node in a graph, with learnable adjacency matrices determining the communication topology. During forward passes:

Key Innovations

Learnable Communication Patterns: Unlike fixed expert hierarchies, the adjacency matrix learns optimal communication patterns during training. Some experts might become highly connected "hub" nodes, while others develop specialized pairwise relationships.

Dense Activation Benefits: By activating all experts, the model can leverage the full capacity of its expert network for complex reasoning tasks. No expert knowledge is left unused, and the communication layer ensures that diverse perspectives are integrated effectively.

Modular Design: The codebase establishes clean separation between configuration (`gnn_moe_config.py`), architecture (`gnn_moe_architecture.py`), training (`gnn_moe_training.py`), and analysis (`gnn_moe_analysis.py`), creating a foundation for future architectural evolution.

Performance Characteristics

Initial experiments showed promising results with 2-8 experts, achieving perplexities in the 87-92 range on WikiText datasets. Interestingly, smaller expert counts often outperformed larger ones, suggesting that expert communication quality matters more than raw expert quantity.

However, scaling challenges emerged with larger expert counts due to VRAM limitations. The dense connectivity patterns of traditional GNNs create O(E²) memory complexity, making it difficult to scale beyond 8 experts on standard hardware.

Foundation for Evolution

Phase 1 established the core principles that would guide the entire project:

While VRAM constraints limited immediate scaling, Phase 1 proved the viability of dense expert collaboration and identified the path forward: more efficient communication mechanisms that could capture richer expert relationships while reducing memory overhead.

This foundation set the stage for Phase 2's evolution to Hypergraph Neural Networks, which would address the scaling limitations while enabling even more sophisticated multi-expert interactions.

Code and Implementation

The complete Phase 1 implementation is available in the gnn_MoE directory of the repository. Key features include comprehensive hyperparameter sweeps, automated checkpointing, and detailed analysis tools for visualizing expert communication patterns.