Documentation
Technical reference for TRIBE v2 — the foundation model powering NeuralPrint's brain activity predictions.
A Predictive Foundation Model for Human Brain Activity
TRIBE v2 is a tri-modal foundation model that predicts whole-brain fMRI responses to video, audio, and text stimuli — trained on 1,117+ hours of neural recordings from 720 healthy volunteers.
720
Training subjects
1,117+
fMRI hours
29,286
Brain locations
70×
Resolution increase
~1B
Trainable params
Overview
TRIBE v2 (Tri-modal Brain Encoding model) is Meta FAIR's first AI foundation model of human brain responses to sights, sounds, and language. It acts as a digital twin of human neural activity — given any video, audio, or text stimulus, it predicts what the brain's fMRI response would look like across 29,286 cortical and subcortical locations.
The core insight: representations inside modern AI models (vision transformers, LLMs, speech models) geometrically align with representations inside the human brain. TRIBE v2 exploits this alignment at scale — building on the Algonauts 2025 award-winning model (4 subjects) to achieve zero-shot generalization across 720 volunteers.
THE FOUR DESIGN PILLARS
Whole-brain predictions across all experimental conditions simultaneously
Exceeds traditional linear encoding models — zero-shot R_group ≈ 0.4 on HCP dataset, 2× better than median individual
Zero-shot predictions for new subjects, new languages, and unseen tasks without any fine-tuning
Mechanistic decomposition — ICA naturally recovers five known functional brain networks without supervision
AUTHORS
Stéphane d'Ascoli, Jérémy Rapin, Yohann Benchetrit, Teon Brookes, Katelyn Begany,
Joséphine Raugel, Hubert Banville, Jean-Rémi King · FAIR at Meta
Model Architecture
TRIBE v2 uses frozen pre-trained backbones for each modality and trains only a transformer aggregator and subject-conditional linear layer on top — approximately 1 billion trainable parameters in total.
PIPELINE
V-JEPA-2-Giant
D = 1,280 · frozen
Wav2Vec-Bert-2.0
D = 1,024 · frozen
Llama-3.2-3B
D = 2,048 · frozen
Transformer
Encoder
8 layers · 8 heads
D_model = 1,152
Subject Block
Linear (S × D × N)
dropout p=0.1
fMRI Output
29,286 locations
@ 1 Hz
| Modality | Model | Embedding Dim | Sampling Rate |
|---|---|---|---|
| Video | V-JEPA-2-Giant | D = 1,280 | 2 Hz (4 s window) |
| Audio | Wav2Vec-Bert-2.0 | D = 1,024 | 2 Hz (60 s chunks) |
| Text | Llama-3.2-3B | D = 2,048 | Per-word, timestamped |
All three backbones are frozen during training. Only the transformer aggregator and subject block are updated — this keeps training compute manageable while leveraging the full representational power of each foundation model.
Training Data
TRIBE v2 was trained on a curated collection of naturalistic fMRI datasets totalling 1,117+ hours of neural recordings from 720 healthy volunteers exposed to diverse real-world media — TV episodes, podcasts, documentaries, silent videos, and narrative text.
| Dataset | Split | Subjects | fMRI Hours | Modalities |
|---|---|---|---|---|
| CNeuroMod (Friends TV + movies) | Train | 4 | 268.7 h | Video + Audio + Text |
| BoldMoments (short videos) | Train | 10 | 61.9 h | Video + Audio |
| LeBel2023 (podcasts) | Train | 8 | 85.8 h | Audio + Text |
| Wen2017 (silent videos) | Train | 3 | 35.2 h | Video |
| NNDb, LPP, Narratives, HCP | Test | 695 | 666.1 h | Various |
720
Total subjects
1,117+
Total fMRI hours
4
Train datasets
4
Test datasets
Training exclusively uses naturalistic stimuli (real media content as opposed to controlled lab stimuli) to maximise generalisation. This means NeuralPrint can reliably predict brain responses to any everyday video, audio, or text — not just synthetic or highly controlled inputs.
Key Results
R ≈ 0.4
Zero-shot group score on HCP
2× better than median individual subject
+50%
Multimodal gain over best unimodal
In temporal-parietal-occipital junction
70×
Resolution increase vs. prior models
At comparable generalisation performance
TRIBE v2 correctly recovers all the following region-selectivity findings from the neuroscience literature — without any explicit supervision for these contrasts.
VISUAL SYSTEM
FFA — Fusiform Face Area → selectively responds to faces
PPA — Parahippocampal Place Area → responds to scenes & places
EBA — Extrastriate Body Area → responds to bodies
VWFA — Visual Word Form Area → responds to written text
LANGUAGE SYSTEM
A5 — Primary auditory cortex → speech vs. silence
STS — Superior temporal sulcus → speech vs. natural sounds
TPJ — Temporo-parietal junction → emotional processing
BA45 — Broca's area → syntactic complexity
Known Limitations
TRIBE v2 is a powerful but scoped model. Understanding its limitations is essential for interpreting NeuralPrint outputs responsibly.
| Limitation | Impact | Workaround |
|---|---|---|
| fMRI resolution: ~2 mm, 1 Hz TR | Cannot capture millisecond-level neural dynamics | Focus on slow perceptual and cognitive processes |
| No olfaction, touch, or proprioception | Incomplete sensory model | Scope all inputs to audio, video, and text |
| Passive observer model | Does not model active behaviour or decision-making | Use for passive perception experiments only |
| Healthy adult brain only | Not representative of clinical or paediatric populations | Avoid any clinical or diagnostic claims |
| CC BY-NC licence | Non-commercial use only | Academic and research projects only |
| Subject block memory footprint | Memory-intensive when loading many subjects | Use zero-shot (group) mode for inference |
Resources & Links
GitHub — Code & Implementation
github.com/facebookresearch/tribev2
HuggingFace — Model Weights
facebook/tribev2 · CC BY-NC
Interactive Demo
aidemos.atmeta.com/tribev2
Meta Blog Post
ai.meta.com · March 2026
BACKBONE MODELS & REFERENCED PAPERS
| Model / Paper | Reference |
|---|---|
| V-JEPA-2-Giant | arXiv:2506.09985 |
| Wav2Vec-Bert-2.0 | Chung et al., 2021 · IEEE ASRU |
| Llama-3.2-3B | Grattafiori et al., 2024 · arXiv:2407.21783 |
| HCP Parcellation | Glasser et al., 2016 · Nature, 536:171 |
Licence
TRIBE v2 model weights, code, and research paper are released under a CC BY-NC licence by Meta FAIR. NeuralPrint is a non-commercial research tool built on top of TRIBE v2. Authors: d'Ascoli, Rapin, Benchetrit, Brookes, Begany, Raugel, Banville, King (FAIR at Meta), March 2026.