NeuralPrintDocumentation

Documentation

Technical reference for TRIBE v2 — the foundation model powering NeuralPrint's brain activity predictions.

TRIBE v2CC BY-NC · March 2026

A Predictive Foundation Model for Human Brain Activity

TRIBE v2 is a tri-modal foundation model that predicts whole-brain fMRI responses to video, audio, and text stimuli — trained on 1,117+ hours of neural recordings from 720 healthy volunteers.

720

Training subjects

1,117+

fMRI hours

29,286

Brain locations

70×

Resolution increase

~1B

Trainable params

Overview Architecture Training Data Results Limitations Resources

Overview

TRIBE v2 (Tri-modal Brain Encoding model) is Meta FAIR's first AI foundation model of human brain responses to sights, sounds, and language. It acts as a digital twin of human neural activity — given any video, audio, or text stimulus, it predicts what the brain's fMRI response would look like across 29,286 cortical and subcortical locations.

The core insight: representations inside modern AI models (vision transformers, LLMs, speech models) geometrically align with representations inside the human brain. TRIBE v2 exploits this alignment at scale — building on the Algonauts 2025 award-winning model (4 subjects) to achieve zero-shot generalization across 720 volunteers.

THE FOUR DESIGN PILLARS

Integration

Whole-brain predictions across all experimental conditions simultaneously

Performance

Exceeds traditional linear encoding models — zero-shot R_group ≈ 0.4 on HCP dataset, 2× better than median individual

Generalization

Zero-shot predictions for new subjects, new languages, and unseen tasks without any fine-tuning

Interpretability

Mechanistic decomposition — ICA naturally recovers five known functional brain networks without supervision

AUTHORS

Stéphane d'Ascoli, Jérémy Rapin, Yohann Benchetrit, Teon Brookes, Katelyn Begany,
Joséphine Raugel, Hubert Banville, Jean-Rémi King · FAIR at Meta

Read Blog Post

Model Architecture

TRIBE v2 uses frozen pre-trained backbones for each modality and trains only a transformer aggregator and subject-conditional linear layer on top — approximately 1 billion trainable parameters in total.

PIPELINE

Video

Audio

Text

V-JEPA-2-Giant

D = 1,280 · frozen

Wav2Vec-Bert-2.0

D = 1,024 · frozen

Llama-3.2-3B

D = 2,048 · frozen

D=384

Transformer

Encoder

8 layers · 8 heads

D_model = 1,152

Subject Block

Linear (S × D × N)

dropout p=0.1

fMRI Output

29,286 locations

@ 1 Hz

Modality	Model	Embedding Dim	Sampling Rate
Video	V-JEPA-2-Giant	D = 1,280	2 Hz (4 s window)
Audio	Wav2Vec-Bert-2.0	D = 1,024	2 Hz (60 s chunks)
Text	Llama-3.2-3B	D = 2,048	Per-word, timestamped

All three backbones are frozen during training. Only the transformer aggregator and subject block are updated — this keeps training compute manageable while leveraging the full representational power of each foundation model.

Training Data

TRIBE v2 was trained on a curated collection of naturalistic fMRI datasets totalling 1,117+ hours of neural recordings from 720 healthy volunteers exposed to diverse real-world media — TV episodes, podcasts, documentaries, silent videos, and narrative text.

Dataset	Split	Subjects	fMRI Hours	Modalities
CNeuroMod (Friends TV + movies)	Train	4	268.7 h	Video + Audio + Text
BoldMoments (short videos)	Train	10	61.9 h	Video + Audio
LeBel2023 (podcasts)	Train	8	85.8 h	Audio + Text
Wen2017 (silent videos)	Train	3	35.2 h	Video
NNDb, LPP, Narratives, HCP	Test	695	666.1 h	Various

720

Total subjects

1,117+

Total fMRI hours

Train datasets

Test datasets

Training exclusively uses naturalistic stimuli (real media content as opposed to controlled lab stimuli) to maximise generalisation. This means NeuralPrint can reliably predict brain responses to any everyday video, audio, or text — not just synthetic or highly controlled inputs.

Key Results

R ≈ 0.4

Zero-shot group score on HCP

2× better than median individual subject

+50%

Multimodal gain over best unimodal

In temporal-parietal-occipital junction

70×

Resolution increase vs. prior models

At comparable generalisation performance

TRIBE v2 correctly recovers all the following region-selectivity findings from the neuroscience literature — without any explicit supervision for these contrasts.

VISUAL SYSTEM

FFA — Fusiform Face Area → selectively responds to faces

PPA — Parahippocampal Place Area → responds to scenes & places

EBA — Extrastriate Body Area → responds to bodies

VWFA — Visual Word Form Area → responds to written text

LANGUAGE SYSTEM

A5 — Primary auditory cortex → speech vs. silence

STS — Superior temporal sulcus → speech vs. natural sounds

TPJ — Temporo-parietal junction → emotional processing

BA45 — Broca's area → syntactic complexity

Known Limitations

TRIBE v2 is a powerful but scoped model. Understanding its limitations is essential for interpreting NeuralPrint outputs responsibly.

Limitation	Impact	Workaround
fMRI resolution: ~2 mm, 1 Hz TR	Cannot capture millisecond-level neural dynamics	Focus on slow perceptual and cognitive processes
No olfaction, touch, or proprioception	Incomplete sensory model	Scope all inputs to audio, video, and text
Passive observer model	Does not model active behaviour or decision-making	Use for passive perception experiments only
Healthy adult brain only	Not representative of clinical or paediatric populations	Avoid any clinical or diagnostic claims
CC BY-NC licence	Non-commercial use only	Academic and research projects only
Subject block memory footprint	Memory-intensive when loading many subjects	Use zero-shot (group) mode for inference

Resources & Links

GitHub — Code & Implementation

github.com/facebookresearch/tribev2

HuggingFace — Model Weights

facebook/tribev2 · CC BY-NC

Interactive Demo

aidemos.atmeta.com/tribev2

Meta Blog Post

ai.meta.com · March 2026

BACKBONE MODELS & REFERENCED PAPERS

Model / Paper	Reference
V-JEPA-2-Giant	arXiv:2506.09985
Wav2Vec-Bert-2.0	Chung et al., 2021 · IEEE ASRU
Llama-3.2-3B	Grattafiori et al., 2024 · arXiv:2407.21783
HCP Parcellation	Glasser et al., 2016 · Nature, 536:171

Licence

TRIBE v2 model weights, code, and research paper are released under a CC BY-NC licence by Meta FAIR. NeuralPrint is a non-commercial research tool built on top of TRIBE v2. Authors: d'Ascoli, Rapin, Benchetrit, Brookes, Begany, Raugel, Banville, King (FAIR at Meta), March 2026.