NeuralPrintDocumentation

Documentation

Technical reference for TRIBE v2 — the foundation model powering NeuralPrint's brain activity predictions.

TRIBE v2CC BY-NC · March 2026

A Predictive Foundation Model for Human Brain Activity

TRIBE v2 is a tri-modal foundation model that predicts whole-brain fMRI responses to video, audio, and text stimuli — trained on 1,117+ hours of neural recordings from 720 healthy volunteers.

720

Training subjects

1,117+

fMRI hours

29,286

Brain locations

70×

Resolution increase

~1B

Trainable params

Overview

TRIBE v2 (Tri-modal Brain Encoding model) is Meta FAIR's first AI foundation model of human brain responses to sights, sounds, and language. It acts as a digital twin of human neural activity — given any video, audio, or text stimulus, it predicts what the brain's fMRI response would look like across 29,286 cortical and subcortical locations.

The core insight: representations inside modern AI models (vision transformers, LLMs, speech models) geometrically align with representations inside the human brain. TRIBE v2 exploits this alignment at scale — building on the Algonauts 2025 award-winning model (4 subjects) to achieve zero-shot generalization across 720 volunteers.

THE FOUR DESIGN PILLARS

Integration

Whole-brain predictions across all experimental conditions simultaneously

Performance

Exceeds traditional linear encoding models — zero-shot R_group ≈ 0.4 on HCP dataset, 2× better than median individual

Generalization

Zero-shot predictions for new subjects, new languages, and unseen tasks without any fine-tuning

Interpretability

Mechanistic decomposition — ICA naturally recovers five known functional brain networks without supervision

AUTHORS

Stéphane d'Ascoli, Jérémy Rapin, Yohann Benchetrit, Teon Brookes, Katelyn Begany,
Joséphine Raugel, Hubert Banville, Jean-Rémi King · FAIR at Meta

Read Blog Post

Model Architecture

TRIBE v2 uses frozen pre-trained backbones for each modality and trains only a transformer aggregator and subject-conditional linear layer on top — approximately 1 billion trainable parameters in total.

PIPELINE

Video
Audio
Text

V-JEPA-2-Giant

D = 1,280 · frozen

Wav2Vec-Bert-2.0

D = 1,024 · frozen

Llama-3.2-3B

D = 2,048 · frozen

D=384

Transformer

Encoder

8 layers · 8 heads

D_model = 1,152

Subject Block

Linear (S × D × N)

dropout p=0.1

fMRI Output

29,286 locations

@ 1 Hz

ModalityModelEmbedding DimSampling Rate
VideoV-JEPA-2-GiantD = 1,2802 Hz (4 s window)
AudioWav2Vec-Bert-2.0D = 1,0242 Hz (60 s chunks)
TextLlama-3.2-3BD = 2,048Per-word, timestamped

All three backbones are frozen during training. Only the transformer aggregator and subject block are updated — this keeps training compute manageable while leveraging the full representational power of each foundation model.

Training Data

TRIBE v2 was trained on a curated collection of naturalistic fMRI datasets totalling 1,117+ hours of neural recordings from 720 healthy volunteers exposed to diverse real-world media — TV episodes, podcasts, documentaries, silent videos, and narrative text.

DatasetSplitSubjectsfMRI HoursModalities
CNeuroMod (Friends TV + movies)Train4268.7 hVideo + Audio + Text
BoldMoments (short videos)Train1061.9 hVideo + Audio
LeBel2023 (podcasts)Train885.8 hAudio + Text
Wen2017 (silent videos)Train335.2 hVideo
NNDb, LPP, Narratives, HCPTest695666.1 hVarious

720

Total subjects

1,117+

Total fMRI hours

4

Train datasets

4

Test datasets

Training exclusively uses naturalistic stimuli (real media content as opposed to controlled lab stimuli) to maximise generalisation. This means NeuralPrint can reliably predict brain responses to any everyday video, audio, or text — not just synthetic or highly controlled inputs.

Key Results

R ≈ 0.4

Zero-shot group score on HCP

2× better than median individual subject

+50%

Multimodal gain over best unimodal

In temporal-parietal-occipital junction

70×

Resolution increase vs. prior models

At comparable generalisation performance

TRIBE v2 correctly recovers all the following region-selectivity findings from the neuroscience literature — without any explicit supervision for these contrasts.

VISUAL SYSTEM

FFA Fusiform Face Area → selectively responds to faces

PPA Parahippocampal Place Area → responds to scenes & places

EBA Extrastriate Body Area → responds to bodies

VWFA Visual Word Form Area → responds to written text

LANGUAGE SYSTEM

A5 Primary auditory cortex → speech vs. silence

STS Superior temporal sulcus → speech vs. natural sounds

TPJ Temporo-parietal junction → emotional processing

BA45 Broca's area → syntactic complexity

Known Limitations

TRIBE v2 is a powerful but scoped model. Understanding its limitations is essential for interpreting NeuralPrint outputs responsibly.

LimitationImpactWorkaround
fMRI resolution: ~2 mm, 1 Hz TRCannot capture millisecond-level neural dynamicsFocus on slow perceptual and cognitive processes
No olfaction, touch, or proprioceptionIncomplete sensory modelScope all inputs to audio, video, and text
Passive observer modelDoes not model active behaviour or decision-makingUse for passive perception experiments only
Healthy adult brain onlyNot representative of clinical or paediatric populationsAvoid any clinical or diagnostic claims
CC BY-NC licenceNon-commercial use onlyAcademic and research projects only
Subject block memory footprintMemory-intensive when loading many subjectsUse zero-shot (group) mode for inference

Resources & Links

BACKBONE MODELS & REFERENCED PAPERS

Model / PaperReference
V-JEPA-2-GiantarXiv:2506.09985
Wav2Vec-Bert-2.0Chung et al., 2021 · IEEE ASRU
Llama-3.2-3BGrattafiori et al., 2024 · arXiv:2407.21783
HCP ParcellationGlasser et al., 2016 · Nature, 536:171

Licence

TRIBE v2 model weights, code, and research paper are released under a CC BY-NC licence by Meta FAIR. NeuralPrint is a non-commercial research tool built on top of TRIBE v2. Authors: d'Ascoli, Rapin, Benchetrit, Brookes, Begany, Raugel, Banville, King (FAIR at Meta), March 2026.