Skip to content


We support 3 flavors of R(2+1)D:

  • r2plus1d_18_16_kinetics 18-layer R(2+1)D pre-trained on Kinetics 400 (used by default) – it is identical to the torchvision implementation
  • r2plus1d_34_32_ig65m_ft_kinetics 34-layer R(2+1)D pre-trained on IG-65M and fine-tuned on Kinetics 400 – the weights are provided by moabitcoin/ig65m-pytorch repo for stack/step size 32.
  • r2plus1d_34_8_ig65m_ft_kinetics the same as the one above but this one was pre-trained with stack/step size 8

models are pre-trained on RGB frames and follow the plain torchvision augmentation sequence.


The flavors that were pre-trained on IG-65M and fine-tuned on Kinetics 400 yield significantly better performance than the default model (e.g. the 32 frame model reaches an accuracy of 79.10 vs 57.50 by default).

By default (model_name=r2plus1d_18_16_kinetics), the model expects to input a stack of 16 RGB frames (112x112), which spans 0.64 seconds of the video recorded at 25 fps. In the default case, the features will be of size Tv x 512 where Tv = duration / 0.64. Specify, model_name, step_size and stack_size to change the default behavior.

Set up the Environment for R(2+1)D

Setup conda environment. Requirements are in file conda_env_torch_zoo.yml

# it will create a new conda environment called 'torch_zoo' on your machine
conda env create -f conda_env_torch_zoo.yml

Quick Start

Open In Colab

Activate the environment

conda activate torch_zoo

and extract features from the ./sample/v_GGSY1Qvo990.mp4 video and show the predicted classes

python \
    feature_type=r21d \
    video_paths="[./sample/v_GGSY1Qvo990.mp4]" \

Supported Arguments

model_name "r2plus1d_18_16_kinetics" A variant of R(2+1)d. "r2plus1d_18_16_kinetics", "r2plus1d_34_32_ig65m_ft_kinetics", "r2plus1d_34_8_ig65m_ft_kinetics" are supported.
stack_size null The number of frames from which to extract features (or window size). If omitted, it will respect the config of model_name during training.
step_size null The number of frames to step before extracting the next features. If omitted, it will respect the config of model_name during training.
extraction_fps null If specified (e.g. as 5), the video will be re-encoded to the extraction_fps fps. Leave unspecified or null to skip re-encoding.
device "cuda:0" The device specification. It follows the PyTorch style. Use "cuda:3" for the 4th GPU on the machine or "cpu" for CPU-only.
video_paths null A list of videos for feature extraction. E.g. "[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]" or just one path "./sample/v_GGSY1Qvo990.mp4".
file_with_video_paths null A path to a text file with video paths (one path per line). Hint: given a folder ./dataset with .mp4 files one could use: find ./dataset -name "*mp4" > ./video_paths.txt.
on_extraction print If print, the features are printed to the terminal. If save_numpy or save_pickle, the features are saved to either .npy file or .pkl.
output_path "./output" A path to a folder for storing the extracted features (if on_extraction is either save_numpy or save_pickle).
keep_tmp_files false If true, the reencoded videos will be kept in tmp_path.
tmp_path "./tmp" A path to a folder for storing temporal files (e.g. reencoded videos).
show_pred false If true, the script will print the predictions of the model on a down-stream task. It is useful for debugging.


Start by activating the environment

conda activate torch_zoo

It will extract R(2+1)d features for two sample videos. The features are going to be extracted with the default parameters.

python \
    feature_type=r21d \
    device="cuda:0" \
    video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"

Here is an example with r2plus1d_34_32_ig65m_ft_kinetics 34-layer R(2+1)D model that waas pre-trained on IG-65M and, then, fine-tuned on Kinetics 400

python \
    feature_type=r21d \
    model_name="r2plus1d_34_8_ig65m_ft_kinetics" \
    device="cuda:0" \
    video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"

See the config file for other supported parameters. Note, that this implementation of R(2+1)d only supports the RGB stream.


  1. The TorchVision implementation.
  2. The R(2+1)D paper: A Closer Look at Spatiotemporal Convolutions for Action Recognition.
  3. Thanks to @ohjho we now also support the favors of the 34-layer model pre-trained on IG-65M and fine-tuned on Kinetics 400.


The wrapping code is under MIT, yet, it utilizes torchvision library which is under BSD 3-Clause "New" or "Revised" License.