R(2+1)D (RGB-only)

The extraction of an 18-layer R(2+1)D (RGB-only) network is borrowed from torchvision models. Similar to I3D, R(2+1)D is pre-trained on Kinetics 400. The features are extracted from the pre-classification layer of the net. Therefore, it outputs a tensor with 512-d features for each stack. By default, according to torchvision docs, it expects to input a stack of 16 RGB frames (112x112), which spans 0.64 seconds of the video recorded at 25 fps. Specify --step_size and --stack_size to change the default behavior. In the default case, the features will be of size Tv x 512 where Tv = duration / 0.64. The augmentations are similar to the proposed in torchvision training scripts.

Set up the Environment for R(2+1)D

Setup conda environment. Requirements are in file conda_env_torch_zoo.yml

# it will create a new conda environment called 'torch_zoo' on your machine
conda env create -f conda_env_torch_zoo.yml

Minimal Working Example

Activate the environment

conda activate torch_zoo

and extract features from the ./sample/v_GGSY1Qvo990.mp4 video and show the predicted classes

python \
    feature_type=r21d \
    video_paths="[./sample/v_GGSY1Qvo990.mp4]" \


Start by activating the environment

conda activate torch_zoo

It will extract R(2+1)d features for sample videos using 0th and 2nd devices in parallel. The features are going to be extracted with the default parameters.

python \
    feature_type=r21d \
    device_ids="[0, 2]" \
    video_paths="[./sample/v_ZNVhz7ctTq0.mp4, ./sample/v_GGSY1Qvo990.mp4]"

See I3D Examples. Note, that this implementation of R(2+1)d only supports the RGB stream.


  1. The TorchVision implementation.
  2. The R(2+1)D paper: A Closer Look at Spatiotemporal Convolutions for Action Recognition.


The wrapping code is under MIT, yet, it utilizes torchvision library which is under BSD 3-Clause "New" or "Revised" License.