Learning spatiotemporal features with 3d convolutional networks

In this note, we discuss [TBF+15].

Summary

An approach for learning spatiotemporal features using deep 3D convolutional networks is presented. The network is trained on large supervised dataset.

Claims

  • 3D convolutional networks are more suitable for spatiotemporal feature learning.
  • A homogeneous architecture with \(3 \times 3 \times 3\) kernels in all layers performs quite well.
  • The learned C3D (convolutional 3D) features with a simple linear classifier outperform state of the art methods on many benchmarks.

Further:

  • Features are compact.
  • Features are efficient to compute.
  • Features are conceptually simple and easy to train.

Prior work

  • Spatio-temporal interest points (STIP)
  • SIFT-3D for action recognition
  • HOG-3D for action recognition
  • Cuboids features for behavior recognition
  • Improved Dense Trajectories (iDT)
  • Two stream networks
  • Human detector and head tracking
  • Deep Video [KTS+14]

Remarks on prior work

  • Image based deep features are not directly suitable for video due to lack of motion modeling.
  • iDT has good performance but it is computationally intensive and intractable on large scale datasets.

Major results

Benchmarks used

  • Sports1M for action recognition
  • UCF101 for action recognition
  • ASLAN for action similarity labeling
  • YUPENN for scene classification
  • UMD for scene classification
  • Object for object recognition

Proposals

Requirements of effective video descriptors:

  • Generic: Can represent different types of videos well while being discriminative.
  • Compact: Should be small in size to build databases of millions of videos. Storage and retrieval tasks should be scalable.
  • Efficient to compute: Need to process thousands of videos every minute
  • Simple to implement: Avoid complicated feature encoding methods and classifiers.

Examples of different types of videos: Landscapes, Natural scenes, Sports, TV shows, Movies, Pets, Food

Proposed 3D convnets

  • \(3 \times 3 \times 3\) convolution kernels for all layers work best.
  • Encapsulate information related to objects, scenes and actions in a video.
  • No need to fine-tune the model for specific task.
  • Model appearance and motion simultaneously.
  • Outperform existing results on 4 different tasks and 6 different benchmarks.
  • Are compact and efficient to compute.
  • No preprocessing on the video. Full video frames as input.
  • 3D pooling
  • Gradual pooling of space and time information.
  • In 3D nets, the input as well as output is an image volume.

Comparison with 2D ConvNets

  • 2D convolutional networks lose temporal information of input signal right after every convolution operation.
  • Similarly, 3D pooling retains temporal information while 2D pooling loses it.
  • 2D convolutional networks when applied on multiple images by treating them as different channels also result in an image only.
  • Slow fusion model [KTS+14] uses 3D convolutions and average pooling in first 3 layers. Loses all the temporal information after this. The use of 3D convolutions in initial layers is probably the reason for its better performance compared to other models.

Notation

  • \(c\) : number of channels in image
  • \(l\) : number of frames in video clip
  • \(h\) : height of frame
  • \(w\) : width of frame
  • \(c \times l \times h \times w\) : size of the video clip
  • \(k\) : kernel spatial size
  • \(d\) : kernel temporal depth
  • \(d \times k \times k\) : size of convolutional kernel and pooling kernel

Basic study of 3D convolution and learning

Here our goal is to find out the right parameters for kernel temporal depth.

Network architecture

  • Video frame size : \(128 \times 171\).
  • Non overlapped 16 frame clips.
  • Jittering using random crop for training clips size: \(3 \times 16 \times 112 \times 112\).
  • 5 convolutional layers, 5 pooling layers.
  • 2 FC layers.
  • Final softmax layer.
  • Number of filters in each layer: 64, 128, 256, 256, 256.
  • Kernel temporal depth is variable.
  • Appropriate padding is used [both spatially and temporally].
  • Stride = 1
  • Same convolution used. No change in image volume size in convolution layers.
  • First max pooling layer of size \(1 \times 2 \times 2\).
  • Remaining max pooling layers of size \(2 \times 2 \times 2\).
  • Clip length changes as follows from layer to layer: \(16 \to 16 \to 8 \to 4 \to 2 \to 1\).
  • Two 2048 (output size) FC layers
  • Mini batches of 30 clips.
  • Initial learning rate 0.003.
  • Learning rate divided by 4 after 10 epochs.
  • Training stopped after 16 epochs.

Temporal depth of convolutional kernel

Following options

  • Homogeneous temporal depth (same in each layer)
  • Varying temporal depth (different in each layer)

Setups

  • depth-d homogeneous networks.
  • depth-1 network is same as using 2D net on each frame.
  • Increasing depth nets: 3-3-5-5-7.
  • Decreasing depth nets: 7-5-5-3-3.

Results

  • Clip accuracy is the main metric on UCF 101 dataset.
  • Depth 3 network performs best among homogeneous nets.
  • Depth 1 (2D net) is significantly worse than others.
  • Depth 3 performs better than varying depth networks (gap is not much).
  • Increasing depth net performs slightly better than decreasing depth net.

Spatiotemporal feature learning

Network architecture

  • Only \(3 \times 3 \times 3\) nets are used.
  • 8 convolution layers, 5 pooling layers, two FC layers and then softmax layer.
  • C - P - C - P - C - C - P - C - C - P - C - C - P - FC - FC - SF.

Training setup

  • Sports 1M
  • Five random clips each 2 second long from each video
  • Frame size to \(128 \times 171\).
  • Random cropping (both spatial and temporal) to \(16 \times 112 \times 112\) size.
  • Horizontal flipping with 0.5 probability.
  • SGD with mini batch of 30 clips.
  • Initial learning rate: 0.003.
  • Divided by 2 every 150K iterations.
  • Optimization stopped after 1.9M iterations (13 epochs).

Results

  • Accuracy of 84.4% at top-5 accuracy.

C3D video descriptor

  • Break video into 16 frame clips with overlap of 8 frames.
  • Compute the output activation of final FC layer for each clip.
  • Average these activations over all clips.
  • L2-Normalize the resultant vector

What C3D learns?

  • Focuses on appearance in first few frames of a clip.
  • Tracks the salient motion in remaining frames.
  • Selectively attends to both motion and appearance.

Action recognition

Bibliography

[KTS+14](1, 2) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732. 2014.
[TBF+15]Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489–4497. 2015.