Learning spatiotemporal features with 3d convolutional networks¶
In this note, we discuss [TBF+15].
Summary¶
An approach for learning spatiotemporal features using deep 3D convolutional networks is presented. The network is trained on large supervised dataset.
Claims
- 3D convolutional networks are more suitable for spatiotemporal feature learning.
- A homogeneous architecture with \(3 \times 3 \times 3\) kernels in all layers performs quite well.
- The learned C3D (convolutional 3D) features with a simple linear classifier outperform state of the art methods on many benchmarks.
Further:
- Features are compact.
- Features are efficient to compute.
- Features are conceptually simple and easy to train.
Prior work¶
- Spatio-temporal interest points (STIP)
- SIFT-3D for action recognition
- HOG-3D for action recognition
- Cuboids features for behavior recognition
- Improved Dense Trajectories (iDT)
- Two stream networks
- Human detector and head tracking
- Deep Video [KTS+14]
Remarks on prior work
- Image based deep features are not directly suitable for video due to lack of motion modeling.
- iDT has good performance but it is computationally intensive and intractable on large scale datasets.
Major results¶
Benchmarks used
- Sports1M for action recognition
- UCF101 for action recognition
- ASLAN for action similarity labeling
- YUPENN for scene classification
- UMD for scene classification
- Object for object recognition
Proposals¶
Requirements of effective video descriptors:
- Generic: Can represent different types of videos well while being discriminative.
- Compact: Should be small in size to build databases of millions of videos. Storage and retrieval tasks should be scalable.
- Efficient to compute: Need to process thousands of videos every minute
- Simple to implement: Avoid complicated feature encoding methods and classifiers.
Examples of different types of videos: Landscapes, Natural scenes, Sports, TV shows, Movies, Pets, Food
Proposed 3D convnets
- \(3 \times 3 \times 3\) convolution kernels for all layers work best.
- Encapsulate information related to objects, scenes and actions in a video.
- No need to fine-tune the model for specific task.
- Model appearance and motion simultaneously.
- Outperform existing results on 4 different tasks and 6 different benchmarks.
- Are compact and efficient to compute.
- No preprocessing on the video. Full video frames as input.
- 3D pooling
- Gradual pooling of space and time information.
- In 3D nets, the input as well as output is an image volume.
Comparison with 2D ConvNets
- 2D convolutional networks lose temporal information of input signal right after every convolution operation.
- Similarly, 3D pooling retains temporal information while 2D pooling loses it.
- 2D convolutional networks when applied on multiple images by treating them as different channels also result in an image only.
- Slow fusion model [KTS+14] uses 3D convolutions and average pooling in first 3 layers. Loses all the temporal information after this. The use of 3D convolutions in initial layers is probably the reason for its better performance compared to other models.
Notation
- \(c\) : number of channels in image
- \(l\) : number of frames in video clip
- \(h\) : height of frame
- \(w\) : width of frame
- \(c \times l \times h \times w\) : size of the video clip
- \(k\) : kernel spatial size
- \(d\) : kernel temporal depth
- \(d \times k \times k\) : size of convolutional kernel and pooling kernel
Basic study of 3D convolution and learning¶
Here our goal is to find out the right parameters for kernel temporal depth.
Network architecture
- Video frame size : \(128 \times 171\).
- Non overlapped 16 frame clips.
- Jittering using random crop for training clips size: \(3 \times 16 \times 112 \times 112\).
- 5 convolutional layers, 5 pooling layers.
- 2 FC layers.
- Final softmax layer.
- Number of filters in each layer: 64, 128, 256, 256, 256.
- Kernel temporal depth is variable.
- Appropriate padding is used [both spatially and temporally].
- Stride = 1
- Same convolution used. No change in image volume size in convolution layers.
- First max pooling layer of size \(1 \times 2 \times 2\).
- Remaining max pooling layers of size \(2 \times 2 \times 2\).
- Clip length changes as follows from layer to layer: \(16 \to 16 \to 8 \to 4 \to 2 \to 1\).
- Two 2048 (output size) FC layers
- Mini batches of 30 clips.
- Initial learning rate 0.003.
- Learning rate divided by 4 after 10 epochs.
- Training stopped after 16 epochs.
Temporal depth of convolutional kernel
Following options
- Homogeneous temporal depth (same in each layer)
- Varying temporal depth (different in each layer)
Setups
- depth-d homogeneous networks.
- depth-1 network is same as using 2D net on each frame.
- Increasing depth nets: 3-3-5-5-7.
- Decreasing depth nets: 7-5-5-3-3.
Results
- Clip accuracy is the main metric on UCF 101 dataset.
- Depth 3 network performs best among homogeneous nets.
- Depth 1 (2D net) is significantly worse than others.
- Depth 3 performs better than varying depth networks (gap is not much).
- Increasing depth net performs slightly better than decreasing depth net.
Spatiotemporal feature learning¶
Network architecture
- Only \(3 \times 3 \times 3\) nets are used.
- 8 convolution layers, 5 pooling layers, two FC layers and then softmax layer.
- C - P - C - P - C - C - P - C - C - P - C - C - P - FC - FC - SF.
Training setup
- Sports 1M
- Five random clips each 2 second long from each video
- Frame size to \(128 \times 171\).
- Random cropping (both spatial and temporal) to \(16 \times 112 \times 112\) size.
- Horizontal flipping with 0.5 probability.
- SGD with mini batch of 30 clips.
- Initial learning rate: 0.003.
- Divided by 2 every 150K iterations.
- Optimization stopped after 1.9M iterations (13 epochs).
Results
- Accuracy of 84.4% at top-5 accuracy.
C3D video descriptor
- Break video into 16 frame clips with overlap of 8 frames.
- Compute the output activation of final FC layer for each clip.
- Average these activations over all clips.
- L2-Normalize the resultant vector
What C3D learns?
- Focuses on appearance in first few frames of a clip.
- Tracks the salient motion in remaining frames.
- Selectively attends to both motion and appearance.
Action recognition¶
Bibliography¶
[KTS+14] | (1, 2) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732. 2014. |
[TBF+15] | Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489–4497. 2015. |