< Back Home

Recurrent Space-time Graph Neural Network

We introduce in this post our Recurrent Space-time Graph Neural Network (RSTG) architecture designed for learning video representation and especially suited for tasks heavily relying on interactions.

Let’s begin by considering the key components of video understanding that our method should include. Being able to detect and localise objects is the first crucial thing that we need to do, then we should combine them in various ways to create complex scenes. Whole is greater than the sum of its parts, but is it always true, and what are the conditions of this happening?

In order to form a greater whole, objects must have some kind of connections between them. A set of random objects that do not correlate in any way doesn’t bring much additional information. Entities could form different types of connections, they could be semantically related or they could be related by their physical position in space and in time. These kinds of relations happen in both image and video, with the time dimension of the video adding more difficulty, greatly increasing the amount and complexity of interactions happening in the scene.

Let’s define the kind of interaction happening in the video by examining a few examples.

eeml2019_group_photo

Above, the yellow car and the grass appear together during the whole video, but the event the car crosses the line is characterized by a specific interaction. We call those types of interactions happening at the frame level spatial interactions. On the other hand, the event the yellow car overtakes the red car can not be captured from a single frame as it only makes sense across time. We call this temporal interactions.

The entities that interact shouldn’t necessary be close to each other, either in time or in space, since there could also exist long-range interactions. In the above picture, the man and the moon are connected regardless of their distance in the image.

Analysing videos with spatio-temporal graph models

Models commonly used in Computer Vision, based on convolutional networks, implicitly capture interactions in both space and time dimension, but they are biased towards local, short-range relations.

We propose a method designed to explicitly model relations and that is capable of capturing long range connections. Our model fits into the broader category of graph neural networks (GNNs). We process the video imposing a graph structure in order to explicitly model interactions between different entities. GNNs usually send messages between each pair of nodes, thus easily modeling binary interactions. Higher order interactions could be achieved by multiple such message-passing iterations.

The animation below illustrates the main components of our model.

RSTG architecture.

At each time step, we create nodes by extracting information from features given by a convolutional network. Each node corresponds to a different fixed region of the input and we connect them if they come from neighbouring regions or if they overlap, as shown in the figure above. Using multiple scales helps us to capture entities of different sizes and also to connect distant regions more easily.

For the two types of interactions described above, we create two separate processing components. For each time step, we design a space processing stage to model the spatial interaction by iteratively message-passing. This involves 3 steps: sending messages between each pair of connecting nodes, aggregating the messages received at each node by an attention mechanism and updating each node based on the current state and the aggregated message. We design a time processing stage to model temporal interactions by sending messages in time, only between the states of the same node, in a recurrent fashion. More specifically, at each time step, each node receives information only from its corresponding state from the previous time step.

To model more expressive spatio-temporal interactions and to give it the ability to reason about all the information in the scene, with knowledge about past states, we alternate the two processing stages, as shown in the animation below.

rstg_stages

By having multiple space iteration, we go from local to more global processing, modeling interactions happening between an increasing number of entities situated at increasingly longer distances. We want to have some temporal information at every such steps, in order to model interactions that takes into account the history. In order to combine the same kind of features from different time steps, we only connect the k-th spatial stage with the k-th stage from the previous step as shown in the animation.

We can pool all the nodes into a vector representation used for the final prediction or we could project back each node into its initial corresponding region, forming a feature volume with the same size as the graph input so that our model could be used as a module inside any other architecture.

Synthetic Dataset

Training on a large real world dataset takes a lot of time and computational resources and could also involves hidden biases that could mask the capabilities of a model. For example, because of an unbalanced dataset, the activity of skiing could be detected only from the context of a snowy scene. Thus, we designed a synthetic dataset where the complexity comes from the necessity of explicitly modeling spatial and temporal interactions but in a cleaner, simpler environment.

SyncMNIST

Our SyncMNIST datasets involves videos where the goal is to detect a pair of digits that move synchronous among others that move randomly. On this dataset, we validate our key design choices by conducting ablation studies.

We show that it is important to have both processing stages, each having its own set of parameters and to have multiple alternating temporal and spatial stages. We note that our model is also improved by incorporating positional embeddings in the node features. Our final model, that includes all these elements surpasses strong models such as I3D and Non-Local.

Real world experiments

smt_example

To validate the capability of our model, we evaluate the RSTG model on a human-object interaction dataset - Something-Something v1 . This dataset contains fine-grained actions, that can not be distinguished solely on their context, where the interactions between entities across the entire video are essential. We compare against top-models in the literature and obtain state of the art results.

results

Conclusion

We hope that graph based methods would be more broadly adopted in visual domain tasks, especially those where interactions play a crucial role and that our methods brings more evidence that such models could be successfully applied in such tasks.

Our proposed RSTG model, seen as a spatio-temporal processing module, could be used for other various problems. Key aspects proved by our experiments, such as the creation of a graph structure from convolutional features or the coupled but factorised time and space processing could be integrated into other models.

We released the code of our models and SyncMNIST dataset here.

More details about our work could be found in our paper:

Andrei Nicolicioiu, Iulia Duta, Marius Leordeanu, Recurrent Space-time Graph Neural Networks, In Advances in neural information processing systems (NeurIPS 2019).