## Learning Pedestrian Group Representations for Multi-modal Trajectory Prediction

#### Inhwan Bae, Jin-Hwi Park and Hae-Gon Jeon*

###### European Conference on Computer Vision 2022

### Abstract

Modeling the dynamics of people walking is a problem of long-standing interest in computer vision. Many previous works involving pedestrian trajectory prediction define a particular set of individual actions to implicitly model group actions. In this paper, we present a novel architecture named GP-Graph which has collective group representations for effective pedestrian trajectory prediction in crowded environments, and is compatible with all types of existing approaches. A key idea of GP-Graph is to model both individual-wise and group-wise relations as graph representations. To do this, GP-Graph first learns to assign each pedestrian into the most likely behavior group. Using this assignment information, GP-Graph then forms both intra- and inter-group interactions as graphs, accounting for human-human relations within a group and group-group relations, respectively. To be specific, for the intra-group interaction, we mask pedestrian graph edges out of an associated group. We also propose group pooling&unpooling operations to represent a group with multiple pedestrians as one graph node. Lastly, GP-Graph infers a probability map for socially-acceptable future trajectories from the integrated features of both group interactions. Moreover, we introduce a group-level latent vector sampling to ensure collective inferences over a set of possible future trajectories. Extensive experiments are conducted to validate the effectiveness of our architecture, which demonstrates consistent performance improvements with publicly available benchmarks.

### Demo Video

### Learning the Pedestrian Grouping

##### Pedestrian Group Estimation

First of all, we estimate grouping information to which the pedestrian belongs using a Group Assignment Module. Using the history trajectory of each pedestrian, we measure the feature similarity among all pedestrian pairs based on their L2 distance. With this pairwise distance, we pick out all pairs of pedestrians that are likely to be a colleague (affiliated with same group). Next, we arrange the colleague members in associated groups and assign their group index.

##### Pedestrian Group Pooling&Unpooling

We group the pedestrian nodes, where the corresponding node's features are aggregated into one node. Here, the most representative feature for each pedestrian node is selected via an average pooling. With the feature, we can model the group-wise graph structures, which have much fewer number of nodes than the input pedestrian graph. Next, we upscale the group-wise graph structures back to their original size by using an unpooling operation. This enables each pedestrian trajectory to be forecast with output agent-wise feature fusion information. We duplicate the group features and then assign them into nodes for all the relevant group members so that they have identical group behavior information.

##### Pedestrian Group Hierarchy Graph

Using the estimated pedestrian grouping information, we reconstruct the initial social interaction graph in an efficient form for pedestrian trajectory prediction. Instead of the existing complex and complete pedestrian graph, intra- and inter-group interaction graphs capture the group-ware social relation.

##### Straight-Through Group Estimator

A major hurdle, when training the group assignment module, which is a sampling function, is that index information is not treated as learnable parameters. Accordingly, the group index cannot be trained using standard backpropagation algorithms. We tackle this problem by introducing a Straight-through (ST) trick, inspired by the biased path derivative estimators. Instead of making the discrete index set differentiable, we separate the forward pass and backward pass of the group assignment module in the training process. In the forward pass, we perform our group pooling over both pedestrian features and the group index from the input trajectory and estimated group assignment information, respectively. For the backward pass, we propose group-wise continuous relaxed features to approximate the group indexing process.

##### GP-Graph Architecture

We incorporate the social interactions as a form of group hierarchy into well-designed existing trajectory prediction baseline models. Meaningful features can be extracted by feeding a different type of graph-structured data into the same baseline model. Here, the baseline models share their weights to reduce the amount of parameters while enriching the augmentation effect. Afterward, the output features from the baseline models are aggregated agent-wise, and are then used to predict the probability map of future trajectories using our group integration module.

##### Group-level Latent Vector Sampling

To infer the multi-modal future paths of pedestrians, an additional random latent vector is introduced with an input observation path. There are two ways to adopt this latent vector in trajectory generation: (1) Scene-level sampling where everyone in the scene shares one latent vector, unifying the behavior patterns of all pedestrians in a scene; (2) Pedestrian-level sampling that allocates the different latent vectors for each pedestrian, but forces the pedestrians to have different patterns, where the group behavior property is lost. We propose a group-level latent vector sampling method as a compromise of the two ways. We use the group information estimated from the GP-Graph to share the latent vector between groups. If two people are not associated with the same group, an independent random noise is assigned as a latent vector. In this way, it is possible to sample a multi-modal trajectory, which is independent of other group members and follows associated group behaviors.

### Qualitative Results

### BibTeX

@inproceedings{bae2022gpgraph,

title={Learning Pedestrian Group Representations for Multi-modal Trajectory Prediction},

author={Bae, Inhwan and Park, Jin-Hwi and Jeon, Hae-Gon},

booktitle={Proceedings of the European Conference on Computer Vision},

year={2022}

}