Modeling Pedestrian Dynamics – When Physics Meets Data

Understanding Pedestrian Motion: From Empirical Curves to Learned Behaviors

Modeling how people move through space is an old problem with new relevance. Urban planners, architects, and safety engineers all rely on accurate pedestrian flow models – whether for simulating evacuation scenarios, optimizing corridor widths, or designing public transport hubs.

For decades, the go-to approach has been the Fundamental Diagram (FD), which captures the relationship between local density and average speed [2]. The intuition is straightforward: the more crowded it gets, the slower people move. Yet, like many elegant simplifications, this one hides complexity. Two areas with the same average density can exhibit completely different movement patterns depending on how pedestrians are arranged and how geometry constrains flow.

In a ring corridor, for example, people walk smoothly with relatively uniform spacing. In a bottleneck, even with the same density, crowd dynamics become irregular – bursts of motion alternate with halts as people funnel through narrow exits. The FD, being a single-valued function of density, cannot represent these contextual variations.

To overcome this limitation, Tordeux et al. (2019) proposed a neural network approach that directly learns how local geometry and neighbor configuration influence walking speed [1]. Instead of hand-crafting formulas, the model is trained from empirical trajectory data, effectively learning the “micro-physics” of human movement from observation.

The Data Behind the Model

Controlled Experiments: Rings and Bottlenecks

The study used controlled pedestrian experiments with two geometries – a ring and a bottleneck – which together represent two ends of the crowd-flow spectrum [1].

Ring experiment: Participants walk in a circular corridor, maintaining continuous flow. Densities range roughly between 0.25 and 2 ped/m².
Bottleneck experiment: The same participants pass through a narrowing corridor with variable exit widths \(0.7–1.8 m\), generating intermittent stop-and-go dynamics.

Fig. 1: Ring and bottleneck experimental layouts

These setups create complementary datasets: the ring is spatially homogeneous but temporally steady, while the bottleneck is spatially heterogeneous and temporally bursty. A model that generalizes across both must therefore capture deeper relationships than a one-dimensional FD curve.

Fig. 2: Visualization of the first 30 trajectories of the bottleneck scenario

Encoding the Crowd: Features and Representations

At the heart of both classical and learned models lies a measure of local crowding. The fundamental variable is the mean spacing to the ( K ) nearest neighbors:

\[ \bar{s}*K = \frac{1}{K} \sum*{i=1}^{K} \sqrt{(x - x_i)^2 + (y - y_i)^2}. \tag{1} \]

This scalar describes how much “free space” surrounds a pedestrian. In the FD model, it is the only input. However, the neural network augments this with relative neighbor coordinates, preserving spatial arrangement:

\[ \mathbf{f} = [\bar{s}_K, (x_1 - x, y_1 - y), \dots, (x_K - x, y_K - y)]. \tag{2} \]

Including the neighbor positions gives the model a sense of directional crowding – whether others are ahead, behind, or beside the focal pedestrian. This matters because humans tend to react more strongly to obstacles in front than behind.

In practice, \(K = 10\) neighbors worked well: enough to describe local geometry without diluting temporal resolution.

Fig. 3: Observations of the pedestrian speeds as a function of the mean spacing with the 10 closest neighbours for the ring and bottleneck experiments

The Baseline: Revisiting the Fundamental Diagram

The Fundamental Diagram models pedestrian speed as a deterministic function of mean spacing, often using an exponential form first proposed by Weidmann (1994) [2]:

\[ v = v_0 \left( 1 - \exp!\left(\frac{\ell - \bar{s}_K}{v_0 T}\right) \right), \tag{3} \]

where

\(v_0\) is the free-flow speed,
\(\ell\) is the effective body size, and
\(T\) is a relaxation time (roughly the headway people maintain).

This model is compact, interpretable, and fits data reasonably well for simple, unidirectional flows. Yet, it assumes that all local environments with the same (\bar{s}_K) behave identically – an assumption that breaks down in complex geometries.

The Neural Alternative: Learning Context from Geometry

The neural network replaces the fixed exponential law with a data-driven regression function:

\[ v = NN(\mathbf{f};\ \theta), \tag{4} \]

where \(\mathbf{f}\) are the local geometric features and \(\theta\) denotes the learned weights. The model architecture is intentionally modest – a fully connected feed-forward network with one hidden layer of three neurons and a single scalar output for speed.

At first glance, three neurons seem laughably few. But with the structured input described earlier, this small network already captures significant nonlinearity. Increasing depth or width did not substantially improve accuracy, and even risked overfitting – a clear sign that the key lies in feature quality, not model size.

The model is trained using the mean squared error (MSE) loss:

\[ \text{MSE} = \frac{1}{N} \sum_{n=1}^N (v_n - \hat{v}_n)^2, \tag{5} \]

optimized with Adam at a learning rate of \(5 \times 10^{-4}\), and early stopping based on validation loss. Input and output features were normalized to zero mean and unit variance before training to stabilize gradients.

Comparative Results

Training, Testing, and the Subtleties of Bootstrapping

A 50/50 train–test split was used, and to estimate model uncertainty, bootstrap resampling was applied [5]. Each bootstrap sample was used to train a separate network, and the ensemble of results provided both mean performance and confidence intervals.

Bootstrapping is particularly valuable when datasets are small or contain correlated samples – both common in experimental crowd studies. However, as observed in reproductions, the final “optimal architecture” can depend on the number of bootstrap iterations used: fewer iterations produce more variance in the MSE estimate.

This sensitivity underscores an important practical lesson – statistical rigor matters as much as architectural design when evaluating data-driven models.

Fig. 4: Learning curves of different networks architectures on different scenarios

FD vs. NN on Controlled Data

When trained and tested within the same geometry, both FD and NN models performed reasonably well. But the FD struggled with cross-geometry generalization (e.g., trained on ring, tested on bottleneck). The NN, in contrast, captured subtle context effects because it encoded relative neighbor positions.

Training on a combined dataset from both geometries improved generalization further. This suggests that geometric diversity during training is essential for building robust crowd models – a concept mirroring domain adaptation trends seen in computer vision.

Fig. 5: Comparison of test losses on models with different train/test combinations

For context, results were also compared with the Social Force Model (SFM) [3], a physics-based simulation framework where pedestrians are treated as self-driven particles subject to forces:

\[ m_i \frac{d\mathbf v_i}{dt} = \frac{v_i^0 \hat e_i - \mathbf v_i}{\tau} + \sum_{j\ne i} A \exp!\left(\frac{r_i + r_j - d_{ij}}{B}\right) \hat d_{ij}. \tag{6} \]

Despite its interpretability, the SFM achieved higher errors \(MSE ≈ 0.32\) compared to both FD \(≈ 0.13\) and NN \(≈ 0.11\). Its handcrafted interaction terms struggle to capture the anisotropic, adaptive behaviors humans exhibit in crowds. Neural models, although less transparent, adapt to such irregularities automatically.

Final Thoughts

Broader Insights and Lessons Learned

Several practical takeaways emerge from this modeling exercise:

Feature design matters more than model complexity. The inclusion of neighbor geometry provided the biggest accuracy gains.
Preprocessing precision is crucial. Minor inconsistencies in position tracking or coordinate alignment significantly affect predicted speeds.
Generalization requires geometric diversity. Models trained on mixed datasets adapt better to unseen conditions, aligning with findings from domain adaptation literature [4].
Statistical evaluation is nontrivial. Bootstrap variance can easily mask or exaggerate differences between models if not done carefully.
Hybrid approaches are promising. Combining interpretable physical priors (like the FD) with flexible neural corrections could offer the best of both worlds.

Where This Is Heading

The marriage of physics and learning offers a path forward: data-driven models grounded in interpretable structure.

Possible directions include:

Graph-based models that explicitly encode neighbor relationships instead of flattening them into vectors – leveraging relational inductive biases similar to social-force terms.
Physics-informed neural networks (PINNs) where the FD equation acts as a soft constraint, guiding the NN toward physically meaningful predictions.
Spatiotemporal sequence models, e.g. transformers or recurrent networks, capturing how motion evolves rather than only instantaneous snapshots.
Domain adaptation frameworks that adjust models trained on controlled experiments for real-world crowd scenes [4].

As sensing technology improves and datasets grow richer, such hybrid systems may eventually form the backbone of predictive crowd analytics – from smart urban design to safety-critical simulation.

References

[1] A. Tordeux, M. Chraibi, A. Seyfried, and A. Schadschneider, “Prediction of Pedestrian Speed with Artificial Neural Networks,” Traffic and Granular Flow ’17, Springer, 2019.

[2] U. Weidmann, Transporttechnik der Fußgänger, IVT ETH Zürich, 1994.

[3] D. Helbing and P. Molnár, “Social Force Model for Pedestrian Dynamics,” Physical Review E, vol. 51, no. 5, pp. 4282–4286, 1995.

[4] J. Amirian et al., OpenTraj: Human Trajectory Prediction Benchmark, 2020.

[5] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” IJCAI, 1995.