Automated Match Statistics from Video Using Computer Vision

Oct 27, 2020 · 8 min read

Introduction: From autonomous driving to automated sports analytics

Imagine a system that watches a table-tennis match, locates the ball in each frame, reconstructs the 3D trajectory, and outputs match statistics such as ball speed, bounce locations, and height over the net. The idea is inspired by sensor-rich autonomous vehicles and stadium-grade systems like Hawk-Eye: both rely on multi-view cameras, careful synchronization, and robust detection to convert pixels into actionable, three-dimensional information [1][2].

This project investigates whether a relatively lightweight, camera-only pipeline – implemented in Python with OpenCV and a few well-chosen heuristics – can produce reliable per-rally statistics for recreational table tennis. The focus lies on three measurable quantities: ball speed, impact position on the table, and ball height above the net. The pipeline combines a color-based detector tuned to orange table-tennis balls, dual-camera geometry for 3D localization, simple trajectory completion by parabola fitting, and post-processing steps that map perspective image coordinates into a Cartesian court frame.

Fig. 1: Processing an original frame – color channel differencing isolates the orange ball for subsequent circular detection.

Background

Why color-based detection for table tennis?

Two mainstream approaches for object localization in OpenCV are the Viola–Jones cascade detector and color-space based segmentation. Viola–Jones performs well for objects with distinctive shape and texture but requires a trained classifier and struggles for small, fast objects that lack rich local features [3]. The orange table-tennis ball, however, is best characterized by its color rather than by strong contours or texture; thus, a color-space approach that constructs per-channel differences yields a simpler, faster, and robust detector for this use case.

Software and statistics foundations

The implementation uses Python for its readability and ecosystem, with OpenCV for image processing and Matplotlib / xlwt for visualization and tabular export [4]. Statistical analysis relies on standard descriptive measures: mean, median, 0.25 and 0.75 percentiles, interquartile range (IQR), and box-plot based outlier rules. These statistics are later used to compare winner and loser tendencies over multiple rallies.


Core pipeline

The pipeline consists of (1) framewise detection, (2) camera organisation & synchronization, (3) 3D reconstruction from two views, (4) gap filling using physical priors, and (5) statistic computation and export.

Detection by RGB differencing

The detector operates on single frames. Each frame is split into R, G, B channels and combined with simple differences to emphasize orange pixels:

  • high \(R\) relative to \(B\): \(D_{rb} = R - B\)
  • moderate \(G\) relative to \(B\): \(D_{gb} = G - B\)

A combined saliency map \(S\) is formed by summing the normalized differences:

\[ S = \operatorname{norm}(D_{rb}) + \operatorname{norm}(D_{gb}). \]

Thresholding \(S\) yields candidate regions; circular Hough (or contour-based circle fitting) rejects spurious detections and returns \((x,y)\) image coordinates for the ball when a minimal radius condition is met.

This channel-differencing approach suppresses red objects (e.g., paddles, skin) because those lack the green component typical for orange, while robustly isolating the ball (Fig. 1).

Camera arrangement and geometry

Two cameras are required for 3D localization. The minimal configuration used here places one camera in a lateral view aligned with the net (captures \(x\) and \(z\)) and a second camera in an overhead (bird’s eye) position centered above the net (captures \(y\)) (Fig. 2). Each camera should deliver >100 fps and at least 720p resolution so the ball spans multiple pixels even at table corners.

Fig. 2: Camera positioning – lateral camera provides x/z information, overhead camera supplies y (depth) information.

Frame synchronisation

To merge coordinates from both cameras, frames must be temporally aligned. A simple manual (or semi-automatic) protocol was used: both recordings are advanced to a common reveal event (ball becomes visible to both cameras) and playback is paused until both reach that frame. Assuming identical frame rates, this yields a shared frame index mapping between the two videos.

From image coordinates to a Cartesian court frame

Perspective distortions make naive pixel distances inconsistent with real-world lengths. For the lateral cameras, perspective causes nearer objects to appear larger and to move more per image unit than distant ones. A corrective mapping estimates Cartesian coordinates ((x_k, y_k, z_k)) from perspective image coordinates \((x_p, y_p, z_p)\) using the image-center as a fixed alignment line and a view angle parameter \(\alpha\). The mapping for the corrected \(x\)-coordinate can be expressed as:

\[ x_k = x_p + \frac{\alpha , y , (2 x_p - x_{\max})}{90, x_{\max}}, \]

and analogously for \(z_k\) and \(y_k\) with appropriate substitutions. These formulas compress the perspective exaggeration and map values back into the domain of image coordinates, then a single global scale factor (derived from a reference length in pixels and meters) converts coordinates into metric units.

Gap filling via quadratic trajectory fitting

Ball occlusions (e.g., by the net) create short missing intervals. The ballistic nature of a struck ball justifies a parabolic model in the vertical plane. Given three known points \(P_1(x_1,y_1)\), \(P_2(x_2,y_2)\), \(P_3(x_3,y_3)\) the parabola \(f(x)=ax^2+bx+c\) is solved by forming the linear system:

\[ \begin{aligned} a x_1^2 + b x_1 + c &= y_1,\ a x_2^2 + b x_2 + c &= y_2,\ a x_3^2 + b x_3 + c &= y_3. \end{aligned} \]

Eliminating \(c\) and \(b\) leads to a closed form for \(a\); the implementation uses numerically stable algebraic rearrangements to compute \(a,b,c\) and then interpolates missing \(y\)-values for intermediate \(x\)-samples. Horizontal (in-plane) gaps are linearly interpolated using point pairs immediately before and after the occlusion.


Output: statistics and visualization

Three independent tables are produced and exported to Excel: (1) per-rally speeds \(m/s\), (2) impact positions mapped onto a 3×6 grid of the table, and (3) ball heights over the net \(cm\). Metric conversion uses a reference segment of known physical length \(\Delta s_2\) and its pixel length \(\Delta x_2\); any measured pixel displacement \(\Delta x_1\) maps to meters via

\[ \Delta s_1 = \frac{\Delta s_2}{\Delta x_2}\cdot \Delta x_1. \]

Rally segmentation is performed by locating extrema in \(x\) (or \(z\)) coordinates: consecutive minima/maxima delimit individual strokes. Short trajectories that do not cross the net are discarded.

A 3D matplotlib plot reconstructs trajectories for visual inspection; missing-frame positions are filled with the last known coordinate to preserve timing in the visual playback.


Results & validation

The dataset included three recorded sets with a total of 382 ball trajectories. Manual annotations provided ground truth for validation.

  • Detection robustness: color differencing plus circularity tests removed nearly all false positives; 100% of obviously erroneous trajectories (e.g., missed net) were discarded by the automated filter.
  • Speed measurement: automatic speeds matched manual calculations within ±0.1 m/s for 99.9% of strokes.
  • Height over net: parabola-based height estimates matched manual results within ±0.1 cm for 99.9% of strokes.
  • Impact location: grid cell assignments were error-free in the validation subset.

Box-plot comparisons between winner and loser statistics (mean, median, IQR) revealed subtle tendencies. Example: in set 2 the eventual winner played on average ~1.52 cm lower over the net than the opponent, and box plots indicated the winner more frequently targeted deeper table regions while maintaining marginally higher median speeds (see Box-Plot 4 in the appendix).


Fig. 3: Perspective vs. Cartesian coordinates – correction reduces apparent speed differences caused by perspective foreshortening.

Final thoughts

Limitations & future work

Several practical limitations emerged:

  • Frame rate & resolution constraints: fast smashes demand >100 fps for reliable per-stroke sampling; low resolution limits sub-pixel accuracy at table corners.
  • Synchronization overhead: the manual / semi-automatic reveal method is brittle; hardware or software time-stamping would scale better.
  • Simplified geometry corrections: the perspective → Cartesian mapping is approximate and introduces minimal residual errors because y and z corrections are mutually dependent. A full camera calibration (homography + stereo triangulation) would reduce this residual.
  • Occlusion edge cases: complex occlusions (multiple players, umpire, rackets crossing camera lines) occasionally yield detection failures; combining color detection with a lightweight tracker (e.g., SORT/DeepSORT) would improve continuity.
  • Generality of color detector: orange ball variants, lighting changes, and strong motion blur can reduce detection confidence; adding adaptive color thresholds or a small learned detector would improve robustness.

Future directions include automated multi-camera time-synchronisation, calibration-based metric reconstruction, and extending the analytics to rally-level tactical metrics (e.g., target heatmaps, winner/forced error classification).

Conclusion

The project demonstrates that an accessible, Python-based pipeline combining color-space detection, simple multi-view geometry, and physics-informed interpolation can deliver precise, usable table-tennis statistics at recreational levels. The approach is lightweight, interpretable, and – given modest improvements in calibration and synchronization – readily extensible toward more advanced coaching or referee-support applications similar in spirit to stadium systems, but at a fraction of the cost.

Fig. 4: Example box-plot comparing bounce regions and speeds for set 2 – winner tends to hit deeper and slightly faster.

List of Abbreviations

AbbreviationMeaning
HSV / RGBColor spaces (Hue-Saturation-Value / Red-Green-Blue)
IQRInterquartile Range
fpsFrames Per Second
Hawk-EyeCommercial multi-camera ball tracking system
ROIRegion of Interest
HoughHough circle transform

References

[1] Example: Hawk-Eye system descriptions and overviews; see public materials on ball-tracking solutions (accessed 2025).

[2] Autonomous driving sensor overview – examples of multi-sensor fusion and camera primacy in perception (textbooks and surveys, 2018–2022).

[3] P. Viola and M. Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features,” 2001.

[4] A. Kaehler and G. Bradski, Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library, O’Reilly, 2017.