High-resolution Piano Transcription with Pedals

Based on: Kong, Q., Li, B., Song, X., Wan, Y., & Wang, Y. (2021). High-resolution Piano Transcription with Pedals by Regressing Onset and Offset Times. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3707–3717.

Introduction: Music Transcription

Think of listening to a piano recording and wanting to turn it into written sheet music. If you’re a skilled musician, you might transcribe it by ear – but most of us would rather rely on an algorithm. Automatic music transcription (AMT) aims to do exactly that: convert an audio signal into a symbolic representation of notes. To succeed, a model must not only detect which notes are played but also determine exactly when they start and stop – down to the millisecond.

Traditional frame-wise systems divide the audio into short, fixed windows (frames) and predict, for each frame, which pitches are active [2] . This works reasonably well but has a built-in limitation: temporal precision is tied to the frame hop size. For piano music – with its sharp attacks, rapid ornaments, and expressive pedaling – those frame boundaries can blur fine timing details, introducing subtle but perceptible quantization errors.

Fig. 1: Creating a note representation from an audio signal using a log mel spectrogram

Kong et al. (2021) propose a more elegant solution. Instead of treating onsets and offsets as discrete events bound to frame centers, they regress continuous onset and offset times directly. This regression-based design yields truly high-resolution transcriptions: onsets, offsets, velocities, and pedal activations with sub-frame precision. The approach builds upon the successful Onsets and Frames model (Hawthorne et al., 2018) but replaces hard frame-based classifications with smooth, continuous regression targets – a seemingly small change that results in notably cleaner and temporally sharper transcriptions.

The task & why it’s non-trivial

At its core the Automatic Music Transcription (AMT) task for piano means mapping an audio recording to symbolic events:

Onsets: when a key starts sounding
Offsets: when the note ends (damped or released)
Velocity: how hard the key was struck (mapped here to MIDI 0–127)
Pedal events: sustain (on/off) – a special noisy source of sustain

Why is this challenging?

Piano sounds are highly polyphonic and spectrally overlapping.
Pedal introduces long resonances that make offset detection tricky.
Frame-based outputs lose timing resolution and blur attack/decay demarcations.

The core idea

Instead of casting onsets/offsets as binary targets per frame, the authors define regression targets that encode the time distance from the center of a frame to the true onset/offset time. Practically:

For each pitch and each frame, compute a scalar target that depends on the distance between the frame center and the precise event time.
A hyperparameter (J) controls the shape (sharpness vs. smoothness) of that target: larger (J) → smoother target, smaller (J) → sharper peak.

Training the network to predict these continuous maps lets the model learn sub-frame timing – and at inference a local maximum plus interpolation over adjacent frames yields the precise time estimate.

Fig. 2: Regression target construction – encoding the time distance between frame centers and actual onset/offset times.

Proposed model architecture

At its core, the model follows a familiar acoustic modeling pipeline, yet with several thoughtful design choices that enable precise timing prediction.

From audio to features: the log-mel spectrogram

Before a neural network can “hear” music, the raw waveform must be transformed into something it can process effectively. The log-mel spectrogram is a compact time–frequency representation that maps the energy of different frequency bands (following the human ear’s mel scale) over time. Taking the logarithm compresses the huge dynamic range of audio amplitudes, making quiet and loud events more comparable. In practice, this representation captures the timbral and harmonic structure of the piano in a form well suited for convolutional processing. If you want to know more about the log-mel spectogram, this blog post is a good starting point.

The input audio is first converted to mono, resampled to 16 kHz, and transformed into a log-mel spectrogram using a 10 ms hop size. This representation retains both the spectral richness of the piano and the temporal dynamics needed for accurate onset and offset estimation.

The acoustic model backbone

Once we have a spectrogram, the job of the acoustic model is to map these acoustic patterns to symbolic musical events – effectively translating sound into notes. Originating from speech recognition, acoustic models learn the relationship between audio features and structured targets (like phonemes or, in this case, musical notes). They combine convolutional and recurrent layers to detect both local spectral cues and longer-term temporal dependencies. For more details about acoustic models, refer to this article.

From there, a convolutional neural network extracts high-level time-frequency features. Importantly, pooling is applied only along the frequency axis – never across time – preserving temporal resolution so the network can localize note boundaries accurately. These features are then passed through bidirectional GRU (biGRU) layers, which capture long-range dependencies by processing the sequence both forward and backward in time. This bidirectional design is crucial for piano transcription, where the context surrounding each event (for example, a preceding chord or lingering resonance) informs its interpretation.

Finally, fully connected output heads perform different specialized tasks:

predicting frame-wise pitch activity,
regressing onset and offset times for each pitch,
estimating note velocity, and
modeling pedal activity through dedicated regression heads for pedal onsets, offsets, and frame-wise sustain states.

Altogether, the network has roughly 20.2 million parameters, trained with the Adam optimizer (learning rate 5 × 10⁻⁴, decayed by 0.9 every 10 000 iterations) for 200 000 iterations on a Tesla V100 GPU with a batch size of 12.

Fig. 3: Overview of the proposed acoustic model architecture

Inference: how precise times are extracted

Once trained, how does the model actually turn continuous predictions into symbolic note events? The inference stage handles this translation.

At inference the network outputs:

Frame-wise pitch activations
Onset regression values (per frame/pitch)
Offset regression values (per frame/pitch)
Velocity predictions

To find a precise onset:

Locate a frame (B) that is a local maximum on the onset regression map for a pitch.
Use the center times of adjacent frames (A, B, C) and the regression values to interpolate a continuous onset time (G).
Similar procedure for offsets.

Practical limits: because interpolation relies on neighboring frames, notes shorter than ~4 frames (hop 10 ms → notes shorter than ~40 ms) are hard to recover reliably.

Modeling expressive controls: pedal and velocity

Beyond the note predictions themselves, expressive controls such as pedal and dynamics play a critical role in realistic piano transcription.

Pedal transcription

Sustain pedals complicate piano transcription because they prolong note resonances even after key release, making offsets ambiguous. Kong et al. address this by training the model to predict not just a frame-wise pedal activation map but also continuous onset and offset times for pedal events – using the same regression target formulation as for note boundaries. This parallel treatment allows the network to localize when the pedal is pressed and released with high temporal precision.

However, the approach captures only binary pedal states (on/off). It doesn’t model half-pedaling, where the sustain level varies continuously – a subtle but musically significant aspect of expressive piano performance. Extending the regression design to continuous pedal depth would be an interesting direction for future work.

Velocity estimation

Velocity – how fast a piano key is struck – plays a vital role in expressive transcription. Louder notes typically have stronger attacks and slightly earlier energy bursts, so accurately estimating velocity also helps refine onset timing. In this model, velocity is predicted for each detected onset as a normalized value between 0 and 1, which is later scaled back to the MIDI range of 0–127. Crucially, the predicted velocity is also used as a conditioning input for the onset detection head, allowing the model to leverage the correlation between note dynamics and attack characteristics. This coupling improves both expressive detail and timing precision.

Model Evaluation

Dataset, preprocessing, and evaluation

Training and evaluation are performed on MAESTRO v2.0.0, a large, high-quality dataset containing around 200 hours of aligned piano performances with corresponding MIDI annotations. The recordings are converted to mono, resampled to 16 kHz, and transformed into log-mel spectrograms with a 10 ms hop size.

The evaluation follows standard AMT metrics – precision, recall, and F1-score – computed for both onsets and offsets at multiple temporal tolerances. These tests reveal a key strength of the regression formulation: it maintains high accuracy even under strict timing thresholds. In other words, when precise temporal alignment matters, this model shines.

Additional experiments show that the model is relatively robust to label misalignments in the dataset – an inherent advantage of using smooth regression targets (controlled by the hyperparameter J) rather than hard binary labels.

Results: What this approach gains

The advantages are clear. Compared to frame-based or Onsets + Frames baselines, Kong et al.’s regression approach delivers more accurate onset and offset timings, particularly under tight evaluation tolerances. Conditioning on velocity improves dynamic realism, while explicit pedal modeling brings the transcription closer to human-level expressiveness. Across a range of hyperparameters and decision thresholds, the model consistently produces smoother, more temporally precise transcriptions and shows resilience to annotation noise.

Fig. 4: Regression-based model vs. Onsets and Frames baseline

Limitations & Future Work and Takeaways

Limitations and open challenges

Despite its strengths, several limitations remain. The model treats the sustain pedal as a binary switch, ignoring half-pedal and continuous control nuances. Very short notes (under ≈ 40 ms, or four frames) remain difficult to capture due to the inherent limits of the 10 ms framing. Dataset diversity is another concern: MAESTRO, though extensive, doesn’t fully capture the acoustic variability of different pianos, rooms, and recording setups. Finally, polyphonic complexity – overlapping notes and sustained resonances – continues to challenge offset prediction, and the system as presented isn’t optimized for real-time inference.

Final thoughts

Kong et al. show that a relatively straightforward change in the target representation – from discrete frames to continuous regression maps – yields tangible improvements where it matters most: timing. For many musical applications (score following, expressive score recovery, alignment, performance analysis) sub-frame accuracy is crucial. The paper’s strengths are clarity, sensible architecture choices, and convincing empirical gains on MAESTRO.

At the same time, the work is an incremental but meaningful step: pedal modeling, half-pedal, extreme polyphony, and robustness to more varied recordings remain active problems. Combining the regression idea with stronger source separation, multi-instrument modeling, or higher sample rates could be promising next steps.

List of Abbreviations

Abbreviation	Meaning
AMT	Automatic Music Transcription
BCE	Binary Cross Entropy
biGRU	Bidirectional Gated Recurrent Unit
MAESTRO	MIDI and Audio Edited for Synchronous TRacks and Organization (dataset)
STFT	Short-Time Fourier Transform
MIDI	Musical Instrument Digital Interface

References

[1] Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-resolution piano transcription with pedals by regressing onset and offset times,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3707–3717, 2021.

[2] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Onsets and frames: Dual-objective piano transcription,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2018.

[3] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” in International Conference on Learning Representations (ICLR), 2019.

[4] X. Riley, D. Edwards, and S. Dixon, “High-resolution guitar transcription via domain adaptation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1051–1055.

[5] E. Benetos, S. Dixon, Z. Duan, and S. Ewert, “Automatic music transcription: An overview,” IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20–30, 2019.

[6] G. E. Poliner and D. P. W. Ellis, “A discriminative model for polyphonic piano transcription,” EURASIP Journal on Advances in Signal Processing, 2006.

[7] J. P. Bello, L. Daudet, and M. Sandler, “Automatic piano transcription using frequency and time-domain information,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 2242–2251, 2006.

[8] T. Berg-Kirkpatrick, J. Andreas, and D. Klein, “Unsupervised transcription of piano music,” in Advances in Neural Information Processing Systems (NeurIPS), pp. 1538–1546, 2014.

[9] M. Marolt, “A connectionist approach to automatic transcription of polyphonic piano music,” IEEE Transactions on Multimedia, vol. 6, no. 3, pp. 439–449, 2004.

[10] A. Cogliati, Z. Duan, and B. Wohlberg, “Piano music transcription with fast convolutional sparse coding,” in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, 2015.

[11] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network for polyphonic piano music transcription,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927–939, 2016.

[12] W. Goebl, “The role of timing and intensity in the production and perception of melody in expressive piano performance,” PhD Thesis, University of Vienna, 2003.

[13] M. Müller, “DTW-based motion comparison and retrieval,” in Information Retrieval for Music and Motion, Springer, 2007, pp. 211–226.

[14] Understanding the Mel Spectrogram. Analytics Vidhya, 2020. Available at: https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53.

[15] Acoustic Model Explained. Medium, 2021. Available at: https://medium.com/@avinashmachinelearninginfo/acoustic-model-14a1c8939497.