[AVT] Problems with uRTR: draft-ietf-avt-variable-rate-audio-00.txt

Hi,

I have discussed the issue around RTP timestamp for variable sampling 
rate codecs with my colleagues and also Colin to get as good idea of the 
issues as possible. We also went through the exercise of to see what the 
usage of an uRTR (unified RTP timestamp rate) concept would mean for the 
RTP payload format for AMR-WB+. This revealed a number of issues.

Lets start with a brief explanation of the system view for codec like 
AMR-WB+:

1. Input sampling (Input Sampling Rate)
->
2. encoding into frames using a specific codec internal sampling 
frequency (ISF)
->
3. RTP packetization, and assigning an RTP TS value for each frame (RTP 
TS rate)
->
4. Transmission
->
5. RTP Reception
->
6. Buffering
->
7. Decoding to Output Sampling Rate
->
8. Audio playout (Output Sampling Rate)

In such a codec system, the Input Sampling Frequency must not necessary 
match the output sampling frequency. The audio signals bandwidth is 
dependent on the selected ISF. Thus Input and output sampling frequency 
should be larger then ISF.

The RTP payload and timestamp must provide the receiver with sufficient 
information for:
- Recovery of decoding and playout position and order
- Intermedia synchronization

First, a common thing when using an RTP timestamp rate not matching the 
input sampling rate is that the sampling instance may not be represented 
by an integer timestamp value, instead it may be fractional. This can 
lead to a initial offset error to another media, when starting decoding, 
due to the rounding.

When one uses uRTR or a timestamp rate that results in that the 
transport units, either samples or audio frames, do not have integer 
timestamp tick duration is that one may get in-stream jitter. This is 
due to that a frame has a duration in the RTP timestamp domain that is 
fractional, the rounding error becomes the error in placing the data 
correctly on the timescale.

This may not be a serious problem for frame based codecs as long as all 
data arrives, as then one can run a scheme that concatenates the data to 
be decoded into a correct stream. Thus the decoder output should be 
correctly and unjittered stream. However if losses occurs then one may 
needs to insert the data without the help of prior data to determine 
what the fractional offset it, thus potentially introducing jitter in 
the placement.

If I understand things correctly, the inter media synchronization error 
is normally not a problem as humans are quite tolerable to offset. 
However we are very sensitive to jitter within an audio stream.

The error introduced by fractional frame lengths will also have impact 
on the RTP payload design. When aggregating frames for frame-based 
codecs the normal RTP timestamp recovery scheme is to calculate the RTP 
TS as: RTP TS value + N * <frame duration in RTP TS ticks>, where N is 
the number of frames prior to this frame. However if one can't express 
the frame duration in integer number of RTP ticks, then the error is 
multiplied by N. Thus an error can grove to several timestamp ticks. Or 
one uses an scheme that provide absolute RTP TS offset values, which 
will raise the need overhead for aggregation.

For sample based codecs where the smallest unit is a sample, the 
fractional error may be even harder to handle due to need for greater 
precision in alignment and potential less regular borders between packets.

There also seem very hard to select a uRTR that will work well. First, 
audio has two families for frequencies used:
- 8000, 16000, 24000, 32000, 48000, 96000, 192000
- 11025, 22050, 44100, 88200

The frequency span is also quite large due different applications. The 
higher values of 192k and 88.2k are used in SACD and DVD-Audio and can 
be expected to occur. Thus selection of a common rate within what is 
practically feasible (lower than a few MHz due to the wrap around) 
within RTP seem to not be possible. Thus any selected rate would most 
likely result in a compromise leading to bad conversion factors for 
either of the two families.

Due to these issues, I think AMR-WB+ should keep its 72kHz RTP timestamp 
rate, as it provides the codec with the necessary audio frame location 
on full resolution clock without jitter. It also has quite good clock 
conversion factors for commonly used output frequencies:

Hz	# of 72kHz ticks per left column frequency tick
8000	9
16000	4,5
24000	3
25600	2,8125
32000	2,25
44100	1,632653061
48000	1,5

Another common problem with uRTR and more free choice of the RTP 
timestamp rate is the impact on the client implementations. Most 
multi-media clients are driven by the audio card clock. The client 
implementation uses the audio clock and to know when it needs to decode, 
remove data from receiver buffer, etc. Thus RTP timestamp rates will 
need to be converted to that clock and errors may arise also here. 
Allowing different rates to be used on different codecs will result in 
the need for handling conversion for more than one rate. Thus making 
codec plug-ins a more difficult.

In conclusion, not using uRTR will more likely allow for maintained 
precision of where the audio data belongs. The client implementation 
will be slightly more impacted then what it would be for uRTR however I 
think this might be the price we need to accept.

I would also propose that the issues around RTP timestamp rates for 
audio is documented in a informational RFC. This would included both the 
recommendation to use input sampling frequency when applicable. Variable 
rate codec do expose limitations in RTP and these should also be 
documented. Further recommendations on how to select rates, and that 
these may need to be considered already in codec development should also 
be part of it.

Cheers

Magnus Westerlund

Multimedia Technologies, Ericsson Research EAB/TVA/A
----------------------------------------------------------------------
Ericsson AB                | Phone +46 8 4048287
Torshamsgatan 23           | Fax   +46 8 7575550
S-164 80 Stockholm, Sweden | mailto: magnus.westerlund@ericsson.com

_______________________________________________
Audio/Video Transport Working Group
avt@ietf.org
https://www1.ietf.org/mailman/listinfo/avt