On July 1, to mark Canada Day, three
Canadian choirs participated in a master class given by the First Vice-President
of the World Choral Federation (Maria Guinand of Venezuela). The choirs were
located in Alberta, Nova Scotia, and Toronto; I was among the singers. Our
instructor, together with an audience of 100, was in St. John's. There were two
live cameras in each location, making eight cameras in all. The network was of
course CANARIE.
The experience reminded me of the old saw
about an elephant that joined the corps de ballet. "What was remarkable
was not that the elephant danced well, but that it danced at all."
The master class made the most stringent
demands on the technology: to be effective, sound reproduction had to be
CD-quality or better, and latency had to be low enough as to be
imperceptible—not to mention the complexity of the network hookup itself.
During the session, one site was dropped,
but restored within five minutes. There were a other momentary lapses. But on
the whole, In fact the network performed well.
The video conferencing was another story.
Two-way latency was a second or more. Motion was jerky, with frequent freezes
of half a second to a couple of seconds. With that delay, it was impossible for
the instructor to conduct a choir's singing. And audio was so inferior that our
instructor exclaimed, "Well, obviously we're not going to pay any
attention today to how you sound!"
Regional Disharmony
The plan was for the three choirs and the
St. John's audience to end the broadcast by singing "O Canada"
together. But with a second of latency multiplied several times over, this
broke down completely, with the four locations coming in on each phrase over a
span of several seconds. Singing dissolved into laughter—we had just celebrated
Canada Day by demonstrating the regional disharmony that bedevils the Canadian
body politic.
Our instructor congratulated the
technology experts on demonstrating a technique that shows great promise for
bringing together peoples of the world, “even if there are still a few wrinkles
to be ironed out.” But what are the wrinkles? What will it take to produce
real-time videoconferencing of a quality that could make such a session a real
success?
John Riddell, Editor, Telemanagement
‘More
Clear-Path Bandwidth’ Is Needed
Most of today's commercial video conference systems
are designed for simple "talking head" applications with limited
multi-point interaction. Using video conferencing for coordinated music such as
multiple choirs, or virtual orchestras is very much pushing the envelope of
today's video conference technology. However there is considerable promising
research being carried out in this area, as for example at McGill University,
working in partnership with CA*net 4 (Canada’s national optical research
network).
The
central problem faced by current systems is the compression of the video signal
and the synchronization of that signal to the audio that is performed by
off-the-shelf video conferencing systems. That's where you see the one second
and higher delays.
The
key to the solution is to fully utilize the network bandwidth and send
uncompressed audio and video over the link. Then you only have to deal with
delay due to the speed of light and speed of switching through the network.
Ultra
Video
Jeremy
Cooperstock and John Roston at McGill University have developed special “Ultra
Video” technology to minimize the problem (http://ultravideo.mcgill.edu). They
are world leaders in this technology.
Together
with McGill’s Wieslaw Woszczyk, they have demonstrated that musicians can play
with a slight delay as long as that delay is constant and under about 200
milliseconds. Used the right way—that is, with dedicated lightpath
capability—CA*net4 can help achieve those characteristics across North America.
The McGill researchers have been successful with small groups of jazz musicians
between Montreal and California but can do much better over shorter distances.
The
McGill team continues to innovate. They are hoping to show multi-streams of
bi-directional high-definition TV between Montreal and Seattle at the November
SC2005 conference on high-performance computing. That demonstration will have a
music teacher in Seattle giving a Master's class to a jazz ensemble, located in
Montreal.
Getting
a choir coordinated in a large performance space requires a leader due to the
slow speed of sound through air. The problem is compounded when you try to do
this with multiple choirs separated by hundreds or thousands of kilometers, but
it has to be solved in a similar way. There has to be a central coordination
point (the conductor) and then a separate mix of the sounds from each site
which is fed back to the ears of each member of the choirs. This isn't your
usual video conferencing set up!
One
of the biggest problems for multi-point video conferences is that they
typically pass through a single point for redistribution. The box is a
Multipoint Control Unit or an MCU. If that machine is poorly connected or CPU-
or backplane-challenged, then you'll see a bad video conference even if the
end-points have good cameras, excellent encoders and great network
connectivity. So a modern MCU is a good first step.
It
is really hard to handle echo problems (echo cancellation) in any general way.
There are good cheap solutions for voice in a small room, and good expensive
solutions for voice in a big room, but for music especially in a large space,
it is very difficult.
Variability
is the big villain in video conferences. If you are sharing the line with other
IP applications at any point in your network, you could run into problems where
the packets from the video conference are delayed or even lost. This will cause
bad video, pops in the audio and even dropped connections. So you probably
don't want a Grand Challenge physics project sharing your network while you
sing O Canada.
Three
Lines of Research
There
are currently three key areas where IP-based video conferencing development is
taking place:
1.
The Ultra Video non-compressed video and surround sound work at McGill,
currently being upgraded to high-definition video. Other groups around the
world—e.g. in the Netherlands, Japan, Korea, Australia and the USA—are working
on variations on this theme. The idea is to see how close to being there you
can get if you remove constraints on bandwidth and keep the latency (delay) to
a minimum. Typically 1.5 gigabits per second is necessary for each stream;
McGill is currently working with 3 bi-directional streams. Needless to say
these experiments are still very expensive.
2.
DV (Digital Video) encoders/decoders of the type used for consumer camcorders
can be used as a cheap entry into fixed speed compression video with stereo
audio. Because of compression, there's often an unavoidable delay when using
these, but it is constant and predictable, which is better than some of the
commercial video conferencing systems. Also the audio quality is significantly
better, especially for music—most commercial VC systems are optimized for voice
(which really is no surprise). Typical streams are about 30 megabits per
second. The Internet2 in the U.S. has an active user group using technology
originally from Japan.
3.
Access Grid (AG) is yet a third approach to VC that is still developing. The
idea is that each system has a wall-sized display (approximately 3072x768
pixels), usually produced by at least three projectors. Each site generates at
least three views of its room for transmission to other sites. A high-speed,
multicast-enabled network connects all sites. Each site has an operator to
optimize the placement of these video streams on the screen. Current work aims
to integrate better quality video and audio systems into the Access Grid
framework, as well as producing tools to more easily manage the system.
Australian researchers seem to be leading in this although the technology base
was originally developed in the U.S.
The
whole VC landscape is complicated by the fact that most current IP-based
systems area based on H.323, which in Internet terms it is clunky, cumbersome
and hard to deal with. SIP-based systems are appearing a fast rate due to the
simplicity and openness of the protocol.
None
of the three technologies described here uses SIP yet. But as they move toward
a common call setup, it is likely that SIP will be adopted.
In
the long term, business communications is likely to borrow from each of the
above approaches. It will have the quality of the McGill HD/Surround Sound
streams, the low cost of the DV camera-based systems, and the management
structures and meeting place ideas from the Access Grid work. And certainly
less compression will be involved and more clear-path bandwidth required.
—Peter
Marshall, Director of Network Applications, Canarie