Synchronising audio and video from multiple sources

**Gordon** · 14-02-16, 21:03

You are right Dave there are many technical issues to be resolved in jointly processing video and audio material, not least synchronisation both in the video environment itself but also in combining audio and video from multiple sources. Whereas audio has no inherent time base, video does and it is essential that processing video respects those timing constraints and, where audio is also involved, it is always subservient to the video.

One clue to the challenging problems is that of bit rate: professional Standard Definition TV requires over 200 MBit/s per video stream without any audio. HDTV requires in excess of 1 GBit/s. Compression and some other compromises are almost essential for consumer applications which bring their own problems.

This is not the place for a lengthy tutorial but having spent the majority of my career involved in the analogue and digital broadcast/media industry I can say that some of the issues are non trivial. Agreed international procedures and standards have evolved to deal with all of them and as standards develop further to include the latest 4K high definition formats these standards are still evolving. Repeating in domestic situations what can be done in professional ones can be difficult or even impossible because of the complexities and/or the implied cost.

**MrGongGong** · 14-02-16, 22:38

I think (???) you need a time base corrector
This used to imply adding a couple of 00s to the budget but not sure these days

**Dave2002** · 14-02-16, 23:08

Originally posted by Gordon View Post

Whereas audio has no inherent time base, video does and it is essential that processing video respects those timing constraints and, where audio is also involved, it is always subservient to the video.

Generally I would agree with that - losing or gaining a few milliseconds of audio probably doesn't make too many problems in drama type productions.
However, I specifically enquired about concerts, where perhaps the audio would be more important. I think that (say) frequent 10-20 ms glitches in audio in order to patch in different video clps could be too much to bear.

Professionals may very well have equipment and ways of overcoming, or reducing the problems. Using cameras with significantly higher frame rates might help too.

Repeating in domestic situations what can be done in professional ones can be difficult or even impossible because of the complexities and/or the implied cost.

This does indeed make sense to me.

Now I'm thinking more about this, I find it really amazing that there have been live concert broadcasts with video over the last 30-40 years, with apparently seamless video handover, which must have been done in earlier years with analogue equipment, and more recently with digital kit. Unless all the video recording was done with synchronised cameras, there would almost inevitably have been some problems further down the broadcast chain. Even with delayed/recorded video for post production editing the problems would have been hard - or were there some fudges (tricks) which made any glitches seem seamless and not important?

Perhaps viewers/listeners were more tolerant of problems than I imagine.

We know there are tricks which can be done after the event in audio - if there's enough time delay to do a quick edit, such as patching in a good horn note over a bad one, or putting a cough from the audience in to hide a mistake, but video would seem to present more significant challenges.

Anyway, thanks for responding. I am currently trying to sort out a video recording which must surely have some of the problems I have hinted at. The problems may turn out to be insuperable, given the material I have, and the equipment and software, though I might figure out ways of making it work - "almost"!

If the videos are to be synchronised, I may have to shorten or lengthen some of the musical notes from the sound tracks - hopefully judiciously. As yet I've not even tried to use the recordings from two cameras in anger, though I have tested the feasibility, and can do that.

What I've not yet checked is how bad the results will be if video clips are simply inserted and switched around arbitrarily. I may be expecting things to be worse than they may turn out to be, though such expectation may turn out to be fully justified.

**MrGongGong** · 14-02-16, 23:12

I think the perception of synchronisation varies enormously depending on what you are seeing and hearing
It always strikes me as remarkable that we perceive "timing errors" when seeing someone speak on TV (particularly with digital TV) when we can't actually see what is making the sound. I suspect small errors of timing are hard to detect with some visual images compared to others.

**Dave2002** · 14-02-16, 23:17

Originally posted by MrGongGong View Post

I think (???) you need a time base corrector
This used to imply adding a couple of 00s to the budget but not sure these days

Thanks for the suggestions.

Time base corrector - https://en.wikipedia.org/wiki/Time_base_correction
Frame synchronizer - https://en.wikipedia.org/wiki/Frame_...on_%28video%29

Looks as though £500 might do something - though how well? http://www.rcblogic.co.uk/p-1480-kra...FXcW0wodaLoMJA Various bits of kit in the range under £1k.

More stuff - https://www.axon.tv/EN/products/3-in...-synchronizers

Page not found - Studio 1 Productions

http://www.studio1productions.com/Articles/TBC.htm

Looks more complex and expensive than many of us might wish to get involved with!

**Dave2002** · 14-02-16, 23:20

Originally posted by MrGongGong View Post

I think the perception of synchronisation varies enormously depending on what you are seeing and hearing
It always strikes me as remarkable that we perceive "timing errors" when seeing someone speak on TV (particularly with digital TV) when we can't actually see what is making the sound. I suspect small errors of timing are hard to detect with some visual images compared to others.

I agree. Lip sync is a particular issue which stands out, though - but I think that is very critical. Other audio/visual content might be much easier to tolerate with moderate errors.

**Anastasius** · 15-02-16, 06:53

In a professional scenario (specifically a concert), you don't have lots of separate cameras, each with their own audio being recorded alongside the video. The audio will be laid down as a separate track(s) and synchronised to the video in post-production using timecode. NB Not a 'time-base-corrector' which is exclusively used to synchronise different analogue video sources so that there is no frame jump when the sources are cut between on the vision mixer.

If you are trying to edit between several digitally recorded streams (audio and video on the same device eg camera with a wee mic.stuck on top) then I would have thought it extremely difficult in the domestic environment simply because you have no common reference such as timecode between the sources.

As a slight digression, this business of jump cuts was anathema to broadcasters like the BBC. Not an issue in a studio environment where all cameras were fed from the same SPG (sync pulse generator) but in an OB (outside broadcast) environment where you you might have several remote camera sites then the delays in the video from these arriving at the central mixing site would cause a problem particularly with colour and PAL. There were various mechanisms developed (Genlock, for example or Slavelock) but the one that I am most familiar with and can still 'hear' the audio tones in my mind, is Natlock designed and developed by the BBC. In essence, it sent signals in the form of audio tones to the remote site to advance the SPG and CSC (colour subcarrier) so that by the time the delayed video arrived back at the central mixing site, it was both in sync and colour phase.

**Dave2002** · 17-02-16, 05:46

Right now I'd settle even for synchronisation even to within one frame - around 33 ms. This "ought" to be possible with tools such as Adobe Premiere Elements, but right now my experiments with this suggest that this is a very difficult - sort of hit and miss - affair. This is a great shame, as I can see the two video streams, but even getting them synchronised to within one frame is hard. One has better video quality, but a restricted view, and isn't always stable, while the other is atable, has a wider view from a different viewpoint, and seems to be grainier. It "ought" to be possible to combine these, but seems very hard. maybe this will serve as a warning to anyone else who attempts this with domestic equipment and consumer level software.

For slightly less demandng applications the Adobe software seems quite good, and making an "amateur" video (holiday clips, etc.) even with multiple audio and video streams is possible, but precise synchronisation - even to within one frame seems hard. With the number of cuts and overlays I'd need for my project, it may be time to call it a day now! A suggestion from elsewhere to use the audio tracks to synchronise the video also seems hard. For pure audio, using a tool, such as Audacity, I'd expect to get an overlay or join accurate to within one or two ms, thought there might still be an audible click. I have done that before.

**Anastasius** · 17-02-16, 15:16

One thought as to why you are having problems is the relative timing stability of the individual sources. If you have a camera rated at x frames per second, if the internal cock stability is not that good then I posit you might get x.002 frames per second or even x - 0.002 frames per second. If there is no reference timecode recorded alongside (which I suspect might be the case with professional cameras) then you really are flogging a dead horse!

Again, going back to the earlier digression, in an attempt to minimise the need for 'fine tuning' of timings, as it were, the BBC did try using a Rubidium oscillator. The frequency stability of these is bordering on the edges of our technology, based as it is on known laws of the universe. But even using those, you could get two colour signals in phase but within seconds they had drifted out sufficiently far as to require phase correction.

Edit: I think I may have hit the nail on the head. http://www.bhphotovideo.com/find/new.../Time-Code.jsp

and another useful article http://www.bhphotovideo.com/explora/...why-it-matters

I Googled 'Do professional cameras record timecode' Lots of good stuff there.

**Dave2002** · 17-02-16, 15:57

I take some of your points, which do seem similar to my own initial thinking. However, actually trying to do this "for real" has so far caused me to beat my head against a wall (without even going near that tea drinker PG Tipps). Basically I have two videos and a separate audio recording, each of different quality. I would at least like to try, and hopefully succceed, in making a composite whole which is better than each of the parts. So far I have failed miserably. I can't even get frame synchronisation. I assume that if I could get some of the frames synchronised that the remaining frames would not drift out of synchronisation too rapidly. There may be some arcane commands for the Adobe tool which would enable me to do what I'm hoping for. I'd like to make the clips "rigid" so thst the timing won't shift. Without that I think that Premiere Elements may stretch or squeeze the clips if I make a wrong move. Once I've done that it ought to be simple to just shift one video relative to the other, for example to find a frame in each where (say) the conductor has his fingers outstretched, and then bind them togeher so that video can be taken from either. I haven't actually checked that the frame rates are the same on thm two inut sources. I might be able to get a result by generating an output file, and then feeding that back in, but I'd really like to avoid that if at all possible.

**Anastasius** · 17-02-16, 16:13

Dave, I really think you are on a hiding to nothing for all the reasons we've discussed.

**Dave2002** · 17-02-16, 19:04

Thanks. I read msg 10 again - perhaps nobody else has done so - "internal cock stability".

What you appear to be drawing attention to is that it is very hard (impossible) to do synchronous communication without using a common clock. However what is probably most often used nowadays is asynchronous communication, in which recognisable markers are put into the data channel, and these are used to resynchronise devices at a block level. It is usually assumed that the communciations and devices using local clocks of similar accuracy are sufficiently stable for a long enough period that drift errors within a block won't have a significant or permanent effect within a digital system, particularly if error correction is used.

As yet I don't know enough about this, but it does appear that some digital video (editing?) systems don't use the time codes you mention, or may even strip them out. Some systems may attempt to put time codes back in. I had assumed that with digital editing that there would be start of frame markers which could be used to synchronise at the frame level, though even that is complicated as a lot of video is digitally compressed, and may (will) use both forwards and backwards predictive frames. I had rather assumed that a decent software video editing package could take care of all this. Video editing does not have to be done in real time, either human or computer time, though it's desirable if it can be done relatively quickly. Edit operations which require (say) 5 times longer to edit than simply play the video (i.e. 5 hours edit versus 1 hour playback) may be acceptable, but operations which have a significant factor (100 even 1000 or even much more) are for most purposes impractical, though some commercial movies are produced with an enormous expenditure of money, time and effort - human time and computer time.

I can see that getting eveything right might be hard or impossible, but surely it shouldn't be as hopeless as I'm currently finding it. I'll follow up your sources in msg 10.

**Gordon** · 18-02-16, 00:20

Further to your #11 and #13 above it would help to understand the problems you have if you could tell us more about your sources - the 2 videos and separate audio, presumably of the same event. What were the videos recorded on and what format are they in [digital, compressed or not, aspect ratio, MPEG or something else?] and how are they stored eg on tape on DVD etc or just files in your laptop or other computer. Similarly the audio.

How are you trying to combine them - on a laptop or something else? What other hardware is involved ie how are the sources [say the cameras and audio recorder themselves] connected to the system? What interfaces are there between the sources and the laptop/computer? IEEE1394? Are there multiple inputs available on the computer?

It is possible to combine video material in non real time by ingesting the sources into separate files on a hard drive and then, provided you know [as you say in #13] where the frames start, by creating a new file made from parts of the inputs frame by frame. MPEG 2 and MPEG 4 provide frame markers [among other vital parameters] in the file format but not necessarily for each and every frame. The editing software will need to read the file headers. Timing issues only arise when you want to watch the moving sequences together. Tricky bit is the amount of memory - hard drive space and some RAM - you may need for full SDTV resolution. SDTV at full resolution [European frames are 720x576 pixels, only about 0.4MPixels] makes about 21MBytes per second and so an hour's uncompressed file is about 75GBytes in size. What resolution are your sources? An iPlayer file for an hour is only about 640MBytes - it's highly compressed [over 100:1] and so carries impairments and is delivered in asynchronous mode by file transfer using IP. In that form its editing potential is limited. Then there is the matter of your laptop's video card and native screen resolution.....

You mention 33ms above - that is the US video frame rate standard, nominally 30Hz, in Europe the rate is 25Hz making a 40ms frame. Are you using cameras with US standards? If you are mixing US and European standard video you'll need a standards converter.

**Dave2002** · 18-02-16, 08:10

Thanks Gordon. Both videos were reorded on digital equipment. One was a Sony video camera - I don't know the model, the other was a Zoom H2 Handycam. The Sony was set up by someone else who has given me access to the material. Perhaps foolishly I recorded some of the video material on the Zoom in HD. The good thing about that is that some of the images are very clear - though unfortunately my camera work wan't very good. I think the Sony material was recorded in 720 HD, so not as high resolution as the Zoom. Post production zooming in to the Sony material, for example to pick out soloists, seems to rather quickly give poor results - at least in the edit preview. This is for several reasons. Firstly the angle of view used covers more of the performers, so zooming in on a single face or part of the scene exacerbates the quality issues compared to the Zoom material. The Zoom material seems in some sequences to be closer to,the performers, so further zooming in does not degrade the images so much as with the other material. Secondly, the resolution of the Zoom recordings, being higher, does allow for much clearer close up shots. I rather think that all of the material was recorded at 30 fps, but I'd have to check. Modern digital kit can often record in a range of standards, aspect ratios, frame rates etc. I think most of the Zoom audio was in PCM, while the Sony audio was in Dolby AC3. The Zoom can also do mp4 (aac) audio at a number of different compression ratios.

If the best bits of the Zoom material can be combined with the Sony material then the final result could be better. If I were doing this again I'd hope to have much better camera work! The Sony was on a static mount.

The separate audio tracks of most of the pieces were also recorded on a HiMD minidisc recorder, though in fact the audio format used was ATRAC rather linear PCM which had been an option. This was decided to give a longer running length, and also as the theoretical quality improvement of PCM over ATRAC did not seem to make a substantial difference for these recordings. I made the CD of the event months ago - didn't take me very long, though I did have to do some conversion using that old Sony Sonicstage program on a PC. One piece wasn't completely recorded on the MD recorder, so I used one of the video soundtracks for that - probably the Zoom PCM version.

Arguably the MD sound track is better than either of the video soundtracks, but it is rather debatable now whether it would be worth using that for the eventual final video. Since the video soundtracks are already tied to their respective video, and the experience of watching a video is different it is now undoubtedly easier just to work only with the video source material.

Synchronising audio and video from multiple sources

Synchronising audio and video from multiple sources

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment