1. Introduction to OBS and applications
OBS Studio(OBS) is a popular streaming application, it is well designed and implemented. I have added a few features to OBS, and used it in 4 different products/components.
The first product is a game streaming application. On top of OBS features, I added game overlay support(including DX8,9,10,11,Opengl),refined render logic. Thanks to the chance, I has learned the insider of OBS, and understood more about DX,audio encoding, video encoding.
The second product is a live show streaming application. Unlike the game streaming application, it doesn’t need any feature related to games, however, it needs more detailed control in camera, and post-processing in images from camera. I added face landmark detection before I showed the OBS-based streaming solution. Later they asked Audio capture from other processes which we will talk soon.
The third product is a component of a gaming platform, it offers in-game overlay(yes it is added in the game streaming application, now it becomes a standalone component). It provides a simple API to application developers, mirrors the render result from application to games while application developers no need to worry about the interprocess operation.
The fourth product is a component of a gaming platform, it offers in-game screen capture and other features, unlike all aforementioned products, it is designed to be small and straightful. It is re-written from scratch with modern C++.
After made of the above applications, I have known OBS a bit, and let us talk audio capture(apihook based)
Microsoft brought a brand new audio system, Windows Audio Session API(WASAPI) to Windows OS in Windows Vista, to replace all legacy audio system, say waveout, direct-sound, and MIDI. It also built another foundation on top of WASAPI, Microsoft Media Foundation. The legacies API is translated into WASAPI by the system.
OBS is using WASAPI to capture audio from MIC and other input device, and use loopback device to capture audio from system. As a fact, we all know all sounds around MIC are captured, so do outputs. Sounds being played through certain device will be captured altogether. if you are watching youtube in a web browser or listening to a music player while keep some instant messaging application running, say, skype, line, your recording will probably contain notification sound. This is not the best result you want.
We have known the limitation of the default audio capture, that works globally, sometimes it doesn’t good enough. We need audio capture at process-level, which means we need to capture sound from a selected process, say, a music play, a game, or web browser. The OS doesn’t provide an API for this, however, once we know how applications play audios, we know how to capture them. OBS doesn’t implement this either. We need to explore by ourselves, I will show you step by step.
Before we start capture, we have to know how music is played.
This is the most old audio system which is still widely used in windows OS. It works basicly in this way:
Open device by calling waveOutOpen(…);
Call waveOutWrite(…) periodically;
Call waveOutClose after finish use.
This is another popular audio system, it offers most control comparing to older waveout API. It works in this way:
Create IDirectSound object with API “DirectSoundCreate”,
Create LPDIRECTSOUNDBUFFER object through IDirectSound::CreateSoundBuffer,
Periodically call DirectSoundBuffer::Lock/Unlock to fill buffer, and DirectSoundBuffer::Play to play periodically
This is the latest audio system, it offers more features to support nowadays audio hardware, and applications. That also means this is the most complicated system we have to face. It works in this way:
Get an audio endpoint of the default audio device according device type, or get it through enumeration.
Get IAudioRenderClient service.
Periodically call IAudioRenderClient::GetBuffer/ReleaseBuffer to fill buffer.
Since there is no official API to intercept audio data through above API, we need to employ API HOOK to do the extra work right before or after these API calls.
The concept of API HOOK is simple, by placing a jump at the beginning of the original function, our tamper function gets the chance to check the calling before passing to the original function, and alter return when needed.
Because our application and target process are living in different processes, and target process may even have its own child process, we need injection to to make api hook work.
Music players like Window Media Player, MediaMonkey, Foobar2000 are single process programs. Web browsers like Chrome,Firefox, IE are multi process. Flash player in chrome/firefox are also popular, and it is running in a standalone process.
One more thing worth to mention is cross architecture injection, it stands injection from a 32bit program to a 64bit program, versa vice. A helper program is required, so a 32bit program can pass necessary parameters to a 64bit help program to inject into a 64bit program, or a 64bit program calls 32bit helper program to inject into a 32bit program.
For example, plugincontainer.exe process in firefox is always in 32bit. When our application has injected a 64bit firefox, our hook dll is in charge of injecting into the 32bit plugincontainer.exe. The same thing happens to IE. Chrome is the first multi process web browser, it is more robust than single process of IE6 when Chrome was born. Media in chrome can be played natively or played through flash plugin.
After talking about audio playback and api hook, it is the time to start audio hook now. Let us begin with audio data interception.
Audio playback can be started by any audio system api, waveout, direct sound, wasapi, there is the minimal unit,a handle in waveout, an instance of directsound, or an instance of audio client. I call it an audio session. An application can create one or more audio sessions, by using one or more set of above API.
Audio has its own attributes, say sample rate, channel, data format, volume, and more.
For waveout api, we need to hook waveOutOpen to know a session is created at the earliest time, parameter “LPWAVEFORMATEX pwfx” declares what is its format. We also need hook waveOutSetVolume to know the volume is changed at the earliest time.
For Directsound, audio session, please refer to source code, they are similar but in more classes.
After we captured the audio data in target process, we need a fast and reliable way to relay it to our main application.
ASIO is a cross-platform C++ library for network I/O programming that provides developers with a consistent asynchronous model using a modern C++ approach. TCP is a connection-oriented protocol, it is suitable to audio data transmission. Establish the connection when an audio session is created, and close the connection when an audio session. Using multi threads for asio::ioservice, spread connections among these threads to reduce latency of each connection.
The concept of root process is important. We may create multi capture(source in obs studio), each capture represents a group processes, for example, it may be windows media player, or chrome. So our tcp server for each capture is different, listen at different port, and pass the port through a shared memory.
The shared memory for single process program is simple, we can put all info in that memory, and name the shared memory as something like “shared_memory” plus processid. The hook dll can get the shared memory according the same naming rule. But how about multi process program, all instances of a program should share the same tcp server and other settings. So I introduce another concept, index info which is also stored in a shared memory but different instance. So when we are going to inject a program, say chrome, we create two blocks of shared memory index_info, hook_info. We store the root process id in the index_info, which is itself, because it is the root process. After our hook dll entered the chrome process, it is in charge of monitor new child process, once a new process is created, it create a new index_info which has the root process inside. The new child process will open the hook_info according the pid in index_info, so both the root process and child processes are using the same hook_info by using different index_info.
OBS defines input elements as sources, one source can can a logical collection of source, or a source does input something either video content or audio content. Each source has its own private data, usually state and buffers.
As mentioned before, we can have multi session from different audio api instances. So I introduced sub-source into OBS. it is a source but used internally only. Applications create the audio hook capture source as usual, and the source creates sub-source on demand, say when a tcp connection is made.
Core Audio APIs on MSDN: https://msdn.microsoft.com/en-us/library/windows/desktop/dd370802(v=vs.85).aspx