In a typical SIP based VoIP infrastructure, the SRTP standard [RFC 3711] is usually employed to protect voice and video media packets.
SRTP uses the AES [FIPS 197] block cipher in counter mode to encrypt audio and video data and HMAC-SHA1 [RFC 2104] [RFC 3174] to authenticate the SRTP packets. More modern authenticated encryption mode (AEAD) [AES-GCM] has also been standardized for use in SRTP.
While SRTP defines how communicating parties shall transform RTP packets into corresponding protected SRTP packets, it doesn’t specify how the parties shall agree on the symmetric encryption and authentication keys. For this purpose, several key agreement protocols were defined with SDES, ZRTP and DTLS being the most popular ones.
The SDES key agreement protocol [RFC 4568] is the simplest one. In this protocol, communicating parties include their encryption keys in session descriptions (SDP), which are then forwarded via the SIP signaling channel to the other party.
The keys are transmitted in clear-text, so the SDES protocol pretty much relies on a secure SIP signaling path to the other peer (e.g. SIP over TLS). Note that all the intermediate SIP proxies can still see (and leak) the encryption keys, even if SIP over TLS is used.
The ZRTP key agreement protocol [RFC 6189], designed by Phil Zimmermann of the PGP fame, doesn’t rely on SIP signaling channel. It uses direct media path to the other peer to execute the Diffie-Hellman key agreement protocol between the two peers.
ZRTP supports both finite field and elliptic curve variants of the Diffie-Hellman protocol. For the finite field variant, ZRTP reuses parameters from RFC 3526, namely the 3072-bit and 2048-bit MODP groups. Parameters for the elliptic curve variant are taken from RFC 5114, namely the 256-bit, 384-bit and 521-bit ECP groups (these can also be found in NSA-Suite-B).
ZRTP also supports short authentication strings, which the peers can compare verbally at the end of the key exchange to ascertain the absence of a man in the middle attack.
The DTLS protocol [RFC 6347] is a datagram variant of the well known TLS protocol [RFC 5246]. DTLS executes the TLS handshake over an unreliable datagram channel, which is typically used for audio and video media packets, so, like ZRTP, DTLS also uses direct media path to the other peer.
Note that the SRTP protocol is still used to protect the actual media packets after the DTLS handshake finishes. Only the shared DTLS secret is used to derive the appropriate SRTP encryption and authentication keys.
This section lists features of the mentioned protocols that are currently implemented and used inside Acrobits’ products.
AES-CM with 128-bit, 192-bit and 256-bit keys
HMAC-SHA1 32-bit and 80-bit authentication tags
AES-GCM with 128-bit key and 64-bit and 128-bit authentication tags
AES-GCM with 256-bit key and 128-bit authentication tag
Finite field Diffie-Hellman with 3072-bit and 2048-bit MODP groups
Elliptic curve Diffie-Hellman with 256-bit, 384-bit and 521-bit ECP groups
Base32 and base256 short authentication strings
DTLS 1.2 with the following cipher suites