- Yuki Subs Guides has a master list of guides of all you need to know about anime encoding.
- Also, go visit my MPC-HC + MadVR, mpv, and SVP 4 setup guide!
Table of Contents
- x265 Settings Guide
- Why x265?
- What tools do I use?
- Which x265 encoder? 8-bit, 10-bit or 12-bit?
- What preset should I use?
- x265 Encoding Efficiency
- Encoding Parameters
- x265’s Biggest Flaw: Grain, Micro-Banding, and Grain-Blocking
- TL;DR Summary for x265 Encode Settings
- Audio Codecs Guide
- Bonus: Ranting on Audio Gear and Recommendations
x265 Settings Guide
All my encoding parameters are tuned for Anime at 1080p. TL;DR at the end.
x265 is a library for encoding video into High Efficiency Video Coding (HEVC / H.265) video compression format. While developed as a replacement for H.264, due to early performance and licensing issues, it didn’t gain much traction, especially as a streaming format. While most devices now support HEVC in some form or another, adoption didn’t really kick off. Despite these downfalls, HEVC does compress much better in higher crf (lower bitrate) and has been a good codec to use since 2018. Most modern devices support hardware decoding (iOS, Android, laptops, Macs, etc.), and even at 1080p most older consumer PCs should be able to software decode just fine.
In terms of efficiency as of right now (2019/2020), when properly encoded, HEVC could save ~25% bandwidth minimum compared to AVC (H.264, with its encoder known as x264) from my observations. HEVC is about the same quality as VP9, a codec by Google. HEVC does have a slight advantage in terms of parallel encoding efficiency, though they are both just as slow when encoding compared to x264. VP9 today is mainly used by Google (YouTube) via the webm/DASH format, which
Apple refuses to support on iOS devices which is probably why it’s not widespread. Update: Apple now somewhat supports VP9 on some devices for 4K streaming (iOS 14+).
What I’m more excited about is AV1. AV1 is a next gen codec with members from many tech giants (Apple, Google, MS, etc.) to create a royalty free codec. IIRC some of its baseline is based on VP10, which Google scrapped and had its codebase donated to develop AV1. Unfortunately the reference encoder
libaom is still in early stages of development and is rather inefficient.
SVT-AV1 is Intel and Netflix’s scalable implementation. It performs (quality-wise) not as well as
libaom, but quality is still better than x265 (SSIM), and encode times are much more reasonable. For now, they look promising, and I am excited to see results in a few years. HEVC took 3-4 years before it was more widely accepted by Anime encoders, and took 5 years until 2018 before I began experimenting with it. AV1 began major development in mid-2018 so by that logic we got 2-3 more years to go.
What tools do I use?
Handbrake is a great tool for beginners. It allows reading Blu-ray disks (unencrypted), and has a pretty UI to deal with. While it runs on an 8-bit pipeline, this isn’t an issue for Anime. Hopefully 10-bit pipeline will be supported once HDR becomes common. (Update: 1.4.0 is still under development but should be updated to use a proper 10-bit pipeline for HDR.) If you want FDK-AAC, you will need a custom compiled version. Guides on how to do this are easily found.
Other useful stuff for dealing with Anime (research them yourself):
- MKVToolNix (MKV merge & Extract)
- Notepad++ or similar text editor
- Batch scripting (.bat files)
Which x265 encoder? 8-bit, 10-bit or 12-bit?
To keep things short and not get into technical details: use the 10-bit encoder (Main 10 profile).
10-bit produces slightly smaller files while preventing banding from 8-bit compressions (this isn’t a joke, quantization and linear algebra is a mysterious thing).
Why not just use 12-bit then you say? Well, to put simply: moving form 8 to 10 bit increases color gradient available from 256 to 1024 to eliminate banding, but going to 4048 in 12-bit really isn’t noticeable. In fact, it is much less supported, and from my tests back in 2018 the 12-bit encoder is actually worse than the 10-bit encoder at high crf likely due to less resources put into developing it.
What preset should I use?
First, one must understand x265 is fundamentally different than x264. In x265, the slower the preset, the bigger the file size is at the same crf. While counter intuitive at first, this is due to the more complex algorithms used to more precisely estimate motion and preserve details. To put into perspective how important presets are in x265, a clip encoded at crf=16
preset=medium is actually worse in quality than crf=18 with
preset=slow, while also being 50% larger in file size. One may argue SSIM isn’t the best representation as a quality metric, but the matter of fact is that it’s an objective measurement readily available. Subjective tests I’ve done (using Kimi no Na Wa & NGNL Zero) also agree with the above statement. The move from
slow each reduces bad noise and artifacts, though I do have to admit going any slower I could not observe any significant improvements (x265 3.1 has a revised
veryslow preset that changes this).
Now to answer the questions which preset to use. In my mind, there are only 3 presets worth using:
fast preset, which the quality is pretty “eh”, is the first “sweet spot” from the list and does serve a purpose for people with really weak systems. It is barely any slower than the others while offering slightly better quality. However, it is rather lackluster in my opinion and should never used to encode anime, especially those with dark/complex scene.
slow is the second sweet spot as you could see from the graph below, and should be the preset most people should use. Yes, I know it isn’t the fastest at encoding videos (typically ~10fps for a modern 6-core desktop processor @1080p), but it does offer superior quality especially with fast moving objects in dark scene.
Normally I wouldn’t recommend
veryslow. It does, however, have its place, especially with the recent changes in v3.1. The
veryslow preset is very useful at higher crf values (22+), with much better motion estimation at low bitrates. The only downside to
veryslow is its encoding speed, which is a gazillion times slower than
slow and requires a supercomputer. At lower crf values you barely get any improvements and thus could be ignored.
Presets vs. SSIM at 4K (source):
Include 1% low & Encode time:
Presets vs. file bitrate at 4K
Bonus conversation: Due to x264 being much less complex, presets can pretty much maintain only a slight loss in quality as you encode faster. This gives an illusion that slower presets are slower due to spending more time compressing. The more correct way to see this is that slower presets are slower due to doing more motion calculations and finding the best scheme that best describes the frame, which in x264 just so happens benefits compression too. However, x265 is waaay more complex with motion algorithms, meaning that accurately describing motion actually increases bitrate.
x265 Encoding Efficiency
For mainstream systems, just let x265 handle it automatically. For more advanced encoders that may have beefier CPUs, or even servers, this section may interest you.
x265 heavily favors real cores over threads, so keep that in mind if you use programs like process lasso. Theoretically, a 1080p encode with a default CTU size of 64 has an encode parallelization cap of
1080/64 = 16.875 threads. Beyond that x265 will not scale linearly. You could lower CTU size to 32, but you lose some compression efficiency (~1-5% depending on source complexity, personal tests show ~1-2% for anime @
crf=19), especially with anime where CTU of 64 actually does benefit.
For those interested, x265 heavily utilizes AVX2 instructions, which runs on 256-bit FPU for optimal speed. Keep this in mind when researching CPU choices.
The more threads you add into a pool, the more encode overhead you will experience, since every row of encode requires the upper right CTU block to complete before it can proceed. When the upper right block is more complex and slows down, the next row has to wait. More rows = more possibility of waiting to happen. This translates to about 30-50% efficiency when encoding with additional threads, with the value gradually lowering the more threads you have. Learn more about frame threading.
According to someone’s 128 core Azure VM test on a 4K footage w/
preset=veryslow, encode scales pretty much linearly to 32 cores, and a bit dodgy at 64 cores. If we translate this to 1080p in theory scaling should be good up to 16 cores, which agrees with our earlier theoretical calculation of 16.8 threads.
Earlier we determined the theoretical limit of 1080p is about 16 threads. While at first glance an 8 core CPU should be the limit, remember that x265 favors real cores over threads. This means on a 16 core CPU, each encoding thread gets a real core to run on, not to mention there are also other processes in the encode chain that could use the extra threads. Due to x265 encode threads being terrible at sharing a core you still get good efficiency. Realistically speaking, if you really want to peg your 16 core CPU at 100%, I would run 2 instances of 16 thread encodes (you gain maybe ~5-10% efficiency), or lower the CTU to 32 to increase the theoretical limit to 33 threads. You can control thread count with the
Those with CPUs that have multiple NUMA domains, look at –pools options on how to set x265 to run on a single NUMA node, then run as many instances as NUMA nodes you have with each instance running on each NUMA.
Now that we established we should always use
preset=slow, let us look at parameters that you may want to use/override to improve quality. For test clips, I recommend NGNL Zero and Tensei Shitara Slime episode 1 as they represent pretty much the worse case scenario for encoding anime (lots of dark scene, fiery effects, glow effects, floating particles, etc.).
QP, crf and qcomp
Note: x265 also has ABR (average bitrate) and 2-pass ABR encoding mode which I won’t get into. As a quick summary: never use ABR, and only use 2-pass ABR if you absolutely must have a predictable output file size. 2-pass ABR will be identical to crf assuming the result file size are identical and no advanced modifications are made to crf (e.g.
Before we begin, we need to understand the basics of how x265 works. QP, a.k.a. quantization parameter controls the quantization of each macroblock in a frame. In QP encoding mode (
qp=<0..51>), QP is constant throughout, and each macroblock is quantized (compressed) the set QP target. I do NOT recommend using QP encoding mode, which I will explain why in a bit.
Keyword: Quantization – lossy compression achieved by compressing a range of values to a single value. Higher quantization (QP) = more compression.
crf=<0..51>), known as constant rate factor, encodes the video to a set “visual” quality. Keyword: Visual. The major difference between crf and QP is that QP encoding mode has a CQP (constant quantization parameter) whereas crf uses the QP as a baseline, but varies QP based on perceived quality by human eyes. Essentially crf can more smartly distribute bitrate to where it visually matters as opposed to QP encoding mode where it quantizes (compresses) constantly (constantly in terms of math, not to the eyes). For example, crf will increase QP in motion scene due to motion masking imperfections, while decrease QP in static scene where our eyes are more sensitive. Additionally, in crf encoding mode, QP can be further manipulated with AQ and PSY options (discussed later).
The obvious downside to QP encoding mode, and especially crf is that it is almost impossible to determine the output size, particularly when modifying options that manipulate QP may result in drastic file size differences. However, as opposed to 2-pass ABR, crf guarantees that no matter which episodes you encode, they will all be at the same visual quality across.
One should always encode their own test clips and determine what crf they prefer and can accept in terms of size vs. quality loss. However, as a general guide (personal opinion):
- I have a small laptop / I watch on TV / 21″ monitor: crf=20-23
- I have a 24-27″ monitor: crf=18-21 (crf=18 is my lowest recommended value for Anime, note that going below crf=18 may increase file size quite rapidly.)
- I have a 4K 27″ monitor/TV in my face and I want minimal artifacts: crf=16-18
- I have a 4K 27″ monitor/TV in my face and I determine video quality by pausing the video and using a magnifying glass (lol): crf=14
Jokes aside, video quality should be assessed by watching, and not by pausing the video. If you can’t see the flaw without pausing the video, is it really a flaw after all? 4K requires higher crf due to upscaling often amplifying artifacts, and high quality upscalers (especially ones like FSRCNN) often benefit from higher quality source.
The variability of crf can be manipulated with qcomp (quantizer curve compression factor;
qcomp=<0..1>), but I recommend leaving it alone at default
qcomp isn’t as important of a variable as it was in x264 (since x264’s
aq-mode=2 & 3 are pretty much broken). In terms of crf encoding mode, high
qcomp leads to more aggressive QP reduction (higher bitrate) for complex scene.
aq settings also affect
qcomp somewhat. crf in x265 is somewhat a pain to tune and confusing to beginners due to tons of settings being intertwined controlling bitrate and quantization.
However, if you are encoding a source where it is mainly either simple scene or complicated motion, you can try increasing
qcomp. Somewhere around
qcomp=0.8 should be sufficient for even the most extreme cases. One interesting strategy to use for such source is to increase crf by 1 and use
qcomp=0.8. This results in similar file size, but complex motion scene essentially gets allocated more bitrate than static portions. Beware of doing this to “average anime” as combining this with other quantization options (AQ, psy) incorrectly may lead to bitrate starving normal scene.
Now before I get to other parameters, it is possible to tune a crf=20 video to be better than one at crf=18. Raising crf isn’t end-all-be-all solution to everything so make sure to read the following sections and do encode tests yourself.
Allowed values are
<0..16>. For anime just use
8 has minor savings (~3-5%) over default
4 with a small encode time penalty (~5%),
16 is pretty much only useful for static images (BD Menu) as the encode penalty isn’t worth the saved space (1.6% smaller than
crf=20 episode (~200-400MB/ep), expect about a few MB savings per anime episode going from
8. While at first glance it isn’t worth it, on the higher end (static-ish low-fps slice of life) you might be able to save 15-30 MB per episode (~5-10% savings on the very extreme end).
If you don’t want to use
6 should be the sweet spot since around
5-6 is where consecutive b-frames drop to single digit percentages for most anime.
You can also encode an episode with
bframes=16 and look at the encode log to optimize for your content.
Example encode log with
bframes=16, values represent consecutive b-frames percent from 0-16 (notice the sudden drop at 6 b-frames and another drop to decimals at ~8+ b-frames):
x265 [info]: consecutive B-frames: 18.8% 10.2% 19.2% 12.4% 8.3% 15.9% 4.7% 3.1% 2.9% 0.8% 1.2% 1.7% 0.3% 0.2% 0.1% 0.1% 0.2%
bframe=n % file size of
bframe=0 (Rokudenashi ep1
According to legend the more the better. Consensus is that optimally you should use
ref=6 for 10-bit anime encodes. x265 allows values of
8 is the the “true maximum” x265 can currently use and any more actually doesn’t improve quality.
A study from late 2018 showed that going from
ref=1 > ref=5 > ref=10 > ref=16 improved quality by
0 > 0.04 > 0.13 > 0.13 PSNR @720p with
100% > 135% > 179% > 223% encode time penalty. Note that
16 changed nothing due to true max capped at
8. For reference, we measure the difference of presets at the
0.x magnitude for PSNR.
Optimally you should use
ref=6, although I personally stick with the default
ref=4. Note that
ref=6 is the max you can go if you enabled b-frames and
--b-pyramid. Newer versions of x265 also blocks non-conforming values.
Loop Filters: sao, limit-sao, no-sao and Deblock
SAO is the Sample Adaptive Offset loop filter. SAO tends to loose sharpness on tiny details, but improves visual quality by preventing artifacts from forming by smoothening/blending. I would leave this on for crf>=20. crf=18 it depends on personal taste and anime, most of the time I set it to
limit-sao. I would not turn off sao with crf any higher than crf=16 unless you are trying to preserve extremely fine grain/detail.
deblock specifies deblocking strength offsets. I tend to leave it at default
deblock=1:1. If you want to preserve more grain/detail, you may set it to
-1:-1. Please note in FFmpeg based programs you will need to type
deblock=0,0 to pass the values, as
: is a parameter separator.
Psycho-Visual Options: psy-rd and psy-rdoq
Traditionally, the encoder tends to favor blurred reconstructed blocks over blocks which have wrong motion. The human eye generally prefers the wrong motion over the blur. Psycho-visual options combat this. While technically less “correct”, which is why they are disabled for research purposes, they should be enabled for content intended for “human eyes”.
psy-rd will add an extra cost to reconstructed blocks which do not match the visual energy of the source block. In laymen’s terms, it throws in extra bits to blocks in a frame that are more complex. Higher strength = favor energy over blur & more aggressively ignore rate distortion. Too high will introduce visual artifacts and increase bitrate & quantization drastically.
psy-rdoq will adjust the distortion cost used in rate-distortion optimized quantization (RDO quant). Higher strength will also prevent
psy-rd from blurring frames.
If you didn’t understand any of that, don’t worry. Basically, these 2 options are crucial to QP manipulation and grain/detail preservation.
psy-rd will decide the tendency to add extra cost (bitrate) to match source visual energy (i.e. grain, etc.) and
psy-rdoq will control the extent of this extra cost. Too low and details will be blurred to improve compression (the reason why people hated x265 in the early days), too high and you create artificial noise and artifacts.
For anime, use
psy-rd=1. On anime with some grain/snow/particles, or lots of detailed dark scene (often anime movies), set
psy-rd=1.5 (e.g. Kimetsu no Yaiba). If grain is a main feature, or the whole series is dark, well mastered with details you may use
psy-rd=2 (e.g. NGNL Zero has lots of fallout dust and complex details throughout the whole movie).
psy-rdoq is the key to preserve grain (and also quite aggressively lowers QP). Keep in mind the
--tune grain x265 built-in actually has too high of a value for slower presets, as it actually artificially creates even more grain. For anime, I would leave it at default
psy-rdoq=1. With some grain/CRT TV effects, I would set it to
3 depending on how strong the effects are. For anime where grain effects are staple throughout, or to eliminate blocking in complex fast motion scene at lower crf (<16) you may increase the value to
5. Additionally, some anime use grain to prevent blocking/banding and may also need a higher value to prevent micro-banding. Note that these values apply to
preset=slow. A higher value may be needed for faster presets.
Keep in mind both these options drastically increase file size, but also improve visual quality. On lower bitrate encodes, having too high of
psy-rd may starve bitrate from flat blocks, and too high
psy-rdoq may also create artifacts.
psy-rdoq=1 for most of the anime out there. I sometimes use
psy-rd=1.5 and rarely ever go to
2. I rarely use
psy-rdoq >2 due to how much bitrate it increases. (To me, its not worth increasing the value to make that 10 second grainy scene look better. I only increase it when the whole show/movie has grains/grain-like objects).
ipratio may also improve grain retention (more “real” frames over b/p frames), although I do not recommend touching them.
Adaptive Quantization Options: aq-mode and aq-strength
aq-mode sets the Adaptive Quantization operating mode. Raise or lower per-block quantization based on complexity analysis of the source image. The more complex the block, the more quantization is used. This offsets the tendency of the encoder to spend too many bits on complex areas and not enough in flat areas.
As this is beneficial for anime, you pretty much want this enabled. As for the modes:
- 0: disabled
- 1: AQ enabled
- 2: AQ enabled with auto-variance (default)
- 3: AQ enabled with auto-variance and bias to dark scenes. This is recommended for 8-bit encodes or low-bitrate 10-bit encodes, to prevent color banding/blocking.
- 4: AQ enabled with auto-variance and edge information.
I highly recommend, in fact I think it pretty much is a must to use
aq-mode=3 for anime. It raises bitrate in dark scene to prevent banding. Seasoned encoders will know dark scene with colorful glowing effects (i.e. fire) and dark walls with subtle colors are most prone to banding, blocking, and color artifacts. Setting
aq-mode=3 is so beneficial to anime that a crf=20 encode with it looks better than a crf=18 encode without while having similar file size.
aq-strength is the strength of the adaptive quantization offsets. Default is 1 (no offset). Higher = tendency to spend more bits on flatter areas, vice versa. Setting <1 in crf mode decreases overall file bitrate and reduce spending bitrate on plain areas (but potentially introduce blocking/banding in higher crf). For anime, anywhere from
aq-strength=1 is acceptable depending on the show. I tend to leave it at default unless I feel that
aq-mode is spending too much bits. Setting this high sometimes helps with grain preservation, but very expensive bitrate wise and may cause halo artifacts.
hevc-aq are experimental features that are still broken but should be interesting to use in the future (unless AV1 beats it to the punch). From my tests
hevc-aq lowers overall VMAF slightly but increases 1% low.
Prevents bilinear interpolation of 32×32 blocks. Prevents blur but may introduce bad blocking at higher crf. I do not use this option on my encodes. For non-anime stuff, this option may help preserve small details from blurring such as hair. I recommend not using this option unless you have
no-sao and low crf as
sao has a bigger impact on blur.
Disable analysis of rectangular motion partitions.
rect is enabled for presets lower than
slow. Enabling rect may help improve blocking in challenging scenes. For
preset=slow, disabling saves ~25% encode time at the cost of 1-3% compress efficiency.
I recommend not touching
rect as this is the main difference between preset
slow, unless you really want to save the encode time. Do note that your video quality will decrease ever so slightly.
I advocate to always encode directly from the blu-ray disk as you avoid re-encoding. Re-encoding (or to be more technical: generation loss) is very destructive for video quality, even more so than re-encoding audio.
However, not everyone could afford blu-rays and rip them manually. If you are forced to re-encode (i.e. you got your video files from cough), ensure you have the highest quality encode possible and enable
constrained-intra to prevent propagation of reference errors. For re-encodes I would not go below crf=20 as any lower simply isn’t worth it.
Keep in mind this isn’t a magic parameter to remove artifacting from re-encodes. Edges, especially edges close to each other (e.g. hair) tend to have jpeg-like artifacts in between. Dark scene also suffer (since they’re usually the most artifact prone in an anime encode). Any artifacts from the source will also likely be amplified.
Note: Very rarely, but happened once to me,
constrained-intra might cause encoding errors that look similar to missing p-frame data.
Number of concurrently encoded frames. Set
frame-threads=1 for theoretical best quality and ever so slightly better compression. If you have <=4 core CPU you may consider this option. High core count systems will suffer greatly (encode speed) if set to 1. I found really no quality loss setting it to >1. As for the “max” to set before losing quality, from my test setting it from 2 to 16 yielded identical results to each other contrary to claims that setting it >3 hurts quality. Basically just let your system handle this value unless you really want to encode with
x265’s Biggest Flaw: Grain, Micro-Banding, and Grain-Blocking
As we discussed, x265 has a tendency to blur/smoothen to save bitrate. While this can be mitigated somewhat with psycho-visual options, encoders should be aware of what I call the “Miro-Banding” and “Grain-Blocking” phenomenon. As we know, banding in x265 mostly occurs in dark scene. To combat this, many studios are starting to inject dynamic grain to prevent this in AVC 8-bit BD encodes (Increasingly prevalent post 2018/2019). While extremely effective due to BD’s having very high bitrates, this is actually detrimental to higher crf x265 encodes. On a smoky/fiery grainy scene, x265 tends to smoothen out each block creating weird “patches” of regional grainy block (I call this “grain-blocking”, though do note it isn’t “blocking” per se and more of “regional encode-block” color difference that still has a smooth gradient across). For example, the intro scene in ep1 of Tensei Slime exhibits this problem with smoke and fire effects.
x265 also introduces micro-banding when smoothening out the bright objects in the dark with glow effects, and is even more noticeable (slightly wider) when the source has dynamic grain. These micro-bands aren’t conventional banding, but extremely thin bands that only appear in dark scene objects with strong color gradient change (e.g. edges of fire where color rapidly changes from white to red to yellow glowing then to dark grey in a short distance, or a glowing katana swinging sword effect). Micro-banding are a bigger pain as even stronger deband filters cannot smoothen it out during post-processing (i.e. when watching), and unlike grain-blocking where tuning psy values can easily fix it, micro-banding requires a much lower crf value on top of that to suppress. Luckily, micro-banding is much more rare in encodes and that few seconds in a movie is unlikely to harm viewing experience.
There is no simple answer to fix these 2 problems due to crf targets. Traditionally in x264 such scene will simply end up in blocking artifacts. x265 chooses to eliminate artifacts at the cost of detail loss. The down side is that even at lower crf targets it is tough to eliminate x265’s tendency to blur. To truly eliminate such effects, you will first need
no-sao:no-strong-intra-smoothing:deblock=-1,-1 to make x265 behave more like x264, then raise
psy-rdoq accordingly (
5 respectively should do the trick). However, this reintroduces unpleasant artifacts x265 aimed to eliminate in the first place, thus I do not recommend such encode parameters unless encoding crf <16 (in which case file sizes are so big just use x264, why even bother x265?).
Image set to greyscale, with color/gamma corrections to amplify the artifacts. Micro-banding is self-explanatory. For grain-blocking, you can see each block still retains the grains, but the average color for each block is different creating a “regional” grain-block effect.
Unfortunetly I am not familiar with AVS and VPY scripting and cannot give advises on how to do it. However, x265 benefits greatly from filtering and can avoid its flaws with proper filters, such as proper denoise with masking and custom deband shaders tailored to different encodes. x265 also has a tendancy to magnify aliasing so AA scripts should benefit encodes too.
TL;DR Summary for x265 Encode Settings
preset=slow. Then choose 1 following to override the default parameters. These are my recommended settings, feel free to tune them.
- 1 Setting to rule them all:
- Flat, slow anime (slice of life, everything is well lit):
- Some dark scene, some battle scene (shonen, historical, etc.):
- crf=18-19 (motion + fancy FX),
- crf=20 (non-complex, motion only alternative),
- crf=18-19 (motion + fancy FX),
- Movie-tier dark scene, complex grain/detail:
- I have infinite storage, a supercomputer, and I want details:
Side note: If you want x265 to behave similarly to x264, use these:
no-sao:no-strong-intra-smoothing:deblock=-1,-1. Your result video will be very similar to x264, including all its flaws (blocking behavior, etc.).
Audio Codecs Guide
Why you should never use FLAC
Note: If your source isn’t FLAC/WAV/PCM, always passthrough (
-c:a copy) to prevent generation loss.
FLAC, to put it simply, is very inefficient use of data. A typical anime (23-24min) episode will have a FLAC audio size of 250MB. Now compare to an AAC track of merely 20-30MB with basically no quality loss. And then there’s the issue of anime audio track being mastered well in the first place…
The preserving audio quality argument has always been the dumbest argument I’ve ever seen. I feel that this is partly due to the “audiophile” community blowing this issue out of proportion. Just yesterday I was on cough and saw a 24 episode cough that is 12GB large. The encoder argues that at this size the video had “barely any quality loss”. It was a re-encode at crf=22.5, with zero encoder tuning and preset at fast. Needless to say, it was blocking and artifacts galore. The icing on the cake? The audio was FLAC “to preserve audio quality”… I’m pretty sure there are more people in this world with 1080p screens than high-end headphones lol. The biggest mistake really wasn’t encoding with crf=22.5. Imo, it was the really imbalanced release. By simply using AAC you can allocate extra 4GB towards video quality with 99.9% people not notice any audio quality loss.
Note: The only exception to using lossless codecs is:
- You are in production using lossless codecs to prevent generation loss.
- Providing remux/RAWs for cough.
Being an audio enthusiast myself, I am a firm believer of double blind ABX testing, and I encourage people to find the lowest bitrate they can go before being able to discern the difference with this method.
I later provide audio samples (see Encoders Comparason section) from various encoders for people to listen. A great ABX tool I found is the ABX plugin for foobar2000 with replaygain enabled.
AAC or OPUS
Ideally, one should use opus. It’s a open codec, and outperforms even the best AAC encoders at very low bitrates. CELT/SILK encoding mode switches on-the-fly with voice activity detection (VAD) and speech/music classification using a recurrent neural network (RNN), ensuring the best encoding method depending on content.
It also performs exceptionally well with surround sound. Say we want to target an equivalent quality of
128kbps stereo track (
64kbps x 2 channels) on a 5.1 channel setup. Conventional codecs such as AAC simply use
64kbps x 6 channels = 374kbps (usually slightly less due to LFE & C channel being lows/vocals only).
OPUS on the other hand can achieve similar quality with a much lower bitrate, the recommended formula being
(# stereo pairs) x (target stereo bitrate). In this case with a conventional 5.1 setup (L & R, C, LFE, BL & BR), we have
(2 stereo pairs) x 128kbps = 256kbps. You may need to use the conventional formula if your stereo pairs aren’t “stereo” pairs, i.e. the audio is significantly different.
This is due to OPUS using surround masking and takes advantage of cross-channel masking techniques to smartly distribute bitrate. Think of HE-AACv2 but better and optimized for higher bitrate + surround setup. Instead of distributing 64kbps per channel, more bitrate goes to the stereo pairs and less to the center/LFE channels where only vocals/bass exist. Next, by utilizaing joint encoding (intensity stereo) and other techniques it “increases” the “bitrate” per channel. Obviously this is an oversimplification and the underlying technology is way more complex, but you get the idea.
Sounds cool, right? Why not just always use this sci-fi level magical codec then? Well you see, we already do, but in the form of commercial implementation (FaceTime audio, Discord, Skype, etc.). Opus is not widely supported in many container formats (only in .mkv, .webm, .opus (.ogg), .caf (CoreAudio format)), especially in the video world, essentially limiting users to those using 3rd party players. This leaves us…
AAC. An old codec developed to kill mp3 and they (mp3) still exist for some reason. AAC is a widely supported codec just like mp3 was as its replacement. If a device can play music, 9/10 it supports AAC. Quality is about the same as Opus on higher bitrates. There’s also HE-AAC that’s used in low-bitrates (~64kbps) with spectral band replication (SBR) and HE-AAC v2 with Parametric Stereo (PS) that’s used in even lower bitrates (~48kbps).
I personally think that you shouldn’t use HE-AAC, especially He-AAC v2 as it’s pretty obvious with a decent studio monitor/headphone. If such low bitrate is needed, Opus is also much better.
Now that we’ve established AAC is the way to go for most users, now to the bad news: unlike Opus with 1 definitive official open source encoder, there are many encoders developed by many corporations for AAC, and some aren’t “free”.
Just like licensing issues plaguing HEVC, the same can be seen in the AAC world. This means without some computer knowledge, it’s pretty hard to your hands on good AAC encoders.
Now to introduce the 4 most prominent AAC encoders:
The first is qaac, an tool utilizing Apple’s CoreAudio toolkit to encode Apple AAC. Mac users: you have direct access to the CoreAudio library and do not need any special tools. Apple AAC is know to be the best consumer available AAC encoder.
Second in place is is Fraunhofer FDK AAC, developed by Fraunhofer IIS, and included as part of Android. While open-source and once used in FFmpeg, due to licensing issues it has pretty much disappeared from any builds. To get it back, you need to compile FFmpeg yourself and enable the
non-free flag. Once compiled, the build cannot be shared or distributed.
(There’s also the FhG AAC encoder from Fraunhofer bundled in WinAmp but it’s another complicated topic for another day. Beginners to using AAC, just pretend it doesn’t exist.)
Third place is Nero AAC, once pretty prominent in the AAC world as it was provided by Nero themselves in their software. It is currently outdated and should not be used.
Last is the FFmpeg 3.0 encoder. AAC encoder in FFmpeg used to be trash, even at 128kbps it was hissing all over. The new improved encoder has “eh” quality and can be widely used with 1 caveat: it’s CBR ready only. The VBR is still experimental (although my tests show that it is not any worse than CBR, a.k.a. they’re both “eh”). I do not recommend any less than 256kbps (as you will see later).
My MEGA page contains all the audio file samples (backup link). Feel free to download and listen/ABX them. I chose the Z*lda theme song orchestra due to its challenging nature: brass instruments and cymbals. If anything will show a flaw in low bitrates, its going to be the trumpet. Basically, if a certain encode setting can handle this track, it can handle everything.
|Encoder||Size (Bytes)||Bitrate (kbps)||Peak Bitrate (exclude FDK initial)||FDK Burst|
|FDK VBR Q1 HEv2||1341425||31||40||41|
|FDK VBR Q2 HE||2973163||68||101||142|
|FDK VBR Q2||4051211||92||134||201|
|FDK VBR Q3||4681758||107||144||208|
|FDK VBR Q4||5812754||132||179||219|
|FDK VBR Q5||9925912||226||302||351|
|OPUS VBR 48kbps||2131368||49||81|
|OPUS VBR 64kbps||2841338||65||102|
|OPUS CVBR 64kbps||2792245||64||66|
|OPUS VBR 96kbps||4278459||97||131|
|OPUS VBR 128kbps||5698682||131||167|
|OPUS VBR 192kbps||8409104||194||234|
|qaac CVBR 40kbps||1794573||41||54|
|qaac CVBR 64kbps||2847934||65||86|
|qaac VBR Q27||3081110||70||91|
|qaac VBR Q64||5760371||132||154|
|qaac VBR Q91||8882577||205||227|
|qaac VBR Q109||11767124||272||296|
|qaac CVBR 256kbps (iTunes)||11852276||274||310|
|FFMPEG VBR Q0.5||2701485||61||75|
|FFMPEG VBR Q1.0||4527582||104||133|
|FFMPEG VBR Q1.5||6049106||139||185|
|FFMPEG VBR Q2.0||8727091||202||249|
FDK AAC is probably the most accessible AAC encoder. It is built into FFmpeg and HandBrake (both require manual non-free compiling, go check out guides on the official site / videohelp / reddit how to compile HandBrake with FDK AAC. For FFmpeg use media-autobuild-suite. It being only ever so slightly behind qaac makes it the perfect encoder for ripping Blu-Rays without demux/remuxing.
Keep in mind anything other than Q5, all will have a low-pass filter, with it progressively lower the lower the bitrate is. I recommend using VBR Q5 due to its pretty high bitrate (~100kbps/channel), and to prevent the low-pass filter.
An interesting thing is that FDK tends to not respect VBR quality goal as well (i.e. VBR Q2 has a theoretical ~64kbps goal, but the result file is 92kbps), and allocate bits when it knows Q2 is simply too low for a specific file. This really isn’t an issue as on average FDK Q2 is indeed ~64kbps, just more of an FYI that FDK WILL throw more bits if a complex track needs it.
One “quirk” of FDK AAC is that it tends to allocate bitrate at the beginning for music (this is not a problem for anime as shown later) when it detects a complex beginning. This is why I do not recommend FDK AAC for music files as it tends to waste bits at the beginning adding up slowly. On the bright side over-allocation doesn’t impact quality (it “increases” actually haha).
Apple AAC (qaac)
While in a perfect world everyone should use the coreaudio encoder, the matter of fact is that it requires demuxing the file making FDK the best in terms of workflow in FFmpeg based programs. Some tools like Staxrip processes files as such, but for most this is very inconvenient. (Mac users: CoreAudio encoder can directly be used by programs like HandBrake).
Apple AAC is more suitable for music (as expected, since it’s the encoder used for iTunes). When encoding quiet tracks (such as movies) it often undershoots its bitrate target by quite a bit (which can be mitigatged by using a higher TVBR target or simply using CVBR). I really have nothing to say about it other than it’s really good, especially at low bitrates.
Apple AAC tends to respect bitrate targets much more than FDK AAC (with the exception of ultra low bitrates). It also favors allocating bitrate to music over speech.
Constrained VBR (CVBR) mode in Apple AAC constrains the minimum value to not go too low, but does not limit the upper value like that on Opus. Keep in mind CVBR encodes identical to TVBR if bitrate doesn’t drop below threashold. Funny enough, in the context of music tracks, CVBR behaves very similar to FDK AAC with an initial burst.
Not much really to say here. You really have to experience it yourself to understand what I mean by Opus really is the next gen codec. It pretty much has no major flaws, and multiple public tests have proven Opus to be pretty much the best encoder out there.
Its 2 self-switching encoding modes SILK/CELT also makes it perfect for both music and speech/vocal.
The only real down side to Opus (other than format support), is that it doesn’t have a quality mode and requires specifying a bitrate. For example, qaac Q91 has a target of ~192kbps. However, on complex tracks, it would not hesitate to go much higher just like the sample track provided. On tracks such as slice of life anime where much is just voice and quietness, the result bitrate will be much lower than 192. On the other hand, Opus pretty much always produces ~192kbps file. Additionally, qaac quality mode will base bitrate on per channel, whereas OPUS bases bitrate per audio track, meaning for OPUS you will have to manually set double/triple the bitrate for 5.1/7.1 surround sound audio encodes.
For streaming companies, Opus’s ability to respect bitrate so well is a huge advantage for networking and storage problems. For consumers like us however, this is a disadvantage as a quality mode would serve us much better. (Think of quality mode like crf in video encoders, huge file-to-file variation but ultimately equivalent quality for each file).
CVBR mode in Opus works different than those in AAC. Think of it as a VBR mode for CBR. It’s basically constant bitrate, but with a bit wiggle room for very small momentary bursts.
Opus also has a “soft” low-pass filter from 16-20kHz, and starts becoming progressively aggressive <96kbps. I say “soft” because it isn’t a hard-limit, but decided by the encoder depending on content. For example, in my test track even as low as 48kbps, when even HE-AAC is low-passing <13kHz OPUS still allows trumpet harmonics up to 20kHz. Opus also momentarily boosts VBR to ~50-55 kbps as the encoder smartly determined that low-passing the trumpet will be detrimental to the overall quality.
Holy Jesus it’s bad. Anything lower than 256kbps produces farting noises with the trumpet (both CBR and VBR). Usually it isn’t this bad, but now you know why I chose this track to compare encoders.
On the bright side their VBR algorithm is pretty much spot on. Though according to their website VBR is still experimental and should not be used.
Encoding Audio for Anime
FDK AAC vs. qaac
Using Danmachi episode 1 as an example, you can see how both are really more similar than different. The only real takeaway is that qaac’s quality mode differs drastically with different content and really favors music (Q91 is ~192kbps for music, but in this case it undershot the target by a 30kbps margin). After using a higher quality target for qaac to compensate, you can see both have very similar bitrate allocation and variability, with the exception of the ED where qaac increases the bitrate compared to FDK AAC.
Opus vs. qaac
As both AAC and Opus are fundamentally different, we can’t really draw any conclusions from this.
One thing to note is that Opus really does respect bitrate targets VERY well. Oh and again, qaac loves to allocate bits to music over vocals (OP and ED peak).
All numbers in kbps.
|qaac (Apple AAC)||
|FFmpeg 3.0||384 CBR||DON’T||BOTHER|
Remember to enable the
|qaac (Apple AAC)||
|FDK AAC||Always use VBR Quality 5 to avoid low-pass filter. Q5 is approx 100kbps per channel. You may use Q4 (~64kbps/channel) if you don’t mind the low-pass filter.|
|FFMPEG 3.0||I really do not recommend using this audio encoder unless it is stereo and a final compression render to upload to sites such as Youtube.|
Here are my general rule of thumb for audio quality vs. gear:
64kbps/channel for budget chi-fi gear (
96kbps/channel for mid-fi gear (
~$300-500 USD audio setups).
SnakeOil/channel for hi-fi gear (
$1000USD+ setups). Jokes aside double-blind ABX test your limits, though I doubt anyone can differentiate past
Thanks for reading!
Basically yeah, follow this guide and your encodes should be good. This is the end of the guide, but if you want to read my ranting about audio gear and stuff keep reading.
Bonus: Ranting on Audio Gear and Recommendations
I really do wonder to what extent people are able to ABX and discern the difference between compressed files and lossless. I personally have trouble beyond >96kbps stereo for most sources and beyond >160kbps I literally cannot tell the difference. I only own an HD 650 and Etymotic ER2/3/4XR so I do wonder what people with better gear can hear.
My journey into the “audiophile” world
Skip this section if you aren’t into my life story.
In middle school, just like everyone out there I owned a pair of V-shaped generic $20 IEM. The forgotten days where I though muddy bass=good, the days when a +20db bass boost was about right.
High school is when I had the first taste of “real” audio gear. My first IEMs were the RHA MA750s. Classified as a “warm” IEM with a ~10db sub-bass boost and bright highs, it was one of the best at its time in the $100 price range. Initially I though it was really bass-light (lol), and the sharp 10K made me not like it as much.
In University, following the trend with everyone I got the infamous HD650s, and also ventured into the tube amp world (full of regrets, tube amp = money pit, coloured sound, and eh detail retrieval on most tubes). Here is when I got to experience what ‘soundstage’ is, got more used to a ‘natural’ sound signature.
While searching for an IEM upgrade after my RHA crapped out, I stumbled upon Etymotic’s ER3 series. I always knew that they were the benchmark for studio IEMs, but never really though much of it until one faithful day for some god damn reason I ordered the ER3XR to try it out.
To my pleasant surprise, after spending a month with it and getting used to a neutral sound signature, I am really impressed. Listening to classical on it is nothing compared to other IEMs in terms of timbre and accuracy, albeit the single BA does sound a tad ‘dry’ sometimes.
My biggest complaint/gripe with the audio world is that many tend to associate “sound signature” with “sound quality”. So once again, I would like to scream at the world:
“STRONG BASS ISN’T SOUND QUALITY. IT’S YOUR SOUND SIGNATURE PREFERENCE!”
Fun fact, stronger bass actually worsens sound quality due to the stronger bass often creating distortion worsening harmonic distortion (THD) measurements.
My Recommended Gear
(Below info relevant as of early 2021)
Part 1: IEMs
Beginners looking for recommendations for audio gear (IEMs): I highly recommend Etymotic’s ER2, ER3, and ER4 series, specifically the ER2XR for beginners (Diffuse-field flat with a +5dB bass boost). They definitely aren’t for everyone with their house-sound (Diffuse-Field Target w/ slightly weaker treble). However, for those who want a taste into what a truly neutral IEM sounds like, these are unbeatable.
For those not into Diffuse-Field tuning, I recommend finding headphones tuned to the Harman/modified-Harman 2017/2019 IEM target (more “mainstream”). I haven’t been keeping up with what’s best, but for the lower budget people chi-fi (Chinese-fi) is the way to go. Some recently popular brands in the chi-fi are Tin, KZ, FiiO, BLON, Moondrop, etc. Moondrop is especially hot on the radar these days, and other than it’s tuning, I can also attest to it’s sound quality from display units I tried.
Fun fact: Etymotic invented insert headphones.
There are 2 variants within the ER2/3/4 series, the SE/SR and XR (i.e. ER3XR). SE/SR is the studio version, and the XR is the bass-boosted (+3db, or +5dB for ER2XR) variant. If this is your first time venturing into the neutral sound signature, get the XR variant. Get the SE/SR if you want a truly FLAT bass. Note that since the highs aren’t boosted like most mainstream IEMs, the bass is surprisingly present due to other frequencies not drowning it out.
The ER2 are the cheapest (~$125 USD) of the bunch and use Dynamic drivers (DD). They have almost identical frequency response to the ER3, but due to the dynamic drivers they have better sub-bass (2dB stronger), sound more natural, and bass packs more of a punch (likely from the slower decay of DD). If you listen to mainstream music, these are the ones to pickup. The value these offer are quite amazing at $125. I personally prefer the SE (non-bass-boosted) version.
ER3 use a single Balanced Armature driver (BA) and are suited for people who listen to classical and the like, as it has better detail retrieval at the cost of bass sounding uh… a bit unnatural (difficult to describe, best way to put it is that it lacks impulse and dynamics). The ER3s are the best to get into the Etymotic house sound (~$130-160 USD). These are fine with non-fast bass (bass guitar, non-synthetic bass drums, etc.), but once a track gets too complex (i.e. in EDM or metal when every instrument plays + heavy bass hits) the single BA sometimes gets overwhelmed and bass starts bleeding into the mids. Those who are used to DDs might find BAs sound a bit ‘dry’. I personally recommend the SE version over the XR version for the ER3 due to it sounding more natural and less prone to bass bleeding into mids.
ER4 are the professional version of the ER lineup, and have a legendary history. They are manufactured in the States, with each unit having its FR and channel match certification. Get these if you want the best Etys can offer. They are a small upgrade from the ER3 so I do not recommend the ER4s if you already own the ER3s. For the ER4 I recommend the XR over the SR variant which is opposite of the ER3 recommendations (reasons are complicated, but the SR is perfectly fine if you don’t mind the flat bass).
The Etys use their infamous triple-flange eartips that may not be everyone. Fortunately due to their long nozzle design using foam tips do not alter the frequency response (Innerfidelity has proven this). I personally use Comply foam tips with them. There are also many aftermarket tips that fit them such as the Spinfits CP-800.
Part 2: Full-Sized Headphones
As a brief summary: closed back headphones have better noise isolation, open back leak noise but often have better soundstage.
I haven’t been keeping up in this market segment for years, so it’s up to you to research. Some big name brands in no particular order: Sennheiser, HiFiMan, Audeze, Beyerdynamic, Audio Technica, AKG, Grado, Fostex (LOTS of derivative ‘brands’ from modded T50 series), Focal, Philips, Sony, etc.
One thing to keep in mind is that full-sized headphones are often harder to drive (especially planers, don’t let the low impedance trick you) and require a dedicated amp.
Part 3: DAC/AMP
Note that new products come out every month so you should always do your research.
Also be very careful, other than cables, DAC/AMPs are the most prone to snake oil claims. Many might know the legend NwAvGuy who one day stormed into the scene, created an open-source Objective 2 AMP/DAC design that outperformed competition 10x its price then disappeared without trace. His story is a long one reserved for another day, but thanks to him modern gear are mostly more about objective measurements than subjective claims.
I usually search for an amp that is suitable for 1.5-2x my headphone impedance. This is due to headphones impedance are measured at 1KHz, whereas often the impedance isn’t linear throughout (stares at Senns). (Example: HD650 is rated at 300 ohms but its peak resistance is 500 ohms at 80Hz.) Update: RIP innerfidelity, hopefully someone has that link archived somewhere
Portable DAC/AMP rarely are powerful enough to drive 300 ohm class headphones. Amps that do are probably bad at driving low impedance IEMs (volume matching, output impedance problems, etc.). It’s basically pick your poison and finding you needs.
For IEMs: Your goal is to find an amp that is clean (high SINAD) and quiet (low noise-floor). Try find the lowest output impedance possible using the 1/8 (or 1/10) rule: the output impedance of your amp should be 1/8 or 1/10 (depending on who you ask) of your headphone impedance. Typically this means looking for <=1 ohm for IEMs. Portable amps often work well due to them being battery powered (clean DC power source). Digital DAC volume control is a plus to ensure channel matching.
For high-impedance headphones: Your goal is to find something that does well at high gain with low distortion. Output power calculators (Site 1 & 2) are your friend. You often find people complaining about weak/flabby/distorted bass on such headphones and a weak amp is probably the cause.
Crossed out some devices. I have not been keeping up with the scene now that I’ve found my end-game: the Apple Dongle. Yes, I’m dead serious.
- Uber Budget: Apple USB-C to 3.5mm Dongle ($10)
- No, this isn’t a joke. Make sure you get the US version, as the EU version is weaker and doesn’t measure as well due to reasons (EU volume limit shenanigans). These measure insanely well for $10 (99dB SINAD / 113dB SNR), and are perfect for even end-game IEMs due to it being really clean and non-existent noise-floor. Suitable for <50ohms IEMs / non-planar headphones. Can be used on Windows with no problems, although Android users may have volume issues due to a config bug (can be mitigated by using exclusive mode such as USB audio driver app).
EarStudio ES100 ($100)Not usre what’s good these days.
- Budget Desktop DAC/AMP Combo:
FiiO K3 ($100) Designed to replace the infamous E10K. While measurements aren’t the best in 2019, it packs a lot of functionality (Optical, RCA, etc.) and is pretty good for its price for an all-in-one. Also has an actually good 6db bass-boost switch for watching movies (6db bass boost on D-F tuning is close to Harman target’s bass’s 7db boost). Recommended for <150 ohm non-planar Headphones and IEMs.
- Budget Desktop DAC/AMP Combo (More Power):
FiiO K5 Pro ($150) Has a surprisingly good AMP and OK DAC. Can even drive the HD650s well. Basically a much more powerful desktop K3 that can drive ~300ohm / planars with no problem.
- Budget Portable DAC/AMP Combo:
Topping NX4 ($160) Pretty much better measurements than the FiiO K3. Can even drive the HD650s. However, QC isn’t as good as compared to FiiO.
- Mid-Range Desktop DAC/AMP Combo:
Topping DX3Pro ($220) A really good all-in-one unit with good measurements. Not as good as separate DAC AMP units but for functionality (BT, preamp, etc.) and desk space friendliness it is unbeatable. If you get the newer v2 LDAC version, unfortunately its output impedance is 10 ohms so make sure your headphone is >80 ohms (1:8 rule).
- Mid-Range DAC & AMP units: JDS Atom AMP/DAC ($100/$100), Schiit Heresy ($100),
Grace SDAC ($79), Khadas Tone Board ($100), or similar products (e.g. SMSL/Topping DAC/AMPs)
- These are popular entry-level single units. There are also many good Chinese DACs in the $100 price range, although it might be more of a hassle to acquire one (Aliexpress, warranty issues, etc.).
- Upper-Range SE DAC/AMP: JDS Element II ($399)
- The AMP unit is very good (the Atom was derived from this unit during research). The DAC chip is definitely the weakest link in this unit. However, if you need a nice looking DAC/AMP and don’t care for balanced connectors, this is still a very good choice. You’re definitely paying some premium for the looks though.
- High-End Stuff: I will refrain from recommending anything specific, but here is a random list that might interest you. Always do your own research and never blindly trust strangers.
- Massdrop THX AAA 789
- SMSL SP200 THX
- SMSL SU-8 v2
- SDAC Balanced
- Schiit Modius
- Topping DX7Pro
- Other brands: THX powered AMPs, Chord, iFi, etc.