Ext4 Filesystem

This article will provide a brief introduction to Linux File System Ext4, the successor to Linux File System Ext3. Several tips regarding usage as well as additional links will be provided.

Evolution

The ext3 source was forked and developed independently, in order to correct several existing limitations of the previous Ext2 and Ext3 file systems in the future. Ext4 has been accepted into the Linux kernel as of Kernel version 2.6.19 and finally declared stable as of Kernel version 2.6.28.

Current distribution files, like RedHat Enterprise Linux 6 (RHEL), Debian 6.0 (Squeeze) or Ubuntu 10.10 (Maverick Meerkat) provide stable Ext4 support and use it as the default file system, in some cases.

Advantages

Improved performance through:
- Multi-block allocation
- Extent-based block mapping
- Delayed allocation
- Stripe-aware allocation
Improved file security through:
- Write barriers
Time stamps use the nanosecond range instead of the second range
Increased e2fsck speed
Unlimited number of sub-directories (32,000 sub-directory under a directory under Ext3)
Journalized quota data, whereby a quota check will not be performed after a system crash

You can find additional details regarding these advantages in a white paper from Red Hat ^[1].

Compatibility with Ext3

Because there have been many changes in comparison with Ext3, migration to Ext4 is not as easy as from Ext2 to Ext3.

To take full advantage of the Ext4 file system, Red Hat recommends backing all of the data up for RHEL 6, re-creating the Ext4 file system and copying the data into the new Ext4 file system (see also ^[1]).

The Ext4 driver does support mounting an Ext3 file system, however only with limited functionality. On the other hand, as soon as one uses extent-based mapping, mounting an Ext4 file system as an Ext3 file system becomes impossible.

SSD Optimizations

ATA Trim

Ext4 supports ATA Trim for solid state drives (SSDs):

Online discard from Kernel 2.6.33
- The -o discard mount option (for example mount -o discard /dev/sdb1 /mnt/. For permanent activation, the option must be entered in /etc/fstab, because the discard capability is deactivated by default)^[2]
Batched discard from Kernel 2.6.37
Accelerated batched discard from Kernel 3.1
Pre-discard during formatting from mke2fs 1.41.10^[3]^[4]^[5]
- Extract from the man page for mke2fs: -E discard: Attempt to discard blocks at mkfs time (discarding blocks initially is useful on solid state devices and sparse / thin-provisioned storage). When the device advertises that discard also zeroes data (any subsequent read after the discard and before write returns zero), then mark all not-yet-zeroed inode tables as zeroed. This significantly speeds up file system initialization. This is set as default.
  - Caution when using this feature with the device mapper with mixed physical volumes. discard_zeros_data will first be properly returned as of Kernel 3.0 - see patch block: Fix discard topology stacking and reporting)
- In our test, we observed significantly different time requirements for the discard operation during formatting. However, significant effects were not caused by discard also zeroes data (regarding this, see ATA Trim Performance).
- With discard also zeroes data, hdparm -I displayed Deterministic read ZEROs after TRIM (or deterministic read data after TRIM). The following example shows an OCZ Vertex 3 SSD and an Intel 320 Series SSD:

[root@fedora15 ~]# hdparm -V
hdparm v9.36
[root@fedora15 ~]# hdparm -I /dev/sda | grep 'Model\|TRIM'
	Model Number:       OCZ-VERTEX3                             
	   *	Data Set Management TRIM supported (limit 1 block)
	   *	Deterministic read data after TRIM
[root@fedora15 ~]# hdparm -I /dev/sdb | grep 'Model\|TRIM'
	Model Number:       INTEL SSDSA2CW160G3                     
	   *	Data Set Management TRIM supported (limit 8 blocks)
	   *	Deterministic read ZEROs after TRIM
[root@fedora15 ~]#

File System Journal: Yes or No?

Not using the Ext4 file system journal^[6]^[7] can increase file system performance, is however connected with disadvantages when the shutdown procedure is not completely clean (such as during a power failure). Theodore Tso, an Ext4 file system developer, determined in tests that the performance disadvantages caused by journaling were between four and twelve percent. Therefore, the journal should be used. Not using the atime journaling would be a more recommendable way of increasing performance.^[8]^[9]

noatime

Using the noatime mount option improves performance when reading.^[8]^[10]^[11]

Should stride and stripe-width Parameters be used?

There are a number of different recommendations for the stride and stripe-width parameters^[12]^[13] for using SSDs under Ext4.^[14]^[15]

If certain values really do provide a benefit cannot be determined with certainty at this time. Individualized tests with the respective SSD would be required for making a determination.^[16]^[17]

Lazy Initialization

When creating an Ext4 file system, the existing regions of the inode tables must be cleaned (overwritten with nulls, or "zeroed"). The "lazyinit" feature should significantly accelerate the creation of a file system, because it does not immediately initialize all inode tables, initializing them gradually instead during the initial mounting process in background (from Kernel version 2.6.37).^[18]^[19] Regarding this see the extracts from the mkfs.ext4 man pages:^[20]

If enabled and the uninit_bg feature is enabled, the inode table will not be fully initialized by mke2fs. This speeds up file system initialization noticeably, but it requires the kernel to finish initializing the file system in the background when the file system is first mounted. If the option value is omitted, it defaults to 1 to enable lazy inode table zeroing.

One should be careful when testing the performance of a freshly created file system. The "lazy initialization" feature may write a lot of information to the hard disk after the initial mounting and thereby invalidate the test results. At first, the "ext4lazyinit" kernel process writes at up to 16,000kB/s to the device and thereby uses a great deal of the hard disk’s bandwidth (see also I/O Statistics by Process). In order to prevent lazy initialization, advanced options are offered by the mkfs.ext4 command:^[20]

mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/mapper/fc-root

By specifying these options, the inodes and the journal will be initialized immediately during creation.

Potential Problems

Poor Performance due to Write Barriers

The integrity of the file system can be guaranteed by write barriers, if the hard disk uses a volatile cache and the contents of the cache are lost due to a power failure. Thereby, data security is improved, however at the price of a small performance disadvantage. By default, write barriers are activated, however they can be deactivated with the nobarrier file system option. However, this is only recommended when the write cache for your RAID array has been secured using a battery backup unit, zero maintenance cache or similar measure.

Data Loss with Applications that do not use fsync() correctly

When Ext4 was first introduced with Linux distributions, notifications about sometimes massive data losses began to appear.^[21]

The reason for this is delayed allocation, which first allocated the necessary storage space up to 60 seconds later. Thereby, file renaming entered the scenario, for example, so that the metadata properly represented the rename process, but the actual data had not yet been written. The file name pointed to a 0 byte file thereby. However, this only occurs when an application has not properly used the fsync() feature. According to developer Theodore Ts'o, Ext4 precisely implements the POSIX standard for file operations. The problem is that the "secure" behavior of Ext3 was undesirable, however was considered a given by many application developers due to its wide distribution.

Initially the "alloc_on_commit" mode was introduced as a workaround, which was replaced by the "auto_da_alloc" mode shortly thereafter in Kernel 2.6.30 (see Kernel Commit ^[22]). Thereby, attempts will be made to detect and avoid frequently occurring cases for potential data loss. This mode became the new default mode, but can be deactivated by "noauto_da_alloc".

In general, applications should improve their compliance with the POSIX standard in the future and use fsync() in the required positions. Ext4 has been subsequently optimized for this problem based on the prior history of Ext3. Other file systems using delayed allocation (such as XFS or Btrfs) do not take this into consideration.

References

↑ ^1.0 ^1.1 White Paper: "Is It Time to Migrate Your File Systems to Ext4?" by Krista Guglielmeti, (Login required!)
↑ http://www.kernel.org/doc/Documentation/filesystems/ext4.txt
↑ e2fsprogs Release Notes e2fsprogs 1.41.10 (February 10, 2010): Mke2fs will use BLKDISCARD to pre-discard all blocks on an SSD or thinly-provisioned storage device.
↑ e2fsprogs Release Notes e2fsprogs 1.41.13 (December 13, 2010): Mke2fs now understands the extended option "discard" and "nodiscard", and the older option -K is deprecated. The default of whether discards are enabled by default can be controlled by the mke2fs.conf file., mke2fs: Deprecate -K option, introduce discard/nodiscard (commit)
↑ Man page correction (probably in e2fsprogs 1.41.15, as of July 3, 2011, 1.41.14 is the most current version): mke2fs: Simple man page nodiscard option correction (commit)
↑ Ext4: "No Journaling" mode (kernelnewbies.org)
↑ ext4: Allow ext4 to run without a journal (Kernel Commit)
↑ ^8.0 ^8.1 SSDs, Journaling, and noatime/relatime (Theodore Tso's blog, 01.03.2009)
↑ Re: Ext4 on SSD Intel X25-M (Linux Ext4 Mailing List, 12.11.2009)
↑ Linux: Replacing atime with relatime (kerneltrap.org)
↑ Does noatime imply nodiratime? (lwn.net)
↑ [1] (Ext4 Wiki), see s_raid_stride, s_raid_stripe_width
↑ Creating and Tuning Ext4 Partitions (blog.peacon.co.uk)
↑ Optimizing Linux for SSD usage (searchenterpriselinux.techtarget.com)
↑ http://www.nuclex.org/blog/personal/80-aligning-an-ssd-on-linux
↑ Re: -E stride and stripe-width necessary for best performance of SSDs? (Linux Ext4 Mailing List, 01.07.2011)
↑ Re: -E stride and stripe-width necessary for best performance of SSDs? (Linux Ext4 Mailing List, 01.07.2011)
↑ Kernel Log: What Does 2.6.37 Add? Two File Systems (heise.de, 05.12.2010)
↑ ext4: add support for lazy inode table initialization (git.kernel.org, 28.10.2010)
↑ ^20.0 ^20.1 mkfs.ext4 man Page (linux.die.net)
↑ Potential Data Loss with Ext4 (heise.de)
↑ GIT Kernel Commit for Ext4 auto_da_alloc Mode

Additional Information

Ext4 (en.wikipedia.org)
Ext4 Wiki (ext4.wiki.kernel.org)
Chapter 9. The Ext4 File System (Red Hat Enterprise Linux 6 Storage Administration Guide)

Author: Christoph Mitasch

Christoph Mitasch works in the Web Operations & Knowledge Transfer team at Thomas-Krenn. He is responsible for the maintenance and further development of the webshop infrastructure. After an internship at IBM Linz, he finished his diploma studies "Computer- and Media-Security" at FH Hagenberg. He lives near Linz and beside working, he is an enthusiastic marathon runner and juggler, where he hold various world-records.

[redhat-1] 1.0 ^1.1 White Paper: "Is It Time to Migrate Your File Systems to Ext4?" by Krista Guglielmeti, (Login required!)

[2] ttp://www.kernel.org/doc/Documentation/filesystems/ext4.txt

[3] 2fsprogs Release Notes e2fsprogs 1.41.10 (February 10, 2010): Mke2fs will use BLKDISCARD to pre-discard all blocks on an SSD or thinly-provisioned storage device.

[4] 2fsprogs Release Notes e2fsprogs 1.41.13 (December 13, 2010): Mke2fs now understands the extended option "discard" and "nodiscard", and the older option -K is deprecated. The default of whether discards are enabled by default can be controlled by the mke2fs.conf file., mke2fs: Deprecate -K option, introduce discard/nodiscard (commit)

[5] Man page correction (probably in e2fsprogs 1.41.15, as of July 3, 2011, 1.41.14 is the most current version): mke2fs: Simple man page nodiscard option correction (commit)

[6] Ext4: "No Journaling" mode (kernelnewbies.org)

[7] xt4: Allow ext4 to run without a journal (Kernel Commit)

[tso-noatime-8] 8.0 ^8.1 SSDs, Journaling, and noatime/relatime (Theodore Tso's blog, 01.03.2009)

[9] Re: Ext4 on SSD Intel X25-M (Linux Ext4 Mailing List, 12.11.2009)

[10] Linux: Replacing atime with relatime (kerneltrap.org)

[11] Does noatime imply nodiratime? (lwn.net)

[12] [1] (Ext4 Wiki), see s_raid_stride, s_raid_stripe_width

[13] Creating and Tuning Ext4 Partitions (blog.peacon.co.uk)

[14] Optimizing Linux for SSD usage (searchenterpriselinux.techtarget.com)

[15] ttp://www.nuclex.org/blog/personal/80-aligning-an-ssd-on-linux

[16] Re: -E stride and stripe-width necessary for best performance of SSDs? (Linux Ext4 Mailing List, 01.07.2011)

[17] Re: -E stride and stripe-width necessary for best performance of SSDs? (Linux Ext4 Mailing List, 01.07.2011)

[18] Kernel Log: What Does 2.6.37 Add? Two File Systems (heise.de, 05.12.2010)

[19] xt4: add support for lazy inode table initialization (git.kernel.org, 28.10.2010)

[ext4-20] 20.0 ^20.1 mkfs.ext4 man Page (linux.die.net)

[21] Potential Data Loss with Ext4 (heise.de)

[22] GIT Kernel Commit for Ext4 auto_da_alloc Mode

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]