linux文件系统

内核版本 5.15

概述

VFS(Virtual File System / Virtual Filesystem Switch)是内核中对用户空间的程序提供文件系统接口的一个抽象的软件层，它对用户空间的程序屏蔽了不同的具体文件系统的实现差异。

VFS的系统调用都是在进程上下文中调用的。

源文件	包含的系统调用
fs/d_path.c	getcwd
fs/exec.c	execve, execveat
fs/fcntl.c	fcntl
fs/file.c	dup
fs/filesystems.c	sysfs
fs/ioctl.c	ioctl
fs/locks.c	flock
fs/namei.c	mknod, mkdir, rmdir, unlink, link, syslink, rename
fs/namespace.c	mount, umount
fs/open.c	truncate, acess, chdir, chroot, chmod, open, create, close
fs/pipe.c	pipe
fs/read_write.c	lseek, read, write,
fs/select.c	select, poll
fs/stat.c	stat, fstat, lstat
fs/sync.c	sync
fs/utimes.c	utime, utimes, futimes, lutimes

关键全局变量

全局变量	所在文件	说明
struct list_head super_blocks	fs/super.c	保存系统中所有的super_block
struct file_system_type* file_systems	fs/filesystem.c	保存了系统中所有的file_system_type信息

关键数据结构

数据结构	所在文件
struct super_block	include/linux/fs.h
struct dentry	include/linux/dcache.h
struct inode	include/linux/fs.h
strucr file	include/linux/fs.h
struct file_system_type	include/linux/fs.h
struct vfsmount	include/linux/mount.h
struct mount	include/linux/mount.h

数据结构之间的关联关系(相同颜色表示指向同一个对象)
example_1

super_block

存储一个已挂载的文件系统的相关信息，实例在文件系统挂载的时候产生。

对于磁盘类文件系统，相关信息会持久化到磁盘中。superblock保存了一个文件系统的最基础的元信息，一般都保存在底层存储设备的开头；文件系统挂载之后会读取文件系统的superblock并常驻内存，部分字段是动态创建时设置的。

同一个文件系统可能会有多个super_block。

super_block存在于两个链表中,一个是系统所有super_block的链表（全局变量super_blocks）, 一个是对于特定的文件系统的super_block链表(file_system_type.fs_supers).

struct super_block {
    struct list_head    s_list;                /* Keep this first --- 挂在全局变量 struct list_head super_blocks链表上*/
    dev_t            s_dev;                    /* search index; _not_ kdev_t  对应的设备描述符 */
    unsigned char        s_blocksize_bits;   /* 以位为单位的块的大小 */
    unsigned long        s_blocksize;        /* 以字节为单位的块大小 */
    loff_t            s_maxbytes;                /* 文件大小的上限 */
    struct file_system_type    *s_type;        /* 文件系统类型 */
    const struct super_operations    *s_op;  /* super_block的操作函数 */
    const struct dquot_operations    *dq_op; /* 磁盘限额方法 */
    const struct quotactl_ops    *s_qcop;    /* 限额控制方法 */
    const struct export_operations *s_export_op;    /* 导出方法 */
    unsigned long        s_flags;    /* 文件系统的mount标记 */
    unsigned long        s_magic;    /* 文件系统的魔术字 */
    struct dentry        *s_root;        /* 根目录的dentry */

    struct hlist_bl_head    s_roots;    /* alternate root dentries for NFS */
    struct list_head    s_mounts;    /* list of mounts; _not_ for fs use */
    struct block_device    *s_bdev;    /* 相关的块设备 */
    struct hlist_node    s_instances;    /* 挂在对应的文件系统结构体 file_system_type 的fs_supers链表 */
    
    // 指向具体文件系统私有结构体，如 xfs_mount, ramfs_fs_info, ext4_sb_info, proc_fs_info等
    void            *s_fs_info;    /* Filesystem private info  */

    const struct dentry_operations *s_d_op;     /* default d_op for dentries */

    /*
     * Owning user namespace and default context in which to
     * interpret filesystem uids, gids, quotas, device nodes,
     * xattrs and security labels.
     */
    struct user_namespace *s_user_ns;

    /*
     * The list_lru structure is essentially just a pointer to a table
     * of per-node lru lists, each of which has its own spinlock.
     * There is no need to put them into separate cachelines.
     */
    struct list_lru        s_dentry_lru;       // 未使用的dentry列表
    struct list_lru        s_inode_lru;        // 未使用的inode列表
    struct rcu_head        rcu;
    struct work_struct    destroy_work;
    struct list_head    s_inodes;    /* all inodes   所有的inode列表*/
    struct list_head    s_inodes_wb;    /* writeback inodes 需要回写的inode列表*/
} __randomize_layout;

dentry

目录项（directory entry），保存了文件（目录）名称和具体的inode的对应关系，同时也实现目录与其包含的文件/目录之间的映射关系；引入dentry的概念主要是为了方便查找文件/目录，path中的每个目录和文件都有对应的dentry。

用来保存文件路径和inode之间的映射，从而支持在文件系统中移动。dentry 由 VFS 维护，所有文件系统共享，不和具体的进程关联。

dentry没有在磁盘等底层持久化存储设备上存储，是一个动态创建的内存数据结构，主要是为了构建出树状组织结构而设计，用来进行文件、目录的查找。

通过从文件系统根开始的目录项进行连接，所有的目录项会形成一个树状结构；查找时通过这个树状结构来找到对应的文件/目录。

虚拟文件系统维护了一个 DEntry Cache缓存（全局变量 struct hlist_bl_head *dentry_hashtable），用来保存最近使用的 dentry，加速查询操作。当调用open()
函数打开一个文件时，内核会第一时间根据文件路径到 DEntry Cache里面寻找相应的dentry，找到了就直接构造一个struct file对象并返回。如果该文件不在缓存中，那么 VFS 会根据找到的最近目录一级一级地向下加载，直到找到相应的文件。期间 VFS 会缓存所有被加载生成的dentry。

struct dentry {
    /* RCU lookup touched fields */
    unsigned int d_flags;        /* protected by d_lock */
    seqcount_spinlock_t d_seq;    /* per dentry seqlock */
    struct hlist_bl_node d_hash;    /* lookup hash list */
    struct dentry *d_parent;    /* parent directory */      reserve
    struct qstr d_name;     reserve
    struct inode *d_inode;        /* Where the name belongs to - NULL is negative */ reserve
    unsigned char d_iname[DNAME_INLINE_LEN];    /* small names */

    /* Ref lookup also touches following */
    struct lockref d_lockref;    /* per-dentry lock and refcount */
    const struct dentry_operations *d_op;
    struct super_block *d_sb;    /* The root of the dentry tree */
    unsigned long d_time;        /* used by d_revalidate */
    void *d_fsdata;            /* fs-specific data */

    union {
        struct list_head d_lru;        /* LRU list */
        wait_queue_head_t *d_wait;    /* in-lookup ones only */
    };
    struct list_head d_child;    /* child of parent list 链到d_parent的d_subdirs链表中 */ reserve
    struct list_head d_subdirs;    /* our children  当前dentry中的子dentry链表 */  reserve
    /*
     * d_alias and d_rcu can share memory
     */
    union {
        struct hlist_node d_alias;    /* inode alias list */
        struct hlist_bl_node d_in_lookup_hash;    /* only for in-lookup ones */
         struct rcu_head d_rcu;
    } d_u;
} __randomize_layout;

inode

索引节点（index node）记录了文件或目录的属性信息。文件和inode是一一对应的。一个 inode可能被多个 dentry 所关联（通常是为文件建立硬连接）。

当创建一个文件时会对应的生成一个struct inode实例，并且该信息会持久化保存到磁盘中，由具体的文件系统进行组织。

当磁盘上的文件被访问时，才会由文件系统从磁盘上加载相应的数据并构造inode。虚拟文件系统维护了一个 Inode-cache缓存（全局变量 struct hlist_head *inode_hashtable），用来保存最近使用的inode，加速查询操作。

inode存在于两个双向链表, inode所在文件系统的 super_block 的 s_inodes 和 s_inodes_wb 链表中。

ls -li 命令结果的第一列就是文件的 inode 号

baoze@baoze:~/workspace$ ls -li
total 1281244
1444216 drwxrwxr-x  2 baoze baoze       4096 Jan 14 04:28 c
 524309 -rw-rw-r--  1 baoze baoze   59243820 Jan  6 22:58 compile_commands.json
 533495 drwxrwxr-x 28 baoze baoze       4096 Jan  7 00:26 linux-5.15.86
 533494 -rw-rw-r--  1 baoze baoze  195477769 Jan  5 15:40 linux-5.15.86.tar.gz
 557446 -rw-rw-r--  1 baoze baoze 1057254056 Jan 14 01:07 linux-image-unsigned-5.15.0-58-generic-dbgsym_5.15.0-58.64_amd64.ddeb

硬链接与软连接

硬链接：指向原始文件 inode 的指针，系统不为它分配新的inode。我们每添加一个硬链接，该文件的 innode 连接数就会增加 1 ；而且只有当该文件的 inode 连接数为 0 时，才算彻底被将它删除。因此即便删除原始文件，依然可以通过硬链接文件来访问。需要注意的是，我们不能跨分区对文件进行链接。
软链接：链接文件会生成新的inode。因此能链接目录，也能跨文件系统链接。但是，当删除原始文件后，链接文件也将失效。

inode的状态通常有三种

存在内存中，未关联到任何文件，也不处于活动使用状态;
存在内存中，正在由一个或多个进程使用，正在由一个或多个进程使用，通常表示一个文件。两个计数器（i_count和i_nlink）的值都必须大于0。文件内容和inode元数据都与底层块设备上的信息相同。也就是表示从上一次与介质同步依赖，该inode没有改变过;
处于活动使用状态。其数据内容已经改变，与存储介质上的内容不同。这种状态的inode被称作脏的。

struct inode {
    umode_t            i_mode;     //访问权限控制
    unsigned int        i_flags;    //文件系统标志

    const struct inode_operations    *i_op;  //指向索引结点操作结构体的指针
    struct super_block    *i_sb;      //指向inode所属文件系统的超级块的指针

    // 这个结构目的是缓存文件的内容，对文件的读写操作首先要在i_mapping包含的缓存里寻找文件的内容。
    // 如果有缓存，对文件的读就可以直接从缓存中获得，而不用再去物理硬盘读取，从而大大加速了文件的读操作。
    // 写操作也要首先访问缓存，写入到文件的缓存。然后等待合适的机会，再从缓存写入硬盘
    struct address_space    *i_mapping;     //相关的地址映射

    unsigned long        i_ino;      //索引结点号。通过ls -i命令可以查看文件的索引节点号
    dev_t            i_rdev;
    loff_t            i_size;       /* 以字节为单位的文件长度 */
    struct timespec64    i_atime;    //最后访问时间
    struct timespec64    i_mtime;    //最后修改时间
    struct timespec64    i_ctime;    //最后改变时间
    blkcnt_t        i_blocks;    //文件的块数

    struct hlist_node    i_hash;
    struct list_head    i_io_list;    /* backing dev IO list */
    struct list_head    i_lru;        /* inode LRU list */
    struct list_head    i_sb_list;      /* 链接到 super_block 中的 inode 链表 */
    struct list_head    i_wb_list;    /* backing dev writeback list */
    union {
        struct hlist_head    i_dentry;
        struct rcu_head        i_rcu;
    };
    union {
        const struct file_operations    *i_fop;    /* former ->i_op->default_file_ops */
        void (*free_inode)(struct inode *);
    };
    struct address_space    i_data;     //设备地址映射
    struct list_head    i_devices;      //块设备链表
    union {
        struct pipe_inode_info    *i_pipe;
        struct cdev        *i_cdev;
        char            *i_link;
        unsigned        i_dir_seq;
    };

    void            *i_private; /* fs or device private pointer */
} __randomize_layout;

file

file是内核中的数据结构，描述的是进程已经打开的文件，和进程是关联的。

因为一个文件可以被多个进程打开，所以一个文件可以存在多个文件对象，但多个文件对象其对应的索引节点和目录项对象肯定是唯一的。

每个进程都持有一个fd[]数组，数组里面存放的是指向file结构体的指针，同一进程的不同fd可以指向同一个file对象。

当应用程序调用open()函数的时候，VFS就会创建相应的file对象，打开文件的过程也就是对file结构体的初始化的过程。
在打开文件的过程中会将inode部分关键信息填充到file中，特别是文件操作的函数指针。
在task_struct中保存着一个file类型的数组，而用户态的文件描述符其实就是数组的下标。这样通过文件描述符就可以很容易到找到file，然后通过其中的函数指针访问数据。

struct file {       # include/linux/fs.h

    // f_path.dentry 指向该file对应的dentry
    // f_path.mnt指向该file对应的vfsmount
    struct path        f_path;
    struct inode        *f_inode;    /* cached value */
    const struct file_operations    *f_op;      //指向文件操作表的指针

    atomic_long_t        f_count;        //文件对象的使用计数
    unsigned int         f_flags;        //打开文件时所指定的标志
    fmode_t            f_mode;     //文件的访问模式
    loff_t            f_pos;      //文件当前的位移量

    u64            f_version;
    void            *private_data;

    struct address_space    *f_mapping;     //页缓存映射
}

filesystem

fs/filesystem.c 文件中定义了全局变量 static struct file_system_type *file_systems，保存了系统中所有的file_system_type信息。

对file_systems全局变量list的遍历必须要通过 file_systems_lock 来进行保护
文件系统module卸载时，必须调用 unregister_filesystem()接口
访问list中的某一个成员时，可以在加锁（file_systems_lock）的代码段中进行，或者获取file_system_type->owner的引用计数。获取引用计数可以通过try_module_get()函数实现，该函数返回0表示获取失败。

如下面代码示例

static int fs_name(unsigned int index, char __user * buf)
{
    struct file_system_type * tmp;
    int len, res;

    read_lock(&file_systems_lock);  # 加锁
    for (tmp = file_systems; tmp; tmp = tmp->next, index--)
        if (index <= 0 && try_module_get(tmp->owner))  # 获取引用计数
            break;
    read_unlock(&file_systems_lock); # 解锁
    if (!tmp)
        return -EINVAL;

    /* OK, we got the reference, so we can safely block */
    len = strlen(tmp->name) + 1;
    res = copy_to_user(buf, tmp->name, len) ? -EFAULT : 0;
    put_filesystem(tmp);
    return res;
}

file_system_type结构体

struct file_system_type {
    const char *name;       // 文件系统名称
    int fs_flags;
    int (*init_fs_context)(struct fs_context *);
    const struct fs_parameter_spec *parameters;
    struct dentry *(*mount) (struct file_system_type *, int, const char *, void *);
    void (*kill_sb) (struct super_block *);
    struct module *owner;
    struct file_system_type * next;     /* 挂在全局变量 file_systems链表上 */
    struct hlist_head fs_supers;    /* 表示给定类型的已安装文件系统所对应的super_block链表的头 */
};

vfsmount

struct vfsmount {       // include/linux/mount.h
	struct dentry *mnt_root;	/* root of the mounted tree */
	struct super_block *mnt_sb;	/* pointer to superblock */
	int mnt_flags;
	struct user_namespace *mnt_userns;
}

mount

struct mount {          // fs/mount.h
	struct hlist_node mnt_hash;
	struct mount *mnt_parent;
	struct dentry *mnt_mountpoint;
	struct vfsmount mnt;
	union {
		struct rcu_head mnt_rcu;
		struct llist_node mnt_llist;
	};

	struct list_head mnt_mounts;	/* list of children, anchored here */
	struct list_head mnt_child;	/* and going through their mnt_child */
	struct list_head mnt_instance;	/* mount instance on sb->s_mounts */
	const char *mnt_devname;	/* Name of device e.g. /dev/dsk/hda1 */
	struct list_head mnt_list;
	struct list_head mnt_expire;	/* link in fs-specific expiry list */
	struct list_head mnt_share;	/* circular list of shared mounts */
	struct list_head mnt_slave_list;/* list of slave mounts */
	struct list_head mnt_slave;	/* slave list entry */
	struct mount *mnt_master;	/* slave is on master->mnt_slave_list */

	struct mnt_namespace *mnt_ns;	/* containing namespace */
	struct mountpoint *mnt_mp;	/* where is it mounted */
	
    union {
		struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
		struct hlist_node mnt_umount;
	};
	struct list_head mnt_umounting; /* list entry for umount propagation */
	int mnt_id;			/* mount identifier */
	int mnt_group_id;		/* peer group identifier */
	int mnt_expiry_mark;		/* true if marked for expiry */
	struct hlist_head mnt_pins;
	struct hlist_head mnt_stuck_children;
}

mnt_namespace

struct mnt_namespace {      // fs/mount.h
	struct ns_common	ns;
	struct mount *	root;
	/*
	 * Traversal and modification of .list is protected by either
	 * - taking namespace_sem for write, OR
	 * - taking namespace_sem for read AND taking .ns_lock.
	 */
	struct list_head	list;
	spinlock_t		ns_lock;
	struct user_namespace	*user_ns;
	struct ucounts		*ucounts;
	u64			seq;	/* Sequence number to prevent loops */
	wait_queue_head_t poll;
	u64 event;
	unsigned int		mounts; /* # of mounts in the namespace */
	unsigned int		pending_mounts;
}

mountpoint

struct mountpoint {     // fs/mount.h
	struct hlist_node m_hash;
	struct dentry *m_dentry;
	struct hlist_head m_list;
	int m_count;
};

nameidata

struct nameidata {
	struct path	path;
	struct qstr	last;
	struct path	root;
	struct inode	*inode; /* path.dentry.d_inode */
	unsigned int	flags, state;
	unsigned	seq, m_seq, r_seq;
	int		last_type;
	unsigned	depth;
	int		total_link_count;
	struct saved {
		struct path link;
		struct delayed_call done;
		const char *name;
		unsigned seq;
	} *stack, internal[EMBEDDED_LEVELS];
	struct filename	*name;
	struct nameidata *saved;
	unsigned	root_seq;
	int		dfd;
	kuid_t		dir_uid;
	umode_t		dir_mode;
} __randomize_layout;

初始化流程

全局变量	缓存池名称	对象	说明
dentry_hashtable	Dentry Cache	struct hlist_bl_head	alloc_large_system_hash
inode_hashtable	Inode-cache	struct hlist_head	alloc_large_system_hash
names_cachep	names_cache	4K的char(path_name)	slab
dentry_cache	dentry	struct dentry	slab
inode_cache	inode_cache	struct inode	slab
filp_cachep	filep	struct file	slab
mnt_cache	mnt_cache	struct mount	slab
mount_hashtable	Mount-cache	struct hlist_head	alloc_large_system_hash
mountpoint_hashtable	Mountpoint-cache	struct hlist_head	alloc_large_system_hash
kernfs_node_cache	kernfs_node_cache	struct kernfs_node	slab
kernfs_iattrs_cache	kernfs_iattrs_cache	struct kernfs_iattrs	slab
shmem_inode_cachep	shmem_inode_cache	struct shmem_inode_info	slab
bdev_cachep	bdev_cache	struct bdev_inode	slab

在dcache_init，inode_init同dcache_init_early， inode_init_early函数中分别创建struct entry 和 struct inode的slab cache和hash table
在slab cache中保存数据，使用hash table为其建立索引表，典型的以空间换时间方式
这里使用有early后缀和没有early后缀的函数是根据hash是否分布在NUMA上来选择hash table的创建时机是否推迟到vmalloc空间可以使用

start_kernel()
  |--- vfs_caches_init_early()  # fs/dcache.c
  |      |--- dcache_init_early()
  |      |      |--- # fs/dcache.c
  |      |      |--- # 全局变量 struct hlist_bl_head *dentry_hashtable 分配内存(alloc_large_system_hash)
  |      |      |--- # table name: Dentry cache,   对象 struct hlist_bl_head
  |      |--- inode_init_early()
  |      |      |--- # fs/inode.c 
  |      |      |--- # 全局变量 struct hlist_head *inode_hashtable 分配内存(alloc_large_system_hash)
  |      |      |--- # table name: Inode-cache,   对象 struct hlist_head
  |--- mm_init()    # 内存初始化，kmem_cache_init
  |--- vfs_caches_init()    
  |      |--- # fs/dentry.c
  |      |--- # 全局变量 struct kmem_cache *names_cachep 分配内存, slab name: names_cache,   对象 4K的char(path_name)
  |      |--- dcache_init()
  |      |      |--- # fs/dcache.c 
  |      |      |--- # 全局变量 struct kmem_cache *dentry_cache 分配内存, slab name: dentry,   对象 struct dentry
  |      |--- inode_init()
  |      |      |--- # fs/inode.c
  |      |      |--- # 全局变量 struct kmem_cache *inode_cache 分配内存, slab name: inode_cache,   对象 struct inode
  |      |--- files_init() 
  |      |      |--- # fs/file_table.c 
  |      |      |--- # 全局变量 struct kmem_cache *filp_cachep 分配内存, slab name: filp,  对象 struct file
  |      |      |--- # 全局变量 struct percpu_counter nr_files初始化
  |      |--- files_maxfiles_init()
  |      |      |--- # fs/file_table.c
  |      |      |--- # 全局变量 struct files_stat_struct files_stat.max_files 初始化
  |      |--- mnt_init()
  |      |      |--- # fs/namespace.c
  |      |      |--- # 全局变量 struct kmem_cache *mnt_cache 分配内存，slab name: mnt_cache, 对象 struct mount
  |      |      |--- # 全局变量 struct hlist_head *mount_hashtable 分配内存(alloc_large_system_hash), name: Mount-cache, 对象 struct hlist_head
  |      |      |--- # 全局变量 struct hlist_head *mountpoint_hashtable 分配内存(alloc_large_system_hash), name: Mountpoint-cache, 对象 struct hlist_head
  |      |      |--- kernfs_init()
  |      |      |      |--- # fs/kernfs/mount.c
  |      |      |      |--- # 全局变量  kernfs_node_cache 分配slab内存，name: kernfs_node_cache, 对象 struct kernfs_node
  |      |      |      |--- # 全局变量  kernfs_iattrs_cache 分配slab内存，name: kernfs_iattrs_cache, 对象 struct kernfs_iattrs
  |      |      |--- sysfs_init()
  |      |      |      |--- # fs/sysfs/mount.c
  |      |      |      |--- kernfs_create_root()  # 创建一个新的kernfs层次结构，返回值保存在全局变量 struct kernfs_root *sysfs_root
  |      |      |      |--- # 全局变量 struct kernfs_node *sysfs_root_kn 初始化为 sysfs_root->kn 
  |      |      |      |--- register_filesystem(&sysfs_fs_type)  # 注册文件系统，名称 sysfs
  |      |      |--- kobject_create_and_add("fs", NULL)
  |      |      |      |--- # lib/kobject.c
  |      |      |      |--- # 创建一个结构kobject，并将其注册到sysfs中，呈现为fs目录，返回值保存在全局变量struct kobject *fs_kobj中
  |      |      |--- shmem_init()
  |      |      |      |--- # mm/shmem.c
  |      |      |      |--- # 全局变量 shmem_inode_cachep 分配slab内存， name: shmem_inode_cache, 对象 struct shmem_inode_info
  |      |      |      |--- register_filesystem(&shmem_fs_type)
  |      |      |      |--- kern_mount(&shmem_fs_type)
  |      |      |      |      |--- # fs/namespace.c
  |      |      |      |      |--- # 返回值保存在全局变量 struct vfsmount *shm_mnt 中
  |      |      |--- init_rootfs()
  |      |      |      |--- # init/do_mounts.c
  |      |      |      |--- # 根据条件设置全局变量 bool is_tmpfs的值是否为ture， 该变量在 rootfs_init_fs_context 中使用
  |      |      |--- init_mount_tree()  # 安装rootfs文件系统， 见下面单独展开
  |      |--- bdev_cache_init()
  |      |      |--- # block/bdev.c
  |      |      |--- # 全局变量 bdev_cachep 分配slab内存， name： bdev_cache， 对象 struct bdev_inode
  |      |      |--- register_filesystem(&bd_type)
  |      |--- chdev_init()  # fs/char_dev.c  全局变量 struct kobj_map *cdev_map 分配内存（kmalloc）并初始化
  |--- arch_call_rest_init()
         |--- rest_init()
         |      |--- kernel_thread(kernel_init, NULL, CLONE_FS)
         |      |      |--- kernel_init()
         |      |      |      |--- kernel_init_freeable()
         |      |      |      |      |--- do_basic_setup()
         |      |      |      |      |      |--- driver_init()
         |      |      |      |      |      |--- do_initcalls()
         |      |      |      |      |      |      |--- rootfs_initcall(populate_rootfs)
         |      |      |      |      |      |      |      |--- do_populate_rootfs()
         |      |      |      |      |      |      |      |      |--- # init/initramfs.c unpack_to_rootfs 解压initrd到rootfs 
         |      |      |      |--- run_init_process(ramdisk_execute_command) # 执行rootfs中的 /init 程序

init_mount_tree()函数

static void __init init_mount_tree(void)        # fs/namespace.c
{
    struct vfsmount *mnt;
    struct mount *m;
    struct mnt_namespace *ns;
    struct path root;

    # 挂在rootfs文件系统，期间会创建super_block
    mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
    if (IS_ERR(mnt))
        panic("Can't create rootfs");

    # 创建namespace
    ns = alloc_mnt_ns(&init_user_ns, false);
    if (IS_ERR(ns))
        panic("Can't allocate initial namespace");
    m = real_mount(mnt);
    m->mnt_ns = ns;
    ns->root = m;
    ns->mounts = 1;
    list_add(&m->mnt_list, &ns->list);
    init_task.nsproxy->mnt_ns = ns;
    get_mnt_ns(ns);

    root.mnt = mnt;
    root.dentry = mnt->mnt_root;
    mnt->mnt_flags |= MNT_LOCKED;

    # 将根目录和当前工作目录都设为rootfs文件系统根目录，即init_task进程可以看见整个内核根文件系统。
    # init_task进程创建子进程时，其根目录和当前工作目录信息会传递给子进程。
    set_fs_pwd(current->fs, &root);     # current->fs->pwd = root
    set_fs_root(current->fs, &root);    # current->fs->root = root
}

vfs_kern_mount()的主要流程创建fs_context -> 创建super_block -> 创建inode -> 创建dentry -> 创建vfs_mount/mount
此时内核还不存在根文件系统，因此无法关联挂载点。
实际上此时创建的rootfs文件系统根目录项，就是初始内核根文件系统的根目录项。
此时rootfs文件系统的内容为空，内核在启动后期，初始化子系统时调用populate_rootfs()函数将initramfs中的内容解压至rootfs文件系统。

vfs_kern_mount(type: &rootfs_fs_type, flags: 0, name: "rootfs", data: NULL)
  |--- struct fs_context *fc;
  |--- struct vfsmount *mnt;
  |--- fc = fs_context_for_mount(rootfs_fs_type, 0);
  |       |--- alloc_fs_context(rootfs_fs_type, reference: NULL, 0, 0, FS_CONTEXT_FOR_MOUNT);
  |       |      |--- fc->fs_type->init_fs_context(fc);  //rootfs_init_fs_context
  |       |      |      |--- ramfs_init_fs_context(fc);
  |       |      |      |      |--- fc->ops = &ramfs_context_ops;
  |--- mnt = fc_mount(fc);
  |       |--- vfs_get_tree(fc);
  |       |      |--- fc->ops->get_tree(fc); // ramfs_context_ops->get_tree -> ramfs_get_tree
  |       |      |      |--- ramfs_get_tree(fc);
  |       |      |      |      |--- get_tree_nodev(fc, ramfs_fill_super);
  |       |      |      |      |      |--- vfs_get_super(fc, vfs_get_independent_super, fill_super);
  |       |      |      |      |      |      |--- struct super_block *sb = sget_fc(fc, test, set_anon_super_fc);
  |       |      |      |      |      |      |      |--- # fs/super.c 根据fs_context 创建super_block
  |       |      |      |      |      |      |--- fill_super(sb, fc); -> ramfs_fill_super(sb, fc);
  |       |      |      |      |      |      |      |--- struct inode *inode = ramfs_get_inode()  # fs/ramfs/inode 创建inode
  |       |--- vfs_create_mount(fc);
  |       |      |--- struct mount *mnt = alloc_vfsmnt(fc->source ?: "none");
  |       |      |--- return &mnt->mnt;
  |--- return mnt;

static const struct fs_context_operations ramfs_context_ops = {   # fs/ramfs/inode.c
    .free        = ramfs_free_fc,
    .parse_param    = ramfs_parse_param,
    .get_tree    = ramfs_get_tree,
};

static int rootfs_init_fs_context(struct fs_context *fc)    # init/do_mounts.c
{
    if (IS_ENABLED(CONFIG_TMPFS) && is_tmpfs)
        return shmem_init_fs_context(fc);

    return ramfs_init_fs_context(fc);
}
struct file_system_type rootfs_fs_type = {      # init/do_mounts.c
    .name        = "rootfs",
    .init_fs_context = rootfs_init_fs_context,
    .kill_sb    = kill_litter_super,
};

系统日志中可以看到相关的初始化打印信息

[    0.123936] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes, linear)
[    0.123936] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[    0.811286] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[    0.815888] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[    1.150965] devtmpfs: initialized
[   10.327085] VFS: Disk quotas dquot_6.6.0
[   10.328635] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[   11.294724] Trying to unpack rootfs image as initramfs...
[   15.765063] Freeing initrd memory: 104932K
[   16.347904] Run /init as init process

常用的文件系统

ramfs

基于内存的简易文件系统类型，是完全基于虚拟文件系统数据结构实例的文件系统，文件系统没有大小限制，文件内容不能交换至外部交换区

tmpfs

ramfs文件系统类型的增强版，对文件大小进行限制，文件内容可交换至交换区。需选择CONFIG_TMPFS配置选项，不仅可用于内核根文件系统，还可用于进程间通信的共享内存机制等

rootfs

内核启动时的初始根文件系统类型，可以是ramfs或tmpfs其中之一。内核在以下条件同时都成立时选择tmpfs作为初始根文件系统类型，否则选用ramfs文件系统类型：
（1）选择了CONFIG_TMPFS配置选项，支持tmpfs文件系统
（2）命令行参数rootfstype=tmpfs或未定义
（3）命令行参数root=未定义

该判断逻辑在 init_rootfs() 函数中
命令行参数可以通过/proc/cmdline 或 /boot/grub/grub.cfg 文件中查看

initramfs

保存初始根文件系统内容，它是一个.cpio类型的文件，链接内核时保存在内核镜像的初始化段中。内核在do_basic_setup()函数中，初始化子系统时调用populate_rootfs()函数（/init/initramfs.c）将initramfs的内容解压至根文件系统中。initramf具有默认的内容（/usr/），用户可通过配置选项指定编入其中的文件夹，编译内核时会将指文件夹的内容编译入initramfs内，目标文件格式为.cpio。使用initramfs传递根文件系统内容需要选择BLK_DEV_INITRD配置选项，并指定”initrd= xxx”。

procfs

sysfs

sysfs是一个基于内存的文件系统，它的作用是将内核信息以文件的方式提供给用户程序使用。sysfs 文件系统被挂载在 /sys 挂载点上。

devtmpfs

xfs

nfs

进程与文件系统的关联

每个进程有一个根目录和当前工作目录（由fs_struct结构体表示），这两个目录指向内核根文件系统中的一个目录。根目录是进程能看见内核根文件系统的起点，也就是说此目录以上的部分对进程不可见，进程只能看到此目录以下的部分。进程能看到的文件系统是内核根文件系统的一部分。当前工作目录，即在不指定的情况下，进程在当前工作目录下搜索、打开文件等。