WAPBL(9) | Kernel Developer's Manual | WAPBL(9) |
WAPBL
, wapbl_start
,
wapbl_stop
, wapbl_begin
,
wapbl_end
, wapbl_flush
,
wapbl_discard
, wapbl_add_buf
,
wapbl_remove_buf
,
wapbl_resize_buf
,
wapbl_register_inode
,
wapbl_unregister_inode
,
wapbl_register_deallocation
,
wapbl_jlock_assert
,
wapbl_junlock_assert
—
#include <sys/wapbl.h>
typedef void (*wapbl_flush_fn_t)(struct mount *, daddr_t *, int *, int);
int
wapbl_start
(struct
wapbl **wlp, struct mount
*mp, struct vnode
*devvp, daddr_t
off, size_t count,
size_t blksize,
struct wapbl_replay *wr,
wapbl_flush_fn_t flushfn,
wapbl_flush_fn_t
flushabortfn);
int
wapbl_stop
(struct
wapbl *wl, int
force);
int
wapbl_begin
(struct
wapbl *wl, const char
*file, int
line);
void
wapbl_end
(struct
wapbl *wl);
int
wapbl_flush
(struct
wapbl *wl, int
wait);
void
wapbl_discard
(struct
wapbl *wl);
void
wapbl_add_buf
(struct
wapbl *wl, struct buf
*bp);
void
wapbl_remove_buf
(struct
wapbl *wl, struct buf
*bp);
void
wapbl_resize_buf
(struct
wapbl *wl, struct buf
*bp, long oldsz,
long oldcnt);
void
wapbl_register_inode
(struct
wapbl *wl, ino_t
ino, mode_t
mode);
void
wapbl_unregister_inode
(struct
wapbl *wl, ino_t
ino, mode_t
mode);
void
wapbl_register_deallocation
(struct
wapbl *wl, daddr_t
blk, int len);
void
wapbl_jlock_assert
(struct
wapbl *wl);
void
wapbl_junlock_assert
(struct
wapbl *wl);
WAPBL
, or write-ahead physical block
logging, is an abstraction for file systems to write physical blocks in
the buffercache(9) to a
bounded-size log first before their real destinations on disk. The name means:
When a file system using WAPBL
issues
writes (as in bwrite(9) or
bdwrite(9)), they are grouped
in batches called transactions in memory, which are
serialized to be consistent with program order before
WAPBL
submits them to disk atomically.
Thus, within a transaction, after one write, another write need not wait for disk I/O, and if the system is interrupted, e.g. by a crash or by power failure, either both writes will appear on disk, or neither will.
When a transaction is full, it is written to a circular buffer on
disk called the log. When the transaction has been written
to disk, every write in the transaction is submitted to disk asynchronously.
Finally, the file system may issue new writes via
WAPBL
once enough writes submitted to disk have
completed.
After interruption, such as a crash or power failure, some writes issued by the file system may not have completed. However, the log is written consistently with program order and before file system writes are submitted to disk. Hence a consistent program-order view of the file system can be attained by resubmitting the writes that were successfully stored in the log using wapbl_replay(9). This may not be the same state just before interruption — writes in transactions that did not reach the disk will be excluded.
For a file system to use WAPBL
, its
VFS_MOUNT(9) method should
first replay any journal on disk using
wapbl_replay(9), and
then, if the mount is read/write, initialize WAPBL
for the mount by calling wapbl_start
(). The
VFS_UNMOUNT(9) method
should call wapbl_stop
().
Before issuing any
buffercache(9) writes,
the file system must acquire a shared lock on the current
WAPBL
transaction with
wapbl_begin
(), which may sleep until there is room
in the transaction for new writes. After issuing the writes, the file system
must release its shared lock on the transaction with
wapbl_end
(). Either all writes issued between
wapbl_begin
() and
wapbl_end
() will complete, or none of them will.
File systems may also witness an exclusive lock
on the current transaction when WAPBL
is flushing
the transaction to disk, or aborting a flush, and invokes a file system's
callback. File systems can assert that the transaction is locked with
wapbl_jlock_assert
(), or not
exclusively locked, with
wapbl_junlock_assert
().
If a file system requires multiple transactions to initialize an
inode, and needs to destroy partially initialized inodes during replay, it
can register them by ino_t inode number before
initialization with wapbl_register_inode
() and
unregister them with wapbl_unregister_inode
() once
initialization is complete. WAPBL
does not actually
concern itself whether the objects identified by ino_t
values are ‘inodes’ or ‘quaggas’ or anything
else — file systems may use this to list any objects keyed by
ino_t value in the log.
When a file system frees resources on disk and issues writes to
reflect the fact, it cannot then reuse the resources until the writes have
reached the disk. However, as far as the
buffercache(9) is
concerned, as soon as the file system issues the writes, they will appear to
have been written. So the file system must not attempt to reuse the resource
until the current WAPBL
transaction has been flushed
to disk.
The file system can defer freeing a resource by calling
wapbl_register_deallocation
() to record the disk
address of the resource and length in bytes of the resource. Then, when
WAPBL
next flushes the transaction to disk, it will
pass an array of the disk addresses and lengths in bytes to a
file-system-supplied callback. (Again, WAPBL
does
not care whether the ‘disk address’ or ‘length in
bytes’ is actually that; it will pass along
daddr_t and int values.)
wapbl_start
(wlp,
mp, devvp,
off, count,
blksize, wr,
flushfn, flushabortfn)WAPBL
for the file system mounted at
mp, storing a log of count
disk sectors at disk address off on the block device
devvp writing blocks in units of
blksize bytes. On success, stores an opaque
struct wapbl * cookie in
*
wlp for use with the other
WAPBL
routines and returns zero. On failure,
returns an error number.
If the file system had replayed the log with
wapbl_replay(9),
then wr must be the struct
wapbl_replay * cookie used to replay it, and
wapbl_start
() will register any inodes that were
in the log as if with wapbl_register_inode
();
otherwise wr must be
NULL
.
flushfn is a callback that
WAPBL
will invoke as
flushfn (mp,
deallocblks, dealloclens,
dealloccnt) just before it flushes a transaction
to disk, with the an exclusive lock held on the transaction, where
mp is the mount point passed to
wapbl_start
(), deallocblks
is an array of dealloccnt disk addresses, and
dealloclens is an array of
dealloccnt lengths, corresponding to the addresses
and lengths the file system passed to
wapbl_register_deallocation
(). If flushing the
transaction to disk fails, WAPBL
will call
flushabortfn with the same arguments to undo any
effects that flushfn had.
wapbl_stop
(wl,
force)WAPBL
. If flushing the transaction fails and
force is zero, return error. If flushing the
transaction fails and force is nonzero, discard the
transaction, permanently losing any writes in it. If flushing the
transaction is successful or if force is nonzero,
free memory associated with wl and return zero.wapbl_begin
(wl,
file, line)The lock is not exclusive: other threads may acquire shared locks on the transaction too. The lock is not recursive: a thread may not acquire it again without calling wapbl_end first.
May sleep.
file and line are the file name and line number of the caller for debugging purposes.
wapbl_end
(wl)wapbl_begin
().wapbl_flush
(wl,
wait)The current transaction must not be locked.
wapbl_discard
(wl)The current transaction must not be locked.
wapbl_add_buf
(wl,
bp)This is meant to be called from within buffercache(9), not by file systems directly.
wapbl_remove_buf
(wl,
bp)This is meant to be called from within buffercache(9), not by file systems directly.
wapbl_resize_buf
(wl,
bp, oldsz,
oldcnt)This is meant to be called from within buffercache(9), not by file systems directly.
wapbl_register_inode
(wl,
ino, mode)wapbl_unregister_inode
(wl,
ino, mode)wapbl_register_deallocation
(wl,
blk, len)wapbl_start
().wapbl_jlock_assert
(wl)Note that it might not be locked by the current thread: this assertion passes if any thread has it locked.
wapbl_junlock_assert
(wl)Users of WAPBL
observe exclusive locks
only in the flushfn and
flushabortfn callbacks to
wapbl_start
(). Outside of such contexts, the
transaction is never exclusively locked, even between
wapbl_begin
() and
wapbl_end
().
There is no way to assert that the current transaction is not
locked at all — i.e., that the caller may acquire a shared lock
on the transaction with wapbl_begin
() without
danger of deadlock.
WAPBL
subsystem is implemented in
sys/kern/vfs_wapbl.c, with hooks in
sys/kern/vfs_bio.c.
WAPBL
works only for file system metadata managed via
the buffercache(9), and
provides no way to log writes via the page cache, as in
VOP_GETPAGES(9),
VOP_PUTPAGES(9), and
ubc_uiomove(9), which is
normally used for file data.
Not only is WAPBL
unable to log writes via
the page cache, it is also unable to defer
buffercache(9) writes
until cached pages have been written. This manifests as the well-known
garbage-data-appended-after-crash bug in FFS: when appending to a file, the
pages containing new data may not reach the disk before the inode update
reporting its new size. After a crash, the inode update will be on disk, but
the new data will not be — instead, whatever garbage data in the free
space will appear to have been appended to the file.
WAPBL
exacerbates the problem by increasing the
throughput of metadata writes, because it can issue many metadata writes
asynchronously that FFS without WAPBL
would need to
issue synchronously in order for
fsck(8) to work.
The criteria for when the transaction must be flushed to disk
before wapbl_begin
() returns are heuristic, i.e.
wrong. There is no way for a file system to communicate to
wapbl_begin
() how many buffers, inodes, and
deallocations it will issue via WAPBL
in the
transaction.
WAPBL
mainly supports write-ahead, and has
only limited support for rolling back operations, in the form of
wapbl_register_inode
() and
wapbl_unregister_inode
(). Consequently, for example,
large writes appending to a file, which requires multiple disk block
allocations and an inode update, must occur in a single transaction —
there is no way to roll back the disk block allocations if the write fails
in the middle, e.g. because of a fault in the middle of the user buffer.
wapbl_jlock_assert
() does not guarantee
that the current thread has the current transaction locked.
wapbl_junlock_assert
() does not guarantee that the
current thread does not have the current transaction locked at all.
There is only one WAPBL
transaction for
each file system at any given time, and only one
WAPBL
log on disk. Consequently, all writes are
serialized. Extending WAPBL
to support multiple logs
per file system, partitioned according to an appropriate scheme, is left as
an exercise for the reader.
There is no reason for WAPBL
to require
its own hooks in
buffercache(9).
The on-disk format used by WAPBL
is
undocumented.
March 26, 2015 | NetBSD 9.0 |