Debugging the DDlog version of ovn-northd¶
This document gives some tips for debugging correctness issues in the
DDlog implementation of ovn-northd
. To keep things conrete, we
assume here that a failure occurred in one of the test cases in
ovn-e2e.at
, but the same methodology applies in any other
environment. If none of these methods helps, ask for assistance or
submit a bug report.
Before trying these methods, you may want to check the northd log
file, tests/testsuite.dir/<test_number>/northd/ovn-northd.log
for
error messages that might explain the failure.
Compare OVSDB tables generated by DDlog vs C¶
The first thing I typically want to check when ovn-northd-ddlog
does not behave as expected is how the OVSDB tables computed by DDlog
differ from what the C implementation produces. Fortunately, all the
infrastructure needed to do this already exists in OVN.
First, let’s modify the test script, e.g., ovn.at
to dump the
contents of OVSDB right before the failure. The most common issue is
a difference between the logical flows generated by the two
implementations. To make it easy to compare the generated flows, make
sure that the test contains something like this in the right place:
ovn-sbctl dump-flows > sbflows
AT_CAPTURE_FILE([sbflows])
The first line above dumps the OVN logical flow table to a file named
sbflows
. The second line ensures that, if the test fails,
sbflows
get logged to testsuite.log
. That is not particularly
useful for us right now, but it means that if someone later submits a
bug report, that’s one more piece of data that we don’t have to ask
for them to submit along with it.
Next, we want to run the test twice, with the C and DDlog versions of
northd, e.g., make check -j6 TESTSUITEFLAGS="-d 111 112"
if 111
and 112 are the C and DDlog versions of the same test. The -d
in
this command line makes the test driver keep test directories around
even for tests that succeed, since by default it deletes them.
Now you can look at sbflows
in each test log directory. The
ovn-northd-ddlog
developers have gone to some trouble to make the
DDlog flows as similar as possible to the C ones, right down to white
space and other formatting. Thus, the DDlog output is often identical
to C aside from logical datapath UUIDs.
Usually, this means that one can get informative results by running
diff
, e.g.:
diff -u tests/testsuite.dir/111/sbflows tests/testsuite.dir/111/sbflows
Running the input through the uuidfilt
utility from OVS will
generally get rid of the logical datapath UUID differences as well:
diff -u <(uuidfilt tests/testsuite.dir/111/sbflows) <(uuidfilt tests/testsuite.dir/111/sbflows)
If there are nontrivial differences, this often identifies your bug.
Often, once you have identified the difference between the two OVSDB dumps, this will immediately lead you to the root cause of the bug, but if you are not this lucky then the next method may help.
Record and replay DDlog execution¶
DDlog offers a way to record all input table updates throughout the execution of northd and replay them against DDlog running as a standalone executable without all other OVN components. This has two advantages. First, this allows one to easily tweak the inputs, e.g. to simplify the test scenario. Second, the recorded execution can be easily replayed anywhere without having to reproduce your OVN setup.
Use the --ddlog-record
option to record updates,
e.g. --ddlog-record=replay.dat
to record to replay.dat
.
(OVN’s built-in tests automatically do this.) The file contains the
log of transactions in the DDlog command format (see
https://github.com/vmware/differential-datalog/blob/master/doc/command_reference/command_reference.md).
To replay the log, you will need the standalone DDlog executable. By
default, the build system does not compile this program, because it
increases the already long Rust compilation time. To build it, add
NORTHD_CLI=1
to the make
command line, e.g. make
NORTHD_CLI=1
.
You can modify the log before replaying it, e.g., adding dump
<table>
commands to dump the contents of relations at various points
during execution. The <table> name must be fully qualified based on
the file in which it is declared, e.g. OVN_Southbound::<table>
for
southbound tables or lrouter::<table>.
for lrouter.dl
. You
can also use dump
without an argument to dump the contents of all
tables.
The following command replays the log generated by OVN test number
112 and dumps the output of DDlog to replay.dump
:
northd/ovn_northd_ddlog/target/release/ovn_northd_cli < tests/testsuite.dir/112/northd/replay.dat > replay.dump
Or, to dump just the table contents following the run, without having
to edit replay.dat
:
(cat tests/testsuite.dir/112/northd/replay.dat; echo 'dump;') | northd/ovn_northd_ddlog/target/release/ovn_northd_cli --no-delta --no-init-snapshot > replay.dump
Depending on whether and how you installed OVS and OVN, you might need
to point LD_LIBRARY_PATH
to library build directories to get the
CLI to run, e.g.:
export LD_LIBRARY_PATH=$HOME/ovn/_build/lib/.libs:$HOME/ovs/_build/lib/.libs
Note
The replay output may be less informative than you expect because
DDlog does not, by default, keep around enough information to
include input relation and intermediate relations in the output.
These relations are often critical to understanding what is going
on. To include them, add the options
--output-internal-relations --output-input-relations=In_
to
DDLOG_EXTRA_FLAGS
for building ovn-northd-ddlog
. For
example, configure
as:
./configure DDLOG_EXTRA_FLAGS='--output-internal-relations --output-input-relations=In_'
Debugging by Logging¶
One limitation of the previous method is that it allows one to inspect inputs and outputs of a rule, but not the (sometimes fairly complicated) computation that goes on inside the rule. You can of course break up the rule into several rules and dump the intermediate outputs.
There are at least two alternatives for generating log messages.
First, you can write rules to add strings to the Warning relation
declared in ovn_north.dl
. Code in ovn-northd-ddlog.c
will log
any given string in this relation just once, when it is first added to
the relation. (If it is removed from the relation and then added back
later, it will be logged again.)
Second, you can call using the warn()
function declared in
ovn.dl
from a DDlog rule. It’s not straightforward to know
exactly when this function will be called, like it would be in an
imperative language like C, since DDlog is a declarative language
where the user doesn’t directly control when rules are triggered. You
might, for example, see the rule being triggered multiple times with
the same input. Nevertheless, this debugging technique is useful in
practice.
You will find many examples of the use of Warning and warn
in
ovn_northd.dl
, where it is frequently used to report non-critical
errors.
Debugging panics¶
TODO: update these instructions as DDlog’s internal handling of panic’s is improved.
DDlog is a safe language, so DDlog programs normally do not crash, except for the following three cases:
- A panic in a Rust function imported to DDlog as
extern function
. - A panic in a C function imported to DDlog as
extern function
. - A bug in the DDlog runtime or libraries.
Below we walk through the steps involved in debugging such failures.
In this scenario, there is an array-index-out-of-bounds error in the
ovn_scan_static_dynamic_ip6()
function, which is written in Rust
and imported to DDlog as an extern function
. When invoked from a
DDlog rule, this function causes a panic in one of DDlog worker
threads.
Step 1: Check for error messages in the northd log. A panic can
generally lead to unpredictable outcomes, so one cannot count on a
clean error message showing up in the log (Other outcomes include
crashing the entire process and even deadlocks. We are working to
eliminate the latter possibility). In this case we are lucky to
observe a bunch of error messages like the following in the northd
log:
2019-09-23T16:23:24.549Z|00011|ovn_northd|ERR|ddlog_transaction_commit(): error: failed to receive flush ack message from timely dataflow thread
These messages are telling us that something is broken inside the DDlog runtime.
Step 2: Record and replay the failing scenario. We use DDlog’s record/replay capabilities (see above) to capture the faulty scenario. We replay the recorded trace:
northd/ovn_northd_ddlog/target/release/ovn_northd_cli < tests/testsuite.dir/117/northd/replay.dat
This generates a bunch of output ending with:
thread 'worker thread 2' panicked at 'index out of bounds: the len is 1 but the index is 1', /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/slice/mod.rs:2681:10
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.
We re-run the CLI again with backtrace enabled (as suggested by the error message):
RUST_BACKTRACE=1 northd/ovn_northd_ddlog/target/release/ovn_northd_cli < tests/testsuite.dir/117/northd/replay.dat
This finally yields the following stack trace, which suggests array
bound violation in ovn_scan_static_dynamic_ip6
:
0: backtrace::backtrace::libunwind::trace
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.29 10: core::panicking::panic_bounds_check
at src/libcore/panicking.rs:61
[SKIPPED]
11: ovn_northd_ddlog::__ovn::ovn_scan_static_dynamic_ip6
12: ovn_northd_ddlog::prog::__f
[SKIPPED]
Finally, looking at the source code of
ovn_scan_static_dynamic_ip6
, we identify the following line,
containing an unsafe array indexing operator, as the culprit:
ovn_ipv6_parse(&f[1].to_string())
Clean build¶
Occasionally it’s desirable to a full and complete build of the
DDlog-generated code. To trigger that, delete the generated
ovn_northd_ddlog
directory and the ddlog.stamp
witness file,
like this:
rm -rf northd/ovn_northd_ddlog northd/ddlog.stamp
or:
make clean-ddlog
Submitting a bug report¶
If you are having trouble with DDlog and the above methods do not
help, please submit a bug report to bugs@openvswitch.org
, CC
ryzhyk@gmail.com
.
In addition to problem description, please provide as many of the following as possible:
- Are you running with the right DDlog for the version of OVN? OVN and DDlog are both evolving and OVN needs to build against a specific version of DDlog.
replay.dat
file generated as described above- Logs:
ovn-northd.log
andtestsuite.log
, if you are running the OVN test suite