This is a multi-post series about the ODL V2 Driver.
Completing the Picture
So we have a journal that’s processing entries a-synchronously and sending them to ODL, the skies are sunny and unicorns are running in the fields.. OK so no unicorns.. Also, there are some issues that come with operating the journal and also some other issues from the V1 driver the journal didn’t address:
- How do we handle an ODL “cold reboot”?
- What to do with stale journal locks?
- What to do with completed entries?
- What to do with failed entries?
- What to do when the systems just get out of sync?
To address most of these questions we introduced a mechanism called the “Maintenance Thread”.
Let me explain about this solution and what operations it performs.
The Maintenance Thread
The maintenance thread is a self sufficient thread being run by a oslo_service.loopingcall.FixedIntervalLoopingCall.
The design is very simple, interested operations can register to the thread by a register_operation function and the maintenance thread will run them all sequentially (don’t assume any order guarantee).
The maintenance thread takes care of one crucial aspect which is a shared lock kept in the DB used to synchronize in a HA environment where several Neutron Servers might be running. On each run the maintenance thread tries to catch the lock and continues only if successful. The same considerations from earlier for supporting GaleraDB are also present in this code.
To control the interval of the maintenance thread you can change the ‘ml2_odl.maintenance_interval‘ configuration option.
ODL Full Sync
When an ODL “Cold Reboot” is detected we do a “full sync” operations.
Cold Reboot Detection
Each time this operation runs it checks for the existence of a “canary” network. This fictitious network should exist only on ODL. If it’s missing, the full sync process kicks in.
Of course the first time the Neutron Server will connect to ODL there won’t be a canary network, so the first time a full sync will be triggered after which the canary network should always be there.
The full sync is a rather simple operation. First we check that there isn’t already a full sync running by checking if the journal contains an entry to create the canary network – if it does the operation aborts.
The next step is a full journal cleanup since:
- Any pending operation (excluding create, perhaps) is bound to fail.
- Any pending create operation will be recreated anyway so no need to have an older call for it.
Finally the operation goes over all the resources top-to-bottom (i.e. starts with root resources first) and adds an entry to create it to the journal. The last resource that’s created is the canary network so that in case the operation fails before the end, we will just re run it.
This design guarantees that the resource creation will behave in the same manner as any other resource creation handled by the journal, so we get all the benefits the journal provides and also resiliency since writing to the journal should be rather fast even at scale.
Stale Journal Lock Clean Up
Sometimes a journal entry can get locked indefinitely, causing it to get stuck in the journal forever and never be handled. This can happen if the thread handling the entry dies (A bug, a process crash, or a power failure are some of the things that can cause this).
We introduced a maintenance operation which scans the journal table for entries in the PROCESSING state (which indicates they’re locked) that are older than (determined by the last update field) a configured interval that’s set by the ‘ml2_odl.processing_timeout‘ configuration option.
Any journal entry that answers these criteria gets marked back to PENDING state and essentially pushed back to the end of the journal’s queue, so that when a journal will get to it it will pick it up and handle it.
Similar logic will soon be introduced for the maintenance lock so that a stale maintenance lock doesn’t prevent the maintenance thread from running.
Completed Entries Clean Up
When the journal finishes processing an entry it moves to the COMPLETED state. This indicates that the request has been handled and sent off to ODL (but could ultimately fail there, which isn’t handled here).
Obviously in a large scale environment there would be plenty of these rows, which could lead to intense DB growth which could eventually lead to performance problems, space problems and all sorts of mayhem in general. Therefore, it’s a good thing to get rid of the completed rows after a while.
The completed rows get cleaned by a maintenance operation and the retention for a row could be set by the ‘ml2_odl.completed_rows_retention‘ configuration option.
Failed Entries Recovery
When the journal fails in processing the entry too much times (determined by the ‘ml2_odl.retry_count‘ configuration option), it will transition into the FAILED state.
This state would most likely be reached due to either some bug or some mis-synchronization between the Neutron resources and their ODL counterparts.
I have already submitted a spec for implementing this capability, so instead of re-writing it here I’ll just refer you to read the spec.
The TL;DR here is that a maintenance process of recovery will try to recover failed journal entries by actually communicating with ODL and checking it’s state (as opposed to how the journal normally works which is send and forget, without checking ODL’s state at all).
If you got this far, kudos to you and your perseverance.
While this is the end of this post series, this is definitely not the end of the driver development effort.
There are still many challenges ahead such as scale & performance concerns, possible issues & bugs, and many more features to add.
If you’d like to contribute, please join us in our efforts.