This is a multi-post series about the ODL V2 Driver.
V2 Driver Architecture
The V2 driver was designed to solve the issues identified in the V1 driver, while being scalable and performant. While the write up is mainly about the L2 networking bits, the architecture is general and is used for any and all communication between Neutron and ODL (i.e. L2, L3, FWASS, QoS, Trunk Ports, etc).
At a high level, the architecture looks like this:
The basic concept was to add a journal that acts as an intermediary between the ML2 driver and ODL itself.
The journal is implemented as a table in the Neutron DB where each row represents a journal entry and has the following basic fields:
- Resource type (Network, Port, etc)
- Resource ID
- Operation (Create, Update, Delete)
- State (Pending, Processing, Failed, Completed)
- Retry counter
- Create timestamp (Defaults to now)
- Last retry timestamp (Defaults to now)
The journal works as an ordered Queue, where the order of the entries is determined by the last retry timestamp, this essentially means that entries are picked up FIFO.
The journal tries to retrieve the oldest row in processing state. If such a row exists it will try to soft lock it by changing it’s state to ‘Processing’.
The journal is being run by a dedicated thread and this thread is being woken up by a timer configurable by the ‘sync_timeout’ option in the ‘ml2_odl’ section of the ML2 plugin configuration file.
The GaleraDB Inconsistency
There was a need to support Neutron HA where several Neutron servers might attempt to pick the same row at the same time. This posed a first-class synchronization problem which isn’t very easy to solve.
Two possible solutions were suggested:
- Pessimistic locking using the ‘SELECT FOR UPDATE’ SQL statement.
- Optimistic locking using a version field (or CAS or whatever).
The second approach sounded better in theory as the first isn’t really supported by some DB engines, such as the widely used (At least in OpenStack deployments) GaleraDB (when in mutli-master mode).
However, the second approach doesn’t actually work in GaleraDB either since it has it’s own implementation of optimistic locking to synchronize it’s nodes which causes optimistic locking to reach the same problem as with no locking, and get a “deadlock exception” which is GaleraDB’s way to notify a data-conflict occurred.
The way to solve this was to use pessimistic locking, and a software based retry mechanism (which is widely used in OpenStack) for handling GaleraDB specifically.
The Pre-commit Race
Initially the journal entry was recorded in the PRECOMMIT hook phase, which is the last action right before the data is committed, and then the journal thread was called to process the entry.
While I don’t know exactly why it was done, I do know that it posed a race since the journal could start executing before the transaction was committed.
While not a critical issue, this does introduce latency into the process since instead of processing the entry right away the journal thread would wait until it’s timer would fire.
This added latency could mean that a port created for a VM would possibly be reported as created even before it exists, causing the VM to not get an IP since the DHCP agent gave up trying.
The solution was quite simple, instead of triggerring the journal thread at the end of PRECOMMIT we moved the trigger to the end of POSTCOMMIT.
The Expected Errors
So we have REST calls going to ODL, great! But.. What should we do with errors?
There are actually a couple different errors we should look out for:
- Networking errors (ODL is inaccessible)
- Other errors
For networking errors we know that it’s an intermittent state which is supposed to pass, so we put the entry back into the journal and update the state back to pending (the last retry field gets updated as well, pushing the entry to the end of the journal’s queue). We also stop processing further entries since we know ODL is inaccessible so there’s no need to proceed.
For any other error, we increase the retry count and update the state back to pending. If the retry count has reached a configured limit (‘retry_count’ under ‘ml2_odl’ section) then it’s state is updated to Error and is not retried any more.
The Dependency Graph
Another problem that needed to be solved was the racefulness between dependent resources, such as port which relies on subnet and network.
The way chosen to solve this was using dependency graphs which are calculated per resource and per operation so that for example network create is independent of anything, but network delete would only happen when there isn’t any delete operation on a related resource.
For example, If a subnet and port are created in a serial manner (Would happen if you mark the subnet to be served by Nuetron’s DHCP), both requests could arrive to different Neutron controller nodes and have a journal entry created simultaneously. The port creation entry, however, needs to be processed only after the subnet creation has completed processing.
This approach is not perfect since it’s quite error prone, but to allow a Neutron HA environment running several Neutron servers requires such an approach.
What About The Other Stuff?
While covering a lot of issues, the journal is just the beginning. Although it solves some of the problems identified earlier, it raises others – what to do with the failed/completed rows? What happens if a row gets stuck in “locked” mode?
We’ll cover all this in the next post..