Building bridges: “Bridge must not have discovery mode for LACP interface; Interface file: bondeth0;”

Building bridges: “Bridge must not have discovery mode for LACP interface; Interface file: bondeth0;”

Some days start as an exatastic day and end up grasping your hair. This post is about such a day. A brand new shiny X7 Quarter Rack exadata was being installed for a customer. This setup was a fairly straightforward one. Quarter rack, full ovm and 2 virtual clusters. Not too fancy. LACP is being used for the public network on the fibers and vlan tagging is done on the switch.

Given that, my customer created a OEDA configuration using only LACP and not specifying the vlans. This is allowed in the june 2018 version of OEDA:

Only when you select the “Advanced” button and select the Enable Network VLAN option, the vlan id box appears and you can fill in the VLAN to be used.

So far so good, we had very default configuration and checkip.sh had run successfully. That means, green light, lets go for it!

First step of install.sh (verification of the config file) ran perfectly fine. But the second one failed with

Uh oh, nothing to find on My Oracle Support, nor on Google. So that means I had to open an SR.

At this point I decided to try to understand why domu_maker works the way it works. During a fresh install the bond for the public network and the bridge in dom0 is built. When you check the logs, you see that it is exactly during building the bridge that it fails.

In the logging from OEDA ($OEDA_HOME/log) you see things like this:

It’s customer data, so please forgive me the obfuscation. But the point is not very clear why he fails.

Let me give you the “Official” solution from Oracle first:

“I checked internally and the only workaround is to disable LACP mode on switch to proceed with vm creation.

– Disable LACP on network switch
– Uncheck lacp and recreate the config file.

We can re-enable LACP once the vm creation is successful.”

Well … in this case, I couldn’t do that. The network admins already created the LACP bonding and the OEDA allowed it, to do it this way. Also, launching 2 change requests would take way too much time which would put us too far behind on schedule. This means … Time to be creative.

The magic logfile, which is cleaned up afterwards, can be found in /var/log/cellos/exadata.img.domu_maker.trc or .log. When exactly it is cleaned up, I don’t know yet. I wanted to capture more information to put in this blogpost, but it was already gone. The thing is, it show you exactly what is going on. That way I discovered, that even if you select LACP, the installer still does his verifications just the same way as it does it for the non-lacp interfaces. I mean, it takes the interface, puts an address on it and verifies using ping (icmp! dear firewall admins, so be kind during installation please) to verify if it can reach the default gateway.

When I discovered that, the solution was very simple, but efficient. Let’s do it manually. So on both nodes I created my bridge manually. The Domu_maker command does 90% for you, so it is really easy:

you notice that I didn’t specify the LACP option nor the vlan id. Vlan’s are handled (in this case) on switch level. So the help from this function:

I would assume vlan is optional, but it isn’t. Now you understand why I chose the classic bridge. Next step is to convert it into an LACP bond. In /etc/sysconfig/network-scripts/ifcfg-bondeth0 change the bonding opts to the bonding opts you want to have. In my case I end up with this configuration file:

To make this active (system wasn’t in use yet) I restarted the complete network from the node:

And the bridge is also known in the system

Before we try the installer again, it’s best to verify if it all works.

So first put an IP address on the bridge:

Also we want the default gateway to be reachable:

it should be reachable through the correct interface, so do the ping test:

And that works. So now clean it up, and to be sure restart the network again so that we are in a clean state for the installer:

At this point, you would think all would succeed.

Unfortunately it still failed, with the very same error. So this couldn’t be a real error. It must be something else. When you read the domu_maker script, it tells you exactly why:

I don’t like messing around in my installation xml’s, neither do I in oracle provided scripts, but you see the #26338063 ? It refers to a non-public bug. BUG 26338063  – DEPLOYING A VM WHERE THE BOND IS LACP FAILS.

When you search for more information on that bug, it should have been fixed in the October 2017 bundle, but apparently it is still there.

Next step is a little nasty, on line 7049 of the domu_maker script I changed

to

I avoid the message with this. I KNOW it is safe (in this case) to continue, and he shouldn’t do everything for me, so I can skip this.

When the install.sh was retried it ran without any error. Great succes!

But, and this is very important. Before running the third step, modify the domu_maker back to the original values.

The rest of the install was just as default and straightforward as originally planned.

I would like to mention 2 more things.

  1. This is a hack. This is not a clean way of working and the SR is still open and Oracle is informed about this. But I am still convinced, when things are allowed in OEDA, the install.sh script should be able to handle it without problems.
  2. Thank you Andy for the heads up and encouragement.

 

As always, questions, remarks? find me on twitter @vanpupi

Leave a Reply

Your email address will not be published. Required fields are marked *

20 + fourteen =

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: