Browsed by
Author: vanpupi

Aws s3 outage

Aws s3 outage

Today our aws friends suffered from an outage. This raises some questions. 

Suppose your entire organization is built on cloud products. All of them are closely related. This means a total outage for the organization. 

What about costs? Personally I am not an aws user, but would this outage be like “sorry dear customer, but you’re going to pay even we had this unplanned outage”.

If you’re mission critical, let’s say you’re end users are depending on it. Would it still make sense running production in the cloud? 

Curious for opinions!

Why the sun shines for Oracle and it’s Cloudy for others

Why the sun shines for Oracle and it’s Cloudy for others

First of all, I ‘d like to mention that this draft has been put back from schedule several times over and over again. Asking myself the question, should I really do this? But then again … This is a blog, a very personal humble opinion and you should not agree with me, I can be wrong, I can be right. The truth is probably in between. So, the title “Why the sun shines for Oracle and it’s cloudy for others”, it’s kind of a metaphor that Oracle (until now) has missed the cloud-train.

Recently I came across the website of the Synergy research group and found a nice article  When you see the graph, then you immediately get why uncle Larry is doing all this stuff to beat AWS.


You see? Find Oracle … it’s in the “Others” group. If this was the rdbms resource manager, i’d not like to be there. I think Oracle was thinking the same ūüôā If you have a look at AWS, it’s virtually no change. Personally I expected a little growth, but apparently not. Microsoft Azure, Google and IBM are taking up the share of the “others”.

Please dig around on my blog, then you’ll see that I recently worked on a project on the Microsoft Azure cloud. Even tough I’d never like Microsoft and I’m not a fan of Oracle on windows, I have to agree that doing this Azure project (apart from some other problems) was a BLAST! Full support from Microsoft, stable cloud environment, easy to configure, maintain … A very positive experience.

Then I had a look at the Oracle Cloud. A bit sceptic. The interface is fantastic! But then you dig a bit deeper and I have hit limits I wasn’t expecting. A very simple example, Oracle wants to position itself as the number #1 cloud provider. To do so, they want to migrate full datacenters to their cloud, GREAT! Wonderfull idea. I mean this.

One story¬†from¬†the Azure-project. Due to a miscalculation (if you want to hear all about it, find me at a conference for my presentation about the journey of a BI stack to the cloud), we needed far more powerfull servers to cope with the load. That wasn’t a problem, but they are expensive. So expensive that if we made the financial calculation again, that we decided to have a look at the other 2 players as well. AWS was easy and competitive, but about the same price, so that means that there was no reason to change. Then we had a look at the oracle cloud.

Remember the demo Larry Ellison gave at OpenWorld, he wants to lift and shift datacenters to the Oracle Cloud. I love that concept. So we went to the Oracle marketplace (I love this term!) and were looking for our windows server version. No worries our db’s are running on linux ūüôā But err … no decent windows servers available in the marketplace ūüôĀ


Then also … I find that the interface is slow … very slow … and sometimes even unstable.


Some friends had even difficulties to¬†cancel their trial subscription. I can go on like this for a while, but one of the other “no-go’s” for this customer was this entry in the FAQ:

“I have hardware VPN appliance in my datacenter. Will Corente VPN work with my existing appliance?

Currently, third-party VPN appliances will not work with the Corente service. VPN endpoint locations will need to install a Corente Services Gateway.”

This Customer wanted another choice and that was impossible. That’s a pity.
EDIT (24/02/2017): The Oracle cloud, just as the others, is evolving very rapidly. Thanks to Philip Brown (@pbedba) for pointing me to these links about a Third-Party Gateway to an IP Network in Oracle Cloud and a Third-Party Gateway On-Premises to the Shared Network
So it seems that currently it is possible, which is good news! So hopefully the FAQ will be updated quickly.

Ok, let’s do database as a service then. It’s the #1 database company (and yes, I’m an Oracle fanboy), so that should work for a decent price. Right?

I’ll take the anonimized example I use in my presentation as well. 3 prod db’s, 35TB, 15TB and 6TB + their dataguard instances and then for each db 8 non-live¬†versions (Dev, Dev New Release, Test, Test¬†New Release, Int , Int New Release, Uat, Uat New Release). Then you immediately spot, WHY the cloud is an option. Treat this databases as cattle, not as pets. So automation and provisioning would be key. But for production, it should be feasible, right?
Let’s explore the options … In summary … not too much except the full blown exadata option, which was (compared to the Azure solution we had figured out) extremely expensive. Even then we left out mechanisms for cloning those databases in an automated way to non-prod systems.

It’s a bit a frustrating blogpost and I feel so sad writing and reading it. So for Oracle in my opinion, the sun is still shining on premises and I do hope for them the clouds will come, but the way it is now, I’m afraid they ‘ll miss this train. I believe more in the data on premises, but the cloud will definitely take it’s place and we should definitely embrace it. I totally agree with the statement¬†“there will be a co-existence for the next 5 a 10 years”. Ofcourse some other hype will be there by then, but that’s another story.

But Oracle … you still can win this battle!

  • Think about the past, think “back to the future”! How did you win ground in the past? Make it EASY TO USE. So, the trial subscription, make it really free to subscribe and unsubscribe without having to provide credit card details. Have a look at your colleagues of apex, they are doing a GREAT job!
  • Support us. Support is key. If we choose to be dependent from a cloud provider, offer good support. Resolve (i don’t say respond, but really resolve ) SR’s really quick (< 0,5d in the local timezone) as speed in the cloud is key.
  • No unplanned outages please! Make it stable,no suddenly disappearing machines. Outages are acceptable, but communicate them, be very transparant.
  • Invest in a good extensive marketplace. Currently, you’re at the point of Microsoft Azure 2 years ago. You have the experience, the knowledge, the social network,… it must be feasible to fill this marketplace really quick with recent and decent software. Vendors are asking for it … hear them. Make the marketplace a shopping mall or a candy store.
  • Engage your¬†partners! It’s lonely at the top and if you’re high you can fall very low.¬†If the product is mature, and if partners get easy access to features-to-come (compare it to private/public preview with Azure), customers will start to trust you and dare to take the move.
  • Don’t push the “cloud-on-premise” too hard. It’s no cloud at all, it’s just an interface. People don’t get this idea. Keeping the costs of the own datacenter and pay extra for this service. It’s difficult to understand. I do believe in this mechanism as a “step to the cloud”, but make it free (or very very cheap). So that people can use the engineered systems to put their environment on, once done, call DHL/Fedex or some other partner and move them to the Oracle D.C. Done.
  • Don’t change the rules if you can’t win and don’t get agressive. Yes I’m referring to the core-factor story regarding AWS and Microsoft Azure. I heard some customers making the comparison with children “if they can’t win, they change the rules” I couldn’t think of any response at that time … it felt they were right.
  • Provide a clear cloud advantage. This can be for instance that if you are adding a compute layer to host your db yourself, the EE licenses would be included. Or change the license model (in the oracle cloud) that eg. all the options are “free” included in the EE license. If you make that cheaper than the on premises licenses, you will certainly win ground without putting the customers from other certified cloud providers in a strange position.
  • Provide an easy mechanism that customers can go back/away very easily without extra cost. This sounds very strange, but people don’t like to be in prison, so they are very scared about “loosing their data to someone else” or going through a lengthy process to get it out the cloud again (if needed for one reason or another).

Basically, it comes down to one sentence: Listen to your customers, Listen to what they want, don’t push things through their throat. It’s not too late yet. People are interested in it, engage them, don’t scare them.

Once again, this is a very personal opinion and I might be right, but I might be wrong as well. I think by discussing this, more beautiful and working (usable) clouds can be created.


And remember, when it’s cloudy, it doesn’t necessarily mean that it will rain ūüôā

As always, questions, remarks? find me on twitter @vanpupi

Memo to Self: Recap cellsrvstat

Memo to Self: Recap cellsrvstat

Sometimes I ask myself “how did that work again”, so I decided to document this every time I have this feeling. With some links to the documentation, easy commands,… you got the picture.

First one today, new customer, new environment, to get some feeling with the cells, I used cellsrvstat.

Documentation reference (here ). Cellsrvstat is also part of the exawatcher on the cells.

A basic overview of the command. If you log on to the cells as root, it is in your $PATH. But in case you’re looking for it, it’s stored in /opt/oracle/cell<version>/cellsrv/bin/

So basics first, what can it do:

# cellsrvstat -h
LRM-00101: Message 101 not found; No message file for product=ORACORE, facility=LRM
cellsrvstat [-stat_group=<group name>,<group name>,]
[-stat=<stat name>,<stat name>,] [-interval=<interval>]
[-count=<count>] [-table] [-short] [-list]

stat A comma separated list of short strings representing
the stats. Default is all. (unless -stat is specified).
The -list option displays all stats.
Example: -stat=io_nbiorr_hdd,io_nbiowr_hdd
stat_group A comma separated list of short strings representing
stat groups. Default: all except database
(unless -stat_group is specified).
The -list option displays all stat groups.
The valid groups are: io, mem, exec, net,
smartio, flashcache, offload, database.
Example: -stat_group=io,mem
A comma separated list of short strings representing
offload group names.
Default: cellsrvstat -stat_group=offload
(all offload groups unless -offload_group_name is specified).
Example: -offload_group_name=SYS_121111_130502
database_name A comma separated list of short strings representing
database group names.
Default: cellsrvstat -stat_group=database
(all databases unless -database_name is specified).
Example: -database_name=testdb,proddb
interval At what interval the stats should be obtained and
printed (in seconds). Default is 1 second.
count How many times the stats should be printed.
Default is once.
list List all metric abbreviations and their descriptions.
All other options are ignored.
table Use a tabular format for output. This option will be
ignored if all metrics specified are not integer
based metrics.
short Use abbreviated metric name instead of
descriptive ones.
error_out An output file to print error messages to, mostly for

In non-tabular mode, The output has three columns. The first column
is the name of the metric, the second one is the difference between the
last and the current value(delta), and the third column is the absolute value.
In Tabular mode absolute values are printed as is without delta.
cellsrvstat -list command points out the statistics that are absolute values

[root@dm06celadm01 ~]#

So it can display all kind of information about your cell status, which can be helpful to see what’s going on. So let’s do the list: (warning: awful lot of info! But i’ll cut out some of the rows, but if you execute it, be prepared for a long list)

[root@dm06celadm01 ~]# cellsrvstat -list
Statistic Groups:
io Input/Output related stats
mem Memory related stats
exec Execution related stats
net Network related stats
smartio SmartIO related stats
flashcache FlashCache related stats
health Cellsrv health/events related stats
offload Offload server related stats
database Database related stats
ffi FFI related stats
lio LinuxBlockIO related stats
mpp Reverse Offload related stats
Sparse Sparse stats

[ * - Absolute values. Indicates no delta computation in tabular format]

io_nbiorr_hdd Number of hard disk block IO read requests
io_nbiowr_hdd Number of hard disk block IO write requests
io_nbiorb_hdd Hard disk block IO reads (KB)
io_nbiowb_hdd Hard disk block IO writes (KB)
io_nbiorr_flash Number of flash disk block IO read requests
io_nbiowr_flash Number of flash disk block IO write requests
io_nbiorb_flash Flash disk block IO reads (KB)
io_nbiowb_flash Flash disk block IO writes (KB)
io_ndioerr Number of disk IO errors
io_ltow Number of latency threshold warnings during job
io_ltcw Number of latency threshold warnings by checker
io_ltsiow Number of latency threshold warnings for smart IO
io_ltrlw Number of latency threshold warnings for redolog writes
mpp_nr_blcc Num of reqs not pushed due to low cell cpu (C)
mpp_nr_bhcon Num of reqs not pushed due to high cell outnet (C)
mpp_nr_bhrnin Num of reqs not pushed due to high db node innet (C)
mpp_nincr_mb Num rate increase by reverse offload info from db (C)
mpp_ndecr_mb Num rate decrease by reverse offload info from db (C)
mpp_nincr_rn Num rate increases from db node cpu information (C)
mpp_ndecr_rn Num rate decreases from db node cpu information (C)
mpp_ndecr_ccpu Num rate decreases from low cell cpu utilization (C)
mpp_ndecr_con Num rate decreases from high cell outnet util (C)
mpp_ndecr_rn_in Num rate decreases from high db node innet util (C)
sparse_ncb num buckets compacted by sparse HT background scan
sparse_ios num IOs with sparse regions
sparse_ios_kb Total sparse IOs (KB)
sparse_smartio Total redirected smart ios (KB)
[root@dm06celadm01 ~]#

Let’s say you’re only interested in the io related things you could use a stat_group:

[root@dm06celadm01 ~]# cellsrvstat -stat_group io
===Current Time=== Tue Feb 21 11:29:39 2017

== Input/Output related stats ==
Number of hard disk block IO read requests 0 2226820445
Number of hard disk block IO write requests 0 1033312850
Hard disk block IO reads (KB) 0 1909110664882
Hard disk block IO writes (KB) 0 199121447989
Number of flash disk block IO read requests 0 14301322886
Number of flash disk block IO write requests 0 1008668696
Flash disk block IO reads (KB) 0 789129901568
Flash disk block IO writes (KB) 0 52097067586
Number of disk IO errors 0 0
Number of latency threshold warnings during job 0 1081
Number of latency threshold warnings by checker 0 0
Number of latency threshold warnings for smart IO 0 0
Number of latency threshold warnings for redolog writes 0 0
Current read block IO to be issued (KB) 0 0
Total read block IO to be issued (KB) 0 599867955384
Current write block IO to be issued (KB) 0 0
Total write block IO to be issued (KB) 0 197822797002
Current read blocks in IO (KB) 0 0
Total read block IO issued (KB) 0 599867955384
Current write blocks in IO (KB) 0 0
Total write block IO issued (KB) 0 197822797002
Current read block IO in network send (KB) 0 0
Total read block IO in network send (KB) 0 599867955384
Current write block IO in network send (KB) 0 0
Total write block IO in network send (KB) 0 197822797002
Current block IO being populated in flash (KB) 0 2765920
Total block IO KB populated in flash (KB) 0 32844047616
I/Os queued in IORM for hard disks 0 0
I/Os queued in IORM for flash disks 0 0

[root@dm06celadm01 ~]#

Last 2 lines are also very interesting, it tells you if IORM is kicking in or not. Might be usefull in some cases. Just saying.

The exec group is also nice. Once again I will cut out some rows, but the last lines are very interesting as well:

[root@dm06celadm01 ~]# cellsrvstat -stat_group exec
===Current Time=== Tue Feb 21 11:30:17 2017

== Execution related stats ==
Incarnation number 0 3
Number of module version failures 0 0
Number of threads working 0 2
Number of threads waiting for network 0 23
Number of threads waiting for resource 0 9
Number of threads waiting for a mutex 0 112
Number of Jobs executed for each job type
CacheGet 0 3123536972
CachePut 0 1031998876
CloseDisk 0 15376502
OpenDisk 0 20379160
ProcessIoctl 0 304858117
PredicateDiskRead 0 7462707
PredicateDiskWrite 0 36539
PredicateFilter 0 24054836
PredicateCacheGet 0 140219901
PredicateCachePut 0 16917010
FlashCacheMetadataWrite 0 0
RemoteListenerJob 0 0
CacheBackground 0 0
RemoteCellMgrService 0 0
CopyFromRemote 0 30925
sparse_bootstrap 0 0
sparse_free_region 0 0
DelegateIO 0 62678
NetworkPoll 0 0
CopySIFromRemote 0 550
SIGetJob 0 720
NetworkDirectoryGC 0 0

SQL ids consuming the most CPU
INT99 dxpwsgys5za27 3
END SQL ids consuming the most CPU

[root@dm06celadm01 ~]#

This tells me which database is asking the most cpu for which query. Might be usefull in some cases. Remember… in an idle environment and you do something, … then you’re automatically the “top”. But if suspecting things, it’s worth to have a look, it might help.

As always, questions, remarks? find me on twitter @vanpupi

UKOUG Ireland 2017

UKOUG Ireland 2017

Once more I’d like to thanks my colleague and friend Philippe Fierens (@pfierens and¬† ) for convincing me to speak at conferences about the things I do for customers.¬†Last year, 2016, it all started at UKOUG Ireland. I am¬†lucky to be selected this year again!¬†I‚Äôm speaking at this years¬†OUG Ireland¬†event! (agenda)

My first talk is the first day at 14:15h and will tell you all you’d like to know about ovm. It’s called:¬†OVM on Exadata : Living in a virtual world. You can find the abstract here. ¬†One thing I’d like to mention. Normally this is a duo-presentation with Philippe, but due to circumstances he can’t join, but I’d like to credit him for his part in the presentation. Thanks Philippe, I will try to do it as good as you do!

The second day I’ll be speaking at¬†15:25 about a very recent project I did, I like to call it the same as the title of the presentation:¬†The Journey of a bi-stack to the Cloud.¬†You can find the abstract¬†here.

So folks, register for the conference and see you there!

As always, questions, remarks? find me on twitter @vanpupi

A warm welcome to exadata SL6-2

A warm welcome to exadata SL6-2

Last years Oracle Openworld, uncle Larry announced the sparc based exadata SL6-2, so this means that we have to give the sparc chips a warm welcome to the Exadata family.
During the conference I wrote 2 blogposts. You can find them here and here.

To recap, a little picture of the new one in the family:

Exadata SL6-2

Nowadays, we’re used to the big X for the exadata’s.¬†This is for the x86 infrastructure they are running on. So SL stands for “Sparc Linux”. You should follow the Oracle guys on twitter as well, then you see this product¬†(Linux for sparc) is growing very rapidly. One of the questions which pop into the mind directly, which endianness is this using? Well, linux on sparc is using big endian as the sparc chip itself is big endian.

So in my blog posts I was eagerly looking forward to the spec-sheet and here it is!

A shameless copy out of the datasheet:
“The Exadata¬†SL6 Database Machine uses powerful database servers, each with two 32-core¬†SPARC M7 processors and 256 GB of memory (expandable up to 1TB)”

According Gurmeet Goindi’s blog (@exadatapm) it comes at the same cost as the intel based variant. You can read his blog here:¬†

Exadata SL6-2 hardware specifications

Look what’s there! In stead of 2 QDR ports, we now have 4. And also the elastic configs remain. Also remarkable is that the storage cell’s remain on Intel based architecture.
This looks interesting as well (same as the X6-2 trusted partitions):

Exadata SL6-2 mgmt features


On this moment (or I have read over it) I can’t see yet how virtualisation will be done, so if someone has info about this, I will be happy to hear this. I heard several rumours about this, but I am eager to find out what it’s going to be!

One question remains …¬†when will I be able to find a customer who buys it and let me explore this to the bottom ūüôā


As always, questions, remarks? find me on twitter @vanpupi




The first performance related impressions of the new ODA X6-2M

The first performance related impressions of the new ODA X6-2M

New toys are always fun! When Oracle announced their “Small” ODA’s in the X6-2 generation, we were excited to test them. We were not the only ones, so it took a while before getting one, but the first week of¬†january, it was playtime. An ODA X6-2M was delivered to¬†our demoroom and testing could begin.

Normally I would start a blog post by “how to install” it. Actually this is very simple and very well documented. If you want me to blog about it as well, just let me know.

The nice thing about the database appliance is, that in the X6-2 generation, it is now possible to have single instances which can host standard edition. This is a good thing. One of the reasons you want to consider this, is that step-in costs can be reduced. For smaller companies, you get a database in a box which is just working. Nice, isn’t it?

So how does it perform?
Well … first things first! Slob. The wonderfull tool of Kevin Closson (you can find him here¬†). Slob¬†helps stressing the storage so that you can find out how your system is behaving. It always one of the first things I run on a new system.

Marco¬†Mischke (@dbamarco) was also playing with the X6 and he discovered an important performance difference between running your database on ASM and ACFS. It has been classified as a bug and “fixed” in the latest ODA image. Guess which version I installed on the ODA? Right, the latest one. So we got in touch and the first slob test was good. It reached far higher, so the problem looked to be fixed.

But looking a bit further, I wanted to test on ASM as well.
You know? I will provide the results you’re looking for by now ūüôā

Ok First: ACFS, here we go.


So with a limited set of workers we reach up to about 325000 iops. Given that the system has 20 cores available, this results into 16250iops per core.
If we translate that in MB/s we get this:

ACFS throughput MB

I left out the latencies here to make it a bit more clear but it peaks to 2,5GB/s at it’s most. So here are the latencies over the tests:

ACFS read latencies.

I put it into excel as well:

max read latency		2587.22	us	2.58722	ms
max write latency		2094.74	us	2.09474	ms

These are the maximum latencies during the test, so merely at the end. In my opinion, this is good.
If more details are needed, drop me a message and I will provide more information.

Let’s move on to ASM, exactly the same database, parameters, etc,… I love 12c! you can move the datafiles online, so that’s how it has been done.
ASM, your results please.

Oops, what’s that? 800.000 iops in read! And the write ones are only slightly better.

Then we go to the throughput:

So asm is faster than ACFS. I was expecting it to be a bit faster, but not this.
For completeness the latencies:

And then the figures:

max read latency		2508.65	us	2.50865	ms
max write latency		2893.87	us	2.89387	ms

This look likes expected. Good.

I talked to my team-lead and performance tuning expert, Geert De Paep about this behaviour. You could see the lights in his eyes, he wants to test it as well. So I’m looking forward to his blogpost as well. I can tell you already, by doing the queries manually on the swingbench schema, Geert was also able to see this behaviour. So we should also figure out what happens by using acfs. If it is still strange, we should contact Oracle as well. We will see.

If¬†you run swingbench with the preconfigured¬†runbooks, the first bottleneck you find is the cpu. This is due to all the pl/sql in swing bench. So knowing that … the next tests will be Logical IO.

As always, questions, remarks? find me on twitter @vanpupi




During the BI in the cloud project, one of the aspects we had to test is the network. Here is how we did it to figure out how the network performs and most of all, is it stable?

One of the most important things in a cloud environment is the network. It connects devices to eachother and makes it possible to have communication between devices. Sounds obvious, right?

Some tests we have done, were relying very heavily on the network, such like nfs, smb,… and in the beginning, we didn’t manage to get it stable. At some period in time, you have the “I-should-find-some-time-to-do”-moment. This was one of them. I should find some¬†time to, in a very easy quick way, to check if the network remains “ok”. So, I came up with the most basic test a network test could be: ping! Ping? Pong, yes an easy ping. I know that firewalls give lower priority to ping but in this case they are configured well so this is¬†good to go.

The test consists out a very little tiny script, which does 10 pings, some cli magic to grep the time out of it and record it in a file. It’s a quick and dirty script, and it’s a lot better to store it in a database. But hey, we just needed an idea, is the network stable or not. This script goes in the crontab for every 5 minutes on each of the 3 servers. This generates data and I harvested this data after a couple of days. ¬†I would like to mention (oh oh, comment storm coming up) that regarding the network in this Microsoft Azure subscription, windows and linux servers are performing the same. Prerequisite is that you configure them well, so we did that ūüôā

The first test is done on 2 servers, one linux and one windows, and stored in a different availability set (AS).


This is no excel-graph. I would like to thank my team-lead Geert De Paep for letting me put my data into Pandora. Pandora is a tool which puts database data into every kind of svg-graph you would like. For the people interested, I can share the excel graph as well, but there were high peaks. To keep my detail, I needed the exponential graphs and pandora is the ideal choice to do so.

This looks to me that for every ping packet series, the first one takes some time and then it gets pretty stable.

The second¬†test is also done on 2 servers, one linux and one windows. This time that are¬†stored¬†in the availability set (AS). But there’s a little other difference. The network throughput we had on other machines was bit¬†disappointing. Hey¬†Microsoft, can you do something about it? The answer was very easy. Use the preview of accelerated networking. So that is what we did.


Strange behaviour in the beginning, but I assume, as it is a preview, that still something was going on. Timings are a bit lower, which is good. But also the same behaviour. One “slower” ping and then good results. Although between 18h and 20h we see some higher times on a daily rate. I think I should gather more data on this as well to spot if it is a recurring trend.

So that brings us to the third and final test. Just the same setup as the second one, except that it runs between 2 linux boxes. Azure, your results please!


The graph looks different, but spot the time. While the windows boxes were shutdown between christmas and new year.¬†No no no, it’s not because windows crashed, they were simply shutdown and resources are reused for other things.
But I do like the consistency. Still the same behaviour. One longer ping and then the rest lower but consistent.

As always, questions, remarks? find me on twitter @vanpupi

Oracle DB in the Azure cloud – Pt1

Oracle DB in the Azure cloud – Pt1

A few months ago (about october)¬†ago¬†we were¬†contacted with the simple question: Can you run an oracle database in the cloud, the Azure cloud. Well … it depends. The little detail was, that the database is about 34TB and there are a few other multi TB databases AND there are a lot of copies of them. And … the final decision for go live is … end of 2016. ¬†Well, we accepted the challenge.

The deadline was strict, so that’s also the reason I had less time to blog and these Azure cloud series won’t be completely chronological, … but (and this is a spoiler alert) I’m interested in sharing what we ended up with.

This post will focus on how the database tests using slob were done. Credits for @kevinclosson for the SLOB-tool and @flashdba for his slob testing harness. Combining these 2 provides a very quick way of running consistent tests. We needed such a quick testing framework as we were changing about everything to see if it impacted disk throughput / iops or not.

Why we choose those machines is for another post, but we opted for the DS15_V2 vm ( details here ). The explanation from the machine I borrowed from the Microsoft website: “Dv2-series, a follow-on to the original D-series, features a more powerful CPU. The Dv2-series CPU is about 35% faster than the D-series CPU. It is based on the latest generation 2.4 GHz Intel Xeon¬ģ E5-2673 v3 (Haswell) processor, and with the Intel Turbo Boost Technology 2.0, can go up to 3.1 GHz. The Dv2-series has the same memory and disk configurations as the D-series.”
Looks good, right? And we can attach up to 40TB to the machine, which makes it a candidate to be used for the future database servers.
It gets better, these family of servers can use also the Microsoft premium storage, which are basically SSD’s and disk caching is possible if needed.
As the databases are a bit bigger, only way we could do was use the P30 disks ( more details about them here ) So a disk limit of 5000 iops and 200MB/s. Should be ok as a first test.

The first test was done using iozone. The results of that will be in a different blogpost as I still need to do the second tests to crosscheck them. But let’s continue, but not before I would like to ask, if there are remarks, questions or suggestions to improve, I’ll be happy to test them.
The vm is created, 1 storage account was used, and in the storage account, it was completely filled up with 35 premium storage ssds.
Those disks were presented to the virtual machine, added into one big volume group and an xfs striped filesystem was created on a logical volume, which will host the SLOB database.
The db was created db using cr_db.sql from create database kit after enabling it for the 4k redologs. After finishing all steps to make it a Physical IO test we were good to launch the testing harness. It ran for a wile and eventually our top load profile looked like this during all the tests:


I think that’s ok? So after that it’s time to run the¬† to generate a csv file. That csv was loaded in excel and this was the result.







First I splitted the write and read iops, but then I decided to use the total iops as the graph follows the trend. My understanding (please correct me if wrong) is that around 30000 iops of a 8k database block is around 234MB/s? These tests were done without disk caching.

Then we decided to do the whole test again, but this time, instead of using 1 storage account with a bunch of disks, we used a bunch of storage accounts with only one disk in it. The rest of the setup was done exactly the same (created a new¬†vm with same size, same volumegroup, same striping, …) and the database was created using the same scripts again. Here are the results:







I think it is remarkable that even in the cloud, the way how you provide the disks to the machine really does matters. For example if you take the 32 workers. With one storage account, remarkably less work was done.

More to come of course. Feedback is welcome about what might be the next blogpost. Let’s make it interactive ūüôā

As always, questions, remarks? find me on twitter @vanpupi

Documentation bug in ovm for Exadata

Documentation bug in ovm for Exadata

A while ago a customer gave¬†me a heads up about the “bug” concerning the default passwords for root and celladmin. I was thinking a bit further and I wondered if the “documentation bug” I found while adding a new OVM in the virtualised exadata is solved. The official documentation can be found here¬†.


Then “Managing Oracle VM Domains on Oracle Exadata Database Machine” and then “Creating Oracle RAC VM Clusters” brings you to the point I want to warn you for.

All steps are correct, but the last one “Run all steps except for the Configure Cell Alerting step using the XML file for the new cluster. For most installations, the Configure Cell Alerting step is step 7. For example, to execute step 1, run the following command” might be a bit tricky. Why? I will show you.

When deploying the exadata, if you list the steps you will get this output:

$ ./ -cf anonymous_customer.xml -l

1. Validate Configuration File
2. Create Virtual Machine
3. Create Users
4. Setup Cell Connectivity
5. Calibrate Cells
6. Create Cell Disks
7. Create Grid Disks
8. Configure Alerting
9. Install Cluster Software
10. Initialize Cluster Software
11. Install Database Software
12. Relink Database with RDS
13. Create ASM Diskgroups
14. Create Databases
15. Apply Security Fixes
16. Install Exachk
17. Create Installation Summary
18. Resecure Machine

But if you take the newly created xml for the new cluster:

$ ./ -cf anonymous_customer_new_clu.xml -l

1. Validate Configuration File
2. Create Virtual Machine
3. Create Users
4. Setup Cell Connectivity
5. Calibrate Cells
6. Create Cell Disks
7. Create Grid Disks
8. Configure Alerting
9. Install Cluster Software
10. Initialize Cluster Software
11. Install Database Software
12. Relink Database with RDS
13. Create ASM Diskgroups
14. Create Databases
15. Apply Security Fixes
16. Install Exachk
17. Create Installation Summary
18. Resecure Machine

Do you spot the difference? I don’t.
I just want to say … If you create a new cluster, be careful with “Create Cell Disks”. I should recheck the logfiles, but the time I checked it lately, it was performing a drop of the celldisk before recreating it. So you can imagine what will happen to your other virtual machines. If you have an exadata on which I can try it, please let me know. I’m happy to check it out further ūüôā

Exadata add a new vm

Exadata add a new vm

Today a customer highlighted me a nice-to-know. If you add a new virtual machine to an exadata ovm cluster, he experienced something odd. It was tested on a “new installation”, so it worked good. Basic steps are:

  • Run over OEDA and add the cluster
  • move the xml-files to the dom0¬†on the same spot as the original one
  • run with this config

As this is a good customer he followed the advice of having all passwords changed. The bad thing is …¬†while running lots of errors on different components where thrown.
The most remarkable, and even the first one thrown, was:

OCMD-02624: Error while executing command {0}.java.lang.reflect.InvocationTargetException

So after digging around for a while, it turned out that it was due to the “non-default” passwords for root and celladmin.
After changing the root and celladmin passwords back the the wellknown default, the liked it and gave the expected success message.

Successfully completed execution of step Validate Configuration File [elapsed Time [Elapsed...

The IB switches suffer from this as well. But that’s only faced if you are going to upgrade the IB software. So in order to patch them easily, just temporarily reset the passwords to the default and change them back afterwards.