We had a quite strange case recently – the Exadata io resource management (IORM) did not seem to work. This is a quarter X6-2 Exadata HC version, Storage server software version 12.2 (latest patch level), and a variety of different CDBs. We had two CDBs doing heavy smart-scanning on flash (cache) , and one CDB seemed to get much more IOPS / throughput than the other, although IORM was configured with equal shares. Queries were doing smart-scans, but still transferred massive amounts of data to the database servers.
In order to verify and assess, we prepared a test case like this:
We used test queries of this form
select /*+ parallel */ a.* bulk collect into b from test_large a where testf(decode(a.object_id,0,1,2)) = 0
test_large is a 1 billion row table of copies of dba_objects, testf is a deterministic functions which is there to prevent any row filtering on storage side. This query returns 0 rows, but transfers basically the whole table over the network.
We ran the query in a loop – on CDB2 with 1 session, on CDB1 with 4 sessions in parallel, and observed cell server metrics:
dcli -g ./cell_group -l root "cellcli -e LIST METRICCURRENT WHERE name='N_HCA_MB_TRANS_SEC' and metricvalue \>1000 "
cell1: N_HCA_MB_TRANS_SEC cell1 4,090 MB/sec
cell2: N_HCA_MB_TRANS_SEC cell1 4,096 MB/sec
cell3: N_HCA_MB_TRANS_SEC cell3 4,092 MB/sec
# dcli -g ./cell_group -l root "cellcli -e LIST METRICCURRENT WHERE name='PDB_FC_IO_RQ_SEC' and metricvalue \>1000 "
cell1: PDB_FC_IO_RQ_SEC CDB1.PDB1 55,830 IO/sec
cell1: PDB_FC_IO_RQ_SEC CDB2.PDB2 14,689 IO/sec
cell2: PDB_FC_IO_RQ_SEC CDB1.PDB1 56,626 IO/sec
cell2: PDB_FC_IO_RQ_SEC CDB2.PDB2 13,857 IO/sec
cell3: PDB_FC_IO_RQ_SEC CDB1.PDB1 56,000 IO/sec
cell3: PDB_FC_IO_RQ_SEC CDB2.PDB2 14,304 IO/sec
# dcli -g ./cell_group -l root "cellcli -e LIST METRICCURRENT WHERE name='PDB_FC_IO_BY_SEC' and metricvalue \>0 "
cell1: PDB_FC_IO_BY_SEC CDB1.PDB1 3,657 MB/sec
cell1: PDB_FC_IO_BY_SEC CDB2.PDB2 962 MB/sec
cell2: PDB_FC_IO_BY_SEC CDB1.PDB1 3,709 MB/sec
cell2: PDB_FC_IO_BY_SEC CDB2.PDB2 907 MB/sec
cell3: PDB_FC_IO_BY_SEC CDB1.PDB1 3,668 MB/sec
cell3: PDB_FC_IO_BY_SEC CDB2.PDB2 937 MB/sec
So we indeed see that CDB1 gets 4 times the IOPS and the throughput of CDB2 – something which IORM should prevent (both have same IORM shares). To rule out that CDB2 is simply asking for less we stopped the CDB1 sessions and saw this:
dcli -g ./cell_group -l root "cellcli -e LIST METRICCURRENT WHERE name='PDB_FC_IO_BY_SEC' and metricvalue \>0 "
cell1: PDB_FC_IO_BY_SEC CDB2.PDB2 4,140 MB/sec
cell2: PDB_FC_IO_BY_SEC CDB2.PDB2 4,177 MB/sec
cell3: PDB_FC_IO_BY_SEC CDB2.PDB2 4,140 MB/sec
So indeed – the CDB2 workload was substantially throttled – much below what was expected from IORM.
Now what is the reason ? An IORM bug ? If we look at the total (large) IOPS – this is around 70k. This is much less an X6 storage cell can do from flash – I have seen workloads producing 180k per cell. So maybe the reason is that IO is not at top capacity and therefore IORM is simply not kicking in ?
Indeed, appears to be the case. The limiting factor seems instead the network bandwidth. 4100 MB/sec per storage cell is only about 85 percent of what Infiniband should be able todo, but it is certainly high. In addition, the two database servers receive a total of 12.3 MB/sec, so this appears even to be above Infiniband bandwidth. Not fully clear to me where this comes from.
But anyway – it appears to be a reasonable assumption that network bandwidth is the limiting factor, that there is no resource management on this and therefore the CDB with more sessions will simply get more resources.
I have still and Oracle Support call open for this – I will update when I get further insights.
There is some network resource management on Exadata, however it seems only to prioritise between different type of workload, not between same workload type from different databases – this is an extract from the X7 data sheet:
“Exadata also uniquely implements database network resource management to
ensure that network intensive workloads such as reporting, batch, and backups don’t
stall response time sensitive interactive workloads. Latency sensitive network operations
such as RAC Cache Fusion communication and log file writes are automatically moved
to the head of the message queue in server and storage network cards as well as
InfiniBand network switches, bypassing any non-latency sensitive messages. Latency
critical messages even jump ahead of non-latency critical messages that have already
been partially sent across the network, ensuring low response times even in the
presence of large network DMA (Direct Memory Access) operations.”
How relevant is this ? Fairly relevant and becoming more relevant in my opinion. A workload on Exadata which is network bound may have been fairly rare, but with increasing flash cache sizes and other optimisation methods (cellmemory …) it is prone to become more of a usual case.
I would be happy to hear any opinions on this point.