vela: Improve block configuration and weight buffering algorithm

 - Update block config selection to take into account partial
   IFM fetches at edge of non-whole OFM block data.
 - Change to scheduler depth slicing for networks in MLBEDSW-4637
   for improved buffering. This helps general performance by buffering
   larger depth slices.
 - Bug fix for opt_max_schedule always being fitted to SRAM which
   prevented the optimisation step running in some cases.

Signed-off-by: Tim Hall <tim.hall@arm.com>
Change-Id: I97642c5adec3bb684b1daabf2b81574c27d4eef2
diff --git a/ethosu/vela/npu_performance.py b/ethosu/vela/npu_performance.py
index 5c61c7d..21b420b 100644
--- a/ethosu/vela/npu_performance.py
+++ b/ethosu/vela/npu_performance.py
@@ -410,6 +410,7 @@
 
 def measure_mem2mem_cycles(arch, from_mem_area, to_mem_area, to_transfer):
     from_cycles = to_transfer // arch.memory_bandwidths_per_cycle[from_mem_area]
+    from_cycles += arch.memory_latency[from_mem_area][BandwidthDirection.Read]
     to_cycles = to_transfer // arch.memory_bandwidths_per_cycle[to_mem_area]
     return max(from_cycles, to_cycles)