Sen2Agri performance

Hi,

I’ve got a couple of questions considering Sen2Agri’s resource use and performance:

We’re currently running Sen2Agri on a VM with 8 vCPUs and 32GB RAM, which is below the recommended prerequisites as stated in the SUM (if needed additional resources are available).

Our area covers 14 S2-tiles with a season from March to December of 2017, so a year worth of data (including Landsat-8).

However, during the whole processing chain, we never came close to max out any of the available resources. RAM stays typically below 5GB, processor use around 20%.

Here’s a screenshot of our system overview tab (currently running one L4B process in automated mode and one L4A process through commandline):

system_overview

This looks considerably different compared to the example in the SUM (p.48).

I’ve also compared our L2 processing times with the performance example given in Appendix C of the SUM:

On a more powerful machine, the MACCS processor seems to take ~8.5 mins per S2 tile to complete processing. We’re currently taking around 36 minutes per tile.

This all leads me to believe, that some performance tweaking has to be performed, but until now all attempts have been unsuccessful.

So the questions are:

For automatic mode, is it as easy as installing Sen2Agri on a more powerful machine and SLURM will do the rest? Or do some parameters need to be changed?

For commandline mode, there are some flags for controlling number of threads/processes or tiles processed in parallel.

For instance, for demmacs.py there are the flags --processes-number-dem and --processes-number-maccs. I’ve tried to adjust both, but didn’t see any change in processing time.

Is the only way to run MACCS in parallel to create multiple sub-sites and merge them afterwards?

For CropMaskFused.py and CropTypeFused.py there are the flags max-parallelism and tile-threads-hint. These seem to have an effect on performance. Is there any way to change the default values so it doesn’t need to be explicitly called in commandline? I’ve looked at the config db tables, but I haven’t found a corresponding entry.

Many thanks,

Valentin

Hi,

There’s a couple of things here:

  • RAM usage: the RAM requirements come mostly from one step of the unsupervised L4A processor, which uses about 20 GB RAM / tile. We’d like to optimize this in the future, but that hasn’t been implemented yet. The other processors have more reasonable RAM requirements.
  • RAM usage (part. 2): with noisy data or poor otherwise, the L4A and L4B processors end up creating very large modes. While this could be in theory tuned via some configuration settings, we haven’t done an analysis yet to find better defaults. We’re indirectly using OpenCV for classification and it’s somewhat inefficient memory-wise. A 1 GB model can end up using 10 GB RAM. This is made worse when doing the classification for multiple tiles at once – the model is loaded for each instance, and these aren’t shared.
  • -max-parallelism and -tile-threads-hint: these can’t currently be configured in the database, but that should change in the next version. Note that you won’t get perfect scaling: increasing the number of threads breaks down from 4 threads or so upwards (because not every part of the processing pipeline is parallelized – see Amdahl’s Law), and increasing either of them can make the processor I/O-bound (when there’s too much data to read from disk).
  • demmaccs.py flags: unfortunately, those flags have a minimal effect. The main issue is that the L2A processor isn’t run in parallel for multiple tiles of the same site. You can process more data if you have multiple sites, but not with a single site. We’d like this to change in the future.
  • MACCS performance: you could try checking what MACCS is doing by running htop or iotop. MACCS itself is multi-threaded, but seems limited to 4 threads or so. If you’re in a VM my guess is that it’s mostly IO-bound in your case (should show up with status D in htop or ps,
1 Like

Thanks for your remarks.

The SUM lists in appendix C a system performance example. I’m assuming this is how fast you should expect Sen2Agri to work given the necessary resources and the proper configuration.

Can you give some pointers as how these times were achieved or how the system was configured to get these times?

Like, say I’d have a machine with 2 x Intel Xenon E5-2650 v3 @ 2.30GHz, 40 Threds 128GB RAM, what can I do to process 14 tiles with MACCS in two hours?
And what is are recommended thread and parallel tiles flag values for the L4 processors?

Best regards

Can you give some pointers as how these times were achieved or how the system was configured to get these times?

I suspect the timing is from our production server. We have little specific configuration:

  • Linux is installed on bare-metal, not in a virtual machine
  • L2A products were available on either local low-redundancy RAID-5 array, or on a SMB (Samba) share
  • For SMB shares, I recommend something like vers=2.1,cache=loose in the mount options
  • It didn’t matter that much on that server, but we ran tuned-adm profile throughput-performance because on another server with the hpvsa (HP Dynamic Smart Array) we were seeing very poor (~20 MB/s) disk throughput with the default profile.
  • We disabled THP (transparent hugepages) via /sys/kernel/mm/transparent_hugepage/enabled. THP is problematic for performance under some circumstances.
  • We disabled automatic NUMA balancing via /proc/sys/kernel/numa_balancing because it was detrimental for performance.

The times were measured using a previous version of MACCS. I’m not sure if its performance has changed in the meanwhile.

what can I do to process 14 tiles with MACCS in two hours

As I mentioned above, you should get better parallelization with multiple sites because of a limitation in our MACCS launcher script.

And what is are recommended thread and parallel tiles flag values for the L4 processors?

It depends on the amount of input data and the number of CPUs. From what I’ve seen, the L4 processors don’t really use more than 4 threads, so you might want to set tile-threads-hint to something between 2 and 4. This assumes that your site is large enough, e.g. 20 tiles for 2 threads on a 40-CPU system (although hyper-threading might or might not help), and 10 tiles with 4 threads. The default will be 4 threads per tile in the next version (reduced from 5). The defaults seem reasonable (to me) otherwise, without knowing more about the workload and system characteristics.

Note that if you use a low value you might encounter a “long tail” effect. If a tile straddles two UTM zones, the Landsat products might need to be reprojected, which slows down the processing. WIth fewer threads, even if you have a lot of tiles, you might find that the last couple of tiles take a long time (hours) to finish. For example, with tile-threads-hint set to 2, the last tile will only use two threads, even if the system has 40 logical threads available.

To mitigate this, you could try over-subscribing the system by e.g. using 4 threads with 20 concurrent tiles on a 40-way system. I haven’t tested such a configuration myself.

There’s also the issue of the unsupervised L4A processor, which is slower due to the issue mentioned above.

And of course, you should keep in mind that there’s a chance of saturating your I/O bandwidth. That would would render any changes of the threading parameters mostly ineffective.

1 Like

Thank you very much for the insightful info.

I see there are a lot of moving parts - previously we’ve been processing the sites using only automatic mode, and I was under the impression that it doesn’t use any parallelization which meant that computing a cropmask would be 3 days or more for around 13 tiles.

Now when we upscale to the entire country we’d be talking of around 100 tiles, hence a proper adjustment of settings is necessary.

It shouldn’t take that much, as far as I know. We’ve processed sites of 100 tiles or so, but it’s not directly comparable because we used the stratification feature (to split the site into e.g. 5 regions and use different models for those). I believe the running time was on the order of three days for that, but it will vary, of course.

Now when we upscale to the entire country we’d be talking of around 100 tiles, hence a proper adjustment of settings is necessary.

What is your system configuration?

this is currently along with 32GB RAM, but we’ll upgrade to 24vCPU and 64GB RAM:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             8
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Model name:            Intel Core Processor (Broadwell)
Stepping:              2
CPU MHz:               2199.996
BogoMIPS:              4399.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-7

When you split the region, did you process the 5 sub-regions individually for MACCS parallelization? I guess if so, there’s no need for stratification in L4A/L4B

When you split the region, did you process the 5 sub-regions individually for MACCS parallelization? I guess if so, there’s no need for stratification in L4A/L4B

No, we used that on the theory that it might improve the results because the size of the site means training data for one region doesn’t apply to the others (because of the different climate). I’m not convinced it really helps. But by doing that we were able to extract a bit more parallelism by processing multiple strata at once. I think that doesn’t happen in a normal run.

Sorry, I think I wasn’t clear. The time estimate from my post above didn’t include L2A processing.

No I understood, maybe my questions were also ambiguous.

With the 24vCPU system:

Is there any consideration for sub-sites for MACCS? Would splitting it into more than 2 make sense?

And for L4A/B: Do you recon 4 Threads with 6 tiles is a good starting point? If I’d look for over-subscribing, maybe increasing the concurrent tile number to 8 could be an option.

Yes, but unfortunately it’s hard to do. You’d have to prepare a couple of shapefiles for each region – the extents are used for searching the products, but if you have the L1Cs it doesn’t matter, create multiple sites (the more, the better), set up tile filter lists for each of them (there’s a database table, I think it’s described somewhere in the manual and there might be a helper script), and symlink the L1Cs if needed.

Do you recon 4 Threads with 6 tiles is a good starting point?

For the 24 vCPU VM, I assume? Yes, that sounds fine; you only need to set the thread count (it’s 5 by default, as I mentioned, which would get you only 4 tiles at once). 8 concurrent tiles should work, too. My main worry would be the I/O performance.

For Sentinel, isn’t it the actual polygon? I’m not sure how it is for Landsat though. ATM we only have Sentinel-2A L1C on hand, the rest needs to be downloaded.

So the max-parallelism flag doesn’t have any impact? If I target 8 tiles, I shall set the tile-threads-hint to 10 then, if I understand correctly.

Ah, no, it’s computed from the CPU and tile thread counts. For 4/8 you’d need to set both.

Hi again,

I just realized that the times in the SUM are reported for old-style products with multiple tiles. For those, MACCS was indeed run in parallel for multiple tiles at once. On new-style products (with a single tile), only a single MACCS instance is started.

Thanks for clarifying!

Hi,

with the insights gained in the L2A thread, I have a follow-up question to your previous post:

Since in our case, L8 L1C products are the limiting factor, do you recon it to be a good approach to define a site per L8 path (yields ~ individual 7 sites) and filter out the overlaps for L2 processing?

Also, can you elaborate on what you mean with symlinking the L1 files?

That sounds reasonable.

I’m not sure any more :smile:. I probably meant that if you already have the L1C on disk, you’ll have to distribute them to the sub-sites. Since they generally can intersect multiple sub-sites (we, at least, used S2 tiles as a reference and added the needed L8 products), you’ll want to make symlinks instead of copying them to save disk space.

And of course, after the L2As are ready, you might or might not want to merge back all of them into a single site.

Hi @lnicola,

I’m afraid I need to pick up this topic and ask you for your advice:

As planned, I set up 8 separate sites, one per landsat-8 path, and started downloading and processing Sentinel-2 & Landsat 8 data.

In the beginning everything looked fine, but upon closer inspection I’m not sure the L2 processing works properly.

First, even though there are downloaded Landsat products available, MACCS only processes Sentinel at the moment. I suspect that could be some prioritization? In my understanding, Sentinel and Landsat should be able to be processed in parallel.

Then, the processing seems to be too slow. As stated above, we now have increased resources (which seem to be used) but this is the L2 output for the last 2 days:

S2B_MSIL2A_20170818T075609_N0205_R035_T36NYL_20170818T080616.SAFE                   | 2018-02-06 07:55:58.778893+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161003T162407_R121_V20161003T081752_20161003T083125.SAFE | 2018-02-06 06:24:07.619155+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161009T172748_R064_V20161009T083842_20161009T085021.SAFE | 2018-02-06 05:45:25.870406+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161010T142809_R078_V20161010T080842_20161010T082711.SAFE | 2018-02-06 05:39:45.77675+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161007T180053_R035_V20161007T080002_20161007T081556.SAFE | 2018-02-06 05:28:51.320767+00
 S2B_MSIL2A_20170818T075609_N0205_R035_T36NYN_20170818T080616.SAFE                   | 2018-02-06 04:59:30.389212+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161003T160442_R121_V20161003T081752_20161003T083125.SAFE | 2018-02-06 03:01:54.553429+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161009T173738_R064_V20161009T083842_20161009T085021.SAFE | 2018-02-06 02:21:02.730032+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161010T081253_R035_V20161007T080002_20161007T081556.SAFE | 2018-02-06 02:00:59.41955+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161003T164123_R121_V20161003T081752_20161003T083125.SAFE | 2018-02-06 01:53:28.40593+00
 S2B_MSIL2A_20170818T075609_N0205_R035_T36NYM_20170818T080616.SAFE                   | 2018-02-06 01:50:21.853611+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161007T180533_R035_V20161007T080002_20161007T081556.SAFE | 2018-02-05 22:12:56.094008+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161009T172352_R064_V20161009T083842_20161009T085021.SAFE | 2018-02-05 22:07:29.808063+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161003T163554_R121_V20161003T081752_20161003T083125.SAFE | 2018-02-05 22:04:47.819048+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161003T164250_R121_V20161003T081752_20161003T083125.SAFE | 2018-02-05 21:31:20.796908+00
 S2B_MSIL2A_20170815T074609_N0205_R135_T36NXL_20170815T075734.SAFE                   | 2018-02-05 21:28:50.266974+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161003T162903_R121_V20161003T081752_20161003T083125.SAFE | 2018-02-05 12:36:46.866786+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161006T142725_R021_V20161006T083002_20161006T084239.SAFE | 2018-02-05 11:18:24.606033+00
 S2B_MSIL2A_20170815T074609_N0205_R135_T36NYL_20170815T075734.SAFE                   | 2018-02-05 11:18:11.847942+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161007T175840_R035_V20161007T080002_20161007T081556.SAFE | 2018-02-05 11:16:55.037181+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161003T164403_R121_V20161003T081752_20161003T083125.SAFE | 2018-02-05 10:44:04.742515+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161006T142557_R021_V20161006T083002_20161006T084239.SAFE | 2018-02-05 03:01:24.232745+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161003T164404_R121_V20161003T081752_20161003T083125.SAFE | 2018-02-05 02:29:07.835879+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161003T160244_R121_V20161003T081752_20161003T083125.SAFE | 2018-02-05 02:17:15.532008+00
 S2A_OPER_PRD_MSIL2A_PDMC_20161004T165054_R135_V20161004T074742_20161004T075653.SAFE | 2018-02-05 01:37:37.1084+00
 S2B_MSIL2A_20170726T074609_N0205_R135_T36NZM_20170726T075740.SAFE                   | 2018-02-05 00:19:16.922571+00

Some products seem to take incredibly long:

2018-02-06 05:16:41.702475:[31386]:Starting command: /opt/maccs/core/5.1/bin/maccs --input /mnt/archive/demmaccs_tmp/26431/36NYL --TileId 36NYL --output /mnt/archive/maccs_def/lspth170/l2a/S2B_MSIL2A_20170818T075609_N0205_R035_T36NYL_20170818T080616.SAFE//maccs_36NYL --mode L2NOMINAL --loglevel DEBUG --enableTest false --CheckXMLFilesWithSchema false --conf /usr/share/sen2agri/sen2agri-demmaccs/UserConfiguration
2018-02-06 07:54:44.737877:[31386]:Command finished OK (res = 0) in 2:38:03.034433 : /opt/maccs/core/5.1/bin/maccs --input /mnt/archive/demmaccs_tmp/26431/36NYL --TileId 36NYL --output /mnt/archive/maccs_def/lspth170/l2a/S2B_MSIL2A_20170818T075609_N0205_R035_T36NYL_20170818T080616.SAFE//maccs_36NYL --mode L2NOMINAL --loglevel DEBUG --enableTest false --CheckXMLFilesWithSchema false --conf /usr/share/sen2agri/sen2agri-demmaccs/UserConfiguration

Is there anything I can do to speed up the process? With the current performance, this would take some serious time.

Could it be that 8 sites are too many for our setup? Could temporarily disabling sites speed it up a bit? Or do you recon adding computational resources could help?

Thanks for your insight

Sorry for the late reply, but I’m not sure what to say. Can you maybe installl htop, iotop and perf and include the output of them? For perf you can run perf top for a while.

Hi,

Here are the screenshots (perf was running for about 45 mins, hope that’s enough):

htop

htop2

iotop

iotop

perf top

perf-top

I also took a closer look at the individual maccs processes, there seem to be 4 running at the time, all Sentinel-2.

Through the logs I got the distribution of the individual processing times:

s2ex

It’s quite spread out, but taking into account the time of processing, it looks like the processing time was high initially, but started to center around ~ 200 mins … so quite long. I’ve added some colors for the individual UTM tiles, to check for any patterns:

exvstop_woL

Val