Transforming and Splitting huge EDI files with Smooks

März 29, 2011 claus

We have to transform huge EDI files into single data sets (XML). Huge means up to 700K ‘data sets’ which is a file size of 250MB. Each file has a global header and n data groups. Each data group has global (on their) level information as well and n data sets. The file structure looks like this:

file
 |-- header [1]
 |-- data group [0 - *]
 |     |-- header [1]
 |     |-- data sets [0 - *]

These are the issues:
- transform the EDI into XML
- get the global and parent information into each data set
- split the file into n data sets
- export the data sets to file system or JMS (in this sample we will use file export)

The problem is the size of the file. We can’t transform it inside the RAM but we have to use something like data stream in and data stream out. Here comes smooks into game. Smooks is a library for transforming files. One possible combination is EDI in and XML out (check out the smooks page for more options and samples). To face our issues smooks gives us a rich toolset:
- transform the EDI into XML with a simple mapping language
- get the global and parent information into each data set is easy with mixing dom and sax
- split the file into n data sets can be done by freemarker in combination with dom/sax support
- export the data sets to file system or JMS (in this sample we will use file export) via one of the export cartridges

So let’s have a look on the documents.

The edi file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
UNB+UNOC:3+107436557+104027544+20100420:1548+00402++KRZKO109973'
UNH+33097343800001+ABVO:03:0:0+109131438'
REC+288856+20080411+0+20080331+EUR+S'
INV+0275860003+1248M++803000160639131438+108533476'
NAD+MUSTERFALL+HERBERT+19480101+SCHOENSTR. 1+80339+MUENCHEN'
ZUP+649585800+20080312+2++++++2+20080325++1+1+1+1+1+1+1+1+1+1+++649585800'
EFP+4119092+2+58,32+2+1+++0'
BES+58,32+0,00'
INV+0273601015+10001++803000160649131438'
NAD+MUSTERFALL+HEIDI+19560101+ZEPPELINUSSTR. 1+80339+MUENCHEN'
ZUP+000000000+20080313+2++++++3+20080313++1+1+1+1+1+1+1+1+1+1+++000000000'
EFP+1448808+10+283,80+2+1+++0'
BES+283,80+0,00'
UNT+13+33097343800001'
UNH+59091046500002+ABVO:03:0:0+109131438'
REC+260649+20080624+0+20080331+EUR+S'
INV+0271716008+10001++803000160659131438+108533476'
NAD+MUSTERFALL+AUGUSTE+19270101+BAHNHOFWEG 1+80339+MUENCHEN'
ZUP+648429000+20080311+1++++++2+20080312++1+1+1+1+1+1+1+1+1+1+++648429000'
EFP+3222497+2+196,46+2+1+++0'
BES+196,46+0,00'
UNT+8+59091046500002'
...
UNZ+10+00402'

As you can see between the first UNB and UNT segment (data group) there’re 2 INV segments, but only one REC segment. So a data set is always the data between a INV and a BES segment plus global information from REC and UNB.

The mapping file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
<?xml version="1.0" encoding="UTF-8"?>
<medi:edimap xmlns:medi="http://www.milyn.org/schema/edi-message-mapping-1.2.xsd">
 
    <medi:import truncatableSegments="true" truncatableFields="true" truncatableComponents="true" resource="global-segment-definition.xml" namespace="def"/>
 
    <medi:description name="ABVO" version="3.0"/>
 
    <medi:delimiters segment="'" field="+" component=":" sub-component="~" escape="?"/>
 
    <medi:segments xmltag="ABVO">
 
        <!-- UNB+UNOC:3+107436557+104027544+20100420:1548+00402++KRZKO109973' -->
        <medi:segment minOccurs="0" maxOccurs="1" segref="def:UNB" segcode="UNB" xmltag="Header"/>
 
		<medi:segmentGroup xmltag="Datagroup" maxOccurs="-1">
 
			<!-- UNH+33097343800001+ABVO:03:0:0+109131438' -->
			<medi:segment minOccurs="1" maxOccurs="-1" segref="def:UNH" segcode="UNH" xmltag="DatasetHeader"/>
 
			<!-- REC+53408-011022+20081120+0+20080331+EUR+S' -->
			<medi:segment minOccurs="0" maxOccurs="-1" segcode="REC" xmltag="Rechnung">
				<medi:field xmltag="Rechnungsnummer"/>
				<medi:field xmltag="Rechnungsdatum"/>
                                ...
			</medi:segment>
 
			<medi:segmentGroup xmltag="Dataset" maxOccurs="-1">
 
				<!-- INV+0275860003+1248M++803000160639131438+108533476' -->
				<medi:segment minOccurs="0" maxOccurs="-1" segref="def:INV" segcode="INV" xmltag="InformationVersicherte" truncatable="true"/>
 
				...
 
			</medi:segmentGroup>
 
			<!-- UNT+8+66059903600010' -->
			<medi:segment minOccurs="0" segref="def:UNT" segcode="UNT" xmltag="EndsegmentDatensatz"/>
 
		</medi:segmentGroup>
 
		<!-- UNZ+10+00402' -->
		<medi:segment minOccurs="0" segref="def:UNZ" segcode="UNZ" xmltag="EndsegmentDokument"/>
    </medi:segments>
 
</medi:edimap>

With the segmentGroup tag we can structure the document as we want.

The smooks configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
<?xml version="1.0"?>
<smooks-resource-list xmlns="http://www.milyn.org/xsd/smooks-1.1.xsd"
	xmlns:core="http://www.milyn.org/xsd/smooks/smooks-core-1.3.xsd"
	xmlns:file="http://www.milyn.org/xsd/smooks/file-routing-1.2.xsd"
	xmlns:ftl="http://www.milyn.org/xsd/smooks/freemarker-1.1.xsd"
	xmlns:edi="http://www.milyn.org/xsd/smooks/edi-1.4.xsd">
 
        <!-- getting the mapping file into the configuration -->
	<edi:reader mappingModel="mappingmodels/abvo.xml" ignoreNewLines="true" />
 
	<!-- Filter the message using the SAX Filter (i.e. not DOM, so no intermediate 
		DOM, so we can process huge messages... -->
	<core:filterSettings type="SAX" />
 
        <!-- As we need information from the Header and Datagroup we have to define 
                3 diffrent dom models -->
	<resource-config selector="Header,Datagroup,Dataset">
		<resource>org.milyn.delivery.DomModelCreator</resource>
	</resource-config>
 
	<!-- Every time we hit the end of an <Dataset> element, apply this freemarker 
		template, outputting the result to the "orderItemSplitStream" OutputStream, 
		which is the file output stream configured below. -->
	<ftl:freemarker applyOnElement="Dataset">
		<ftl:template><!--
			<#assign dataSet = .vars["Dataset"]>
			<#assign dataGroup = .vars["Datagroup"]>
			<dataset>
				<file id="${Header.Dateinummer}" name="${Header.Dateiname}">
						<sender>${Header.Absender}</sender>
							<receiver>${Header.Empfaenger}</receiver>
				</file> 
				<message ref="${dataGroup.DatasetHeader.Nachrichtenreferenz}" refNr="${dataGroup.DatasetHeader.ZuordnungsReferenzNummer}"/>
        		<invoice id="${dataGroup.Rechnung.Rechnungsnummer}">
        			<date>${dataGroup.Rechnung.Rechnungsdatum}</date>
        			<period>${dataGroup.Rechnung.Abrechnungszeitraum}</period>
        		</invoice>
        		<insurance id="${dataSet.InformationVersicherte.VersNummer}">
					<state>${dataSet.InformationVersicherte.VersStatus}</state>
				</insurance>
			</dataset>
		 --></ftl:template>
		<ftl:use>
			<!-- Output the templating result to the "datasetSplitStream" file output 
				stream... -->
			<ftl:outputTo outputStreamResource="datasetSplitStream" />
		</ftl:use>
	</ftl:freemarker>
 
	<!-- Create/open a file output stream. This is written to by the freemarker 
		template (above).. -->
	<file:outputStream resourceName="datasetSplitStream"
		openOnElement="Dataset">
		<file:fileNamePattern>abvo-${Header.Dateinummer}-${.vars["Datagroup"].DatasetHeader.Nachrichtenreferenz}-${.vars["Dataset"].Einzellfallnachweis.Kennzeichen}.xml</file:fileNamePattern>
		<file:destinationDirectoryPattern>target/out</file:destinationDirectoryPattern>
		<file:highWaterMark mark="300" />
	</file:outputStream>
</smooks-resource-list>

A very important part for working with huge files is the ‘core:filterSettings’ element. Setting it to SAX all work will be done on the streams – this means even processing million of messages works with a very small footprint.

The camel route:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
public class EdiToXmlTest extends CamelTestSupport {
        ...
	public void testSmooksBigFile() throws Exception{
 
		runs = 100000;
 
		RouteBuilder builder = new RouteBuilder() {
 
			@Override
			public void configure() throws Exception {
 
				from("file://target/in?noop=true")
				.log("starting splitting...")
				.to("smooks://src/main/resources/file-config.xml");
 
				from("file://target/out?delete=true")
				.to("log:smooks?level=INFO&groupSize=1000")
				.setBody(constant(""))
				.to("mock:out");
 
			}
		};
 
		context.addRoutes(builder);
 
		MockEndpoint out = getMockEndpoint("mock:out");
		out.setExpectedMessageCount(runs);
 
		assertMockEndpointsSatisfied(5000, TimeUnit.SECONDS);
	}
...
}

In this test case we’re transforming a EDI with 100K messages in it. Starting the test you will get something like this:

2090 [main] INFO org.apache.camel.impl.DefaultCamelContext - Route: route5 started and consuming from: Endpoint[file://src/test/resources/data/in?noop=true]
2117 [main] INFO org.apache.camel.impl.DefaultCamelContext - Route: route6 started and consuming from: Endpoint[file://target/out?delete=true]
2118 [main] INFO org.apache.camel.component.mock.MockEndpoint - Asserting: Endpoint[mock://out] is satisfied
3096 [Camel (camel-1) thread #0 - file://src/test/resources/data/in] INFO route5 - starting splitting...
7275 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 1000 messages so far. Last group took: 3656 millis which is: 273,523 messages per second. average: 273,523
9293 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 2000 messages so far. Last group took: 2018 millis which is: 495,54 messages per second. average: 352,485
12647 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 3000 messages so far. Last group took: 3354 millis which is: 298,151 messages per second. average: 332,3
16192 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 4000 messages so far. Last group took: 3545 millis which is: 282,087 messages per second. average: 318,142
20656 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 5000 messages so far. Last group took: 4464 millis which is: 224,014 messages per second. average: 293,479
23674 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 6000 messages so far. Last group took: 3018 millis which is: 331,345 messages per second. average: 299,177

You can see that smooks is processing around 300 messages per second. This is equal to the ‘highWaterMark’ property we’ve set in line 56 of the smooks config. The high water mark is the maximum number of files that can exist in the destination folder at one time. To check the number of files smooks polls from time to time (you can modify this period by setting the poll interval on your own). The default number of files is 200. We’ve abused this property a little bit. Because camel deletes the files from the directory right after smooks has added them it’s possible to use the high water mark as a kind of adjustment screw where you can configure how many messages will flood your system during the polling frequency. So if we decrease this value the processing will slow down, but the footprint will decrease as well. So this is a very interesting property to configure if you want a fast and more resource intensive system, or a slower one with a small footprint.

We made some test on a Linux machine (2.6.32-27-generic) with 7,8 GB RAM and a Intel Core Duo E8400 with 3.00 GHz CPU. These are no performance tests in the classical sense, but they should show how the system behaves if we change the high water mark.

Here’re the results for a 30 MB EDI file (100K data sets) and a hig water mark of 100:

Smooks with high water mark of 100

Smooks with high water mark of 100


As you can see the CPU and heap consumption is very low. CPU under 5% in average, heap under 25MB in average.

Here’re the results for a 30 MB EDI file (100K data sets) and a high water mark of 500:

smooks with high water mark of 500

smooks with high water mark of 500


As you can see the CPU and heap consumption is middle. CPU around 40% in average, heap under 50MB in average. But the processing with a high water mark of 500 was six times faster than with 100.

, , apache camel

2 Comments → “Transforming and Splitting huge EDI files with Smooks”

  1. tejo 1 month ago   Antworten

    can we dynamically change delimiters segment: for example depends on UNA information,..

Leave a Reply