We have to transform huge EDI files into single data sets (XML). Huge means up to 700K ‘data sets’ which is a file size of 250MB. Each file has a global header and n data groups. Each data group has global (on their) level information as well and n data sets. The file structure looks like this:
file |-- header [1] |-- data group [0 - *] | |-- header [1] | |-- data sets [0 - *] |
These are the issues:
- transform the EDI into XML
- get the global and parent information into each data set
- split the file into n data sets
- export the data sets to file system or JMS (in this sample we will use file export)
The problem is the size of the file. We can’t transform it inside the RAM but we have to use something like data stream in and data stream out. Here comes smooks into game. Smooks is a library for transforming files. One possible combination is EDI in and XML out (check out the smooks page for more options and samples). To face our issues smooks gives us a rich toolset:
- transform the EDI into XML with a simple mapping language
- get the global and parent information into each data set is easy with mixing dom and sax
- split the file into n data sets can be done by freemarker in combination with dom/sax support
- export the data sets to file system or JMS (in this sample we will use file export) via one of the export cartridges
So let’s have a look on the documents.
The edi file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | UNB+UNOC:3+107436557+104027544+20100420:1548+00402++KRZKO109973' UNH+33097343800001+ABVO:03:0:0+109131438' REC+288856+20080411+0+20080331+EUR+S' INV+0275860003+1248M++803000160639131438+108533476' NAD+MUSTERFALL+HERBERT+19480101+SCHOENSTR. 1+80339+MUENCHEN' ZUP+649585800+20080312+2++++++2+20080325++1+1+1+1+1+1+1+1+1+1+++649585800' EFP+4119092+2+58,32+2+1+++0' BES+58,32+0,00' INV+0273601015+10001++803000160649131438' NAD+MUSTERFALL+HEIDI+19560101+ZEPPELINUSSTR. 1+80339+MUENCHEN' ZUP+000000000+20080313+2++++++3+20080313++1+1+1+1+1+1+1+1+1+1+++000000000' EFP+1448808+10+283,80+2+1+++0' BES+283,80+0,00' UNT+13+33097343800001' UNH+59091046500002+ABVO:03:0:0+109131438' REC+260649+20080624+0+20080331+EUR+S' INV+0271716008+10001++803000160659131438+108533476' NAD+MUSTERFALL+AUGUSTE+19270101+BAHNHOFWEG 1+80339+MUENCHEN' ZUP+648429000+20080311+1++++++2+20080312++1+1+1+1+1+1+1+1+1+1+++648429000' EFP+3222497+2+196,46+2+1+++0' BES+196,46+0,00' UNT+8+59091046500002' ... UNZ+10+00402' |
As you can see between the first UNB and UNT segment (data group) there’re 2 INV segments, but only one REC segment. So a data set is always the data between a INV and a BES segment plus global information from REC and UNB.
The mapping file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | <?xml version="1.0" encoding="UTF-8"?> <medi:edimap xmlns:medi="http://www.milyn.org/schema/edi-message-mapping-1.2.xsd"> <medi:import truncatableSegments="true" truncatableFields="true" truncatableComponents="true" resource="global-segment-definition.xml" namespace="def"/> <medi:description name="ABVO" version="3.0"/> <medi:delimiters segment="'" field="+" component=":" sub-component="~" escape="?"/> <medi:segments xmltag="ABVO"> <!-- UNB+UNOC:3+107436557+104027544+20100420:1548+00402++KRZKO109973' --> <medi:segment minOccurs="0" maxOccurs="1" segref="def:UNB" segcode="UNB" xmltag="Header"/> <medi:segmentGroup xmltag="Datagroup" maxOccurs="-1"> <!-- UNH+33097343800001+ABVO:03:0:0+109131438' --> <medi:segment minOccurs="1" maxOccurs="-1" segref="def:UNH" segcode="UNH" xmltag="DatasetHeader"/> <!-- REC+53408-011022+20081120+0+20080331+EUR+S' --> <medi:segment minOccurs="0" maxOccurs="-1" segcode="REC" xmltag="Rechnung"> <medi:field xmltag="Rechnungsnummer"/> <medi:field xmltag="Rechnungsdatum"/> ... </medi:segment> <medi:segmentGroup xmltag="Dataset" maxOccurs="-1"> <!-- INV+0275860003+1248M++803000160639131438+108533476' --> <medi:segment minOccurs="0" maxOccurs="-1" segref="def:INV" segcode="INV" xmltag="InformationVersicherte" truncatable="true"/> ... </medi:segmentGroup> <!-- UNT+8+66059903600010' --> <medi:segment minOccurs="0" segref="def:UNT" segcode="UNT" xmltag="EndsegmentDatensatz"/> </medi:segmentGroup> <!-- UNZ+10+00402' --> <medi:segment minOccurs="0" segref="def:UNZ" segcode="UNZ" xmltag="EndsegmentDokument"/> </medi:segments> </medi:edimap> |
With the segmentGroup tag we can structure the document as we want.
The smooks configuration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | <?xml version="1.0"?> <smooks-resource-list xmlns="http://www.milyn.org/xsd/smooks-1.1.xsd" xmlns:core="http://www.milyn.org/xsd/smooks/smooks-core-1.3.xsd" xmlns:file="http://www.milyn.org/xsd/smooks/file-routing-1.2.xsd" xmlns:ftl="http://www.milyn.org/xsd/smooks/freemarker-1.1.xsd" xmlns:edi="http://www.milyn.org/xsd/smooks/edi-1.4.xsd"> <!-- getting the mapping file into the configuration --> <edi:reader mappingModel="mappingmodels/abvo.xml" ignoreNewLines="true" /> <!-- Filter the message using the SAX Filter (i.e. not DOM, so no intermediate DOM, so we can process huge messages... --> <core:filterSettings type="SAX" /> <!-- As we need information from the Header and Datagroup we have to define 3 diffrent dom models --> <resource-config selector="Header,Datagroup,Dataset"> <resource>org.milyn.delivery.DomModelCreator</resource> </resource-config> <!-- Every time we hit the end of an <Dataset> element, apply this freemarker template, outputting the result to the "orderItemSplitStream" OutputStream, which is the file output stream configured below. --> <ftl:freemarker applyOnElement="Dataset"> <ftl:template><!-- <#assign dataSet = .vars["Dataset"]> <#assign dataGroup = .vars["Datagroup"]> <dataset> <file id="${Header.Dateinummer}" name="${Header.Dateiname}"> <sender>${Header.Absender}</sender> <receiver>${Header.Empfaenger}</receiver> </file> <message ref="${dataGroup.DatasetHeader.Nachrichtenreferenz}" refNr="${dataGroup.DatasetHeader.ZuordnungsReferenzNummer}"/> <invoice id="${dataGroup.Rechnung.Rechnungsnummer}"> <date>${dataGroup.Rechnung.Rechnungsdatum}</date> <period>${dataGroup.Rechnung.Abrechnungszeitraum}</period> </invoice> <insurance id="${dataSet.InformationVersicherte.VersNummer}"> <state>${dataSet.InformationVersicherte.VersStatus}</state> </insurance> </dataset> --></ftl:template> <ftl:use> <!-- Output the templating result to the "datasetSplitStream" file output stream... --> <ftl:outputTo outputStreamResource="datasetSplitStream" /> </ftl:use> </ftl:freemarker> <!-- Create/open a file output stream. This is written to by the freemarker template (above).. --> <file:outputStream resourceName="datasetSplitStream" openOnElement="Dataset"> <file:fileNamePattern>abvo-${Header.Dateinummer}-${.vars["Datagroup"].DatasetHeader.Nachrichtenreferenz}-${.vars["Dataset"].Einzellfallnachweis.Kennzeichen}.xml</file:fileNamePattern> <file:destinationDirectoryPattern>target/out</file:destinationDirectoryPattern> <file:highWaterMark mark="300" /> </file:outputStream> </smooks-resource-list> |
A very important part for working with huge files is the ‘core:filterSettings’ element. Setting it to SAX all work will be done on the streams – this means even processing million of messages works with a very small footprint.
The camel route:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | public class EdiToXmlTest extends CamelTestSupport { ... public void testSmooksBigFile() throws Exception{ runs = 100000; RouteBuilder builder = new RouteBuilder() { @Override public void configure() throws Exception { from("file://target/in?noop=true") .log("starting splitting...") .to("smooks://src/main/resources/file-config.xml"); from("file://target/out?delete=true") .to("log:smooks?level=INFO&groupSize=1000") .setBody(constant("")) .to("mock:out"); } }; context.addRoutes(builder); MockEndpoint out = getMockEndpoint("mock:out"); out.setExpectedMessageCount(runs); assertMockEndpointsSatisfied(5000, TimeUnit.SECONDS); } ... } |
In this test case we’re transforming a EDI with 100K messages in it. Starting the test you will get something like this:
2090 [main] INFO org.apache.camel.impl.DefaultCamelContext - Route: route5 started and consuming from: Endpoint[file://src/test/resources/data/in?noop=true] 2117 [main] INFO org.apache.camel.impl.DefaultCamelContext - Route: route6 started and consuming from: Endpoint[file://target/out?delete=true] 2118 [main] INFO org.apache.camel.component.mock.MockEndpoint - Asserting: Endpoint[mock://out] is satisfied 3096 [Camel (camel-1) thread #0 - file://src/test/resources/data/in] INFO route5 - starting splitting... 7275 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 1000 messages so far. Last group took: 3656 millis which is: 273,523 messages per second. average: 273,523 9293 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 2000 messages so far. Last group took: 2018 millis which is: 495,54 messages per second. average: 352,485 12647 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 3000 messages so far. Last group took: 3354 millis which is: 298,151 messages per second. average: 332,3 16192 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 4000 messages so far. Last group took: 3545 millis which is: 282,087 messages per second. average: 318,142 20656 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 5000 messages so far. Last group took: 4464 millis which is: 224,014 messages per second. average: 293,479 23674 [Camel (camel-1) thread #1 - file://target/out] INFO smooks - Received: 6000 messages so far. Last group took: 3018 millis which is: 331,345 messages per second. average: 299,177 |
You can see that smooks is processing around 300 messages per second. This is equal to the ‘highWaterMark’ property we’ve set in line 56 of the smooks config. The high water mark is the maximum number of files that can exist in the destination folder at one time. To check the number of files smooks polls from time to time (you can modify this period by setting the poll interval on your own). The default number of files is 200. We’ve abused this property a little bit. Because camel deletes the files from the directory right after smooks has added them it’s possible to use the high water mark as a kind of adjustment screw where you can configure how many messages will flood your system during the polling frequency. So if we decrease this value the processing will slow down, but the footprint will decrease as well. So this is a very interesting property to configure if you want a fast and more resource intensive system, or a slower one with a small footprint.
We made some test on a Linux machine (2.6.32-27-generic) with 7,8 GB RAM and a Intel Core Duo E8400 with 3.00 GHz CPU. These are no performance tests in the classical sense, but they should show how the system behaves if we change the high water mark.
Here’re the results for a 30 MB EDI file (100K data sets) and a hig water mark of 100:
Here’re the results for a 30 MB EDI file (100K data sets) and a high water mark of 500:
can we dynamically change delimiters segment: for example depends on UNA information,..
I can’t give you an answer on this question. Please ask the mailinglist on http://www.smooks.org. Thanks.