archive-be.com » BE » B » BENNYMICHIELSEN.BE

Total: 348

Choose link from "Titles, links and description words view":

Or switch to "Titles and links view".
  • cloud – Benny Michielsen
    a follow up post I ll use HDInsight to process the actual data The data source that has to be processed is a 5 12 GB text file containing 100 million email addresses which I uploaded to blob storage Just like Troy I then created a batching solution to process the file in chuncks I didn t want to use webjobs so instead I created a cloud service This service reads line numbers from a queue The line numbers indicate which lines to read from the text file and send to the event hub The code in the cloud service is pretty straightforward Open a filereader and move to the line number which is indicated in the incoming message from queue storage Then start reading the lines of the file that need to be processed and fire away at the event hub I have not given the code much thought so there are probably several ways this can be improved I just wanted to get up and running The event hub receives json objects containing the name of the website and the email address using var streamReader new StreamReader file OpenRead int row 0 while row beginAt streamReader ReadLine row while row endAt row var line streamReader ReadLine lines Add line Parallel ForEach lines line var leakedEmail JsonConvert SerializeObject new leak websitesource address line eventHubClient Send new EventData Encoding UTF8 GetBytes leakedEmail I created batches of 1 million rows which means 100 batches that had to be processed I deployed the service and then observed the system After seeing one instance running correctly I scaled up the system to get 20 instances running simultaneous The result was already quite nice All of the instances together were sending around 4 2 million leaked email addresses every 5 minutes with a peak

    Original URL path: http://blog.bennymichielsen.be/tag/cloud/ (2016-04-29)
    Open archived version from archive


  • event hub – Benny Michielsen
    for processing when either a maximum count is reached or a specific amount of time has passed since the last purge In the FinishBatch method a SqlBulk copy is performed to insert records in a SQL database private void FinishBatch lastRun DateTime UtcNow DataTable table new DataTable table Columns Add Email if breachedAccounts Any foreach var emailAddress in breachedAccounts var row table NewRow row Email emailAddress GetString 0 table Rows Add row try using SqlBulkCopy bulkCopy new SqlBulkCopy connection bulkCopy DestinationTableName Emails bulkCopy ColumnMappings Add Email Email bulkCopy WriteToServer table foreach var emailAddress in breachedAccounts ctx Ack emailAddress catch Exception ex foreach var emailAddress in breachedAccounts ctx Fail emailAddress Context Logger Error ex Message finally breachedAccounts Clear Both the table storage bolt and the SQL bulk copy bolt have lines with ctx Ack Fail or Emit These are places where you can indicate whether processing of a tuple has succeeded failed or you are emitting a new tuple to be processed further downstream It also enables to replay tuples If the SQL bulk copy fails the tuple will be processed again at a later time The azure table spout expects your bolts to ack topologyBuilder SetBolt TableStorageBolt TableStorageBolt Get new Dictionary string List Constants DEFAULT STREAM ID new List address 256 true DeclareCustomizedJavaSerializer javaSerializerInfo shuffleGrouping EventHubSpout topologyBuilder SetBolt SQLBolt SQLStorageBolt Get new Dictionary string List 2 true shuffleGrouping TableStorageBolt Constants DEFAULT STREAM ID addConfigurations new Dictionary string string topology tick tuple freq secs 1 topologyBuilder SetTopologyConfig new Dictionary string string topology workers partitionCount ToString The two bolts also need to be added to the topology The tableStorageBolt is configured with a name a method which can be used to create instances and a description of its output In this case I m emitting the email address to the default stream The parallelism hint is also configured and the boolean flag which indicates this bolt supports acking Since this bolt sits behind the Java spout we also need to configure how the data of the spout can be deserialized The bolt is also configured to use shuffleGrouping meaning that the load will be spread evenly among all instances The SQLBolt is configured accordingly with the additional configuration to support time ticks After I created an HDInsight Storm cluster on Azure I was able to publish the topology from within Visual Studio The cluster itself was created in 15 minutes It took quite some time to find the right parallelism for this topology but with the code I ve posted I got these results after one hour Table storage is slow the 2 SQL bolts are able to keep up with the 256 table storage bolts After one hour my topology was able to process only 10 million records Even though I was able to send all 100 million events to the event hub in under two hours processing would take 5 times longer One of the causes for this is the test data The partition and row key for the table

    Original URL path: http://blog.bennymichielsen.be/tag/event-hub/ (2016-04-29)
    Open archived version from archive

  • Partitioning and wildcards in an Azure Data Factory pipeline – Benny Michielsen
    So if your files are all being stored in one folder but each file has the time in its filename myFile 2015 07 01 txt you can t create a filter with the dynamic partitions myFile Year Month Day txt It s only possible to use the partitionedBy section in the folder structure as shown above If you think this is a nice feature go vote here The price of the current setup is determined by a couple of things First we have a low frequency activity that s an activity that runs daily or less The first 5 are free so we have 25 activities remaining The pricing of an activity is determined on the place where it occurs on premise or in the cloud I m assuming here it s an on premise activity since the files are not located in Azure I ve asked around if this assumption is correct but don t have a response yet The pricing of an on premise activity is 0 5586 per activity So that would mean almost 14 for this daily snapshot each month If we modified everything to run hourly we d have to pay 554 80 per month You can find more info on the pricing on their website In this scenario I ve demonstrated how to get started with Azure Data Factory The real power however lies in the transformation steps which you can add Instead of doing a simple copy the data can be read combined and stored in many different forms A topic for a future post Upside Rich editor Fairly easy to get started No custom application to write On premise support via the Data Management Gateway No firewall settings need to be changed Downside Can get quite expensive Author BennyM Posted on 07

    Original URL path: http://blog.bennymichielsen.be/2015/07/07/partitioning-and-wildcards-in-an-azure-data-factory-pipeline/ (2016-04-29)
    Open archived version from archive

  • data factory – Benny Michielsen
    the data In an update published the end of March it was announced that you can also copy files I wanted to try this out and it proved to be a bit more cumbersome than I first imagined Let s take a look I want to create a very basic flow Let s say I have an application which is populating files in a folder and I now want to move the file into Azure blob storage I can use the same use case as mentioned in my previous post I m placing data leaks in a folder and need them to be sent online for further processing After creating a data factory in your Azure account we ll need the following components Two linked services A connection to my on premises folder A connection to my blob storage account Two datasets also called tables A dataset containing info on where my data is stored on premise and how many times per day it can be fetched A dataset which has info on how and where to store the data in blob storage One pipeline which contains an activity which connects the datasets The connection to the on premise file is handled by an application which you need to install on premise By navigating to the linked services slice you can add a data gateway To configure a gateway you only need to provide a name you can then download the data gateway application and install it on premise After installing the application you need to enter the key which can be viewed in the Azure portal As far as the on premise configuration you are done You do not need to configure any firewall ports but you can only install it once on a PC So far the wizards Now we need to create the pipeline After clicking on the Author and deploy tile The browser navigates to an online editor You create linked services datasets and pipelines by using JSON When clicking on any of the menu options you can select a template to get started As mentioned earlier I needed two linked services You can create those via the new data store option The first one I ll create is the on premise file data store name OnPremisesFileSystemLinkedService properties host localhost gatewayName mydatagateway userId BennyM password type OnPremisesFileSystemLinkedService The options you need to configure are host which will be the name of the machine which contains the files folder you need to connect to gatewayName which has to match the name of the gateway which we created earlier a userid and password or encryptedcredentials to use to connect from the gateway to the target machine type which needs to be OnPremisesFileSystemLinkedService name which we will use later The next one will be the connection to the Azure storage You can get the template by clicking New data store and selecting Azure storage name StorageLinkedService properties connectionString DefaultEndpointsProtocol https AccountName bennymdatafactory AccountKey type AzureStorageLinkedService The options you need

    Original URL path: http://blog.bennymichielsen.be/tag/data-factory/ (2016-04-29)
    Open archived version from archive

  • Copying files with Azure Data Factory – Benny Michielsen
    create linked services datasets and pipelines by using JSON When clicking on any of the menu options you can select a template to get started As mentioned earlier I needed two linked services You can create those via the new data store option The first one I ll create is the on premise file data store name OnPremisesFileSystemLinkedService properties host localhost gatewayName mydatagateway userId BennyM password type OnPremisesFileSystemLinkedService The options you need to configure are host which will be the name of the machine which contains the files folder you need to connect to gatewayName which has to match the name of the gateway which we created earlier a userid and password or encryptedcredentials to use to connect from the gateway to the target machine type which needs to be OnPremisesFileSystemLinkedService name which we will use later The next one will be the connection to the Azure storage You can get the template by clicking New data store and selecting Azure storage name StorageLinkedService properties connectionString DefaultEndpointsProtocol https AccountName bennymdatafactory AccountKey type AzureStorageLinkedService The options you need to configure are name which we will use later connectionstring which needs to match your connectionstring for Azure storage type which needs to be AzureStorageLinkedService Next one the first dataset on premises file name OnPremisesFile properties location type OnPremisesFileSystemLocation folderPath c Temp linkedServiceName OnPremisesFileSystemLinkedService availability frequency Day interval 1 waitOnExternal dataDelay 00 10 00 retryInterval 00 01 00 retryTimeout 00 10 00 maximumRetry 3 The options you need to configure again a name which we ll use later type which has to be OnPremisesFileSystemLocation folderpath where is the folder I want to sync Note the double slashes linkedServiceName this has to be the same value which we used earlier when we created the data store for the on premises gateway availability how many times will the on premises file or folder by synchronized What s very important is the waitOnExternal You have to configure this if the data is not produced by the data factory itself In this case it s an external source so I have to fill in some values Our next dataset is the Azure blob name AzureBlobDatasetTemplate properties location type AzureBlobLocation folderPath myblobcontainer linkedServiceName StorageLinkedService availability frequency Day interval 1 Fairly easy to configure again a name the type which has to be AzureBlobLocation the folderPath wich will be the path inside my Azure blob storage account which was configured in the linked service linkedServiceName which has to match the name we used earlier Then the actual workflow the pipeline name CopyFileToBlobPipeline properties activities type CopyActivity transformation source type FileSystemSource sink type BlobSink writeBatchSize 0 writeBatchTimeout 00 00 00 inputs name OnPremisesFile outputs name AzureBlobDatasetTemplate policy timeout 00 05 00 concurrency 4 name Ingress start 2015 06 28T00 00 00Z end 2015 06 30T00 00 00Z I will not go over every property What s important is that in the copy activity we tie our two datasets together The start and end times indicate the time period our pipeline

    Original URL path: http://blog.bennymichielsen.be/2015/06/30/copying-files-with-azure-data-factory/ (2016-04-29)
    Open archived version from archive

  • Using HDInsight & Storm to process 100 million events – Benny Michielsen
    if breachedAccounts Count batchSize Context Logger Info Max reached time to purge FinishBatch else Context Logger Info Not yet time to purge This bolt does not process every tuple it receives immediately Instead it keeps a list of tuples ready for processing when either a maximum count is reached or a specific amount of time has passed since the last purge In the FinishBatch method a SqlBulk copy is performed to insert records in a SQL database private void FinishBatch lastRun DateTime UtcNow DataTable table new DataTable table Columns Add Email if breachedAccounts Any foreach var emailAddress in breachedAccounts var row table NewRow row Email emailAddress GetString 0 table Rows Add row try using SqlBulkCopy bulkCopy new SqlBulkCopy connection bulkCopy DestinationTableName Emails bulkCopy ColumnMappings Add Email Email bulkCopy WriteToServer table foreach var emailAddress in breachedAccounts ctx Ack emailAddress catch Exception ex foreach var emailAddress in breachedAccounts ctx Fail emailAddress Context Logger Error ex Message finally breachedAccounts Clear Both the table storage bolt and the SQL bulk copy bolt have lines with ctx Ack Fail or Emit These are places where you can indicate whether processing of a tuple has succeeded failed or you are emitting a new tuple to be processed further downstream It also enables to replay tuples If the SQL bulk copy fails the tuple will be processed again at a later time The azure table spout expects your bolts to ack topologyBuilder SetBolt TableStorageBolt TableStorageBolt Get new Dictionary string List Constants DEFAULT STREAM ID new List address 256 true DeclareCustomizedJavaSerializer javaSerializerInfo shuffleGrouping EventHubSpout topologyBuilder SetBolt SQLBolt SQLStorageBolt Get new Dictionary string List 2 true shuffleGrouping TableStorageBolt Constants DEFAULT STREAM ID addConfigurations new Dictionary string string topology tick tuple freq secs 1 topologyBuilder SetTopologyConfig new Dictionary string string topology workers partitionCount ToString The two bolts also need to be added to the topology The tableStorageBolt is configured with a name a method which can be used to create instances and a description of its output In this case I m emitting the email address to the default stream The parallelism hint is also configured and the boolean flag which indicates this bolt supports acking Since this bolt sits behind the Java spout we also need to configure how the data of the spout can be deserialized The bolt is also configured to use shuffleGrouping meaning that the load will be spread evenly among all instances The SQLBolt is configured accordingly with the additional configuration to support time ticks After I created an HDInsight Storm cluster on Azure I was able to publish the topology from within Visual Studio The cluster itself was created in 15 minutes It took quite some time to find the right parallelism for this topology but with the code I ve posted I got these results after one hour Table storage is slow the 2 SQL bolts are able to keep up with the 256 table storage bolts After one hour my topology was able to process only 10 million records Even though I

    Original URL path: http://blog.bennymichielsen.be/2015/06/08/using-hdinsight-storm-to-process-100-million-events/ (2016-04-29)
    Open archived version from archive

  • Data leak processing – Benny Michielsen
    move to the line number which is indicated in the incoming message from queue storage Then start reading the lines of the file that need to be processed and fire away at the event hub I have not given the code much thought so there are probably several ways this can be improved I just wanted to get up and running The event hub receives json objects containing the name of the website and the email address using var streamReader new StreamReader file OpenRead int row 0 while row beginAt streamReader ReadLine row while row endAt row var line streamReader ReadLine lines Add line Parallel ForEach lines line var leakedEmail JsonConvert SerializeObject new leak websitesource address line eventHubClient Send new EventData Encoding UTF8 GetBytes leakedEmail I created batches of 1 million rows which means 100 batches that had to be processed I deployed the service and then observed the system After seeing one instance running correctly I scaled up the system to get 20 instances running simultaneous The result was already quite nice All of the instances together were sending around 4 2 million leaked email addresses every 5 minutes with a peak of 4 5 million at the end This meant that in under two hours the entire file had been sent into my event hub Between 12 and 1 PM I only had one instance running I scaled up somewhere between 1 and 2 PM What did this cost Running 20 A1 instances of a cloud service 3 5 Receiving 100 million event hub request and doing some processing 2 In the next post I ll show you the HDInsight part which will take every event from the event hub and insert or update the address in table storage Author BennyM Posted on 29 05 2015 08 06

    Original URL path: http://blog.bennymichielsen.be/2015/05/29/data-leak-processing/ (2016-04-29)
    Open archived version from archive

  • cloud service – Benny Michielsen
    and send to the event hub The code in the cloud service is pretty straightforward Open a filereader and move to the line number which is indicated in the incoming message from queue storage Then start reading the lines of the file that need to be processed and fire away at the event hub I have not given the code much thought so there are probably several ways this can be improved I just wanted to get up and running The event hub receives json objects containing the name of the website and the email address using var streamReader new StreamReader file OpenRead int row 0 while row beginAt streamReader ReadLine row while row endAt row var line streamReader ReadLine lines Add line Parallel ForEach lines line var leakedEmail JsonConvert SerializeObject new leak websitesource address line eventHubClient Send new EventData Encoding UTF8 GetBytes leakedEmail I created batches of 1 million rows which means 100 batches that had to be processed I deployed the service and then observed the system After seeing one instance running correctly I scaled up the system to get 20 instances running simultaneous The result was already quite nice All of the instances together were sending around 4 2 million leaked email addresses every 5 minutes with a peak of 4 5 million at the end This meant that in under two hours the entire file had been sent into my event hub Between 12 and 1 PM I only had one instance running I scaled up somewhere between 1 and 2 PM What did this cost Running 20 A1 instances of a cloud service 3 5 Receiving 100 million event hub request and doing some processing 2 In the next post I ll show you the HDInsight part which will take every event from the event

    Original URL path: http://blog.bennymichielsen.be/tag/cloud-service/ (2016-04-29)
    Open archived version from archive