In the last blog post, we introduced Singer, the open-source framework, as a powerful tool for ETL processes. This time, we’d like to discuss how you can implement the framework in your own projects.
How to start working with Singer
Starting a test run is rather simple. First, you need to create a python environment, for which step-by-step instructions to do so are available online.
As soon as you’ve done that, it’s time to create your first virtual environment inside python.
Please note before beginning that it’s a best practice to create and use an individual virtual environment for every tap and target. This avoids any conflicts between module requirements for the different modules.
The next step is to install the tap and target you’ve chosen into their corresponding virtual environment. This installation can be performed very easily using a pip install command. This example command installs the tap-salesforce to the load data from your Salesforce account:
After installing the tap, you’ll need to configure the connection to the Salesforce api.
This is done by creating a config.json inside your working directory containing all required configuration parameters. If searching for the file, remember that these can always be found on the tap-specific documentation and mainly consists of the api-credentials.
Singer uses a method called discover to identify all available data objects in your source system. This information is stored either in a catalog.json or in a properties.json, based on the tap you are using. For example, if you are using tap-salesforce, you can create a properties.json with the following command:
Der properties.json is used to select all the objects you want to load with your ETL process. Objects are selected by adding a selected: true row to the metadata section of every object you want to select inside the properties.json.
Now, your tap is ready to go and can either be combined with a target or be run as a standalone process, sending the data to stdout, using the following command:
Please note that the square brackets are not a part of the code and are only used to show the option of piping the source data to a target, in this case the target-csv.
Since the setup of a new tap or target is generally the same, you can easily add more source systems or target databases using the same process.
Though easy to install, your growing Singer instance may affect a few comfort features. A further complication that may arise is the eventual need to schedule your ELT processes to run on a daily basis for consistent staging purposes. Which might require a bit more work to accomplish. Luckily, there also is something for you in that case.
Ausblick
Overall, we consider the Singer framework a great environment for easy-to-use ETL processes. That said, while the process is straightforward, you may run into some problems when working with a larger number of different taps and targets. This is due to the presence of different virtual environments which make the commands grow longer and more complex with time.
Though, there is another open-source tool that uses the Singer standards and even improves upon its ease of use and flexibility.
This tool is called Meltano and we’ll discuss its benefits in the next newsletter!
Stay tuned!
-von Tim Kirschke (Scalefree)
Updates und Support erhalten
Bitte senden Sie Anfragen und Funktionswünsche an [email protected].
Für Anfragen zu Data Vault-Schulungen und Schulungen vor Ort wenden Sie sich bitte an [email protected] oder registrieren Sie sich unter www.scalefree.com.
Um die Erstellung von Visual Data Vault-Zeichnungen in Microsoft Visio zu unterstützen, wurde eine Schablone implementiert, die zum Zeichnen von Data Vault-Modellen verwendet werden kann. Die Schablone ist erhältlich bei www.visualdatavault.com.
Hi Tim,
could you please ealborate a bit more on the ‘T’.
I am having the perspective on ETL processes as the transformation of data into valuable information for, in that wording, information clients or customers. While seeing singer providing ins and outs – taps and targets connected via json interchange, where and how does the actual creation of new information based on sourced data and defined logics takes place. So the ‘and’ between the in and out?
Regards
Ingo
Hi Ingo and thanks for your interesting question!
Singer alone does not support any transformations on the way from the input to the output. But there are Tools like Meltano that wrap around Singer and add the functionality of using dbt for data transformation. You can read more about Meltano in two recent blog articles by a colleague here:
– Part 1 : https://www.scalefree.com/scalefree-newsletter/open-source-production-grade-data-integration-part-1/
– Part 2:https://www.scalefree.com/scalefree-newsletter/open-source-production-grade-data-integration-part-2/
These articles give you an introduction of Meltano, but cover the transformation aspect only briefly. But to learn more about that specific part, you can check out the Meltano documentation about their transformation implementation here: https://docs.meltano.com/guide/transformation .
I hope this helps you, feel free to reach out again!
Greetings, Tim