How to Integrate with Third-Party APIs

Learn everything you need to know about integrating with third-party APIs and how you can get started today.

By Pedram Navid on March 24th, 2022Data

There’s no escaping third-party integrations: every company has applications they use from vendors that need to read and write data. If you’ve ever had to implement these, you know how brittle and tricky they can be. Integrating with third-party services is often complex and error-prone, and there are a lot of factors to consider— especially within the context of your data stack.

In this post, we’ll cover the key steps needed to integrate with a third-party API. We’ll discuss understanding the API, authentication, reading, writing, deployment, monitoring, scheduling, and more. Once you’ve read this guide, you’ll have a good understanding of what it takes to integrate with a third-party application and be well on your way to becoming an integration expert!

Understanding the API interface

The first step in integrating with any third-party API is understanding the API endpoints. This is sometimes easier said than done. A good place to start is the available documentation from the third-party application. In the best-case scenario, there will be plenty of examples of clearly-documented endpoints and return values, fully explained rate limits, and authentication that’s easy to implement. In the worst-case scenario, a lot of this will be trial and error as you kludge along on your journey to understand how the API works.

You’ll want to set up a test account and have some test data ready so that you can start testing reading from and writing to the integration. Most applications have a trial account which should be enough to get started, but some will require a paid subscription or even a discussion with a sales rep before you can get going.

Using tools like Postman or Insomnia can help catalog the endpoints you’ll be using and make authentication simpler by storing cookies and headers. Capturing the requests and responses up-front will make development much easier after you get started. 01-postman.jpeg

Authenticating the API

Once you’ve gained a basic understanding of the API, you’ll immediately want to find out how authentication to the API works. Some APIs may work with a simple API token, such as a Bearer Token provided in the header.

Other applications may have more complex workflows. Providing access grants using OAuth may grant particular permissions to specific resources. These may return tokens that are only valid for a set time and will need to be refreshed periodically.

It’s best practice to read secrets from the environment, rather than hard-coding them into the application configuration. To aid in development, consider having multiple accounts, one for development and one for production usage.

In more complex applications, you may wish to allow the user to set their own tokens, as these can change frequently, and some security policies may require you to regularly change access tokens. For example, Hightouch supports both tokens and OAuth authentication depending on the destination. For HubSpot, both methods are supported, as seen below. 02-hubspot.png

Reading from and writing to the API

Most APIs tend to follow a REST-like interface, where data from resources can be fetched using an HTTP GET request, while information can be updated using POST, PATCH, and PUT methods. When updating data, JSON is a common format used for sending data. Read from & Writing to the API (JSON).png

Depending on the development language, you may have to resort to using HTTP requests to fetch and send data. Many APIs will paginate data for you in order to avoid very large payloads. For these, you’ll often get a URL or a page number as part of the response, but sometimes you may have to iterate through the pages until nothing is returned.

In some cases, you’ll have client libraries available for your application. Python and Java are very common, but JavaScript is also available sometimes. For the largest applications, there may be client libraries in a number of other languages, but for many smaller APIs, you’ll have to resort to writing your own custom library.

Querying source data

You’ll likely have internal data you will want to write to a third-party system. One important question is how you will query that data. There are many different models. You could rely on an event-based system that responds to particular actions in your application; and if a user registers for your application, you could fire a UserCreated event, which is processed by your integration service and used to write the new user to the third-party system.

Event-based models are simple to implement but can be difficult to reason about. If a user changes their name or email, you will need a way to capture that change and ensure the downstream systems are kept up to date. If the data changes outside of the application itself, during a data migration or a backfill, then you will quickly see the weakness of event-based systems.

Another option is to query the data directly from the database or data store. In a database, you could rely on change-data-capture to get a feed of events directly from the database and monitor for relevant changes in order to write those back to the third-party application. Another option is a batch job that queries the data using a query language like SQL, and then writes the entire contents back to the application. More complex systems could keep a history of what was written previously, so that only new or changed rows are sent to the third-party application, reducing the number of API calls required.

Fetching destination data

In some cases, you will need to fetch data before you can write to it. Perhaps in your source data, you have a contact and a person on your team who owns the relationship to that contact. However, your third-party system doesn’t allow you to assign ownership of contact by name or email, but only by an internal ID. You would first need to query the list of owners from the third-party system, fetch the ID by matching on email, then using that ID to write the contact back to the system with the owner. 03-miro.jpg

Relationships are often tricky to get right, and it’s important not to make assumptions about the quality of the data coming in. You might think emails are unique, but that may not be the case, so matching on them needs to be done carefully.

Mapping Fields

In order to write data from a source to a destination, you’ll need some type of mapping between the name of the fields in the source and the names in the destination. You may have a field called user_email in your source database, but in the destination that field may be named ContactEmail. For one or two fields, it can be tempting to hard-code these but consider making field mappings configurable as it’s likely that your business needs will evolve and more data will be required for the integrations in the future. Being able to avoid a code change every time a field changes can help reduce the time it takes to respond to business needs. Dynamic field mapping can make it easy to avoid spelling mistakes that can result in corrupted data. 04-salesforce.png

In the Hightouch application, there are drop-down fields for both the source and target. Field metadata is fetched from both applications and the user is provided with a drop-down selector for mapping between systems. As an additional bonus, the Automapper will suggest mappings based on fields that share a common name, reducing further the manual input required from the user.

Rate limits, batching, and parallelizing

Many APIs often have rate limits that limit the number of calls you can make to a given endpoint. These rate limits often encourage the use of a Batch API over event-based ones, especially for larger workloads. It can be difficult to get rate limits right, as your application may respond more slowly in development than it does in production.

One way to control rate limits is to utilize batches of a fixed size. The larger the batches, the fewer the calls you will have to make. Take care though because a large batch may result in an error as well, as some APIs also limit the size of data you can send in a single request. Being able to tune the batch size can be helpful as you experiment to find the right balance.

You’ll often (but not always, because that would be too easy) receive a 429 HTTP status code, which represents too many requests. Usually, the header will include a Retry-After which is a good way of knowing how long to pause before trying a request again. In absence of that, you may want to simply retry with exponential back-off, hoping that the request eventually works if you wait long enough. Tuning the back-off and the maximum number of retries is also largely experimental and will depend on the particular destination you are writing to.

For some systems, you may benefit from increased throughput by parallelizing requests. Some APIs don’t have rate limits that would affect your data writes but may be slow enough that parallelizing requests would improve throughput. Breaking your data out into batches and sending multiple batches out at once to the same endpoint can help improve the overall time it takes to write the data to the end systems.

Handling errors

There are many possible errors that could occur when writing to third-party systems. Network issues could mean that either the system never got your request, or did get it but never responded successfully. If your systems are idempotent, then retrying is a perfectly valid solution, but if not, then careful thought needs to be taken into how requests are retried for errors.

Often errors will come from the third-party API itself. Some will use HTTP status codes to indicate particular errors, while others (like GraphQL) may return a successful status but have error details in the response itself. If the documentation for the third-party system is well-written, these errors will be explained in detail and you’ll be able to create logic for handling it. If not, then it may be worth sending bad data and invalid requests to see how errors are handled by the system. It’s better to find out upfront than when it really matters.

Errors may occur for the entire request, or on a row-level. For example, Salesforce may return errors for a batch of requests where some rows failed but others succeeded. Exposing row-level errors can be useful for users to debug issues, or to know when errors can be safely ignored. 05-error.png

Classifying errors is difficult but can be really helpful. Some errors can be solved by a simple retry, but others are more fatal. Being able to expose relevant errors back to the users of your systems can help with the debugging process and can give your users more observability into the system. Perhaps a permission change is causing errors in your application, and the business user simply needs to enable a new permission set. Exposing that error in your application in a friendly way to the user can help them solve issues without escalating to the engineering team.

Logging your progress

Good logging is one of those things that you don’t realize you need until it is too late. Having a good system for logging requests and responses can help your users better debug potential issues. If data is missing from a destination system, being able to search the logs for the primary key to see if the request was ever made and if it was successful can help narrow down where the failure occurred.

Logging also can help your users understand the progress of active syncs, especially for long-running syncs. Capturing the timestamp of when the various steps of the integration occurred can also help with timing and profiling integrations to better understand how long integration runs take. Monitoring these can help identify potential issues before they turn into larger problems.

Deploying integrations

Once you’ve finished developing your integration service, the next step is to find a way to deploy it. There are many different designs here, and some of this will depend on how you built your application.

For event-based integrations, lambdas are a good starting point, as they’ll respond quickly to single events before terminating. You can configure most cloud lambdas to read from a message bus or queue, making it easy to scale up to meet demand and scale down when there’s no load.

For batch-processing, you may wish to leverage orchestration tools like Airflow, Dagster, and Prefect. These can be scheduled to run at an hourly or daily cadence and can also capture application output for logging, with built-in capabilities for retrying on failures.

If your needs are more complex, you may want to consider deploying an entire application, which can benefit from containerization. As your needs evolve, you may wish to have more fine-grained control over scheduling and orchestration, but be mindful that even a basic scheduler is hard to implement correctly. Consider, for example, what happens if the time it takes your application to run is longer than the interval between schedules, or that you may have a requirement to backfill data following a fix. These can be tricky to get right, and often it’s better to use a solution that has these edge cases already accounted for.

Monitoring integrations

Being able to monitor the progress of your integrations is critical to ensure that they’re operating as expected. Outside of having access to logs, you may wish to expose metrics on the integrations themselves. Some high-level metrics that could be useful are the total rows queried in the source, the total operations performed on the destination, and the time it took for the query and writes to complete. You may also wish to break down this data by type of operation, for example, adding a row versus updating an existing one.

Deciding how to expose these metrics is case-dependent. Some teams use tools like Datadog to monitor their infrastructure. By writing these metrics to a Datadog instance, you can quickly create summary dashboards and even monitors to alert when certain thresholds are exceeded.

Alerting is also an important feature to consider. If a system goes down but requires someone to actively monitor that it is working, then it’s easy to miss an outage. Being able to alert when an integration fails, or when a threshold of rows does not write properly, is critical for ensuring the health of your pipeline. Commonly Slack and other instant messaging platforms are used for alerting, but sometimes tools like PagerDuty are also useful.

Some integrations are more business-critical than others, so the ability to configure alerting by integration can be important to avoid notification fatigue. 06-slack.png

Wrapping up

As you can see, building third-party integrations might appear simple at first, but the hidden complexity behind moving data between systems means there are many concerns and trade-offs to consider.

In today’s marketplace, third-party APIs are ubiquitous, and as your company grows and becomes more sophisticated, the demands for your data teams to integrate with third-party applications will continue to scale.

If you’d like to take a deeper look into this topic, check out: The Data Engineers Guide to 3rd Party APIs to learn more about the core challenges with third-party APIs, the three concerns with data systems, architectural patterns for data ingestion, and why you should build on top of a data warehouse.

If you’d like to use a platform that takes care of these concerns so you can focus exclusively on delivering value from your data, then sign-up for a free account at hightouch.io

Download The Data Engineers Guide to 3rd Party APIs

Learn the common challenges, architectures and approaches for data integration, including code snippets and how to use your data warehouse as your single source of truth.

Sign up for more articles like this

Ready to leverage your customer data?

Hightouch logo

Your data warehouse is your source of truth for customer data. Hightouch syncs this data to the tools that your business teams rely on.

Copyright © 2022 Carry Technologies, Inc. dba Hightouch.
All rights reserved.

Get in touchhello@hightouch.io
501 Folsom St3rd FloorSan Francisco, CA 94105United States