How we arrived at BFF microservices with a twist of GraphQL

Improving the communication between the backend and frontend to gain performance.

UserZoom, while developing the infrastructure of its complex SaaS platform, has been going through continuous evolution. This article goes through the evolution that one of the teams in UserZoom went through to improve the communication between the backend and frontend to gain performance and improve the maintainability of the system. We will focus on two main topics: GraphQL protocol and the Backend-for-frontend pattern.

Section one: Evolution of BFF through time and technologies

First iteration: Plain RESTful services

Back in 2016 we started developing a platform, we decided to do it with RESTful microservices, so we could separate the concerns of each part of the platform and the solution could be more maintainable. An issue that we didn’t foresee at the time was that the frontend was consuming those microservice endpoints directly, which with time brought a challenge that we had to address. The microservices endpoints had to attend requests of two very different origins: internal (other microservice) and external (frontend). This meant that we didn’t have separation of concerns in the endpoints and as so we applied different palliative measures:

  • Masking errors for all the endpoints, so the frontend wouldn’t expose stack traces of the backend code.
  • Reduce the responses' payload to the information that the frontend needs to manage.

But these solutions were a slippery slope, on one hand, masking errors didn’t allow us to have the real error when calling from an internal service, and this made calls between microservices less debuggable. On the other hand, reducing the responses’ payload made some communications from backend to backend lacking information that was needed.

Additionally, each microservice had to validate the payload, path, and query of each endpoint, and as the platform grew this validation was increasingly harder to maintain and iterate. Also, the team had to manually maintain documentation of all the endpoints, responses, and payloads since RESTful doesn’t have this resolved out of the box. Of course, an automatic documentation strategy like Swagger can be applied, but it had to be maintained too.

Last but not least, our frontend had to coordinate calls to multiple endpoints to be able to get all the data to render each page. This meant that each one of the calls could fail, and we had to decide how to react to each failure.

The biggest concern was that since the endpoints consumed by the frontend were the same as the backend to backend ones, the frontend had to be very coupled with the backend. This way, the backend couldn't have any strategy (like caching or reducing/transforming the payload) only for the frontend calls since each microservice had N number of endpoints, and it didn't know which ones were called from the frontend and which ones from the backend.

Second iteration: RESTful Backend-for-frontends

Two years later, in 2018, with all the lessons learned in the first iteration and all the challenges on the maintainability of the services and frontend, we decided to start applying the backend-for-frontend (also called BFF) pattern. This pattern promotes the creation of a new microservice that handles all communications with the UI so the UI doesn’t have direct communication with the core services. Furthermore, the new microservice would act as a layer of translation between the logic of the system and the frontend logic.

The most important concept behind the BFF is that it has to be tailored to the frontend needs, not the backend ones. In the typical blog example, the right approach would be to create an endpoint that gives the frontend all the information required to render a post page: post title, post description, post author name with image link, post publish date, all the post comments, and for each comment the author, the comment itself, post time, etc. This way, the frontend post page can be rendered without the need for any other endpoints. This methodology brings a lot of clarity to the frontend as well as simplifies the core services since it doesn't need to take into account frontend needs.

The BFF approach can help on many different fronts:

  • Specific strategies of caching for that frontend only.
  • New services using the BFF pattern can be maintained and evolved by the frontend developers, so the team scalability is improved.
  • Core services and frontend services don't need to be aware of each other.
  • Reduces under-fetching and over-fetching. Since the responses are more tailored to the frontend.
  • Frontend developers had the power of the backend in their hands to transform the payload in any way possible, so it's easier to process by the frontend and this can reduce the processing time on the frontend which normally is a less powerful processing machine.
  • Decouples so much the core services from the public calls that now the core services can use better performant protocols between them like gRPC.

As you can see, the BFF brought a lot of benefits but was still RESTful and maintained all the downsides already discussed:

  • Documentation: Since the whole service is for use of the frontend, this API documentation with payloads and responses specifications had to be very well maintained.
  • Over-fetching and under-fetching: Although we reduced this thanks to the BFF pattern, we still had challenges in the area due to RESTful protocol and how it works.
  • RESTful protocol promotes a convention that uses endpoint names and HTTP verbs (i.e. GET /campaigns/{campaignId}) to determine what each endpoint can do. This can be subject to interpretation and misunderstandings. Additionally, sometimes this logic is blurry, and it is difficult to decide which way to go.

Third iteration: Backend-for-frontend with GraphQL

In 2019, after applying the BFF pattern, and experience the benefits of the platform in terms of separation of concerns and many others that we listed above. We were ready for the next iteration to tackle the challenges remaining. We decided to explore the possibility of using GraphQL on top of BFF. After completing a Proof of Concept with promising results, we started applying this new protocol to our BFFs.

Before going into the details of how we implemented GraphQL, here’s a summary of its most important features. GraphQL bases everything on one endpoint that you will need to call with the POST HTTP verb and a payload. In this payload, there are two things: query, which has a special syntax also known as GraphQL and is what does the magic; and variables, which is the dynamic data to query, sort of like the parameters in a function. GraphQL models three different operation types. The first one is called query and is reserved for everything that you want to retrieve from the backend, whether it is just one element or more of them. The second one is called mutation which is used to inform the backend about changes that you want to make. The third and last one is called subscription which allows you two open a real-time connection with the service.

The GraphQL language allows you to communicate two relevant things to the backend: What you want, which would specify which query/mutation/subscription you wanna point to (you can chain multiple operations too). And how you want it, which means that you need to tell the server which attributes you want from each query/mutation/subscription. Here there is an example of a query that we have in the system:

query allAvailablePeriodsAndTimeSlots($timeZone: String!) {
  getAllAvailablePeriodsDates(input: { timeZone: $timeZone }) {
    startTime,
    endTime,
  }
   getAvailablePeriodsForDay(input: { timeZone: $timeZone }) {
    startTime,
    endTime,
  } 
}

And a mutation:

mutation createSchedulerParticipantAndTimeSlot(
  $participantInput: CreateSchedulerParticipantInput!,
  $slotInput: CreateTimeSlotInput!
) {
  createSchedulerParticipant(input: $participantInput) {
    email,
    name,
    timeZone,
  }
  createTimeSlot(input: $slotInput) {
    startTime,
    endTime,
  }
}

So the GraphQL language is what makes this new protocol extremely flexible. Combined with the GraphQL types defined in the server, you only need to call one endpoint, and that endpoint can have all the possible queries available and all the possible attributes that they return. GraphQL's servers can activate this feature with the option called introspection.

The protocol allows for a deeper analysis of the queries and, thereby, it can help maintain the API as clean as possible of endpoints that are no longer used. Furthermore, the type definition has a way of informing the clients that an attribute/query is deprecated.

Thanks to all this, with GraphQL we get documentation of all the possibilities of the API by design, and we get to eliminate over-fetching and under-fetching since now the client can choose which data it queries and which data it leaves out. This also helps the backend in aggregation services like a BFF service to only query for needed data, since each attribute of a query can come from a different service and all the over-costs attached to call core services for data that is not used are gone.

Although GraphQL helps to resolve a lot of concerns and makes the communication between frontend and backend more performant, we are still planning to improve the monitoring of queries usage and improve how we maintain and evolve types in the server.

Section two: Extra powers gained by GraphQL

n the previous section, we discussed GraphQL and how it works, and there, a type of query named subscription was mentioned. Subscriptions allow having real-time bidirectional connections between the client and the server. Behind the curtains, they use WebSockets to support this type of communication. In this section, we will explain the different challenges that we encountered during the implementation and how we implemented these subscription queries.

In real-time communication scenarios, we are facing three main concerns: - Connect the backend and frontend easily, and maintain that connection through time. - Connect the backend event platform with the service that manages real-time. - Filter and match real-time connections from users to the backend event platform.

While the first one is solved by GraphQL subscription in an extremely seamless and simple way, the complexity in our case lies in the second and third concerns.

How to solve the connection between the backend event platform and the service where real-time events live. Back then, we had two different event platforms on our radar: Kafka and SQS/SNS. We historically used SNS/SQS, but due to various challenges, we decided to move the newer communications to Kafka. With two options in use, we had to assess which of the two platforms we should use to integrate the GraphQL service with the event platform.

In this case, event platforms in the backend normally need a process for which you can certify that a certain group of consumers only consume an event once. This is very important when these events are used to propagate updates between microservices. But in the case of real-time communications, each consumer of the group had to consume all the events because each consumer (instance of a microservice) will have a pool of connections with different users that had to receive the same update. This requirement in our case meant that the SNS/SQS platform wasn't a good fit because it should have had a different SQS queue per consumer, and this would have been difficult to implement and maintain.

Furthermore, in Kafka, this functionality is controlled with consumer groups, so if multiple consumers have the same consumer groups name, they form a group and only consume an event once. This meant that each microservice instance had to be identified as a different consumer group and thereby allow each instance to consume all the events.

Apart from these challenges, connecting the event platform with the GraphQL BFF service was simple since the Apollo GraphQL server has a utility package called graphql-subscriptions to help you implement the connector between the system event platform and the subscription queries in GraphQL. The following is the interface that we needed to implement:

export abstract class PubSubEngine {
  public abstract publish(triggerName: string, payload: any): Promise<void>;
  public abstract subscribe(
   triggerName: string, 
   onMessage: Function, 
   options: Object
): Promise<number>;
  public abstract unsubscribe(subId: number);
  public asyncIterator<T>(triggers: string | string[]): AsyncIterator<T> {
    return new PubSubAsyncIterator<T>(this, triggers);
  }
}

And the next was our implementation of the interface:

const { PubSub } = require('graphql-subscriptions');
class EventDriverPubSub extends PubSub {
  constructor(eventDriver) {
   super();
   this.eventDriver = eventDriver;
  }
  publish() {
   throw new Error('Not implemented');
  }
  unsubscribe(id) {
   this.eventDriver.unsubscribe(id);
  }
  subscribe(triggerName, onMessage) {
   return this.eventDriver.subscribe(triggerName, onMessage);
  }
}
module.exports = EventDriverPubSub;

Where eventDriver is our backend client to Kafka.
And with that, we could connect the frontend with real-time updates from the backend using the asyncIterator utility method:

// Sample query (client)
subscription {
  availablePeriodsChanged {
   startTime
   endTime
  }
}
// Subscription (server)
resolvers.Subscription = {
  availablePeriodsChanged: {
   subscribe: () => pubsub.asyncIterator(['campaignStatusLive', 'campaignSta
   resolve: (payload, args, context, info) => {
    return { startTime: new Date(), endTime: new Date() };
   },
  },
};

As a result, the frontend is subscribed to the availablePeriodsChanged event, which is fired by the backend each time that the GraphQL service receives the Kafka events campaignStatusLive or campaignStatusOffline. However, it led us to the third concern: you don’t want to propagate all the backend events to the frontend, but only the ones relevant for the specific end user.

To resolve the third concern, we had to filter the events from the backend to only allow those needed by each user. We used the apollo-server package, which includes the utility withFilter that can be used in each subscription resolver to do exactly this filtering.

The example below is the GraphQL definition of the subscription:

type TimePeriod { 
  startTime: DateTime! 
  endTime: DateTime! 
} 
input AvailablePeriodsSubscriptionInput { 
  startTime: DateTime! 
  endTime: DateTime! 
} 
extend type Subscription {
  availablePeriods(input: AvailablePeriodsSubscriptionInput!): [TimePeriod] } 

Here we were defining the information that the frontend needed to provide AvailablePeriodsSubscriptionInput when the subscription to this event was done and the response that the server had in each event to the frontend [TimePeriod] .

const { withFilter } = require('apollo-server');
module.exports = (server) => ({
  subscribe: withFilter(
   () => pubsub.asyncIterator(['scheduleUpdated']),
   ({ eventName, event }, variables, { connection }) => {
    return isEventRelevantToContext(event, connection.context)
  }
),
resolve: async (eventPayload, variables, { connection }) => {
  return eventPayload.availableTimes;
  },
});

The key strategy here was to use the withFilter function that allows the system to do the following comparison: isEventRelevantToContext. This comparison is the one that allows the system to not send the frontend too much information about all the scheduleUpdated events, but only the ones relevant to the user of the connection.

This is possible thanks to the connection object too, which is the representation of the real-time connection with the user. And since the user when connecting needs to authenticate with JWT, we can get from there certain information on the connection.context object that can help identify the connection and its necessities.

And this was the way we used GraphQL BFFs microservices to connect in real-time with the frontend. To summarize, in this schema there is a graphic representation of the flow:

1. Connect the backend event system with the GraphQL subscription.

2. Filter the backend events with the information for each connected user.

3. Process the request and generate the desired payload for the frontend to receive.

Section three: The road ahead

The journey through these steps was a good one, and we enjoyed the possibilities of the BFF pattern and GraphQL brings us, as well as the power of microservices. We still have a lot to improve and figure out, since along the way we have left tech debt here and there. To name a few:

  • Authentication: We do authentication with Ambassador and authentication service, and GraphQL is a challenge in this pattern since you need to parse the payload of the HTTP to know what each request is trying to access. Plus, with GraphQL subscriptions and WebSockets, this supposes a bigger challenge since we couldn't figure out how to authenticate with Ambassador this type of request in a maintainable way.
  • We decided to start using Typescript as it will make the code more maintainable than plain JS, and so we now have the challenge to transform these GraphQL BFF services into TS.
  • We historically used Hapi when we created REStful services, therefore when we started using GraphQL we created it on top of Hapi too but with Apollo Server. But we have to evaluate if we can remove yet another dependency and use GraphQL Apollo Server alone without the Hapi framework as an intermediary.
  • There are SaaS solutions that help you monitor the queries that are used, but we didn't feel comfortable sharing that information, so we have to find a solution long term to have this data available to be able to deprecate and remove queries and attributes from the GraphQL BFF.

The plan is to keep looking for better ways to improve our stack and practices, as we have been doing up so far.

Last but not least, I’d like to thank Edu Ponte, Jordi Ibañez, Miquel de Arcayne, and Klara Furstner for helping me with the review of this article and so much more every day. And to my whole team for covering for me while I was working on the copy. I would also like to give a special mention to Marc Anell for the big role he played in all that we explained in this article. As one of our company values says, there is no I in team, so consider what is explained above an effort of multiple people throughout years.