A toLocaleString Mystery

16 July 2020

by Fotis Papadogeorgopoulos

Recently, at work, one of our tests started failing. Our site is available in 11 languages, and the months for Azerbaijani (with the Latin script) had inconsistent capitalisation!

After investigating, and a little bit of good guessing, it turned out to be an issue with the localisation data in the browser and Node.

This post digs into how I went about investigating that issue, with some diversions along the way. I hope it gives you a fun insight into how localisation data ends up in JavaScript APIs, and how to spot errors!

The bug

Let’s frame the problem.

We have a function that provides a list of months, localised for one of the languages and scripts that we support. For English US, that would be “January, February, March…”.

JavaScript environments, whether web browsers such as Chrome and Firefox, or Node, provide a set of APIs for localisation and internationalisation. Two common ones are the Intl namespace of APIs, and the Date object with its toLocaleString method. We use toLocaleString specifically to get a localised month, for each month of the calendar.

However, the result of calling those APIs can vary depending on the data that each browser has available.

Because that possibility can sometimes be unexpected (especially for people that have not worked with multiple locales or scripts before), last year we added a series of tests to verify the localisation of months.

Then, at some later point, our tests started failing:

AssertionError: expected [ Array(12) ] to deeply equal [ Array(12) ]
+ expected - actual

[
-  "yanvar"
+  "Yanvar"
    "Fevral"
-  "mart"
+  "Mart"
    "Aprel"
    "May"
    "İyun"
    "İyul"
    "Avqust"
    "Sentyabr"
    "Oktyabr"
    "Noyabr"
-  "dekabr"
+  "Dekabr"
]

In other words: the months for Azerbaijani with the Latin script, Yanvar (January), Mart (March) and Dekabr (December) were lower case, while all the other months were capitalised.

First step, checking our own function

Before going down the path that the data might be wrong, let’s make sure that our own function is not doing anything absurd.

The function itself is provided below, a small wrapper around calling toLocaleString for 12 Dates.

function getArrayOfMonths(localeTag) {
  // Months for Date are 0.=11
  const months = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11].map((month) => {
    const dateobj = new Date(1970, month, 15);
    return dateobj.toLocaleString(localeTag, { month: 'long' });
  });
  return months;
}

(There are subtleties to getting a list of months this way, which may make the results wrong or unidiomatic. In our use, those are fine, but I am listing an example with noun cases at the end of the article.)

Running this function in Firefox and Node (with localisation data, more on that later!) brings up the same results:

// Node
// NODE_ICU_DATA=node_modules/full-icu node
// Welcome to Node.js v12.16.3.

> console.log(getArrayOfMonths('az-AZ'));

[
  'yanvar',   'Fevral',
  'mart',     'Aprel',
  'May',      'İyun',
  'İyul',     'Avqust',
  'Sentyabr', 'Oktyabr',
  'Noyabr',   'dekabr'
]

// Firefox

> console.log(getArrayOfMonths('az-AZ'));

Array(12) [ "yanvar", "Fevral", "mart", "Aprel", "May", "İyun", "İyul", "Avqust", "Sentyabr", "Oktyabr", … ]

Firefox and Node having the same inconsistent capitalisation was already tipping me off. They are different engines, so them processing the data in the same odd way seemed too good to be a coincidence.

Chrome only prints out English months, but that is as-intended, because it does not support Azerbaijani in Intl/toLocaleString yet, and I did not specify a fallback.

Finding if a locale is supported with Intl

The Intl family of APIs is really powerful. They have a bunch of namespaces and constructors to account for different lingustic artefacts and locales. For example, there is Intl.DateTimeFormat for formatting dates and times (day month year? month day year? fight!).

One useful function is Intl.DateTimeFormat.supportedLocalesOf. It takes an array of locales as BCP 47 language tags, such as en-GB (English as used in Great Britan) or el-GR (Greek as used in Greece) as an argument, and returns an array of the ones that are supported:

> console.log(Intl.DateTimeFormat.supportedLocalesOf(['az-AZ', 'en-GB', 'el-GR']))

[ 'az-AZ', 'en-GB', 'el-GR' ]

Locales are an interaction of languages, regions and scripts. This post has already too many diversions, and I don’t feel qualified to give you good examples.

To account for these interactions, BCP 47 tags have optional components for scripts, region or country codes, variants, and also reserved extensions. I found this article from MDN on locale identification useful for a short explanation.

Azerbaijani (as far as my searching shows, and I might be wrong) has both a Latin and Cyrillic script. In BCP 47 format, these would be az-Latn-AZ and az-Cyrl-AZ respectively. As far as I can tell, az-AZ defaults to Latin, but I am not sure whether this is an artefact of a specific data source.

A past Chrome bug with supportedLocalesOf

When I started seeing issues with Azerbaijani in particular, I was already on my toes about issues with data.

About a year ago, we had run into a bug with Azerbaijani and Chrome. Chrome claimed it supported Azerbaijani, when asked via supportedLocalesOf, but would give placeholder months in practice.

In particular, this was the behaviour from this function back then (circa July 2019):

> Intl.DateTimeFormat.supportedLocalesOf(['az-AZ']);

// Means it is supported
['az-AZ']

> getArrayOfMonths('az-AZ')
[M0, M1, M2, M3, ... M11]

In other words, Chrome claimed to support az-AZ, but the months were these odd M0 to M11 months, which seemed like internal placeholders. If Azerbaijani was unsupported, I would expect supportedLocalesOf to not report it, and also the months to be in English GB (because that is my system locale, and I did not specify a fallback).

After double- and triple-checking with colleagues and different platforms, I filed a bug in Chromium, and it was confirmed! It was eventually fixed, and supportedLocalesOf reports Azerbaijani as unsupported.

Long story short, Azerbaijani being unsupported indicates to me that the localisation data might be incomplete.

I have referenced “the data” multiple times now; let’s dive into what that data is, and where it comes from.

Localisation data: ICU, CLDR, oh my

Let’s take a few different Intl APIs:

DateTimeFormat, uhm, formatting (as is bugging us so far)
Pluralization (e.g. apple, 2 of them = two apples, or more complex changes for languages that differentiate between “one”, “a handful”, and “many”)
Locale names (e.g. saying that “Greek” is “Ελληνικά” in Greek)

You can imagine that all of the underlying data (calendars, names of months, pluralization rules) must be coming from somewhere!

Indeed, there is a standard resource for these in the ICU (International Components for Unicode) data. Quoting from the site:

ICU makes use of a wide variety of data tables to provide many of its services. Examples include converter mapping tables, collation rules, transliteration rules, break iterator rules and dictionaries, and other locale data. Additional data can be provided by users, either as customizations of ICU’s data or as new data altogether.

A related dataset is the CLDR (Unicode Common Locale Data Repository). Quoting again from the site:

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. It includes: …

The ICU data-set uses CLDR itself for many things, with a few differences:

Data which is NOT sourced from CLDR includes:
Conversion Data
Break Iterator Dictionary Data ( Thai, CJK, etc )
Break Iterator Rule Data (as of this writing, it is manually kept in sync with the CLDR datasets)

Those data come in different formats, such as XML (LDML), categorised by locale (roughly, that I can tell). The ICU data seems more commonly used by higher-level libraries, because the format is more compact.

With this data available, browsers have enough information to provide the richer Intl and Date localisation APIs.

Handwaving

Here are some things I am handwaving at this point.

I use ICU and CLDR rather interchangeably. As far as I can tell, the ICU data is derived from the CLDR data. I found better links for the CLDR sources, so I am digging into those.

I am also not 100% clear on whether all browsers use the ICU/CLDR data at the moment, or use some other source. I could not find anything normative about the data source in the specs (I would find that surprising anyway), and I am bad at going through issue trackers.

I found one tracking issue about Firefox transitioning to the CLDR data, and at least my testing seems to support that. Perhaps the CLDR data version would be useful for browsers to expose? Not as an API, rather an `about:` config or something similar in the UI.

Node definitely uses the ICU data, and gets its own following section for it.

Excerpt from the CLDR Data

For example, here is the top-level directory structure from one download of the CLDR data:

> tree -L 1 cldr-common-35.1/
cldr-common-35.1/common/
├── annotations
├── annotationsDerived
├── bcp47
├── casing
├── collation
├── dtd
├── main
├── properties
├── rbnf
├── segments
├── subdivisions
├── supplemental
├── transforms
├── uca
└── validity

An excerpt from the main directory:

> cldr-exploration tree -L 1 cldr-common-35.1/common/main
cldr-common-35.1/common/main
├── af_NA.xml
├── af.xml
├── af_ZA.xml
├── agq_CM.xml
├── agq.xml
├── ak_GH.xml
├── ak.xml
├── am_ET.xml
├── am.xml
├── ar_001.xml
├── ar_AE.xml
├── ar_BH.xml

And here is part of the data for English (common/main/en.xml):

<monthWidth type="wide">
    <month type="1">January</month>
    <month type="2">February</month>
    <month type="3">March</month>
    <month type="4">April</month>
    <month type="5">May</month>
    <month type="6">June</month>
    <month type="7">July</month>
    <month type="8">August</month>
    <month type="9">September</month>
    <month type="10">October</month>
    <month type="11">November</month>
    <month type="12">December</month>
</monthWidth>

ICU and Node

If you have tried to work with internationalisation in Node, you might have run into the ICU data yourself.

Up until version 13 (a few months ago), Node only had a base English locale loaded. The ICU data takes up space on the order of tens of megabytes, and so Node for the longest time did not come with them installed.

To get correct localisations in Node, you had to either a) build Node yourself with the full-icu dataset loaded, or b) install the correct build of the icu data locally, and provide the path via NODE_ICU_DATA.

It was messy, and probably still exists as an arcane parameter in current and aging codebases. Watch tests fail because NODE_ICU_DATA is not supplied, ugh.

Node getting the full ICU data built-in from version 13 was one of my favourite features, and if you’ve read this far, at least someone else might now understand my excitement!

If you are curious:

Either way, now that we have gone through all the abbreviations, we’re in a good spot to find the data and investigate it!

Digging into the CLDR data

Time to dig into the CLDR data, to validate whether the months in Azerbaijani show up capitalised, uncapitalised, or inconsistent.

To check for any changes (and in the case of our test, regressions), I downloaded CLDR versions 35.1, 36.1 and 37.

I started browsing through the directories and quickly got lost because my search skills are bad.

I then decided to go with a more drastic approach, and headed to the command line. In my case Gnome Terminal on Linux, but iTerm on MacOS or Windows Subsystem for Linux would work just as well, if you want to follow along.

There is a nice utility called ripgrep which can search through files very fast. It is written in Rust and is lovely, though mostly I just use it out of habit nowadays.

Anyway, I went searching through the files. I used “Yanvar” capital case and “yanvar” lower case for the known issues, as well as “Oktyabr” capital case and “oktyabr” lower case as a control.

The results from ripgrep across three versions follow, and then a long-form explanation of them.

# Yanvar capital case - 1 result from version 35.1
>  az-AZ-exploration rg "Yanvar" cldr*/**/az.xml
cldr-common-35.1/common/main/az.xml
1412:  <month type="1">Yanvar</month>

# Yanvar lower case - two results for version 36.1 and 37, one for 35.1
>  az-AZ-exploration rg "yanvar" cldr*/**/az.xml
cldr-common-37.0/common/main/az.xml
1360:  <month type="1">yanvar</month>
1404:  <month type="1">yanvar</month>

cldr-common-36.1/common/main/az.xml
1360:  <month type="1">yanvar</month>
1404:  <month type="1">yanvar</month>

cldr-common-35.1/common/main/az.xml
1368:  <month type="1">yanvar</month>

# Oktyabr capital case - one result for each version
>  az-AZ-exploration rg "Oktyabr" cldr*/**/az.xml
cldr-common-37.0/common/main/az.xml
1413:  <month type="10">Oktyabr</month>

cldr-common-36.1/common/main/az.xml
1413:  <month type="10">Oktyabr</month>

cldr-common-35.1/common/main/az.xml
1421:  <month type="10">Oktyabr</month>

# Oktyabr lower case - one result for each version
>  az-AZ-exploration rg "oktyabr" cldr*/**/az.xml
cldr-common-37.0/common/main/az.xml
1369:  <month type="10">oktyabr</month>

cldr-common-36.1/common/main/az.xml
1369:  <month type="10">oktyabr</month>

cldr-common-35.1/common/main/az.xml
1377:  <month type="10">oktyabr</month>

We have a winner! From version 36 onward, we get “yanvar” as lower case for January, while “Fevral” for February stays capitalised for all versions. The same pattern repeats with March and December. Version 35, by comparison, has both Yanvar and Fevral (and all the other months) capitalised.

Data sources

Something I found interesting: the data for months appears in two places, once in a “months” entry, and once in a “calendar” entry (again, for the Gregorian calendar).

The “months” entry has consistent capitalisation throughout. They are all lower case; “yanvar”, “fevral” and so on.

This hints to me that Firefox and Node use the “calendar” entry for the names of the months in this case. It makes sense, because if you’ll recall our original function, we go through a Date object’s toLocaleString, which deals with dates directly, rather than canonical names or anything of that sort.

Bear in mind that there might not always be a one-to-one correlation with what the browser does, and the ICU or CLDR data. I think it’s a good place to start, but might not be comprehensive.

Changelog, Contributions

I was curious as to what changed in version 36 onward.

Diving into the Changelog for the CLDR data version 36 we find the following line:

Additionally, the following locales had at least a 15% increase in basic coverage: az (Azerbaijani / Latin script), so (Somali / Latin script).

The inconsistent months might have been entered accidentally, or were caused somehow when the coverage was expanded.

Update: A simpler of finding the data

Update, April 14, 2024:

After version 38.0, the unicode-org/cldr-json repository on GitHub started including JSON versions of the CLDR data. Each version is tagged, so you can find it directly (here is CLDR 44.1.0, for example). These can be viewed directly in the GitHub UI, and can make for a quick gut feeling check.

I would still suggest finding the changelog in the release notes, or PRs to the unicode-org/cldr repository for more context on changes, since the cldr-json repository does not have that history.

Future steps

This is all many words, for a simple change in our codebase at least: change the test to match the data (3 line change), alongside a description about why that is ok (200 words in the PR, however many words is this post).

I am not keen on capitalising the months ourselves (today’s hotfix is tomorrow’s footgun), but we might do that specifically for Azerbaijani, with an inverse test case to notify us when the data is updated.

Another thing I am looking into, is contributing the consistent capitalisation into the CLDR.

As of April 2024, contributions to CLDR are made by native speakers through a survey tool (which makes sense; who am I to say what the capitalisation of months in Azerbaijani should be!), and bug reports are made through Jira tickets. The contributing doc in the cldr repository describes more of this process.

Wrapping up

Long story short: sometimes, it is the data.

This whole process was some of the most fun I’ve had at work this month! I love it when the different layers (specs, JS APIs, JS hosts, CLDR data, bugs, messiness) fall into place. Localisation and internationalisation take a lot of material effort, so diving into it makes me appreciate it much more.

In this case, I am also fond of our team’s past decisions. We had the tests in place, and had already gone into the ICU/CLDR rabbit hole a year ago, filing the Chrome bug. This was both a time-saver, and brought a smile to my face.

I hope I managed to impart at least a glimpse of that fun to you, and that you found something interesting here.

I’ll be happy to discuss this post and any linked resources!

Appendix: When this method of getting months goes wrong

As mentioned many paragraphs ago, our implementation at work goes through a Date object’s toLocaleString to get the array of months.

However, because the formatting happens in the context of a date, languages with different cases might inflect the month.

When running this function for Greek, we get the following:

> console.log(getArrayOfMonths('el-GR'));

[
  'Ιανουαρίου',  'Φεβρουαρίου',
  'Μαρτίου',     'Απριλίου',
  'Μαΐου',       'Ιουνίου',
  'Ιουλίου',     'Αυγούστου',
  'Σεπτεμβρίου', 'Οκτωβρίου',
  'Νοεμβρίου',   'Δεκεμβρίου'
]

All of these months are in the genitive case (denoting possession). This is the equivalent of saying “x of January”, “y of February” and so on in English. In our site, we use this function in the context of birthdays, so it ends up being ok! If, however, we wanted to only list the months, it would technically be wrong (we’d need the nominative case). Make sure to test for your use-case, and beware of tutorials only assuming English language rules.

I am not certain how I would go about listing the months in the nominative, at least with the Date object. Intl has a draft (Stage 3) family of APIs called Intl.DisplayNames that “enables the consistent translation of language, region and script display names”. Would something similar for month names be desirable? I’m not sure! Let me know if you know of an approach.